-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I think text classification could be used for this purpose. You would have to extract text blocks from HTML code (for example enclosed in <td></td> or <div></div>), then compare each block against a previously trained model and discard those blocks whose value is below a certain threshold (to start you off, see the 'techniques' section under http://en.wikipedia.org/wiki/Text_classification for instance). The best place to implement such a feature in Nutch probably is the HtmlParser class.
d e wrote: > I'm sorry! I guess I was REALLY not clear. I mean my problem is to > drop the > junk *on each page*. I am indexing news sites. I want to harvest news > STORIES, not the advertisements and other junk text around the > outside of > each page. Got suggestions for THAT problem? - -- Best regards, Bjoern Wilmsmann -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iD8DBQFF8/fjgz0R1bg11MERArYfAJ40GxGX4dwJwNyu5NssELly5StCgACgzUT6 qTIan/FUmCHd1RW0XlI4HNQ= =Paad -----END PGP SIGNATURE----- ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers