This is a Natural Language Processing problem, although you can certainly take hints from URL graph structures and host block lists. Nutch does not support this natively (that I know of) but you can certainly extend Nutch to be able to recognize and filter ads. Start by looking at how to develop plugins and also look at the indexing plugin.
Regards, Steve > -----Original Message----- > From: d e [mailto:[EMAIL PROTECTED] > Sent: Friday, March 09, 2007 6:49 PM > To: nutch-dev@lucene.apache.org > Subject: Indexing the Interesting Part Only... > > If I'm indexing a news article, I want to avoid getting the junk (other > than > the title, auther and article) into the index. I want to avoid getting > the > advertizments, etc. How do I do that sort of thing? > > What parts of what manual should I be reading so I will know how to do > this > sort of thing. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers