I think if anyone here had the perfect answer for that one they would have sold it Google, Microsoft or Yahoo for a ton of money. You will need an algorithm that can detect ads. I have not written ad filters since my search engine is currently using a domain whitelist. I can tell you that a whole web crawl will definetly need it since it can cut down on pages in the index by 10-20%. If you do a whole web crawl you will also need spam detection.
I would recommend looking for some academic papers on the topic. Maybe use CiteSeer or something like that. Steve -----Original Message----- From: d e [mailto:[EMAIL PROTECTED] Sent: Saturday, March 10, 2007 3:07 PM To: nutch-dev@lucene.apache.org Subject: Re: Indexing the Interesting Part Only... We plan to index many websites. Got any suggestions on how to drop the junk without having to do too much work for each such site? Know anyone who has a background on doing this sort of thing? What sorts of approaches would you recommend? Are there existing plug ins I should consider using? On 3/9/07, J. Delgado <[EMAIL PROTECTED]> wrote: > > You have to build a special HTML Junk parser. > > 2007/3/9, d e <[EMAIL PROTECTED]>: > > > > If I'm indexing a news article, I want to avoid getting the junk (other > > than > > the title, auther and article) into the index. I want to avoid getting > the > > advertizments, etc. How do I do that sort of thing? > > > > What parts of what manual should I be reading so I will know how to do > > this > > sort of thing. > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers