I think if anyone here had the perfect answer for that one they would have
sold it Google, Microsoft or Yahoo for a ton of money. You will need an
algorithm that can detect ads. I have not written ad filters since my search
engine is currently using a domain whitelist. I can tell you that a whole
web crawl will definetly need it since it can cut down on pages in the index
by 10-20%. If you do a whole web crawl you will also need spam detection.

I would recommend looking for some academic papers on the topic. Maybe use
CiteSeer or something like that.

Steve
-----Original Message-----
From: d e [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 10, 2007 3:07 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Indexing the Interesting Part Only...

We plan to index many websites. Got any suggestions on how to drop the junk
without having to do too much work for each such site? Know anyone who has a
background on doing this sort of thing? What sorts of approaches would you
recommend?

Are there existing plug ins I should consider using?


On 3/9/07, J. Delgado <[EMAIL PROTECTED]> wrote:
>
> You have to build a special HTML Junk parser.
>
> 2007/3/9, d e <[EMAIL PROTECTED]>:
> >
> > If I'm indexing a news article, I want to avoid getting the junk (other
> > than
> > the title, auther and article) into the index. I want to avoid getting
> the
> > advertizments, etc. How do I do that sort of thing?
> >
> > What parts of what manual should I be reading so I will know how to do
> > this
> > sort of thing.
> >
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to