Berlin Brown wrote:
Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless.  I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.

http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled

I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

There are quite a few companies (that I know of), who maintain indexes between 50-300 mln pages. All of them implemented their own strategy (specific to their needs) to solve this issue.

It's true that if you start crawling without any constraints, very quickly (~20-30 full cycles) your crawldb will contain 90% of junk, porn and spam. Some strategies to fight this are based on content analysis (detection of porn-related content), url analysis (presence of certain patterns in urls), and link analysis (analysis of link neighborhood). There's a lot of research papers on these subjects, and many strategies can be implemented as Nutch plugins.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to