Berlin Brown wrote:
Yea, you are right. You have to have a constrained set of domains to
search and to be honest, that works pretty well. The only thing, I
still get a lot of junk links. I would say that 30% are valid or
interesting links while the other is kind of worthless. I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.
http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled
I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch. You
can search above and see what you think. I had about 100k links with
my last crawl.
There are quite a few companies (that I know of), who maintain indexes
between 50-300 mln pages. All of them implemented their own strategy
(specific to their needs) to solve this issue.
It's true that if you start crawling without any constraints, very
quickly (~20-30 full cycles) your crawldb will contain 90% of junk, porn
and spam. Some strategies to fight this are based on content analysis
(detection of porn-related content), url analysis (presence of certain
patterns in urls), and link analysis (analysis of link neighborhood).
There's a lot of research papers on these subjects, and many strategies
can be implemented as Nutch plugins.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com