Re: Possible public applications with nutch and hadoop

Andrzej Bialecki Mon, 15 Oct 2007 03:02:12 -0700

Berlin Brown wrote:

Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless.  I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.


http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled

I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

There are quite a few companies (that I know of), who maintain indexesbetween 50-300 mln pages. All of them implemented their own strategy(specific to their needs) to solve this issue.

It's true that if you start crawling without any constraints, veryquickly (~20-30 full cycles) your crawldb will contain 90% of junk, pornand spam. Some strategies to fight this are based on content analysis(detection of porn-related content), url analysis (presence of certainpatterns in urls), and link analysis (analysis of link neighborhood).There's a lot of research papers on these subjects, and many strategiescan be implemented as Nutch plugins.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Possible public applications with nutch and hadoop

Reply via email to