I have posted this before in the "nutch user", but, since that time I
have made some aditional testing and I feel that this has more to do
with developers.
I have about 450 seed sites (in the quality and environmental areas) and
I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the
whole web method till depth 6 and some more sites (in this case not all)
till detpth 7.  I restrained the outlinks to 50, used the default crawl-
urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site)  and
got about 523,000 pages.  Doing some searches I noted that I only got
few results for some terms. For instance "nureg" a document used by the
Nuclear Regulatory Commission (NRC) yielded only a little more than 20
documents (there are more than 3,000 of them).  Than I tried
"site:www.nrc.gov http", and found only 82 pages.  This site has more
than 10,000 pages!  I tried site:www.epa.gov http and only got 2413
pages (also, this site has more than 10,000 pages). The results were
similar for other very large (and not dynamic sites).
Experimenting further I crawled, using the crawl method, depth 7, only
some sites, one per time.  For instance, http://www.nrc.gov/ with the
filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the
"http.max.delays" to 10 and the "http.timeout" to 20000 and the results
were very poor: looking for "http" resulted in only 58 results.
Searching for "nureg" I only found 13 results, but for "adobe" (that
should be blocked by the filter (but not by the "outlinks rule", I do
not know) I got 4.  Performing the same testing in other sites, like
www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a
very, very small percentage of the site pages indexed.
So I am posting those results that can constitute, in itself, an issue
that, may be, shall be dealt with.  May be this is not an problem if you
try to index the whole web, I dont know, but for niche sites, like mine,
it seems to be.

I think you're probably running into the limited # of domains problem that many vertical crawlers encounter.

The default Nutch settings are for a maximum of one fetcher thread per domain. This is the safe setting for polite crawling, unless you enjoy getting blacklisted :)

So if you have only a few domains (e.g. just one for your test case of just nrc.gov), you're going to get a lot of retry timeout errors as threads "block" because another thread is already fetching a page from the same domain.

Which means that your effective throughput per domain is going to be limited to the rate at which individual pages can be downloaded, including the delay that your Nutch configuration specifies between each request.

If you assume a page takes 1 second to download (counting connection setup time), plus there's a 5 second delay between requests, you're getting 10 pages/minute from any given domain. If you have 10M domains, no problem, but if you only have a limited number of domains, you run into inefficiencies in how Nutch handles fetcher threads that will severely constrain your crawl performance.

We're in the middle of a project to improve throughput in this kind of environment, but haven't yet finished.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to