I have posted this before in the "nutch user", but, since that time I have made some aditional testing and I feel that this has more to do with developers. I have about 450 seed sites (in the quality and environmental areas) and I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the whole web method till depth 6 and some more sites (in this case not all) till detpth 7. I restrained the outlinks to 50, used the default crawl- urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site) and got about 523,000 pages. Doing some searches I noted that I only got few results for some terms. For instance "nureg" a document used by the Nuclear Regulatory Commission (NRC) yielded only a little more than 20 documents (there are more than 3,000 of them). Than I tried "site:www.nrc.gov http", and found only 82 pages. This site has more than 10,000 pages! I tried site:www.epa.gov http and only got 2413 pages (also, this site has more than 10,000 pages). The results were similar for other very large (and not dynamic sites). Experimenting further I crawled, using the crawl method, depth 7, only some sites, one per time. For instance, http://www.nrc.gov/ with the filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the "http.max.delays" to 10 and the "http.timeout" to 20000 and the results were very poor: looking for "http" resulted in only 58 results. Searching for "nureg" I only found 13 results, but for "adobe" (that should be blocked by the filter (but not by the "outlinks rule", I do not know) I got 4. Performing the same testing in other sites, like www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a very, very small percentage of the site pages indexed. So I am posting those results that can constitute, in itself, an issue that, may be, shall be dealt with. May be this is not an problem if you try to index the whole web, I dont know, but for niche sites, like mine, it seems to be.
I think you're probably running into the limited # of domains problem that many vertical crawlers encounter.
The default Nutch settings are for a maximum of one fetcher thread per domain. This is the safe setting for polite crawling, unless you enjoy getting blacklisted :)
So if you have only a few domains (e.g. just one for your test case of just nrc.gov), you're going to get a lot of retry timeout errors as threads "block" because another thread is already fetching a page from the same domain.
Which means that your effective throughput per domain is going to be limited to the rate at which individual pages can be downloaded, including the delay that your Nutch configuration specifies between each request.
If you assume a page takes 1 second to download (counting connection setup time), plus there's a 5 second delay between requests, you're getting 10 pages/minute from any given domain. If you have 10M domains, no problem, but if you only have a limited number of domains, you run into inefficiencies in how Nutch handles fetcher threads that will severely constrain your crawl performance.
We're in the middle of a project to improve throughput in this kind of environment, but haven't yet finished.
-- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
