I have posted this before in the "nutch user", but, since that time I have made some aditional testing and I feel that this has more to do with developers. I have about 450 seed sites (in the quality and environmental areas) and I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the whole web method till depth 6 and some more sites (in this case not all) till detpth 7. I restrained the outlinks to 50, used the default crawl- urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site) and got about 523,000 pages. Doing some searches I noted that I only got few results for some terms. For instance "nureg" a document used by the Nuclear Regulatory Commission (NRC) yielded only a little more than 20 documents (there are more than 3,000 of them). Than I tried "site:www.nrc.gov http", and found only 82 pages. This site has more than 10,000 pages! I tried site:www.epa.gov http and only got 2413 pages (also, this site has more than 10,000 pages). The results were similar for other very large (and not dynamic sites). Experimenting further I crawled, using the crawl method, depth 7, only some sites, one per time. For instance, http://www.nrc.gov/ with the filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the "http.max.delays" to 10 and the "http.timeout" to 20000 and the results were very poor: looking for "http" resulted in only 58 results. Searching for "nureg" I only found 13 results, but for "adobe" (that should be blocked by the filter (but not by the "outlinks rule", I do not know) I got 4. Performing the same testing in other sites, like www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a very, very small percentage of the site pages indexed. So I am posting those results that can constitute, in itself, an issue that, may be, shall be dealt with. May be this is not an problem if you try to index the whole web, I dont know, but for niche sites, like mine, it seems to be. Tanks
------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
