I have posted this before in the "nutch user", but, since that time I
have made some aditional testing and I feel that this has more to do
with developers.
I have about 450 seed sites (in the quality and environmental areas) and
I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the
whole web method till depth 6 and some more sites (in this case not all)
till detpth 7.  I restrained the outlinks to 50, used the default crawl-
urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site)  and
got about 523,000 pages.  Doing some searches I noted that I only got
few results for some terms. For instance "nureg" a document used by the
Nuclear Regulatory Commission (NRC) yielded only a little more than 20
documents (there are more than 3,000 of them).  Than I tried
"site:www.nrc.gov http", and found only 82 pages.  This site has more
than 10,000 pages!  I tried site:www.epa.gov http and only got 2413
pages (also, this site has more than 10,000 pages). The results were
similar for other very large (and not dynamic sites).
Experimenting further I crawled, using the crawl method, depth 7, only
some sites, one per time.  For instance, http://www.nrc.gov/ with the 
filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the
"http.max.delays" to 10 and the "http.timeout" to 20000 and the results
were very poor: looking for "http" resulted in only 58 results.
Searching for "nureg" I only found 13 results, but for "adobe" (that
should be blocked by the filter (but not by the "outlinks rule", I do
not know) I got 4.  Performing the same testing in other sites, like
www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a
very, very small percentage of the site pages indexed.
So I am posting those results that can constitute, in itself, an issue
that, may be, shall be dealt with.  May be this is not an problem if you
try to index the whole web, I dont know, but for niche sites, like mine,
it seems to be. 
Tanks




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to