As I indexed about 600 sites with Nutch 0.9, I noticed that, at least, one of them were showing less results than expected. This site was www.nrc.gov. As a test I tried to index only the NRC site, allowing only internal links in "site.xml" conf. file, using "crawl-urlfiter.txt" with "+^http://([a-z0-9]*\.)*www.nrc.gov/ " and also "regex-urlfilter.txt" with "+^http\:\/\/www\.nrc\.gov\/" (to avoid indexing the google site, that was being fetched using only the crawl-urlfiter).
I have used the crawl method with a depth 0f 10, but as Nutch reached the level 5, it stated that there were no more urls to fetch. The total urls number in crawldb was only 124. As I checked the nrc.gov/robots.txt I found: User-agent: * Disallow: /acrs/ ------ ------ Disallow: /what-we-do/". So it seemed that the robots could be blocking the fetch of the pages in a lot of directories. But as I checked for a particular class of document in the NRC site, using the query "nureg site: www.nrc.gov", I found about 11,000 results in Google and about 7,000 in Gigablast So, I would like to get some help in this issue. Thanks
