One possibility is that nrc.gov is using the sitemap protocol which
allows Google et al. to find more pages than would be found with
traditional web crawling:

http://www.nrc.gov/sitemapindex.xml

I don't think Nutch supports the sitemap protocol.  It could be
Aspseek supports sitemap or that the link structure of nrc.gov has
changed or that they have added more exclusions to their robots.txt
file.

Frank


On Sat, Feb 14, 2009 at 11:31 AM, consultas <consul...@qualidade.eng.br> wrote:
> I have indexed about 600 sites on some specific subjects, including the 
> nuclear area, that have resulted in about 500,000 indexed pages.  One 
> important "seed site" is the www.nrc.gov, but no matter what (and this, since 
> version 0.3 of Nutch) I am not able to index  more than about 100 pages for 
> this site.  If you go to Google or Yahoo, they show more than 20000 results.  
> In past years I have used another program, Aspseek,  and with it I was able 
> to index as many pages as I wanted.  I have looked at the source code of some 
> of the nrc pages an could not find any mention to any robots rule.
> Any ideas about this behaviour?
>
> Thanks

Reply via email to