I have indexed about 600 sites on some specific subjects, including the nuclear 
area, that have resulted in about 500,000 indexed pages.  One important "seed 
site" is the www.nrc.gov, but no matter what (and this, since version 0.3 of 
Nutch) I am not able to index  more than about 100 pages for this site.  If you 
go to Google or Yahoo, they show more than 20000 results.  In past years I have 
used another program, Aspseek,  and with it I was able to index as many pages 
as I wanted.  I have looked at the source code of some of the nrc pages an 
could not find any mention to any robots rule.
Any ideas about this behaviour?

Thanks

Reply via email to