I have indexed about 600 sites on some specific subjects, including the nuclear area, that have resulted in about 500,000 indexed pages. One important "seed site" is the www.nrc.gov, but no matter what (and this, since version 0.3 of Nutch) I am not able to index more than about 100 pages for this site. If you go to Google or Yahoo, they show more than 20000 results. In past years I have used another program, Aspseek, and with it I was able to index as many pages as I wanted. I have looked at the source code of some of the nrc pages an could not find any mention to any robots rule. Any ideas about this behaviour?
Thanks