Thanks. I made the changes you suggested but the problem persisted. After about 5 rounds of 1000 URLs one site would "take over." I made the attached small change to get around this problem. It allows you to specific the maximum number of URLs you want from any single host. I now use -topN 1000 -maxSite 500 and things are going as I had hoped.
I like this idea and think it will make a useful addition to Nutch. However the filtering should be done in the loop at line 478, not at line 400, right? This way you'd get the highest scoring N pages from each site. If you agree, can you please modify the patch to work that way?
Thanks,
Doug
