Thanks. I made the changes you suggested but the problem persisted. After about 5 rounds of 1000 URLs one site would "take over." I made the attached small change to get around this problem. It allows you to specific the maximum number of URLs you want from any single host. I now use -topN 1000 -maxSite 500 and things are going as I had hoped.
I like this idea and think it will make a useful addition to Nutch. However the filtering should be done in the loop at line 478, not at line 400, right? This way you'd get the highest scoring N pages from each site. If you agree, can you please modify the patch to work that way?
Thanks,
Doug
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
