Actually there is a property in conf: generate.max.per.host So if you add a message in Generator.java at the appropriate place... you have what you wish...
Gal -----Original Message----- From: Rod Taylor [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 7:28 PM To: Nutch Developer List Subject: Proposal for Avoiding Content Generation Sites We've indexed several content generation sites that we want to eliminate. One had hundreds of thousands of sub-domains spread across several domains (up to 50M pages in total). Quite annoying. First is to allow for cleaning up. This consists of a new option to "updatedb" which can scrub the database of all URLs which no longer match URLFilter settings (regex-urlfilter.txt). This allows a change in the urlfilter to be reflected against Nutches current dataset, something I think others have asked for in the past. Second is to treat a subdomain as being in the same bucket as the domain for the generator. This means that *.domain.com or *.domain.co.uk would create 2 buckets instead of one per hostname. There is a high likely hood that sub-domains will be on the same servers as the primary domain and should be rate-limited as such. generate.max.per.host would work more as generate.max.per.domain instead. Third is ongoing detection. I would like to add a feature to Nutch which could report anomalies during updatedb or generate. For example, if any given domain.com bucket during generate is found to have more than 5000 URLs to be downloaded, it should be flagged for a manual review. Write a record to a text file which can be read and picked up by a person to confirm that we haven't gotten into a garbage content generation site. If we are in a content generation site, the person would add this domain to the urlfilter and the next updatedb would clean out all URLs from that location. Are there any thoughts or objections to this? One and 2 are pretty straight forward. Detection is not so easy. -- Rod Taylor <[EMAIL PROTECTED]>
