Rod Taylor wrote:
First is to allow for cleaning up. This consists of a new option to "updatedb" which can scrub the database of all URLs which no longer match URLFilter settings (regex-urlfilter.txt). This allows a change in the urlfilter to be reflected against Nutches current dataset, something I think others have asked for in the past.
Yes, this would be a welcome addition. Note that Andrzej recently committed a change that causes Generate to filter urls, which achieves the same effect, but w/o removing them from the database, so they're still consuming space & time.
Second is to treat a subdomain as being in the same bucket as the domain for the generator. This means that *.domain.com or *.domain.co.uk would create 2 buckets instead of one per hostname. There is a high likely hood that sub-domains will be on the same servers as the primary domain and should be rate-limited as such. generate.max.per.host would work more as generate.max.per.domain instead.
This could be implemented by adding a new plugin extension point for hostname normalization. The default implementation would be a no-op.
Third is ongoing detection. I would like to add a feature to Nutch which could report anomalies during updatedb or generate. For example, if any given domain.com bucket during generate is found to have more than 5000 URLs to be downloaded, it should be flagged for a manual review. Write a record to a text file which can be read and picked up by a person to confirm that we haven't gotten into a garbage content generation site.
A simple way to implement this would be to have the generator log each host that exceeds the limit. Then you can simply grep the logs for these messages. Good enough?
Doug ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
