Rod Taylor wrote:
First is to allow for cleaning up.  This consists of a new option to
"updatedb" which can scrub the database of all URLs which no longer
match URLFilter settings (regex-urlfilter.txt). This allows a change in
the urlfilter to be reflected against Nutches current dataset, something
I think others have asked for in the past.

Yes, this would be a welcome addition. Note that Andrzej recently committed a change that causes Generate to filter urls, which achieves the same effect, but w/o removing them from the database, so they're still consuming space & time.

Second is to treat a subdomain as being in the same bucket as the domain
for the generator.  This means that *.domain.com or *.domain.co.uk would
create 2 buckets instead of one per hostname. There is a high likely
hood that sub-domains will be on the same servers as the primary domain
and should be rate-limited as such.  generate.max.per.host would work
more as generate.max.per.domain instead.

This could be implemented by adding a new plugin extension point for hostname normalization. The default implementation would be a no-op.

Third is ongoing detection. I would like to add a feature to Nutch which
could report anomalies during updatedb or generate. For example, if any
given domain.com bucket during generate is found to have more than 5000
URLs to be downloaded, it should be flagged for a manual review. Write a
record to a text file which can be read and picked up by a person to
confirm that we haven't gotten into a garbage content generation site.

A simple way to implement this would be to have the generator log each host that exceeds the limit. Then you can simply grep the logs for these messages. Good enough?

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to