Actually there is a property in conf: generate.max.per.host

So if you add a message in Generator.java at the appropriate place... you
have what you wish...

Gal


-----Original Message-----
From: Rod Taylor [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 08, 2006 7:28 PM
To: Nutch Developer List
Subject: Proposal for Avoiding Content Generation Sites

We've indexed several content generation sites that we want to
eliminate. One had hundreds of thousands of sub-domains spread across
several domains (up to 50M pages in total). Quite annoying.

First is to allow for cleaning up.  This consists of a new option to
"updatedb" which can scrub the database of all URLs which no longer
match URLFilter settings (regex-urlfilter.txt). This allows a change in
the urlfilter to be reflected against Nutches current dataset, something
I think others have asked for in the past.

Second is to treat a subdomain as being in the same bucket as the domain
for the generator.  This means that *.domain.com or *.domain.co.uk would
create 2 buckets instead of one per hostname. There is a high likely
hood that sub-domains will be on the same servers as the primary domain
and should be rate-limited as such.  generate.max.per.host would work
more as generate.max.per.domain instead.


Third is ongoing detection. I would like to add a feature to Nutch which
could report anomalies during updatedb or generate. For example, if any
given domain.com bucket during generate is found to have more than 5000
URLs to be downloaded, it should be flagged for a manual review. Write a
record to a text file which can be read and picked up by a person to
confirm that we haven't gotten into a garbage content generation site.
If we are in a content generation site, the person would add this domain
to the urlfilter and the next updatedb would clean out all URLs from
that location.


Are there any thoughts or objections to this? One and 2 are pretty
straight forward. Detection is not so easy.

-- 
Rod Taylor <[EMAIL PROTECTED]>



Reply via email to