Matt:

This is a great addition. We do this external to Nutch right now, but your
code is going to make it a breeze to integrate all the rules we have to keep
Spam out!

Thankx

CC-

 

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Matt
Kangas
Sent: Sunday, January 16, 2005 4:04 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Implementing geography-by-IP filtering?

Stefan and/or Doug,

Here's a followup to my Jan 3 diff. This time I added two hooks to the
Fetcher, for URLFilter and also for a new interface, ContentFilter.
These allow one to:
- filter out URLs prior to fetching, and
- filter out fetched content prior to writing to a segment

This should provide a lot of flexibility for people who don't want to index
the entire web. The only drawback I see is that the interface is too simple
to be leveraged from the command-line; you'd have to make your own custom
CrawlTool and plug in filters at the appropriate point in the crawl cycle.

Speaking of CrawlTool, I think it'd be great if end users could customize
specific steps of the crawl cycle, in Java, w/o having to cut-and-paste the
whole class. Template method is the pattern I'm thinking of here. Does this
sound useful to anybody else?

--Matt

On Wed, 12 Jan 2005 10:50:15 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Good point.  I meant thread-safe, not re-entrant.
> 
> Doug
> 
> Kragen Sitaker wrote:
> > On Fri, 2005-01-07 at 11:34 -0800, Doug Cutting wrote:
> >
> >>It's usually pretty easy to replace fields that must be synchronized 
> >>with ThreadLocals in order to make a class re-entrant.  Perhaps we 
> >>should do this to RegexURLFilter?
> >
> >
> > Nitpick --- as far as I know, ThreadLocals don't make things 
> > re-entrant, only thread-safe, which is a strictly weaker property.  
> > RegexURLFilter probably doesn't need to be re-entrant, because it's 
> > not very likely that it's going to call some client-provided code in 
> > the middle of filtering a URL and have that client-provided code 
> > call RegexURLFilter again --- right?
> >
> > I'd hate to have to argue with someone who thinks ThreadLocals make 
> > things re-entrant in some context where re-entrancy matters, having 
> > gotten the idea from a trusted source.
> >




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to