Matt: This is a great addition. We do this external to Nutch right now, but your code is going to make it a breeze to integrate all the rules we have to keep Spam out!
Thankx CC- -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Kangas Sent: Sunday, January 16, 2005 4:04 PM To: [EMAIL PROTECTED] Subject: Re: [Nutch-dev] Implementing geography-by-IP filtering? Stefan and/or Doug, Here's a followup to my Jan 3 diff. This time I added two hooks to the Fetcher, for URLFilter and also for a new interface, ContentFilter. These allow one to: - filter out URLs prior to fetching, and - filter out fetched content prior to writing to a segment This should provide a lot of flexibility for people who don't want to index the entire web. The only drawback I see is that the interface is too simple to be leveraged from the command-line; you'd have to make your own custom CrawlTool and plug in filters at the appropriate point in the crawl cycle. Speaking of CrawlTool, I think it'd be great if end users could customize specific steps of the crawl cycle, in Java, w/o having to cut-and-paste the whole class. Template method is the pattern I'm thinking of here. Does this sound useful to anybody else? --Matt On Wed, 12 Jan 2005 10:50:15 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > Good point. I meant thread-safe, not re-entrant. > > Doug > > Kragen Sitaker wrote: > > On Fri, 2005-01-07 at 11:34 -0800, Doug Cutting wrote: > > > >>It's usually pretty easy to replace fields that must be synchronized > >>with ThreadLocals in order to make a class re-entrant. Perhaps we > >>should do this to RegexURLFilter? > > > > > > Nitpick --- as far as I know, ThreadLocals don't make things > > re-entrant, only thread-safe, which is a strictly weaker property. > > RegexURLFilter probably doesn't need to be re-entrant, because it's > > not very likely that it's going to call some client-provided code in > > the middle of filtering a URL and have that client-provided code > > call RegexURLFilter again --- right? > > > > I'd hate to have to argue with someone who thinks ThreadLocals make > > things re-entrant in some context where re-entrancy matters, having > > gotten the idea from a trusted source. > > ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
