Hi Kelvin, Any change you'd like to make to Nutch will need the approval of a Nutch committer. I am not one of those folks, so my advice means little. (0.5 wink :-)
For the scale you're talking about, yes, any use of Regex or PrefixURLFilter will be cumbersome. I see at least two approaches that might work: a) Modify URLFilter, and all of its callers, as you suggested. This will affect WebDBInjector, PruneIndexTool, UpdateDatabaseTool, and all plugins based off of it. b) Intercept the fetch results before they are written to the segment and remove extraneous URLs there. Effectively this means hooking Fetcher.outputPage() and rewriting one of its arguments. I have submitted patches myself for doing (b) with a new ContentFilter interface, but others promptly poked holes in my theory. :-) So if you want to take a stab at it, and try to propose a better interface for post-fetch filtering, by all means, go for it. --Matt On Thu, 17 Feb 2005 20:09:29 +0800, kelvin-lists <[EMAIL PROTECTED]> wrote: > Hi Matt, for (2), feels like I'm jumping through hoops to > get this done, and may scale for thousands of rules, but for > millions, or tens of millions? I am looking at about a 50 > million URL database to index. > > > ----- Original Message Follows ----- > From: Matt Kangas <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Subject: Re: [Nutch-dev] Injecting URLs from database > Date: Wed, 16 Feb 2005 10:42:22 -0500 > > > Kelvin, > > > > (1) can be achieved by instantiating WebDBInjector and > > calling addPage() repeatedly. This method is public in > > CVS. (2) is best done with PrefixURLFilter; it uses a trie > > datastructure, which scales much better for thousands of > > rules. > > > > HTH, > > --Matt > > > > On Tue, 15 Feb 2005 20:45:12 +0100, Kelvin Tan > > <[EMAIL PROTECTED]> wrote: > > > I'd like to > > > > > > 1) inject URLs from a database > > > 2) add a RegexFilter for each URL such that only pages > > > under each URL's TLD is indexed ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
