Hi Kelvin,

Any change you'd like to make to Nutch will need the approval of a
Nutch committer. I am not one of those folks, so my advice means
little. (0.5 wink :-)

For the scale you're talking about, yes, any use of Regex or
PrefixURLFilter will be cumbersome. I see at least two approaches that
might work:

a) Modify URLFilter, and all of its callers, as you suggested. This
will affect WebDBInjector, PruneIndexTool, UpdateDatabaseTool, and all
plugins based off of it.

b) Intercept the fetch results before they are written to the segment
and remove extraneous URLs there. Effectively this means hooking
Fetcher.outputPage() and rewriting one of its arguments.

I have submitted patches myself for doing (b) with a new ContentFilter
interface, but others promptly poked holes in my theory. :-) So if you
want to take a stab at it, and try to propose a better interface for
post-fetch filtering, by all means, go for it.

--Matt

On Thu, 17 Feb 2005 20:09:29 +0800, kelvin-lists
<[EMAIL PROTECTED]> wrote:
> Hi Matt, for (2), feels like I'm jumping through hoops to
> get this done, and may scale for thousands of rules, but for
> millions, or tens of millions? I am looking at about a 50
> million URL database to index.
> 
> 
> ----- Original Message Follows -----
> From: Matt Kangas <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Subject: Re: [Nutch-dev] Injecting URLs from database
> Date: Wed, 16 Feb 2005 10:42:22 -0500
> 
> > Kelvin,
> >
> > (1) can be achieved by instantiating WebDBInjector and
> > calling addPage() repeatedly. This method is public in
> > CVS. (2) is best done with PrefixURLFilter; it uses a trie
> > datastructure, which scales much better for thousands of
> > rules.
> >
> > HTH,
> > --Matt
> >
> > On Tue, 15 Feb 2005 20:45:12 +0100, Kelvin Tan
> > <[EMAIL PROTECTED]> wrote:
> > > I'd like to
> > >
> > > 1) inject URLs from a database
> > > 2) add a RegexFilter for each URL such that only pages
> > > under each URL's TLD is indexed


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to