Re: [Nutch-dev] Injecting URLs from database

kelvin-lists Thu, 17 Feb 2005 03:56:36 -0800

Hi Matt, for (2), feels like I'm jumping through hoops to
get this done, and may scale for thousands of rules, but for
millions, or tens of millions? I am looking at about a 50
million URL database to index.


Taking a look at the URLFilter, a URLFilter that restricts
URLs to the TLD can be written if URLFilter accepted another
param, which is the page the link was found in. So, in
UpdateDatabaseTool.pagecontentsChanged(), passing in the
oldPage or oldPage.getURL() as the param. Probably makes
sense to introduce either some kind of filterchain
mechanism.

Is this a reasonable patch to Nutch? If so, I'm happy to
supply diffs. 

kelvin

----- Original Message Follows -----
From: Matt Kangas <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Injecting URLs from database
Date: Wed, 16 Feb 2005 10:42:22 -0500

> Kelvin,
> 
> (1) can be achieved by instantiating WebDBInjector and
> calling addPage() repeatedly. This method is public in
> CVS. (2) is best done with PrefixURLFilter; it uses a trie
> datastructure, which scales much better for thousands of
> rules.
> 
> HTH,
> --Matt
> 
> On Tue, 15 Feb 2005 20:45:12 +0100, Kelvin Tan
> <[EMAIL PROTECTED]> wrote:
> > I'd like to
> > 
> > 1) inject URLs from a database
> > 2) add a RegexFilter for each URL such that only pages
> > under each URL's TLD is indexed 
> > For the first, looking at the code, I suppose a way is
> to subclass/customize WebDBInjector and add a method to
> read URLs from the DB and call addFile() on each URL. So
> that's ok. Is there a better way? I wish WebDBInjector
> could be refactored into something a little more
> extensible in terms of specifying different datasources,
> > like DmozURLSource and FileURLSource. 
> > For the second, using RegexURLFilter to index a million
> URLs at once quickly becomes untenable since all filters
> are stored in-memory and every filter has to be matched
> for every URL. An idea is to index the URLs one at a time,
> adding a TLD regex rule for the currently indexed URL, and
> deleting the rule before the next URL starts. So basically
> modifying the set of rules whilst indexing. Any ideas on a
> > smarter way to do this? 
> > Thanks,
> > k
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products
> from real users. Discover which products truly live up to
> the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
>
https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Injecting URLs from database

Reply via email to