Re: [Nutch-dev] Injecting URLs from database

Matt Kangas Wed, 16 Feb 2005 07:43:23 -0800

Kelvin,

(1) can be achieved by instantiating WebDBInjector and calling
addPage() repeatedly. This method is public in CVS.
(2) is best done with PrefixURLFilter; it uses a trie datastructure,
which scales much better for thousands of rules.


HTH,
--Matt

On Tue, 15 Feb 2005 20:45:12 +0100, Kelvin Tan
<[EMAIL PROTECTED]> wrote:
> I'd like to
> 
> 1) inject URLs from a database
> 2) add a RegexFilter for each URL such that only pages under each URL's TLD 
> is indexed
> 
> For the first, looking at the code, I suppose a way is to subclass/customize 
> WebDBInjector and add a method to read URLs from the DB and call addFile() on 
> each URL. So that's ok. Is there a better way? I wish WebDBInjector could be 
> refactored into something a little more extensible in terms of specifying 
> different datasources, like DmozURLSource and FileURLSource.
> 
> For the second, using RegexURLFilter to index a million URLs at once quickly 
> becomes untenable since all filters are stored in-memory and every filter has 
> to be matched for every URL. An idea is to index the URLs one at a time, 
> adding a TLD regex rule for the currently indexed URL, and deleting the rule 
> before the next URL starts. So basically modifying the set of rules whilst 
> indexing. Any ideas on a smarter way to do this?
> 
> Thanks,
> k


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Injecting URLs from database

Reply via email to