I'd like to

1) inject URLs from a database
2) add a RegexFilter for each URL such that only pages under each URL's TLD is 
indexed

For the first, looking at the code, I suppose a way is to subclass/customize 
WebDBInjector and add a method to read URLs from the DB and call addFile() on 
each URL. So that's ok. Is there a better way? I wish WebDBInjector could be 
refactored into something a little more extensible in terms of specifying 
different datasources, like DmozURLSource and FileURLSource.

For the second, using RegexURLFilter to index a million URLs at once quickly 
becomes untenable since all filters are stored in-memory and every filter has 
to be matched for every URL. An idea is to index the URLs one at a time, adding 
a TLD regex rule for the currently indexed URL, and deleting the rule before 
the next URL starts. So basically modifying the set of rules whilst indexing. 
Any ideas on a smarter way to do this?

Thanks,
k



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to