Kelvin, (1) can be achieved by instantiating WebDBInjector and calling addPage() repeatedly. This method is public in CVS. (2) is best done with PrefixURLFilter; it uses a trie datastructure, which scales much better for thousands of rules.
HTH, --Matt On Tue, 15 Feb 2005 20:45:12 +0100, Kelvin Tan <[EMAIL PROTECTED]> wrote: > I'd like to > > 1) inject URLs from a database > 2) add a RegexFilter for each URL such that only pages under each URL's TLD > is indexed > > For the first, looking at the code, I suppose a way is to subclass/customize > WebDBInjector and add a method to read URLs from the DB and call addFile() on > each URL. So that's ok. Is there a better way? I wish WebDBInjector could be > refactored into something a little more extensible in terms of specifying > different datasources, like DmozURLSource and FileURLSource. > > For the second, using RegexURLFilter to index a million URLs at once quickly > becomes untenable since all filters are stored in-memory and every filter has > to be matched for every URL. An idea is to index the URLs one at a time, > adding a TLD regex rule for the currently indexed URL, and deleting the rule > before the next URL starts. So basically modifying the set of rules whilst > indexing. Any ideas on a smarter way to do this? > > Thanks, > k ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
