On Tue, Feb 15, 2005 at 11:07:54PM -0800, David Spencer wrote: > Kelvin Tan wrote: > > >I'd like to > > > >1) inject URLs from a database > >2) add a RegexFilter for each URL such that only pages under each URL's > >TLD is indexed > > > >For the first, looking at the code, I suppose a way is to > >subclass/customize WebDBInjector and add a method to read URLs from the DB > >and call addFile() on each URL. So that's ok. Is there a better way? I > >wish WebDBInjector could be refactored into something a little more > >extensible in terms of specifying different datasources, like > >DmozURLSource and FileURLSource. > > Good timing, I've had the same basic question wrt database content (use > case, dynamic database-fed web site), NNTP, and IMAP. > > I found the file plugin a bit bogus e.g. > net.nutch.protocol.file.FileResponse.getDirAsHttpResponse() forms a web > page on the fly representing a directory view, apparently so the crawler > can parse a web page to find more URLs to crawl. More natural I think > is if you can just directly drive a crawler and toss URLs at it. > > Question: Is there a way to directly drive the crawler with URLs (to be > queued and then crawled)?
Current crawler (Fetcher.java) loops through entries in ./fetchlist, which is typically generated from ./db. However you certainly can create ./fetchlist directly from your url sources. > > Last weekened I started on a NNTP plugin using the file plugin as a > template. Things came together quickly enough but, as I used the > jakarta-commons-net package for NNTP and it apparently doesn't provide a > "stream protocol handler" for NNTP, any code that tries to new a URL > with "nntp://" will get an exception (and I think the URL filtering does > this). > > Question: Does this make sense that that Nutch depends on URLs, thus any > schema not supported by the JVM (JVM supports http/https/file/ftp) > will need a protocol handler? Not necessary. You can use other implementations. But, if I recall correctly, you will have to extend URL.java to include nntp:// John ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
