On Tue, Feb 15, 2005 at 11:07:54PM -0800, David Spencer wrote:
> Kelvin Tan wrote:
> 
> >I'd like to 
> >
> >1) inject URLs from a database
> >2) add a RegexFilter for each URL such that only pages under each URL's 
> >TLD is indexed
> >
> >For the first, looking at the code, I suppose a way is to 
> >subclass/customize WebDBInjector and add a method to read URLs from the DB 
> >and call addFile() on each URL. So that's ok. Is there a better way? I 
> >wish WebDBInjector could be refactored into something a little more 
> >extensible in terms of specifying different datasources, like 
> >DmozURLSource and FileURLSource. 
> 
> Good timing, I've had the same basic question wrt database content (use 
> case, dynamic database-fed web site), NNTP, and IMAP.
> 
> I found the file plugin a bit bogus e.g.
> net.nutch.protocol.file.FileResponse.getDirAsHttpResponse() forms a web 
> page on the fly representing a directory view, apparently so the crawler 
>  can parse a web page to find more URLs to crawl. More natural I think 
> is if you can just directly drive a crawler and toss URLs at it.
> 
> Question: Is there a way to directly drive the crawler with URLs (to be 
> queued and then crawled)?

Current crawler (Fetcher.java) loops through entries in ./fetchlist,
which is typically generated from ./db. However you certainly
can create ./fetchlist directly from your url sources.

> 
> Last weekened I started on a NNTP plugin using the file plugin as a 
> template. Things came together quickly enough but, as I used the 
> jakarta-commons-net package for NNTP and it apparently doesn't provide a
> "stream protocol handler" for NNTP, any code that tries to new a URL 
> with "nntp://"; will get an exception (and I think the URL filtering does 
> this).
> 
> Question: Does this make sense that that Nutch depends on URLs, thus any 
>  schema not supported by the JVM (JVM supports http/https/file/ftp) 
> will need a protocol handler?

Not necessary. You can use other implementations.
But, if I recall correctly, you will have to extend URL.java to
include nntp://

John



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to