Re: [Nutch-dev] Injecting URLs from database

David Spencer Tue, 15 Feb 2005 23:08:26 -0800

Kelvin Tan wrote:

I'd like to
1) inject URLs from a database
2) add a RegexFilter for each URL such that only pages under each URL's TLD is 
indexed
For the first, looking at the code, I suppose a way is to subclass/customize WebDBInjector and add a method to read URLs from the DB and call addFile() on each URL. So that's ok. Is there a better way? I wish WebDBInjector could be refactored into something a little more extensible in terms of specifying different datasources, like DmozURLSource and FileURLSource.

Good timing, I've had the same basic question wrt database content (use case, dynamic database-fed web site), NNTP, and IMAP.

I found the file plugin a bit bogus e.g. net.nutch.protocol.file.FileResponse.getDirAsHttpResponse() forms a web page on the fly representing a directory view, apparently so the crawler can parse a web page to find more URLs to crawl. More natural I think is if you can just directly drive a crawler and toss URLs at it.

Question: Is there a way to directly drive the crawler with URLs (to be queued and then crawled)?

Last weekened I started on a NNTP plugin using the file plugin as a template. Things came together quickly enough but, as I used the jakarta-commons-net package for NNTP and it apparently doesn't provide a "stream protocol handler" for NNTP, any code that tries to new a URL with "nntp://"; will get an exception (and I think the URL filtering does this).

Question: Does this make sense that that Nutch depends on URLs, thus any schema not supported by the JVM (JVM supports http/https/file/ftp) will need a protocol handler?

For the second, using RegexURLFilter to index a million URLs at once quickly becomes untenable since all filters are stored in-memory and every filter has to be matched for every URL. An idea is to index the URLs one at a time, adding a TLD regex rule for the currently indexed URL, and deleting the rule before the next URL starts. So basically modifying the set of rules whilst indexing. Any ideas on a smarter way to do this?

I think this matches my goals too - I want to be able to drive the crawler by giving it URLs, and I know it should crawl the URLs otherwise I wouldn't be passing the URLs on in the first place.

thx,
 Dave


Thanks,
k

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Injecting URLs from database

Reply via email to