Kelvin Tan wrote:

I'd like to

1) inject URLs from a database
2) add a RegexFilter for each URL such that only pages under each URL's TLD is 
indexed

For the first, looking at the code, I suppose a way is to subclass/customize WebDBInjector and add a method to read URLs from the DB and call addFile() on each URL. So that's ok. Is there a better way? I wish WebDBInjector could be refactored into something a little more extensible in terms of specifying different datasources, like DmozURLSource and FileURLSource.

Good timing, I've had the same basic question wrt database content (use case, dynamic database-fed web site), NNTP, and IMAP.


I found the file plugin a bit bogus e.g.
net.nutch.protocol.file.FileResponse.getDirAsHttpResponse() forms a web page on the fly representing a directory view, apparently so the crawler can parse a web page to find more URLs to crawl. More natural I think is if you can just directly drive a crawler and toss URLs at it.


Question: Is there a way to directly drive the crawler with URLs (to be queued and then crawled)?

Last weekened I started on a NNTP plugin using the file plugin as a template. Things came together quickly enough but, as I used the jakarta-commons-net package for NNTP and it apparently doesn't provide a
"stream protocol handler" for NNTP, any code that tries to new a URL with "nntp://"; will get an exception (and I think the URL filtering does this).


Question: Does this make sense that that Nutch depends on URLs, thus any schema not supported by the JVM (JVM supports http/https/file/ftp) will need a protocol handler?







For the second, using RegexURLFilter to index a million URLs at once quickly becomes untenable since all filters are stored in-memory and every filter has to be matched for every URL. An idea is to index the URLs one at a time, adding a TLD regex rule for the currently indexed URL, and deleting the rule before the next URL starts. So basically modifying the set of rules whilst indexing. Any ideas on a smarter way to do this?

I think this matches my goals too - I want to be able to drive the crawler by giving it URLs, and I know it should crawl the URLs otherwise I wouldn't be passing the URLs on in the first place.


thx,
 Dave



Thanks, k



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers



------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to