David, seems like you and I are using Nutch for primarily its crawling ability. This is probably an obvious route for Lucene developers who need a crawler. I believe lucene-user also points people Nutch's way whenever someone asks if there is a crawler they can use with Lucene.
There is obviously significant complexity introduced into Nutch as a result for needing to support the goals stated on http://www.nutch.org/docs/en/about.html, stuff which is probably unnecessary for folks who just need some plain simple crawling done. So just to throw the question out there: are there any plans/possibilities of extracting the crawling part of Nutch into a library, not unlike what Lucene does for search? k ----- Original Message Follows ----- From: David Spencer <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [email protected] Subject: Re: [Nutch-dev] Injecting URLs from database Date: Tue, 15 Feb 2005 23:07:54 -0800 > Kelvin Tan wrote: > > > I'd like to > > > > 1) inject URLs from a database > > 2) add a RegexFilter for each URL such that only pages > > under each URL's TLD is indexed > > For the first, looking at the code, I suppose a way is > to subclass/customize WebDBInjector and add a method to > read URLs from the DB and call addFile() on each URL. So > that's ok. Is there a better way? I wish WebDBInjector > could be refactored into something a little more > extensible in terms of specifying different datasources, > like DmozURLSource and FileURLSource. > > Good timing, I've had the same basic question wrt database > content (use case, dynamic database-fed web site), NNTP, > and IMAP. > > I found the file plugin a bit bogus e.g. > net.nutch.protocol.file.FileResponse.getDirAsHttpResponse( > ) forms a web page on the fly representing a directory > view, apparently so the crawler > can parse a web page to find more URLs to crawl. More > natural I think is if you can just directly drive a > crawler and toss URLs at it. > > Question: Is there a way to directly drive the crawler > with URLs (to be queued and then crawled)? > > Last weekened I started on a NNTP plugin using the file > plugin as a template. Things came together quickly enough > but, as I used the jakarta-commons-net package for NNTP > and it apparently doesn't provide a "stream protocol > handler" for NNTP, any code that tries to new a URL with > "nntp://" will get an exception (and I think the URL > filtering does this). > > Question: Does this make sense that that Nutch depends on > URLs, thus any > schema not supported by the JVM (JVM supports > http/https/file/ftp) will need a protocol handler? > > > > > > > > > > For the second, using RegexURLFilter to index a million > URLs at once quickly becomes untenable since all filters > are stored in-memory and every filter has to be matched > for every URL. An idea is to index the URLs one at a time, > adding a TLD regex rule for the currently indexed URL, and > deleting the rule before the next URL starts. So basically > modifying the set of rules whilst indexing. Any ideas on a > smarter way to do this? > > I think this matches my goals too - I want to be able to > drive the crawler by giving it URLs, and I know it should > crawl the URLs otherwise I wouldn't be passing the URLs > on in the first place. > > thx, > Dave > > > > > > Thanks, > > k > > > > > > > > ------------------------------------------------------- > > SF email is sponsored by - The IT Product Guide > > Read honest & candid reviews on hundreds of IT Products > > from real users. Discover which products truly live up > > to the hype. Start reading now. > > http://ads.osdn.com/?ad_ide95&alloc_id396&op=click > > _______________________________________________ > > Nutch-developers mailing list > > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products > from real users. Discover which products truly live up to > the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
