David, seems like you and I are using Nutch for primarily
its crawling ability. This is probably an obvious route for
Lucene developers who need a crawler.
Yeah, I'm coming at this from being an experienced Lucene person...and as an aside, I've coded my own crawler (generic but can be plugged into Lucene), but I'm trying to learn about and use Nutch, in part to leverage what's been done (e.g. I'd prefer to get rid of my crawler and use Nutch's)
I believe lucene-user also points people Nutch's way whenever someone asks if there is a crawler they can use with Lucene.
There is obviously significant complexity introduced into Nutch as a result for needing to support the goals stated on http://www.nutch.org/docs/en/about.html, stuff which is probably unnecessary for folks who just need some plain simple crawling done.
Ahh, part of my context is also thinking of ways to set it up for "enterprise/intranet" search, things like:
-- more doc parsers: looking at 0.6 at least ppt is needed, plus the ability to burrow into zip and *.tar.gz files - maybe "eml" and "pst" formats too for email...and does mime (email) count as a separate format?
-- more protocols: imap, nntp, smb
Some of these may not seem related to the goals at the link you gave above, but I think they're consistent with the Nutch direction and will help Nutch take over the world :)
So just to throw the question out there: are there any
plans/possibilities of extracting the crawling part of Nutch
into a library, not unlike what Lucene does for search?
k
----- Original Message Follows ----- From: David Spencer <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [email protected] Subject: Re: [Nutch-dev] Injecting URLs from database Date: Tue, 15 Feb 2005 23:07:54 -0800
Kelvin Tan wrote:
I'd like to
1) inject URLs from a database
2) add a RegexFilter for each URL such that only pages
under each URL's TLD is indexed For the first, looking at the code, I suppose a way is
to subclass/customize WebDBInjector and add a method to
read URLs from the DB and call addFile() on each URL. So
that's ok. Is there a better way? I wish WebDBInjector
could be refactored into something a little more
extensible in terms of specifying different datasources,
like DmozURLSource and FileURLSource.
Good timing, I've had the same basic question wrt database content (use case, dynamic database-fed web site), NNTP, and IMAP.
I found the file plugin a bit bogus e.g.
net.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
) forms a web page on the fly representing a directory
view, apparently so the crawler can parse a web page to find more URLs to crawl. More
natural I think is if you can just directly drive a
crawler and toss URLs at it.
Question: Is there a way to directly drive the crawler with URLs (to be queued and then crawled)?
Last weekened I started on a NNTP plugin using the file plugin as a template. Things came together quickly enough but, as I used the jakarta-commons-net package for NNTP and it apparently doesn't provide a "stream protocol handler" for NNTP, any code that tries to new a URL with "nntp://" will get an exception (and I think the URL filtering does this).
Question: Does this make sense that that Nutch depends on
URLs, thus any schema not supported by the JVM (JVM supports
http/https/file/ftp) will need a protocol handler?
For the second, using RegexURLFilter to index a million
URLs at once quickly becomes untenable since all filters are stored in-memory and every filter has to be matched for every URL. An idea is to index the URLs one at a time, adding a TLD regex rule for the currently indexed URL, and deleting the rule before the next URL starts. So basically modifying the set of rules whilst indexing. Any ideas on a smarter way to do this?
I think this matches my goals too - I want to be able to drive the crawler by giving it URLs, and I know it should crawl the URLs otherwise I wouldn't be passing the URLs on in the first place.
thx, Dave
Thanks, k
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________ Nutch-developers mailing list [email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
