Re: [Nutch-dev] Injecting URLs from database

David Spencer Thu, 17 Feb 2005 14:43:18 -0800

kelvin-lists wrote:

David, seems like you and I are using Nutch for primarily its crawling ability. This is probably an obvious route for Lucene developers who need a crawler.

Yeah, I'm coming at this from being an experienced Lucene person...and as an aside, I've coded my own crawler (generic but can be plugged into Lucene), but I'm trying to learn about and use Nutch, in part to leverage what's been done (e.g. I'd prefer to get rid of my crawler and use Nutch's)

I believe lucene-user
also points people Nutch's way whenever someone asks if
there is a crawler they can use with Lucene.

There is obviously significant complexity introduced into
Nutch as a result for needing to support the goals stated on
http://www.nutch.org/docs/en/about.html, stuff which is
probably unnecessary for folks who just need some plain
simple crawling done.

Ahh, part of my context is also thinking of ways to set it up for "enterprise/intranet" search, things like: -- more doc parsers: looking at 0.6 at least ppt is needed, plus the ability to burrow into zip and *.tar.gz files - maybe "eml" and "pst" formats too for email...and does mime (email) count as a separate format? -- more protocols: imap, nntp, smb

Some of these may not seem related to the goals at the link you gave above, but I think they're consistent with the Nutch direction and will help Nutch take over the world :)

So just to throw the question out there: are there any plans/possibilities of extracting the crawling part of Nutch into a library, not unlike what Lucene does for search?
k
----- Original Message Follows -----
From: David Spencer <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [email protected]
Subject: Re: [Nutch-dev] Injecting URLs from database
Date: Tue, 15 Feb 2005 23:07:54 -0800
Kelvin Tan wrote:
I'd like to

1) inject URLs from a database 2) add a RegexFilter for each URL such that only pages under each URL's TLD is indexed For the first, looking at the code, I suppose a way is
to subclass/customize WebDBInjector and add a method to read URLs from the DB and call addFile() on each URL. So that's ok. Is there a better way? I wish WebDBInjector could be refactored into something a little more extensible in terms of specifying different datasources, like DmozURLSource and FileURLSource.
Good timing, I've had the same basic question wrt database
content (use  case, dynamic database-fed web site), NNTP,
and IMAP.
I found the file plugin a bit bogus e.g. net.nutch.protocol.file.FileResponse.getDirAsHttpResponse( ) forms a web page on the fly representing a directory view, apparently so the crawler can parse a web page to find more URLs to crawl. More natural I think is if you can just directly drive a crawler and toss URLs at it.
Question: Is there a way to directly drive the crawler
with URLs (to be  queued and then crawled)?
Last weekened I started on a NNTP plugin using the file
plugin as a  template. Things came together quickly enough
but, as I used the  jakarta-commons-net package for NNTP
and it apparently doesn't provide a "stream protocol
handler" for NNTP, any code that tries to new a URL  with
"nntp://"; will get an exception (and I think the URL
filtering does  this).
Question: Does this make sense that that Nutch depends on URLs, thus any schema not supported by the JVM (JVM supports http/https/file/ftp) will need a protocol handler?
For the second, using RegexURLFilter to index a million
URLs at once quickly becomes untenable since all filters
are stored in-memory and every filter has to be matched
for every URL. An idea is to index the URLs one at a time,
adding a TLD regex rule for the currently indexed URL, and
deleting the rule before the next URL starts. So basically
modifying the set of rules whilst indexing. Any ideas on a
smarter way to do this?
I think this matches my goals too - I want to be able to
drive the  crawler by giving it URLs, and I know it should
crawl the URLs otherwise  I wouldn't be passing the URLs
on in the first place.
thx,
 Dave
Thanks,
k
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products
from real users. Discover which products truly live up
to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products
from real users. Discover which products truly live up to
the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Injecting URLs from database

Reply via email to