Re: [Nutch-dev] Injecting URLs from database

kelvin-lists Thu, 17 Feb 2005 04:09:13 -0800

David, seems like you and I are using Nutch for primarily
its crawling ability. This is probably an obvious route for
Lucene developers who need a crawler. I believe lucene-user
also points people Nutch's way whenever someone asks if
there is a crawler they can use with Lucene.


There is obviously significant complexity introduced into
Nutch as a result for needing to support the goals stated on
http://www.nutch.org/docs/en/about.html, stuff which is
probably unnecessary for folks who just need some plain
simple crawling done.

So just to throw the question out there: are there any
plans/possibilities of extracting the crawling part of Nutch
into a library, not unlike what Lucene does for search?

k

----- Original Message Follows -----
From: David Spencer <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [email protected]
Subject: Re: [Nutch-dev] Injecting URLs from database
Date: Tue, 15 Feb 2005 23:07:54 -0800

> Kelvin Tan wrote:
>
> > I'd like to
> >
> > 1) inject URLs from a database
> > 2) add a RegexFilter for each URL such that only pages
> > under each URL's TLD is indexed
> > For the first, looking at the code, I suppose a way is
> to subclass/customize WebDBInjector and add a method to
> read URLs from the DB and call addFile() on each URL. So
> that's ok. Is there a better way? I wish WebDBInjector
> could be refactored into something a little more
> extensible in terms of specifying different datasources,
> like DmozURLSource and FileURLSource.
>
> Good timing, I've had the same basic question wrt database
> content (use  case, dynamic database-fed web site), NNTP,
> and IMAP.
>
> I found the file plugin a bit bogus e.g.
> net.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> ) forms a web  page on the fly representing a directory
> view, apparently so the crawler
>   can parse a web page to find more URLs to crawl. More
> natural I think  is if you can just directly drive a
> crawler and toss URLs at it.
>
> Question: Is there a way to directly drive the crawler
> with URLs (to be  queued and then crawled)?
>
> Last weekened I started on a NNTP plugin using the file
> plugin as a  template. Things came together quickly enough
> but, as I used the  jakarta-commons-net package for NNTP
> and it apparently doesn't provide a "stream protocol
> handler" for NNTP, any code that tries to new a URL  with
> "nntp://"; will get an exception (and I think the URL
> filtering does  this).
>
> Question: Does this make sense that that Nutch depends on
> URLs, thus any
>   schema not supported by the JVM (JVM supports
> http/https/file/ftp)  will need a protocol handler?
>
>
>
>
>
>
> >
> > For the second, using RegexURLFilter to index a million
> URLs at once quickly becomes untenable since all filters
> are stored in-memory and every filter has to be matched
> for every URL. An idea is to index the URLs one at a time,
> adding a TLD regex rule for the currently indexed URL, and
> deleting the rule before the next URL starts. So basically
> modifying the set of rules whilst indexing. Any ideas on a
> smarter way to do this?
>
> I think this matches my goals too - I want to be able to
> drive the  crawler by giving it URLs, and I know it should
> crawl the URLs otherwise  I wouldn't be passing the URLs
> on in the first place.
>
> thx,
>   Dave
>
>
> >
> > Thanks,
> > k
> >
> >
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products
> > from real users. Discover which products truly live up
> > to the hype. Start reading now.
> > http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
> > _______________________________________________
> > Nutch-developers mailing list
> > [email protected]
>
https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products
> from real users. Discover which products truly live up to
> the hype. Start reading now.
>
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
>
https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Injecting URLs from database

Reply via email to