[Nutch-dev] Using Nutch to fetch + post-process URLs w/o indexing

Otis Gospodnetic Fri, 23 Jul 2004 09:55:00 -0700

Hello,

I have a database (RDBMS) with URLs I need to periodically fetch in
order to determine things like: page language, character set, HTTP
status code, size, and eventually to index the content (although not in
1 big index, but a number of small ones).  I am not interested in using
Nutch to build 1 big index of fetched pages.


I am wondering if I could make use of Nutch for this, or at least some
of Nutch's functionality.  

I believe I could dump URLs from my RDBMS and create a WebDB using
WebDBInjector (bin/nutch inject ...).

Next, I believe I could generate a fetch list containing all URLs in my
WebDB, and have fetcher download them all.

Is the above correct?

I am not clear about what follows, and especially about the new
plugins.  Where/when do downloaded pages get processed by the plugins,
and where do plugins write their output?

I have a number of indices in my application (think lots of users, each
with its own Lucene index -- see http://www.simpy.com/ ), so I need to
do something like this:

1. for each user in my RDBMS
2.   get all URLs from my RDBMS
3.   for each URL get its lang,size,etc... from Nutch (WebDB?/Fetcher
output?/plugins output?)
4.     add this + the text of the fetched URL to user's index
5.     update some RDBMS columns
6.   end
7. end

The step that is unclear is 3.  Where do I get all that data I need
(page size, HTTP status code, language, and text from the page)?

If I missed a relevant Wiki page, please point me to it.

Thanks,
Otis



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Using Nutch to fetch + post-process URLs w/o indexing

Reply via email to