On Fri, Jul 23, 2004 at 09:50:35AM -0700, Otis Gospodnetic wrote:
> Hello,
> 
> I have a database (RDBMS) with URLs I need to periodically fetch in
> order to determine things like: page language, character set, HTTP
> status code, size, and eventually to index the content (although not in
> 1 big index, but a number of small ones).  I am not interested in using
> Nutch to build 1 big index of fetched pages.
> 
> I am wondering if I could make use of Nutch for this, or at least some
> of Nutch's functionality.  
> 
> I believe I could dump URLs from my RDBMS and create a WebDB using
> WebDBInjector (bin/nutch inject ...).
> 
> Next, I believe I could generate a fetch list containing all URLs in my
> WebDB, and have fetcher download them all.
> 
> Is the above correct?
> 
> I am not clear about what follows, and especially about the new
> plugins.  Where/when do downloaded pages get processed by the plugins,
> and where do plugins write their output?
> 
> I have a number of indices in my application (think lots of users, each
> with its own Lucene index -- see http://www.simpy.com/ ), so I need to
> do something like this:
> 
> 1. for each user in my RDBMS
> 2.   get all URLs from my RDBMS
> 3.   for each URL get its lang,size,etc... from Nutch (WebDB?/Fetcher
> output?/plugins output?)
> 4.     add this + the text of the fetched URL to user's index
> 5.     update some RDBMS columns
> 6.   end
> 7. end
> 
> The step that is unclear is 3.  Where do I get all that data I need
> (page size, HTTP status code, language, and text from the page)?

Hi, Otis,

For each nutch'ed segment, there are those dirs:
./fetchlist: urls to fetch
./fetcher: FetcherOutput
./content: raw Content
./parse_data: meta info about each downloaded page, e.g., Content-Length, etc.
./parse_text: plain text of parsed raw content.
Not sure HTTP status code, language are in ./parse_data now.
But it will be easy to add in.

Check classes under ./src/java/net/nutch/{fetcher,parse}.

I have a tool called DumpSegment.java. I think I am going to check it in
tomorrow after some clearups. You can consult it for which file is what.

John


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to