Re: How to get Text and Parse data for URL

Doug Cutting Tue, 25 Apr 2006 15:27:17 -0700

Dennis Kubes wrote:

I think that I am not fully understanding the rolethe segments directory and its contents play.

A segment is simply a set of urls fetched in the same round, and dataassociated with these urls. The content subdirectory contains the rawhttp content. The parse-text subdirectory contains the extracted text,used when indexing and when building snippets for hits. The indexsubdirectory holds a Lucene index of the pages in the segment. Etc. Itis an independent chunk of Nutch data.

In 0.8, each segment subdirectory is further split into parts, theresult of distributed processing. The parts are split by the hash ofthe url.


Does that help?

Doug

Re: How to get Text and Parse data for URL

Reply via email to