Dennis Kubes wrote:
I think that I am not fully understanding the role the segments directory and its contents play.

A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text subdirectory contains the extracted text, used when indexing and when building snippets for hits. The index subdirectory holds a Lucene index of the pages in the segment. Etc. It is an independent chunk of Nutch data.

In 0.8, each segment subdirectory is further split into parts, the result of distributed processing. The parts are split by the hash of the url.

Does that help?

Doug

Reply via email to