The segment is where all the content is stored. It contains all the html of the pages nutch has crawled and the parsed content (content without html tags) used by lucene. It can contain more or less data depending on your choice of plug-ins to run. Try this out on a small segment: nutch readseg -dump <segment_dir> <output>. It will output the segment as a text file so you can browse through it yourself and see what's in there.
Basically, the segment is where data is stored and manipulated before lucene gets involved. It does not necessarily have to be indexed to be useful. It all depends on what you're trying to accomplish. :) On 10/11/07, Ravish Bhagdev <[EMAIL PROTECTED]> wrote: > Ah, I see, didn't know that, Thanks! > > Interesting that nutch stores it in a different structure (segments) > and doesn't reuse Lucene strategy of storing within index. Any > particular reason why? Is there any other use of "Segments" data > structure except to return snippets? > > Cheers, > Ravish > > On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote: > > Hi Ravish. > > > > You are correct that Nutch does not store document content in the > > Lucene index. The content *is* stored in the Nutch segment, which is > > where snippets come from. > > > > Hope this helps. > > > > -J > > > > > > On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote: > > > > > Hey All, > > > > > > Am I right in believing that in Lucene/Nutch, to be able to return > > > content or snippet to a search query, the field to be returned has to > > > be stored? > > > > > > AFAIK, by default, Nutch dose not store the document field, am I > > > right? If so, how does it manage to return snippets? Wouldn't the > > > index be quite huge if nutch were storing document field by default? > > > > > > I will appreciate any help/comments as I'm bit lost with this. > > > > > > Ravi > > > > >
