The segment is where all the content is stored.  It contains all the
html of the pages nutch has crawled and the parsed content (content
without html tags) used by lucene.  It can contain more or less data
depending on your choice of plug-ins to run.  Try this out on a small
segment: nutch readseg -dump <segment_dir> <output>.  It will output
the segment as a text file so you can browse through it yourself and
see what's in there.

Basically, the segment is where data is stored and manipulated before
lucene gets involved.  It does not necessarily have to be indexed to
be useful.  It all depends on what you're trying to accomplish. :)

On 10/11/07, Ravish Bhagdev <[EMAIL PROTECTED]> wrote:
> Ah, I see, didn't know that, Thanks!
>
> Interesting that nutch stores it in a different structure (segments)
> and doesn't reuse Lucene strategy of storing within index.  Any
> particular reason why?  Is there any other use of "Segments" data
> structure except to return snippets?
>
> Cheers,
> Ravish
>
> On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:
> > Hi Ravish.
> >
> > You are correct that Nutch does not store document content in the
> > Lucene index. The content *is* stored in the Nutch segment, which is
> > where snippets come from.
> >
> > Hope this helps.
> >
> > -J
> >
> >
> > On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
> >
> > > Hey All,
> > >
> > > Am I right in believing that in Lucene/Nutch, to be able to return
> > > content or snippet to a search query, the field to be returned has to
> > > be stored?
> > >
> > > AFAIK, by default, Nutch dose not store the document field, am I
> > > right?  If so, how does it manage to return snippets?  Wouldn't the
> > > index be quite huge if nutch were storing document field by default?
> > >
> > > I will appreciate any help/comments as I'm bit lost with this.
> > >
> > > Ravi
> >
> >
>

Reply via email to