Re: lucene indexing and merge process

Michael McCandless Fri, 19 Oct 2007 04:18:28 -0700

It seems like there are (at least) two angles here for getting better
performance from FieldCache:


  1) Be incremental: with reopen() we should only have to update a
     subset of the array in the FieldCache, according to the changed
     segments.  This is what Hoss is working on and Mark was referring
     to and I think it's very important!

  2) Parsing is slow (?): I'm guessing one of the reasons that John
     added the _X.udt file was because it's much faster to load an
     array of already-parsed ints than to ask FieldCache to populate
     itself.

Even if we do #1, I think #2 could be a big win (in addition)?  John
do you have any numbers of how much faster it is to load the array of
ints from the _X.udt file vs having FieldCache populate itself?

Also on the original question of "can we open up SegmentReader,
FieldsWriter, etc.", I think that's a good idea?  At least we can make
things protected instead of private/final?

Mike

"Ning Li" <[EMAIL PROTECTED]> wrote:
> I see what you mean by 2) now. What Mark said should work for you when
> it's done.
> 
> Cheers,
> Ning
> 
> On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote:
> > Hi Ning:
> >     That is essentially what field cache does. Doing this for each docid in
> > the result set will be slow if the result set is large. But loading it in
> > memory when opening index can also be slow if the index is large and updates
> > often.
> >
> > Thanks
> >
> > -John
> >
> > On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
> > >
> > > Make all documents have a term, say "ID:UID", and for each document,
> > > store its UID in the term's payload. You can read off this posting
> > > list to create your array. Will this work for you, John?
> > >
> > > Cheers,
> > > Ning
> > >
> > >
> > > On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> > > > Forwarding this to java-dev per request.  Seems like the best place
> > > > to discuss this topic.
> > > >
> > > >         Erik
> > > >
> > > >
> > > > Begin forwarded message:
> > > >
> > > > > From: "John Wang" <[EMAIL PROTECTED]>
> > > > > Date: October 17, 2007 5:43:29 PM EDT
> > > > > To: [EMAIL PROTECTED]
> > > > > Subject: lucene indexing and merge process
> > > > >
> > > > > Hi Erik:
> > > > >
> > > > >     We are revamping our search system here at LinekdIn. And we are
> > > > > using Lucene.
> > > > >
> > > > >     One issue we ran across is that we store an UID in Lucene which
> > > > > we map to the DB storage. So given a docid, to lookup its UID, we
> > > > > have the following solutions:
> > > > >
> > > > > 1) Index it as a Stored field and get it from reader.document (very
> > > > > slow if recall is large)
> > > > > 2) Load/Warmup the FieldCache (for large corpus, loading up the
> > > > > indexreader can be slow)
> > > > > 3) construct it using the FieldCache and persist it on disk
> > > > > everytime the index changes. (not suitable for real time indexing,
> > > > > e.g. this process will degrade as # of documents get large)
> > > > >
> > > > >     None of the above solutions turn out to be adequate for our
> > > > > requirements.
> > > > >
> > > > >      What we end up doing is to modify Lucene code by changing
> > > > > SegmentReader,DocumentWriter,and FieldWriter classes by taking
> > > > > advantage of the Lucene Segment/merge process. E.g:
> > > > >
> > > > >      For each segment, we store a .udt file, which is an int[]
> > > > > array, (by changing the FieldWriter class)
> > > > >
> > > > >      And SegmentReader will load the .udt file into an array.
> > > > >
> > > > >      And merge happens seemlessly.
> > > > >
> > > > >      Because the tight encapsulation around these classes, e.g.
> > > > > private and final methods, it is very difficult to extend Lucene
> > > > > while avoiding branch into our own version. Is there a way we can
> > > > > open up and make these classes extensible? We'd be happy to
> > > > > contribute what we have done.
> > > > >
> > > > >      I guess to tackle the problem from a different angle: is there
> > > > > a way to incorporate FieldCache into the segments (it is strictly
> > > > > in memory now), and build disk versions while indexing.
> > > > >
> > > > >
> > > > >      Hope I am making sense.
> > > > >
> > > > >     I did not send this out to the mailing list because I wasn't
> > > > > sure if this is a dev question or an user question, feel free to
> > > > > either forward it to the right mailing list or let me know and I
> > > > > can forward it.
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > -John
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene indexing and merge process

Reply via email to