Re: lucene indexing and merge process

John Wang Fri, 19 Oct 2007 08:14:07 -0700

Hi Mike:

     This is an excellent analysis.


     To do 2), we tried computing the field cache at indexing time to avoid
"parsing" at search time. But what we've found out was that this degrades
indexing (because it computes the entire fieldcache, not in segements) which
was not acceptable to our project either.

     I can tried to get some numbers for leading an int[] array vs
FieldCache.getInts().

Thanks

-John

On 10/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
>
> It seems like there are (at least) two angles here for getting better
> performance from FieldCache:
>
>   1) Be incremental: with reopen() we should only have to update a
>      subset of the array in the FieldCache, according to the changed
>      segments.  This is what Hoss is working on and Mark was referring
>      to and I think it's very important!
>
>   2) Parsing is slow (?): I'm guessing one of the reasons that John
>      added the _X.udt file was because it's much faster to load an
>      array of already-parsed ints than to ask FieldCache to populate
>      itself.
>
> Even if we do #1, I think #2 could be a big win (in addition)?  John
> do you have any numbers of how much faster it is to load the array of
> ints from the _X.udt file vs having FieldCache populate itself?
>
> Also on the original question of "can we open up SegmentReader,
> FieldsWriter, etc.", I think that's a good idea?  At least we can make
> things protected instead of private/final?
>
> Mike
>
> "Ning Li" <[EMAIL PROTECTED]> wrote:
> > I see what you mean by 2) now. What Mark said should work for you when
> > it's done.
> >
> > Cheers,
> > Ning
> >
> > On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote:
> > > Hi Ning:
> > >     That is essentially what field cache does. Doing this for each
> docid in
> > > the result set will be slow if the result set is large. But loading it
> in
> > > memory when opening index can also be slow if the index is large and
> updates
> > > often.
> > >
> > > Thanks
> > >
> > > -John
> > >
> > > On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Make all documents have a term, say "ID:UID", and for each document,
> > > > store its UID in the term's payload. You can read off this posting
> > > > list to create your array. Will this work for you, John?
> > > >
> > > > Cheers,
> > > > Ning
> > > >
> > > >
> > > > On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> > > > > Forwarding this to java-dev per request.  Seems like the best
> place
> > > > > to discuss this topic.
> > > > >
> > > > >         Erik
> > > > >
> > > > >
> > > > > Begin forwarded message:
> > > > >
> > > > > > From: "John Wang" <[EMAIL PROTECTED]>
> > > > > > Date: October 17, 2007 5:43:29 PM EDT
> > > > > > To: [EMAIL PROTECTED]
> > > > > > Subject: lucene indexing and merge process
> > > > > >
> > > > > > Hi Erik:
> > > > > >
> > > > > >     We are revamping our search system here at LinekdIn. And we
> are
> > > > > > using Lucene.
> > > > > >
> > > > > >     One issue we ran across is that we store an UID in Lucene
> which
> > > > > > we map to the DB storage. So given a docid, to lookup its UID,
> we
> > > > > > have the following solutions:
> > > > > >
> > > > > > 1) Index it as a Stored field and get it from reader.document(very
> > > > > > slow if recall is large)
> > > > > > 2) Load/Warmup the FieldCache (for large corpus, loading up the
> > > > > > indexreader can be slow)
> > > > > > 3) construct it using the FieldCache and persist it on disk
> > > > > > everytime the index changes. (not suitable for real time
> indexing,
> > > > > > e.g. this process will degrade as # of documents get large)
> > > > > >
> > > > > >     None of the above solutions turn out to be adequate for our
> > > > > > requirements.
> > > > > >
> > > > > >      What we end up doing is to modify Lucene code by changing
> > > > > > SegmentReader,DocumentWriter,and FieldWriter classes by taking
> > > > > > advantage of the Lucene Segment/merge process. E.g:
> > > > > >
> > > > > >      For each segment, we store a .udt file, which is an int[]
> > > > > > array, (by changing the FieldWriter class)
> > > > > >
> > > > > >      And SegmentReader will load the .udt file into an array.
> > > > > >
> > > > > >      And merge happens seemlessly.
> > > > > >
> > > > > >      Because the tight encapsulation around these classes, e.g.
> > > > > > private and final methods, it is very difficult to extend Lucene
> > > > > > while avoiding branch into our own version. Is there a way we
> can
> > > > > > open up and make these classes extensible? We'd be happy to
> > > > > > contribute what we have done.
> > > > > >
> > > > > >      I guess to tackle the problem from a different angle: is
> there
> > > > > > a way to incorporate FieldCache into the segments (it is
> strictly
> > > > > > in memory now), and build disk versions while indexing.
> > > > > >
> > > > > >
> > > > > >      Hope I am making sense.
> > > > > >
> > > > > >     I did not send this out to the mailing list because I wasn't
> > > > > > sure if this is a dev question or an user question, feel free to
> > > > > > either forward it to the right mailing list or let me know and I
> > > > > > can forward it.
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > -John
> > > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > > >
> > > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: lucene indexing and merge process

Reply via email to