Re: Per-document Payloads (was: Re: lucene indexing and merge process)

John Wang Fri, 19 Oct 2007 20:29:04 -0700

Hi Michael:
     Thanks for the info.

     I haven't played with payloads. Can you give me an example or point me
to how it is used to solve this problem?


Thanks

-John

On 10/19/07, Michael Busch <[EMAIL PROTECTED]> wrote:
>
> John Wang wrote:
> >
> >      I can tried to get some numbers for leading an int[] array vs
> > FieldCache.getInts().
>
> I've had a similar performance problem when I used the FieldCache. The
> loading performance is apparently so slow, because each value is stored
> as a term in the dictionary. For loading the cache it is necessary to
> iterate over all terms for the field in the dictionary. And for each
> term it's posting list is opened to check which documents have that value.
>
> If you store unique docIds, then there are no two documents that share
> the same value. That means, that each value gets its own entry in the
> dictionary and to load each value it is necessary to perform two random
> I/O seeks (one for term lookup + one to open the posting list).
>
> In my app it took for a big index several minutes to fill the cache like
> that.
>
> To speed things up I did essentially what Ning suggested. Now I store
> the values as payloads in the posting list of an artificial term. To
> fill my cache it's only necessary to perform a couple of I/O seeks for
> opening the posting list of the specific term, then it is just a
> sequential scan to load all values. With this approach the time for
> filling the cache went down from minutes to seconds!
>
> Now this approach is already much better than the current field cache
> implementation, but it still can be improved. In fact, we already have a
> mechanism for doing that: the norms. Norms are stored with a fixed size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
> compared to the current FieldCache approach.
>
> I was therefore thinking about adding per-document payloads to Lucene
> (we can also call it document-metadata). The API could look like this:
>
> Document d = new Document();
> byte[] uidValue = ...
> d.addMetadata("uid", uidValue);
>
> And on the retrieval side all values could either be loaded into the
> field cache, or, if the index is too big, a new API can be used:
>
> IndexReader reader = IndexReader.open(...);
> DocumentMetadataIterator it = reader.metadataIterator("uid");
>
> where DocumentMetadataIterator is an interface similar to TermDocs:
>
> interface DocumentMetadataIterator {
>   void seek(String name);
>   boolean next();
>   boolean skipTo(int doc);
>
>   int doc();
>   byte[] getMetadata();
> }
>
> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we still
> want to offer the flexibility of having variable-length values. For this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file
>
> Because we don't have global field semantics Lucene should automatically
> pick the "right" data structure. This could work like this: When the
> DocumentsWriter writes a segment it checks whether all values of a PDP
> have the same length. If yes, it stores them as FixedLengthPDPList, if
> not, then as VariableLengthPDPList.
> When the SegmentMerger merges two or more segments it checks if all
> segments have a FixedLengthPDPList with the same length for a PDP. If
> not, it writes a VariableLengthPDPList to the new segment.
>
> I think this would be a nice new feature for Lucene. We could then have
> user-defined and Lucene-specific PDPs. For example, norms would be in
> the latter category (this way we would get rid of the special code for
> norms, as they could be handled as PDPs). It would also be easy to add
> new features in the future, like splitting the norms into two values: a
> norm and a boost value.
>
> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)
>
> - Michael
>
> >
> > Thanks
> >
> > -John
> >
> > On 10/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> >>
> >> It seems like there are (at least) two angles here for getting better
> >> performance from FieldCache:
> >>
> >>   1) Be incremental: with reopen() we should only have to update a
> >>      subset of the array in the FieldCache, according to the changed
> >>      segments.  This is what Hoss is working on and Mark was referring
> >>      to and I think it's very important!
> >>
> >>   2) Parsing is slow (?): I'm guessing one of the reasons that John
> >>      added the _X.udt file was because it's much faster to load an
> >>      array of already-parsed ints than to ask FieldCache to populate
> >>      itself.
> >>
> >> Even if we do #1, I think #2 could be a big win (in addition)?  John
> >> do you have any numbers of how much faster it is to load the array of
> >> ints from the _X.udt file vs having FieldCache populate itself?
> >>
> >> Also on the original question of "can we open up SegmentReader,
> >> FieldsWriter, etc.", I think that's a good idea?  At least we can make
> >> things protected instead of private/final?
> >>
> >> Mike
> >>
> >> "Ning Li" <[EMAIL PROTECTED]> wrote:
> >>> I see what you mean by 2) now. What Mark said should work for you when
> >>> it's done.
> >>>
> >>> Cheers,
> >>> Ning
> >>>
> >>> On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote:
> >>>> Hi Ning:
> >>>>     That is essentially what field cache does. Doing this for each
> >> docid in
> >>>> the result set will be slow if the result set is large. But loading
> it
> >> in
> >>>> memory when opening index can also be slow if the index is large and
> >> updates
> >>>> often.
> >>>>
> >>>> Thanks
> >>>>
> >>>> -John
> >>>>
> >>>> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
> >>>>> Make all documents have a term, say "ID:UID", and for each document,
> >>>>> store its UID in the term's payload. You can read off this posting
> >>>>> list to create your array. Will this work for you, John?
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>>
> >>>>> On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >>>>>> Forwarding this to java-dev per request.  Seems like the best
> >> place
> >>>>>> to discuss this topic.
> >>>>>>
> >>>>>>         Erik
> >>>>>>
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>>> From: "John Wang" <[EMAIL PROTECTED]>
> >>>>>>> Date: October 17, 2007 5:43:29 PM EDT
> >>>>>>> To: [EMAIL PROTECTED]
> >>>>>>> Subject: lucene indexing and merge process
> >>>>>>>
> >>>>>>> Hi Erik:
> >>>>>>>
> >>>>>>>     We are revamping our search system here at LinekdIn. And we
> >> are
> >>>>>>> using Lucene.
> >>>>>>>
> >>>>>>>     One issue we ran across is that we store an UID in Lucene
> >> which
> >>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
> >> we
> >>>>>>> have the following solutions:
> >>>>>>>
> >>>>>>> 1) Index it as a Stored field and get it from reader.document(very
> >>>>>>> slow if recall is large)
> >>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
> >>>>>>> indexreader can be slow)
> >>>>>>> 3) construct it using the FieldCache and persist it on disk
> >>>>>>> everytime the index changes. (not suitable for real time
> >> indexing,
> >>>>>>> e.g. this process will degrade as # of documents get large)
> >>>>>>>
> >>>>>>>     None of the above solutions turn out to be adequate for our
> >>>>>>> requirements.
> >>>>>>>
> >>>>>>>      What we end up doing is to modify Lucene code by changing
> >>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
> >>>>>>> advantage of the Lucene Segment/merge process. E.g:
> >>>>>>>
> >>>>>>>      For each segment, we store a .udt file, which is an int[]
> >>>>>>> array, (by changing the FieldWriter class)
> >>>>>>>
> >>>>>>>      And SegmentReader will load the .udt file into an array.
> >>>>>>>
> >>>>>>>      And merge happens seemlessly.
> >>>>>>>
> >>>>>>>      Because the tight encapsulation around these classes, e.g.
> >>>>>>> private and final methods, it is very difficult to extend Lucene
> >>>>>>> while avoiding branch into our own version. Is there a way we
> >> can
> >>>>>>> open up and make these classes extensible? We'd be happy to
> >>>>>>> contribute what we have done.
> >>>>>>>
> >>>>>>>      I guess to tackle the problem from a different angle: is
> >> there
> >>>>>>> a way to incorporate FieldCache into the segments (it is
> >> strictly
> >>>>>>> in memory now), and build disk versions while indexing.
> >>>>>>>
> >>>>>>>
> >>>>>>>      Hope I am making sense.
> >>>>>>>
> >>>>>>>     I did not send this out to the mailing list because I wasn't
> >>>>>>> sure if this is a dev question or an user question, feel free to
> >>>>>>> either forward it to the right mailing list or let me know and I
> >>>>>>> can forward it.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> -John
> >>>>>>>
> >>>>>>
> >>>>>>
> >> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>
> >>>>>>
> >>>>>
> >> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Reply via email to