Re: Per-document Payloads

John Wang Sun, 21 Oct 2007 00:52:16 -0700

Hi Michael:

    I took your program and benchmarked against my setup, here are some
numbers comparing to the other options:


Setup: 2M docs with only the id, indexed in various ways for each method

randomly selected 50000 of the docids and do a lookup.

Comparing 3 methods:

1) load int[] from field cache
    Field(fldname,uid,Stored.No,Index.Tokenized);    lucene 2.2

2) using payload (your impl)            lucene 2.2

3) using our custom made lucene, described by my early email
   no fields added, just doc.setUserID(uid);  // since we are adding our own
segment file. (custome lucene 2.0)


Since all three methods loads docids into an int[], the lookup time is the
same for all three methods, what's
different are the load times:

1) 16.5 seconds,      43 MB
2) 590 milliseconds     32.5 MB
3) 186 milliseconds  26MB

I think the payload method is good enough so we don't need to diverge from
the lucene code base. However, I feel that being able to customize the
indexing process and store our own file is still more efficient both in load
time and index size.

Thanks

-John


On 10/20/07, Michael Busch <[EMAIL PROTECTED]> wrote:
>
> John Wang wrote:
> > Hi Michael:
> >      Thanks for the info.
> >
> >      I haven't played with payloads. Can you give me an example or point
> me
> > to how it is used to solve this problem?
> >
>
> Hi John,
>
> I (quickly) put together a class that is able to store UIDs as payloads.
> I believe the type of your UIDs is Integer?
>
> To add the UID to a document use this method:
>   /**
>    *  Adds a unique docId (UID) to a document as a payload.
>    */
>   public void addUID(Document doc, int uid);
>
> You can either load the UID from the disk or use a cache:
>
> /** Returns the UID for the passed-in docId.
>   *  If you use this method in a Hitcollector and your Query
>   *  contains OR-terms, then try to set
>   *  BooleanQuery.setAllowDocsOutOfOrder(false) to improve performance.
>   */
>   public int getUID(IndexReader reader, int docId) throws IOException;
>
>   /** Fills the passed-in array with UID-values. If the given array is
>    *  null or too small, then a new array is created with
>    *  cache.length = reader.maxDoc()
>    */
>   public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
>    throws IOException;
>
> I put a little test program in main and it seems to work fine. However,
> it's not thoroughly tested yet...
>
> You might want to try it without using the cache first. The performance
> might be good enough for your needs. If not, try using the cache, it
> should fill up much faster compared to the FieldCache.
>
> Another comment: If you're using Lucene 2.2, you need to replace the
> lines "tp.seek(UID_TERM);" (see comments in the code below). Or use the
> latest trunk version, it has a fix for this bug.
>
> Let me know please if this improves your performance! Have fun...
> - Michael
>
> And here is the code:
>
> import java.io.IOException;
>
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Payload;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermPositions;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
>
> public class PerDocPayloadReaderWriter {
>   public static final Term UID_TERM = new Term("_ID", "_UID");
>   private SinglePayloadTokenStream singlePayloadTokenStream = new
> SinglePayloadTokenStream();
>   private TermPositions tp;
>   private byte[] payloadBuffer = new byte[4];
>
>   public static void main(String args[]) throws IOException {
>     PerDocPayloadReaderWriter pdp = new PerDocPayloadReaderWriter();
>     Directory dir = new RAMDirectory();
>     IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer());
>     for (int i = 0; i < 10; i++) {
>       Document d = new Document();
>       // create dummy doc
>       d.add(new Field("test", "This is a test.", Store.NO,
> Index.TOKENIZED));
>       pdp.addUID(d, 100 + i);
>
>       writer.addDocument(d);
>     }
>     writer.close();
>
>     IndexReader reader = IndexReader.open(dir);
>     int[] uids = pdp.getCachedIds(reader, null, 0);
>
>     System.out.println("Caching:");
>     System.out.println("ID --> UID");
>
>     for (int i = 0; i < uids.length; i++) {
>       System.out.println(i + "  --> " + uids[i]);
>     }
>
>     System.out.println("\nDirect access:");
>     System.out.println("ID --> UID");
>
>     for (int i = 0; i < uids.length; i++) {
>       System.out.println(i + "  --> " + pdp.getUID(reader, i));
>     }
>     reader.close();
>   }
>
>
>   /** Fills the passed-in array with UID-values. If the given array is
> null or too small, then
>    * a new array is created with cache.length = reader.maxDoc()
>    */
>   public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
> throws IOException {
>     int maxDoc = reader.maxDoc();
>     if (cache == null || cache.length - offset > maxDoc) {
>       cache = new int[maxDoc];
>       offset = 0;
>     }
>
>     if (tp == null) {
>       tp = reader.termPositions(UID_TERM);
>     } else {
>       // if using Lucene 2.2 replace the following line with
>       // tp = reader.termPositions(UID_TERM);
>       tp.seek(UID_TERM);
>     }
>
>     while (tp.next()) {
>       assert tp.doc() < maxDoc;
>       if (!reader.isDeleted(tp.doc())) {
>         tp.nextPosition();
>         tp.getPayload(payloadBuffer, 0);
>         cache[tp.doc() + offset] = bytesToInt(payloadBuffer);
>       }
>     }
>
>     return cache;
>   }
>
>   /** Returns the UID for the passed-in docId.
>    *  If you use this method in a Hitcollector and your Query contains
> OR-terms,
>    *  then try to set BooleanQuery.setAllowDocsOutOfOrder(false) to
> improve performance.
>    */
>   public int getUID(IndexReader reader, int docId) throws IOException {
>     if (tp == null) {
>       tp = reader.termPositions(UID_TERM);
>     } else if (tp.doc()> docId) {
>       // if using Lucene 2.2 replace the following line with
>       // tp = reader.termPositions(UID_TERM);
>       tp.seek(UID_TERM);
>     }
>
>     tp.skipTo(docId);
>     tp.nextPosition();
>     tp.getPayload(payloadBuffer, 0);
>
>     return bytesToInt(payloadBuffer);
>   }
>
>   /**
>    * Adds a unique docId (UID) to a document as a payload.
>    */
>   public void addUID(Document doc, int uid) {
>     singlePayloadTokenStream.setUID(uid);
>     doc.add(new Field(UID_TERM.field(), singlePayloadTokenStream));
>   }
>
>   private int bytesToInt(byte[] bytes) {
>     return ((bytes[3] & 0xFF) << 24) | ((bytes[2] & 0xFF) << 16)
>     | ((bytes[1] & 0xFF) <<  8) |  (bytes[0] & 0xFF);
>
>   }
>
>   private static class SinglePayloadTokenStream extends TokenStream {
>     private Token token = new Token(UID_TERM.text(), 0, 0);
>     private byte[] buffer = new byte[4];
>     private boolean returnToken = false;
>
>     void setUID(int uid) {
>       buffer[0] = (byte) (uid);
>       buffer[1] = (byte) (uid >> 8);
>       buffer[2] = (byte) (uid >> 16);
>       buffer[3] = (byte) (uid >> 24);
>       token.setPayload(new Payload(buffer));
>       returnToken = true;
>     }
>
>     public Token next() throws IOException {
>       if (returnToken) {
>         returnToken = false;
>         return token;
>       } else {
>         return null;
>       }
>     }
>   }
> }
>
>
>
> > Thanks
> >
> > -John
> >
> > On 10/19/07, Michael Busch <[EMAIL PROTECTED]> wrote:
> >> John Wang wrote:
> >>>      I can tried to get some numbers for leading an int[] array vs
> >>> FieldCache.getInts().
> >> I've had a similar performance problem when I used the FieldCache. The
> >> loading performance is apparently so slow, because each value is stored
> >> as a term in the dictionary. For loading the cache it is necessary to
> >> iterate over all terms for the field in the dictionary. And for each
> >> term it's posting list is opened to check which documents have that
> value.
> >>
> >> If you store unique docIds, then there are no two documents that share
> >> the same value. That means, that each value gets its own entry in the
> >> dictionary and to load each value it is necessary to perform two random
> >> I/O seeks (one for term lookup + one to open the posting list).
> >>
> >> In my app it took for a big index several minutes to fill the cache
> like
> >> that.
> >>
> >> To speed things up I did essentially what Ning suggested. Now I store
> >> the values as payloads in the posting list of an artificial term. To
> >> fill my cache it's only necessary to perform a couple of I/O seeks for
> >> opening the posting list of the specific term, then it is just a
> >> sequential scan to load all values. With this approach the time for
> >> filling the cache went down from minutes to seconds!
> >>
> >> Now this approach is already much better than the current field cache
> >> implementation, but it still can be improved. In fact, we already have
> a
> >> mechanism for doing that: the norms. Norms are stored with a fixed
> size,
> >> which means both random access and sequential scan are optimal. Norms
> >> are also cached in memory, and filling that cache is much faster
> >> compared to the current FieldCache approach.
> >>
> >> I was therefore thinking about adding per-document payloads to Lucene
> >> (we can also call it document-metadata). The API could look like this:
> >>
> >> Document d = new Document();
> >> byte[] uidValue = ...
> >> d.addMetadata("uid", uidValue);
> >>
> >> And on the retrieval side all values could either be loaded into the
> >> field cache, or, if the index is too big, a new API can be used:
> >>
> >> IndexReader reader = IndexReader.open(...);
> >> DocumentMetadataIterator it = reader.metadataIterator("uid");
> >>
> >> where DocumentMetadataIterator is an interface similar to TermDocs:
> >>
> >> interface DocumentMetadataIterator {
> >>   void seek(String name);
> >>   boolean next();
> >>   boolean skipTo(int doc);
> >>
> >>   int doc();
> >>   byte[] getMetadata();
> >> }
> >>
> >> The next question would be how to store the per-doc payloads (PDP). If
> >> all values have the same length (as the unique docIds), then we should
> >> store them as efficiently as possible, like the norms. However, we
> still
> >> want to offer the flexibility of having variable-length values. For
> this
> >> case we could use a new data structure similar to our posting list.
> >>
> >> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> >> SkipList>
> >> FixedLengthPDPList    --> <Payload>^SegSize
> >> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> >> Payload               --> Byte^PayloadLength
> >> PayloadLength         --> VInt
> >> SkipList              --> see frq.file
> >>
> >> Because we don't have global field semantics Lucene should
> automatically
> >> pick the "right" data structure. This could work like this: When the
> >> DocumentsWriter writes a segment it checks whether all values of a PDP
> >> have the same length. If yes, it stores them as FixedLengthPDPList, if
> >> not, then as VariableLengthPDPList.
> >> When the SegmentMerger merges two or more segments it checks if all
> >> segments have a FixedLengthPDPList with the same length for a PDP. If
> >> not, it writes a VariableLengthPDPList to the new segment.
> >>
> >> I think this would be a nice new feature for Lucene. We could then have
> >> user-defined and Lucene-specific PDPs. For example, norms would be in
> >> the latter category (this way we would get rid of the special code for
> >> norms, as they could be handled as PDPs). It would also be easy to add
> >> new features in the future, like splitting the norms into two values: a
> >> norm and a boost value.
> >>
> >> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)
> >>
> >> - Michael
> >>
> >>> Thanks
> >>>
> >>> -John
> >>>
> >>> On 10/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> >>>> It seems like there are (at least) two angles here for getting better
> >>>> performance from FieldCache:
> >>>>
> >>>>   1) Be incremental: with reopen() we should only have to update a
> >>>>      subset of the array in the FieldCache, according to the changed
> >>>>      segments.  This is what Hoss is working on and Mark was
> referring
> >>>>      to and I think it's very important!
> >>>>
> >>>>   2) Parsing is slow (?): I'm guessing one of the reasons that John
> >>>>      added the _X.udt file was because it's much faster to load an
> >>>>      array of already-parsed ints than to ask FieldCache to populate
> >>>>      itself.
> >>>>
> >>>> Even if we do #1, I think #2 could be a big win (in addition)?  John
> >>>> do you have any numbers of how much faster it is to load the array of
> >>>> ints from the _X.udt file vs having FieldCache populate itself?
> >>>>
> >>>> Also on the original question of "can we open up SegmentReader,
> >>>> FieldsWriter, etc.", I think that's a good idea?  At least we can
> make
> >>>> things protected instead of private/final?
> >>>>
> >>>> Mike
> >>>>
> >>>> "Ning Li" <[EMAIL PROTECTED]> wrote:
> >>>>> I see what you mean by 2) now. What Mark said should work for you
> when
> >>>>> it's done.
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>> On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote:
> >>>>>> Hi Ning:
> >>>>>>     That is essentially what field cache does. Doing this for each
> >>>> docid in
> >>>>>> the result set will be slow if the result set is large. But loading
> >> it
> >>>> in
> >>>>>> memory when opening index can also be slow if the index is large
> and
> >>>> updates
> >>>>>> often.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> -John
> >>>>>>
> >>>>>> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
> >>>>>>> Make all documents have a term, say "ID:UID", and for each
> document,
> >>>>>>> store its UID in the term's payload. You can read off this posting
> >>>>>>> list to create your array. Will this work for you, John?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Ning
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >>>>>>>> Forwarding this to java-dev per request.  Seems like the best
> >>>> place
> >>>>>>>> to discuss this topic.
> >>>>>>>>
> >>>>>>>>         Erik
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Begin forwarded message:
> >>>>>>>>
> >>>>>>>>> From: "John Wang" <[EMAIL PROTECTED]>
> >>>>>>>>> Date: October 17, 2007 5:43:29 PM EDT
> >>>>>>>>> To: [EMAIL PROTECTED]
> >>>>>>>>> Subject: lucene indexing and merge process
> >>>>>>>>>
> >>>>>>>>> Hi Erik:
> >>>>>>>>>
> >>>>>>>>>     We are revamping our search system here at LinekdIn. And we
> >>>> are
> >>>>>>>>> using Lucene.
> >>>>>>>>>
> >>>>>>>>>     One issue we ran across is that we store an UID in Lucene
> >>>> which
> >>>>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
> >>>> we
> >>>>>>>>> have the following solutions:
> >>>>>>>>>
> >>>>>>>>> 1) Index it as a Stored field and get it from reader.document
> (very
> >>>>>>>>> slow if recall is large)
> >>>>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
> >>>>>>>>> indexreader can be slow)
> >>>>>>>>> 3) construct it using the FieldCache and persist it on disk
> >>>>>>>>> everytime the index changes. (not suitable for real time
> >>>> indexing,
> >>>>>>>>> e.g. this process will degrade as # of documents get large)
> >>>>>>>>>
> >>>>>>>>>     None of the above solutions turn out to be adequate for our
> >>>>>>>>> requirements.
> >>>>>>>>>
> >>>>>>>>>      What we end up doing is to modify Lucene code by changing
> >>>>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
> >>>>>>>>> advantage of the Lucene Segment/merge process. E.g:
> >>>>>>>>>
> >>>>>>>>>      For each segment, we store a .udt file, which is an int[]
> >>>>>>>>> array, (by changing the FieldWriter class)
> >>>>>>>>>
> >>>>>>>>>      And SegmentReader will load the .udt file into an array.
> >>>>>>>>>
> >>>>>>>>>      And merge happens seemlessly.
> >>>>>>>>>
> >>>>>>>>>      Because the tight encapsulation around these classes, e.g.
> >>>>>>>>> private and final methods, it is very difficult to extend Lucene
> >>>>>>>>> while avoiding branch into our own version. Is there a way we
> >>>> can
> >>>>>>>>> open up and make these classes extensible? We'd be happy to
> >>>>>>>>> contribute what we have done.
> >>>>>>>>>
> >>>>>>>>>      I guess to tackle the problem from a different angle: is
> >>>> there
> >>>>>>>>> a way to incorporate FieldCache into the segments (it is
> >>>> strictly
> >>>>>>>>> in memory now), and build disk versions while indexing.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>      Hope I am making sense.
> >>>>>>>>>
> >>>>>>>>>     I did not send this out to the mailing list because I wasn't
> >>>>>>>>> sure if this is a dev question or an user question, feel free to
> >>>>>>>>> either forward it to the right mailing list or let me know and I
> >>>>>>>>> can forward it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>> -John
> >>>>>>>>>
> >>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>>>
> >>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>>
> >>>>>>>
> >>>>>
> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Per-document Payloads

Reply via email to