Re: Per-document Payloads

Michael Busch Sat, 20 Oct 2007 01:12:07 -0700

John Wang wrote:
> Hi Michael:
>      Thanks for the info.
> 
>      I haven't played with payloads. Can you give me an example or point me
> to how it is used to solve this problem?
>


Hi John,

I (quickly) put together a class that is able to store UIDs as payloads.
I believe the type of your UIDs is Integer?

To add the UID to a document use this method:
  /**
   *  Adds a unique docId (UID) to a document as a payload.
   */
  public void addUID(Document doc, int uid);

You can either load the UID from the disk or use a cache:

 /** Returns the UID for the passed-in docId.
  *  If you use this method in a Hitcollector and your Query
  *  contains OR-terms, then try to set
  *  BooleanQuery.setAllowDocsOutOfOrder(false) to improve performance.
  */
  public int getUID(IndexReader reader, int docId) throws IOException;

  /** Fills the passed-in array with UID-values. If the given array is
   *  null or too small, then a new array is created with
   *  cache.length = reader.maxDoc()
   */
  public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
   throws IOException;

I put a little test program in main and it seems to work fine. However,
it's not thoroughly tested yet...

You might want to try it without using the cache first. The performance
might be good enough for your needs. If not, try using the cache, it
should fill up much faster compared to the FieldCache.

Another comment: If you're using Lucene 2.2, you need to replace the
lines "tp.seek(UID_TERM);" (see comments in the code below). Or use the
latest trunk version, it has a fix for this bug.

Let me know please if this improves your performance! Have fun...
- Michael

And here is the code:

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Payload;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public class PerDocPayloadReaderWriter {
  public static final Term UID_TERM = new Term("_ID", "_UID");
  private SinglePayloadTokenStream singlePayloadTokenStream = new
SinglePayloadTokenStream();
  private TermPositions tp;
  private byte[] payloadBuffer = new byte[4];

  public static void main(String args[]) throws IOException {
    PerDocPayloadReaderWriter pdp = new PerDocPayloadReaderWriter();
    Directory dir = new RAMDirectory();
    IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer());
    for (int i = 0; i < 10; i++) {
      Document d = new Document();
      // create dummy doc
      d.add(new Field("test", "This is a test.", Store.NO,
Index.TOKENIZED));
      pdp.addUID(d, 100 + i);

      writer.addDocument(d);
    }
    writer.close();

    IndexReader reader = IndexReader.open(dir);
    int[] uids = pdp.getCachedIds(reader, null, 0);

    System.out.println("Caching:");
    System.out.println("ID --> UID");

    for (int i = 0; i < uids.length; i++) {
      System.out.println(i + "  --> " + uids[i]);
    }

    System.out.println("\nDirect access:");
    System.out.println("ID --> UID");

    for (int i = 0; i < uids.length; i++) {
      System.out.println(i + "  --> " + pdp.getUID(reader, i));
    }
    reader.close();
  }


  /** Fills the passed-in array with UID-values. If the given array is
null or too small, then
   * a new array is created with cache.length = reader.maxDoc()
   */
  public int[] getCachedIds(IndexReader reader, int[] cache, int offset)
throws IOException {
    int maxDoc = reader.maxDoc();
    if (cache == null || cache.length - offset > maxDoc) {
      cache = new int[maxDoc];
      offset = 0;
    }

    if (tp == null) {
      tp = reader.termPositions(UID_TERM);
    } else {
      // if using Lucene 2.2 replace the following line with
      // tp = reader.termPositions(UID_TERM);
      tp.seek(UID_TERM);
    }

    while (tp.next()) {
      assert tp.doc() < maxDoc;
      if (!reader.isDeleted(tp.doc())) {
        tp.nextPosition();
        tp.getPayload(payloadBuffer, 0);
        cache[tp.doc() + offset] = bytesToInt(payloadBuffer);
      }
    }

    return cache;
  }

  /** Returns the UID for the passed-in docId.
   *  If you use this method in a Hitcollector and your Query contains
OR-terms,
   *  then try to set BooleanQuery.setAllowDocsOutOfOrder(false) to
improve performance.
   */
  public int getUID(IndexReader reader, int docId) throws IOException {
    if (tp == null) {
      tp = reader.termPositions(UID_TERM);
    } else if (tp.doc()> docId) {
      // if using Lucene 2.2 replace the following line with
      // tp = reader.termPositions(UID_TERM);
      tp.seek(UID_TERM);
    }

    tp.skipTo(docId);
    tp.nextPosition();
    tp.getPayload(payloadBuffer, 0);

    return bytesToInt(payloadBuffer);
  }

  /**
   * Adds a unique docId (UID) to a document as a payload.
   */
  public void addUID(Document doc, int uid) {
    singlePayloadTokenStream.setUID(uid);
    doc.add(new Field(UID_TERM.field(), singlePayloadTokenStream));
  }

  private int bytesToInt(byte[] bytes) {
    return ((bytes[3] & 0xFF) << 24) | ((bytes[2] & 0xFF) << 16)
    | ((bytes[1] & 0xFF) <<  8) |  (bytes[0] & 0xFF);

  }

  private static class SinglePayloadTokenStream extends TokenStream {
    private Token token = new Token(UID_TERM.text(), 0, 0);
    private byte[] buffer = new byte[4];
    private boolean returnToken = false;

    void setUID(int uid) {
      buffer[0] = (byte) (uid);
      buffer[1] = (byte) (uid >> 8);
      buffer[2] = (byte) (uid >> 16);
      buffer[3] = (byte) (uid >> 24);
      token.setPayload(new Payload(buffer));
      returnToken = true;
    }

    public Token next() throws IOException {
      if (returnToken) {
        returnToken = false;
        return token;
      } else {
        return null;
      }
    }
  }
}



> Thanks
> 
> -John
> 
> On 10/19/07, Michael Busch <[EMAIL PROTECTED]> wrote:
>> John Wang wrote:
>>>      I can tried to get some numbers for leading an int[] array vs
>>> FieldCache.getInts().
>> I've had a similar performance problem when I used the FieldCache. The
>> loading performance is apparently so slow, because each value is stored
>> as a term in the dictionary. For loading the cache it is necessary to
>> iterate over all terms for the field in the dictionary. And for each
>> term it's posting list is opened to check which documents have that value.
>>
>> If you store unique docIds, then there are no two documents that share
>> the same value. That means, that each value gets its own entry in the
>> dictionary and to load each value it is necessary to perform two random
>> I/O seeks (one for term lookup + one to open the posting list).
>>
>> In my app it took for a big index several minutes to fill the cache like
>> that.
>>
>> To speed things up I did essentially what Ning suggested. Now I store
>> the values as payloads in the posting list of an artificial term. To
>> fill my cache it's only necessary to perform a couple of I/O seeks for
>> opening the posting list of the specific term, then it is just a
>> sequential scan to load all values. With this approach the time for
>> filling the cache went down from minutes to seconds!
>>
>> Now this approach is already much better than the current field cache
>> implementation, but it still can be improved. In fact, we already have a
>> mechanism for doing that: the norms. Norms are stored with a fixed size,
>> which means both random access and sequential scan are optimal. Norms
>> are also cached in memory, and filling that cache is much faster
>> compared to the current FieldCache approach.
>>
>> I was therefore thinking about adding per-document payloads to Lucene
>> (we can also call it document-metadata). The API could look like this:
>>
>> Document d = new Document();
>> byte[] uidValue = ...
>> d.addMetadata("uid", uidValue);
>>
>> And on the retrieval side all values could either be loaded into the
>> field cache, or, if the index is too big, a new API can be used:
>>
>> IndexReader reader = IndexReader.open(...);
>> DocumentMetadataIterator it = reader.metadataIterator("uid");
>>
>> where DocumentMetadataIterator is an interface similar to TermDocs:
>>
>> interface DocumentMetadataIterator {
>>   void seek(String name);
>>   boolean next();
>>   boolean skipTo(int doc);
>>
>>   int doc();
>>   byte[] getMetadata();
>> }
>>
>> The next question would be how to store the per-doc payloads (PDP). If
>> all values have the same length (as the unique docIds), then we should
>> store them as efficiently as possible, like the norms. However, we still
>> want to offer the flexibility of having variable-length values. For this
>> case we could use a new data structure similar to our posting list.
>>
>> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
>> SkipList>
>> FixedLengthPDPList    --> <Payload>^SegSize
>> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
>> Payload               --> Byte^PayloadLength
>> PayloadLength         --> VInt
>> SkipList              --> see frq.file
>>
>> Because we don't have global field semantics Lucene should automatically
>> pick the "right" data structure. This could work like this: When the
>> DocumentsWriter writes a segment it checks whether all values of a PDP
>> have the same length. If yes, it stores them as FixedLengthPDPList, if
>> not, then as VariableLengthPDPList.
>> When the SegmentMerger merges two or more segments it checks if all
>> segments have a FixedLengthPDPList with the same length for a PDP. If
>> not, it writes a VariableLengthPDPList to the new segment.
>>
>> I think this would be a nice new feature for Lucene. We could then have
>> user-defined and Lucene-specific PDPs. For example, norms would be in
>> the latter category (this way we would get rid of the special code for
>> norms, as they could be handled as PDPs). It would also be easy to add
>> new features in the future, like splitting the norms into two values: a
>> norm and a boost value.
>>
>> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;)
>>
>> - Michael
>>
>>> Thanks
>>>
>>> -John
>>>
>>> On 10/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>>>> It seems like there are (at least) two angles here for getting better
>>>> performance from FieldCache:
>>>>
>>>>   1) Be incremental: with reopen() we should only have to update a
>>>>      subset of the array in the FieldCache, according to the changed
>>>>      segments.  This is what Hoss is working on and Mark was referring
>>>>      to and I think it's very important!
>>>>
>>>>   2) Parsing is slow (?): I'm guessing one of the reasons that John
>>>>      added the _X.udt file was because it's much faster to load an
>>>>      array of already-parsed ints than to ask FieldCache to populate
>>>>      itself.
>>>>
>>>> Even if we do #1, I think #2 could be a big win (in addition)?  John
>>>> do you have any numbers of how much faster it is to load the array of
>>>> ints from the _X.udt file vs having FieldCache populate itself?
>>>>
>>>> Also on the original question of "can we open up SegmentReader,
>>>> FieldsWriter, etc.", I think that's a good idea?  At least we can make
>>>> things protected instead of private/final?
>>>>
>>>> Mike
>>>>
>>>> "Ning Li" <[EMAIL PROTECTED]> wrote:
>>>>> I see what you mean by 2) now. What Mark said should work for you when
>>>>> it's done.
>>>>>
>>>>> Cheers,
>>>>> Ning
>>>>>
>>>>> On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote:
>>>>>> Hi Ning:
>>>>>>     That is essentially what field cache does. Doing this for each
>>>> docid in
>>>>>> the result set will be slow if the result set is large. But loading
>> it
>>>> in
>>>>>> memory when opening index can also be slow if the index is large and
>>>> updates
>>>>>> often.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
>>>>>>> Make all documents have a term, say "ID:UID", and for each document,
>>>>>>> store its UID in the term's payload. You can read off this posting
>>>>>>> list to create your array. Will this work for you, John?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ning
>>>>>>>
>>>>>>>
>>>>>>> On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>>>>>>>> Forwarding this to java-dev per request.  Seems like the best
>>>> place
>>>>>>>> to discuss this topic.
>>>>>>>>
>>>>>>>>         Erik
>>>>>>>>
>>>>>>>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>>> From: "John Wang" <[EMAIL PROTECTED]>
>>>>>>>>> Date: October 17, 2007 5:43:29 PM EDT
>>>>>>>>> To: [EMAIL PROTECTED]
>>>>>>>>> Subject: lucene indexing and merge process
>>>>>>>>>
>>>>>>>>> Hi Erik:
>>>>>>>>>
>>>>>>>>>     We are revamping our search system here at LinekdIn. And we
>>>> are
>>>>>>>>> using Lucene.
>>>>>>>>>
>>>>>>>>>     One issue we ran across is that we store an UID in Lucene
>>>> which
>>>>>>>>> we map to the DB storage. So given a docid, to lookup its UID,
>>>> we
>>>>>>>>> have the following solutions:
>>>>>>>>>
>>>>>>>>> 1) Index it as a Stored field and get it from reader.document(very
>>>>>>>>> slow if recall is large)
>>>>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the
>>>>>>>>> indexreader can be slow)
>>>>>>>>> 3) construct it using the FieldCache and persist it on disk
>>>>>>>>> everytime the index changes. (not suitable for real time
>>>> indexing,
>>>>>>>>> e.g. this process will degrade as # of documents get large)
>>>>>>>>>
>>>>>>>>>     None of the above solutions turn out to be adequate for our
>>>>>>>>> requirements.
>>>>>>>>>
>>>>>>>>>      What we end up doing is to modify Lucene code by changing
>>>>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking
>>>>>>>>> advantage of the Lucene Segment/merge process. E.g:
>>>>>>>>>
>>>>>>>>>      For each segment, we store a .udt file, which is an int[]
>>>>>>>>> array, (by changing the FieldWriter class)
>>>>>>>>>
>>>>>>>>>      And SegmentReader will load the .udt file into an array.
>>>>>>>>>
>>>>>>>>>      And merge happens seemlessly.
>>>>>>>>>
>>>>>>>>>      Because the tight encapsulation around these classes, e.g.
>>>>>>>>> private and final methods, it is very difficult to extend Lucene
>>>>>>>>> while avoiding branch into our own version. Is there a way we
>>>> can
>>>>>>>>> open up and make these classes extensible? We'd be happy to
>>>>>>>>> contribute what we have done.
>>>>>>>>>
>>>>>>>>>      I guess to tackle the problem from a different angle: is
>>>> there
>>>>>>>>> a way to incorporate FieldCache into the segments (it is
>>>> strictly
>>>>>>>>> in memory now), and build disk versions while indexing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Hope I am making sense.
>>>>>>>>>
>>>>>>>>>     I did not send this out to the mailing list because I wasn't
>>>>>>>>> sure if this is a dev question or an user question, feel free to
>>>>>>>>> either forward it to the right mailing list or let me know and I
>>>>>>>>> can forward it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>>
>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Per-document Payloads

Reply via email to