Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Grant Ingersoll Sat, 20 Oct 2007 05:12:57 -0700


On Oct 19, 2007, at 6:53 PM, Michael Busch wrote:

John Wang wrote:


     I can tried to get some numbers for leading an int[] array vs
FieldCache.getInts().


I've had a similar performance problem when I used the FieldCache. The

loading performance is apparently so slow, because each value isstored

as a term in the dictionary. For loading the cache it is necessary to
iterate over all terms for the field in the dictionary. And for each

term it's posting list is opened to check which documents have thatvalue.


If you store unique docIds, then there are no two documents that share
the same value. That means, that each value gets its own entry in the

dictionary and to load each value it is necessary to perform tworandom

I/O seeks (one for term lookup + one to open the posting list).

In my app it took for a big index several minutes to fill the cachelike

that.

To speed things up I did essentially what Ning suggested. Now I store
the values as payloads in the posting list of an artificial term. To
fill my cache it's only necessary to perform a couple of I/O seeks for
opening the posting list of the specific term, then it is just a
sequential scan to load all values. With this approach the time for
filling the cache went down from minutes to seconds!

Now this approach is already much better than the current field cache

implementation, but it still can be improved. In fact, we alreadyhave amechanism for doing that: the norms. Norms are stored with a fixedsize,

which means both random access and sequential scan are optimal. Norms
are also cached in memory, and filling that cache is much faster
compared to the current FieldCache approach.

I was therefore thinking about adding per-document payloads to Lucene
(we can also call it document-metadata). The API could look like this:

Document d = new Document();
byte[] uidValue = ...
d.addMetadata("uid", uidValue);

And on the retrieval side all values could either be loaded into the
field cache, or, if the index is too big, a new API can be used:

IndexReader reader = IndexReader.open(...);
DocumentMetadataIterator it = reader.metadataIterator("uid");

where DocumentMetadataIterator is an interface similar to TermDocs:

interface DocumentMetadataIterator {
  void seek(String name);
  boolean next();
  boolean skipTo(int doc);

  int doc();
  byte[] getMetadata();
}

The next question would be how to store the per-doc payloads (PDP). If
all values have the same length (as the unique docIds), then we should

store them as efficiently as possible, like the norms. However, westillwant to offer the flexibility of having variable-length values. Forthis

case we could use a new data structure similar to our posting list.

PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
SkipList>
FixedLengthPDPList    --> <Payload>^SegSize
VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
Payload               --> Byte^PayloadLength
PayloadLength         --> VInt
SkipList              --> see frq.file

Because we don't have global field semantics Lucene shouldautomatically

pick the "right" data structure. This could work like this: When the
DocumentsWriter writes a segment it checks whether all values of a PDP
have the same length. If yes, it stores them as FixedLengthPDPList, if
not, then as VariableLengthPDPList.
When the SegmentMerger merges two or more segments it checks if all
segments have a FixedLengthPDPList with the same length for a PDP. If
not, it writes a VariableLengthPDPList to the new segment.

I think this would be a nice new feature for Lucene. We could thenhave

user-defined and Lucene-specific PDPs. For example, norms would be in
the latter category (this way we would get rid of the special code for
norms, as they could be handled as PDPs). It would also be easy to add

new features in the future, like splitting the norms into twovalues: a

norm and a boost value.

Some randomly pieced together thoughts (I may not even be fully awakeyet :-) so feel free to tell me I'm not understanding this correctly)

My first thought was how is this different from just having a binaryfield, but if I understand correctly it is to be stored in a separatefile?

Now you are proposing a faster storage mechanism for them,essentially, since they are to be stored separately from theDocuments themselves? But the other key is they are all stored nextto each other, right, so the scan is a lot faster?

I think one of the questions that will come up from users is whenshould I use addMetadata and when should I use addField? Why makethe distinction to the user? Fields have always representedmetadata, all your doing is optimizing the internal storage of them.So from an interface side of things, I would just make it a new Fieldtype. Essentially what we are doing is creating a two level documentstore, right? First level contains all of the small metadata that islikely to be accessed on every hit, second level contains all of thenon-essential fields, right? Perhaps in this way, if users werewilling to commit to fixed length fields for the first level, wecould also make field updating of these types of fields possible w/ohaving to reindex?????

Btw, I've thought ever since we added payloads that we should find away to hook in scoring on the binary fields and I would presumepeople would eventually want scoring of metadata too, just like theFunctionQuery stuff does.

And yes, to Nicholas point, it starts to sound like flexibleindexing. :-) Which I still would like to get to sometime in mylifetime...



Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Per-document Payloads (was: Re: lucene indexing and merge process)

Reply via email to