Re: Per-document Payloads

Michael Busch Sat, 20 Oct 2007 12:51:17 -0700

Grant Ingersoll wrote:
> 
> Some randomly pieced together thoughts (I may not even be fully awake
> yet :-)  so feel free to tell me I'm not understanding this correctly)
> 
> My first thought was how is this different from just having a binary
> field, but if I understand correctly it is to be stored in a separate file?
> 
> Now you are proposing a faster storage mechanism for them, essentially,
> since they are to be stored separately from the Documents themselves?  
> But the other key is they are all stored next to each other, right, so
> the scan is a lot faster?
>


Yes, scanning and skipping would be much faster, comparable to a posting
list. In fact, what I'm proposing is a new kind of posting list. Since
you mentioned the magic term "flexible indexing" already ;), let's take
a look at http://wiki.apache.org/lucene-java/FlexibleIndexing. Here 4
kinds of posting lists are proposed:

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

Today, we have c. and d. already. c. is the original Lucene format, and
d. can be achieved by storing the boost as a payload.

The new format I'm proposing actually covers a. and b. If you don't
store a payload it's basically a binary posting list without freq and
positions (a.). If you store the boost as a payload, then you have b.


> I think one of the questions that will come up from users is when should
> I use addMetadata and when should I use addField?  Why make the
> distinction to the user?  Fields have always represented metadata, all

I'd like to make a distinction because IMO these are two different use
cases. Not necessarily in terms of functionality, but in terms of
performance. You are right, you can store everything today as stored
fields, but if you want to use e. g. a stored value for scoring, then
performance is terrible. This is simply the nature of the store - it is
optimized for returning all stored fields for a document. Even a
FieldSelector doesn't help you too much, unless the docs contain very
big fields that you don't want to return. The reason is that two random
I/Os are necessary to find the stored fields of a document. Then only
sequential I/O has to be performed. And the overhead of loading e. g.
10KB instead of 2KB is not big, much less than two random I/Os, I believe.

Payloads are also much better in terms of cache utilization. Since they
are stored next to each other, and if accessed frequently (in every
search), then it's very likely that big portions of that posting list
will be in the cache.

So the answer to the question when to use a stored field and when to use
a payload should be: use payloads when you access the data during query
evaluation/scoring, use stored fields when you need the data to
construct a search result from a hit.

> fields, right?  Perhaps in this way, if users were willing to commit to
> fixed length fields for the first level, we could also make field
> updating of these types of fields possible w/o having to reindex?????
> 

Yes I was thinking the same. Just like norms.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Per-document Payloads

Reply via email to