On Oct 19, 2007, at 6:53 PM, Michael Busch wrote:
John Wang wrote:
I can tried to get some numbers for leading an int[] array vs
FieldCache.getInts().
I've had a similar performance problem when I used the FieldCache. The
loading performance is apparently so slow, because each value is
stored
as a term in the dictionary. For loading the cache it is necessary to
iterate over all terms for the field in the dictionary. And for each
term it's posting list is opened to check which documents have that
value.
If you store unique docIds, then there are no two documents that share
the same value. That means, that each value gets its own entry in the
dictionary and to load each value it is necessary to perform two
random
I/O seeks (one for term lookup + one to open the posting list).
In my app it took for a big index several minutes to fill the cache
like
that.
To speed things up I did essentially what Ning suggested. Now I store
the values as payloads in the posting list of an artificial term. To
fill my cache it's only necessary to perform a couple of I/O seeks for
opening the posting list of the specific term, then it is just a
sequential scan to load all values. With this approach the time for
filling the cache went down from minutes to seconds!
Now this approach is already much better than the current field cache
implementation, but it still can be improved. In fact, we already
have a
mechanism for doing that: the norms. Norms are stored with a fixed
size,
which means both random access and sequential scan are optimal. Norms
are also cached in memory, and filling that cache is much faster
compared to the current FieldCache approach.
I was therefore thinking about adding per-document payloads to Lucene
(we can also call it document-metadata). The API could look like this:
Document d = new Document();
byte[] uidValue = ...
d.addMetadata("uid", uidValue);
And on the retrieval side all values could either be loaded into the
field cache, or, if the index is too big, a new API can be used:
IndexReader reader = IndexReader.open(...);
DocumentMetadataIterator it = reader.metadataIterator("uid");
where DocumentMetadataIterator is an interface similar to TermDocs:
interface DocumentMetadataIterator {
void seek(String name);
boolean next();
boolean skipTo(int doc);
int doc();
byte[] getMetadata();
}
The next question would be how to store the per-doc payloads (PDP). If
all values have the same length (as the unique docIds), then we should
store them as efficiently as possible, like the norms. However, we
still
want to offer the flexibility of having variable-length values. For
this
case we could use a new data structure similar to our posting list.
PDPList --> FixedLengthPDPList | <VariableLengthPDPList,
SkipList>
FixedLengthPDPList --> <Payload>^SegSize
VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
Payload --> Byte^PayloadLength
PayloadLength --> VInt
SkipList --> see frq.file
Because we don't have global field semantics Lucene should
automatically
pick the "right" data structure. This could work like this: When the
DocumentsWriter writes a segment it checks whether all values of a PDP
have the same length. If yes, it stores them as FixedLengthPDPList, if
not, then as VariableLengthPDPList.
When the SegmentMerger merges two or more segments it checks if all
segments have a FixedLengthPDPList with the same length for a PDP. If
not, it writes a VariableLengthPDPList to the new segment.
I think this would be a nice new feature for Lucene. We could then
have
user-defined and Lucene-specific PDPs. For example, norms would be in
the latter category (this way we would get rid of the special code for
norms, as they could be handled as PDPs). It would also be easy to add
new features in the future, like splitting the norms into two
values: a
norm and a boost value.
Some randomly pieced together thoughts (I may not even be fully awake
yet :-) so feel free to tell me I'm not understanding this correctly)
My first thought was how is this different from just having a binary
field, but if I understand correctly it is to be stored in a separate
file?
Now you are proposing a faster storage mechanism for them,
essentially, since they are to be stored separately from the
Documents themselves? But the other key is they are all stored next
to each other, right, so the scan is a lot faster?
I think one of the questions that will come up from users is when
should I use addMetadata and when should I use addField? Why make
the distinction to the user? Fields have always represented
metadata, all your doing is optimizing the internal storage of them.
So from an interface side of things, I would just make it a new Field
type. Essentially what we are doing is creating a two level document
store, right? First level contains all of the small metadata that is
likely to be accessed on every hit, second level contains all of the
non-essential fields, right? Perhaps in this way, if users were
willing to commit to fixed length fields for the first level, we
could also make field updating of these types of fields possible w/o
having to reindex?????
Btw, I've thought ever since we added payloads that we should find a
way to hook in scoring on the binary fields and I would presume
people would eventually want scoring of metadata too, just like the
FunctionQuery stuff does.
And yes, to Nicholas point, it starts to sound like flexible
indexing. :-) Which I still would like to get to sometime in my
lifetime...
Cheers,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]