Michael McCandless wrote: > > Michael, are you thinking that the storage would/could be non-sparse > (like norms), and loaded/cached once in memory, especially for fixed > size fields? EG a big array of ints of length maxDocID? In John's > original case, every doc has this UID int field; I think this is > fairly common. >
Yes I agree, this is a common use case. In my first mail in this thread I suggested to have a flexible format. Non-sparse, like norms, in case every document has one value and all values have the same fixed size. Sparse and with a skip list if one or both conditions are false. The DocumentsWriter would have to check whether both conditions are true, in which case it would store the values non-sparse. The SegmentMerger would only write the non-sparse format for the new segment if all of the source segments also had the non-sparse format with the same value size. This would provide the most flexibility for the users I think. > I think many apps have no trouble loading the array-of-ints entirely > into RAM, either because there are not that many docs or because > throwing RAM at the problem is fine (eg on a 64-bit JVM). > >>From John's tests, the "load int[] directly from disk" took 186 msec > vs the payload approach (using today's payloads API) took 430 msec. > > This is a sizable performance difference (2.3 X faster) and for > interactive indexing apps, where minimizing cost of re-opening readers > is critical, this is significant. Especially combining this with the > ideas from LUCENE-831 (incrementally updating the FieldCache; maybe > distributing the FieldCache down into sub-readers) should make > re-opening + re-warming much faster than today. > Yes definitely. I was planning to add a FieldCache implementation that uses these per-doc payloads - it's one of the most obvious use-cases. However, I think providing an iterator in addition, like TermDocs, makes sense too. People might have very big indexes, store longer values than 4 Bytes Ints, or use more than one per-doc payload. In some tests I found out that the performance is still often acceptable, even if the values are not cached. (It's like having one AND-term more in the query, as one more "posting list" has to be processed). > If so, wouldn't this API just fit under FieldCache? Ie "getInts(...)" > would look at FieldInfo, determine that this field is stored > column-stride, and load it as one big int array? > So I think a TermDocs-like iterator plus a new FieldCache implementation would make sense? We could further make these fields updateable, like norms? -Michael --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]