"Michael Busch" <[EMAIL PROTECTED]> wrote: > Michael McCandless wrote: > > > > Michael, are you thinking that the storage would/could be non-sparse > > (like norms), and loaded/cached once in memory, especially for fixed > > size fields? EG a big array of ints of length maxDocID? In John's > > original case, every doc has this UID int field; I think this is > > fairly common. > > > > Yes I agree, this is a common use case. In my first mail in this thread > I suggested to have a flexible format. Non-sparse, like norms, in case > every document has one value and all values have the same fixed size. > Sparse and with a skip list if one or both conditions are false. > > The DocumentsWriter would have to check whether both conditions are > true, in which case it would store the values non-sparse. The > SegmentMerger would only write the non-sparse format for the new segment > if all of the source segments also had the non-sparse format with the > same value size. > > This would provide the most flexibility for the users I think.
OK, got it. So in the case where I always put a field "UID" on every document, always a 4-byte binary field, then Lucene will "magically" store this as non-sparse column-stride field for every segment. But I still have to mark the Field as "column-stride storage" right? Even if some docs do not have the field, it is still beneficial to store it non-sparse up until a point. EG the logic in BitVector.isSparse() is doing a similar calculation. This is only possible when the field, when set on the document, is always the same length in bytes. Maybe we should also allow users to explicitly state that they wish for this field to be stored in this way (sparse or non-sparse) rather than having Lucene choose? New question: how would we handle a "boolean" type column-stride stored field? It seems like we should always use BitVector since it already handles the sparse/non-sparse storage decision "under the hood"? > > I think many apps have no trouble loading the array-of-ints entirely > > into RAM, either because there are not that many docs or because > > throwing RAM at the problem is fine (eg on a 64-bit JVM). > > > >>From John's tests, the "load int[] directly from disk" took 186 msec > > vs the payload approach (using today's payloads API) took 430 msec. > > > > This is a sizable performance difference (2.3 X faster) and for > > interactive indexing apps, where minimizing cost of re-opening readers > > is critical, this is significant. Especially combining this with the > > ideas from LUCENE-831 (incrementally updating the FieldCache; maybe > > distributing the FieldCache down into sub-readers) should make > > re-opening + re-warming much faster than today. > > > > Yes definitely. I was planning to add a FieldCache implementation that > uses these per-doc payloads - it's one of the most obvious use-cases. > However, I think providing an iterator in addition, like TermDocs, makes > sense too. People might have very big indexes, store longer values than > 4 Bytes Ints, or use more than one per-doc payload. In some tests I > found out that the performance is still often acceptable, even if the > values are not cached. (It's like having one AND-term more in the query, > as one more "posting list" has to be processed). > > > If so, wouldn't this API just fit under FieldCache? Ie "getInts(...)" > > would look at FieldInfo, determine that this field is stored > > column-stride, and load it as one big int array? > > > > So I think a TermDocs-like iterator plus a new FieldCache implementation > would make sense? OK, I agree, we should have an iterator API as well so that you can process this posting list "document at a time" just like all other terms in the query. > We could further make these fields updateable, like norms? Agreed, though how would the API work (if indeed we are just adding "column-stride[-non]-sparse" options to Field)? Because if the Field is also indexed, we can't update that. I think I can see why you wanted to make a new API here :) Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]