Hi, On Tue, Apr 9, 2013 at 5:22 PM, Wei Wang <welshw...@gmail.com> wrote: > DocValues makes fast per doc value lookup possible, which is nice. But it > brings other interesting issues. > > Assume there are 100M docs and 200 NumericDocValuesFields, this ends up > with huge number of disk and memory usage, even if there are just thousands > of values for each field. I guess this is because Lucene stores a value for > each DocValues field of each document, with variable-length codec.
The default codec stores numeric doc values by blocks of 4096 values that have independent numbers of bits per values. If you end up having most of these blocks empty, doc values will require little space but in a worst-case scenario where each block contains 1 single value, it is true that memory and disk usage will be very inefficient. > So in such scenario, is it possible only store values for the DocValues > field of the docment that actually has a value for that field? Or does > Lucene has a column storage mechanism sort of like hash map for DocValues: > > key: the docId that has a value for the DocValues field > value: the value of the DocValues field Lucene doesn't have a HashMap-like storage for doc values, although it would be doable to build a DocValuesFormat that would work this way. However, for your problem, I would recommend that you encode your numeric data on top on BinaryDocValues. On the contrary to NumericDocValues, BinaryDocValues require very little space for missing values. All you need is to have conversion methods between your numeric data and byte arrays. -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org