Hi,

On Tue, Apr 9, 2013 at 5:22 PM, Wei Wang <welshw...@gmail.com> wrote:
> DocValues makes fast per doc value lookup possible, which is nice. But it
> brings other interesting issues.
>
> Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> with huge number of disk and memory usage, even if there are just thousands
> of values for each field. I guess this is because Lucene stores a value for
> each DocValues field of each document, with variable-length codec.

The default codec stores numeric doc values by blocks of 4096 values
that have independent numbers of bits per values. If you end up having
most of these blocks empty, doc values will require little space but
in a worst-case scenario where each block contains 1 single value, it
is true that memory and disk usage will be very inefficient.

> So in such scenario, is it possible only store values for the DocValues
> field of the docment that actually has a value for that field? Or does
> Lucene has a column storage mechanism sort of like hash map for DocValues:
>
> key: the docId that has a value for the DocValues field
> value: the value of the DocValues field

Lucene doesn't have a HashMap-like storage for doc values, although it
would be doable to build a DocValuesFormat that would work this way.

However, for your problem, I would recommend that you encode your
numeric data on top on BinaryDocValues. On the contrary to
NumericDocValues, BinaryDocValues require very little space for
missing values. All you need is to have conversion methods between
your numeric data and byte arrays.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to