On 2/13/2013 2:42 AM, Adrien Grand wrote:
Doc values are like FieldCache except that you don't need to uninvert
values from the inverted index whenever you open a new Reader. I think
there are two reasons why you would like to turn doc values on:

Confession -- that's almost gibberish to me! At my current level of understanding, the pieces make some semblance of sense, but the whole thing falls apart before my head grasps it. My fault, not yours. :)

  - if you are indexing a field only for faceting, sorting or grouping
(not searching), setting indexed=false and docValues=true will provide
the same functionnality and be lighter, both at indexing time (no need
to invert the field) and when opening a new IndexReader (no need to
uninvert the field),

I have some fields that mostly get used for sorting. The most common field used for sorting is a seconds-since-epoch timestamp simply stored as a long (source is MySQL bigint). We have another copy of it in tdate format that we use for date range searches. I'll need to ask whether they are using it for searching or filtering before I make the long version indexed=false.

  - if the field is also used for searching, turning doc values on will
give your Lucene index a little more work at indexing time (not a big
deal in my opinion) but it will be faster to open (especially
interesting if you're doing near-realtime search) and likely more
memory-efficient.

I have a lot more index headroom thanks to stored/termvector compression. My indexes fit entirely in available RAM now! Even before the upgrade, not all of the index data was being cached, so I still had free RAM, so I have plenty of room for index growth. I just have to convince them to start using the upgraded index copy so I can upgrade the other one.

However doc values are useless for searching, so there is no need to
turn them on on a field which is used solely for searching.

Similarly to stored fields, doc values could help you retrieve the
value of a field, but the trade-off is very different: stored fields
are better at retrieving many fields of a single document efficiently
while doc values are good at retrieving one field for a lot of
documents efficiently. So if you want to get a field's value in the
response, you should keep setting stored=true. There might be
optimizations in the future for example if you're only asking for a
single field which has doc values, but this will be transparent to
you.

This suggests that adding docvalues to the uniqueKey field would be a good idea for distributed searching in general, since the first phase of a distributed search only retrieves that field and score. That assumes of course that the docvalues are fully utilized for retrieving fields during that initial phase.

Generally when we search, we retrieve all stored fields, so I will keep those around. We already don't store every field, and advances we've made on the client side will probably allow me to stop storing more of them, further reducing our index size.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to