+1 to that idea. Maybe a shorter-term possibility would be to only do
this compression on a field when the user has explicitly configured
index sorting on the field (can we hackishly peek at it and tell?)

On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpou...@gmail.com> wrote:
>
> I believe that this sort of optimization would be more effective and robust 
> if we made doc values look more like postings, with relatively small blocks 
> of values that would get compressed independently and decompressed in bulk. 
> This way, we wouldn't require data to be sorted across entire segments for 
> this optimization to kick in, and we would be less likely to slow down the 
> normal case.
>
> On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcm...@gmail.com> wrote:
>>
>> We did this monotonic detection/compression before in older times, but
>> had to remove it because it caused too many slowdowns.
>>
>> I think it easily causes too much type pollution, for example, for a
>> typical large index with unsorted docvalues field, big segments aren't
>> won't be sorted, tiny segments with a few values might happen to be
>> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>> document are sorted. Now we have a mix of monotonic and non-monotonic
>> over the same field.
>>
>> On the other hand, optimization is very fragile and rare: even for
>> these log users actually sorting on that field at index-time, it will
>> just apply to one field out of the somehow typical dozens/hundreds
>> that they like to have. But may destroy performance of all the other
>> fields and overall causes more harm than good.
>>
>> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xugan...@icloud.com.invalid> wrote:
>> >
>> > Hi,
>> >
>> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, 
>> > DocValuesProducer valuesProducer), all numericDocValues will be visited to 
>> > calculate gcd, in the meantime,  we can check if all values were sorted. 
>> > if so, maybe we could use DirectMonotonicWriter to store them.  
>> > DirectMonotonicWriter can get impressive compression.
>> >
>> > In addition, when i use Elasticsearch to store numeric field types, in 
>> > Lucene level,  the data always at least stored by 
>> > NumericDocValues/SortedNumericDocValues. So when indexing some sorted 
>> > values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>> >
>> > Could I have some suggestions?
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to