Re: Use DirectMonotonicWriter store sorted NumericDocValues

LuXugang Wed, 16 Jun 2021 18:48:43 -0700

Thanks, Robert, Adrien. your replies are helpful to me

> 2021年6月15日 下午10:19，Robert Muir <rcm...@gmail.com> 写道：
> 
> Well it definitely wouldn't be as useful as changing to a
> postings-style approach. That would bring a lot more benefits to
> general cases, e.g. use of PFOR and so on.
> 
> But it is also easier to implement right now, to accelerate cases
> where fields are sorted, without hurting other things.
> 
> On Tue, Jun 15, 2021 at 9:53 AM Adrien Grand <jpou...@gmail.com> wrote:
>> 
>> SegmentWriteState has a reference to SegmentInfos which itself has the index 
>> sort, so I believe that it would be possible.
>> 
>> I wonder how useful it would be in practice. E.g. in the Elasticsearch case, 
>> even though we store lots of time-based data and have been looking into 
>> index sorting for storage/query efficiency reasons, the index sorts that we 
>> are interested in in practice look more like `host.name ASC, @timestamp 
>> DESC` than just `@timestamp DESC`. The reason for sorting by `host` first is 
>> that it helps a lot with storage/query efficiency of metadata that is tied 
>> to the host (e.g. IP addresses, operating system, etc.), and then because 
>> `host.name` is usually a low-cardinality field, queries by descending 
>> timestamp remain super efficient thanks to LUCENE-9280. So we'd be more 
>> interested in an optimization that would support piecewise monotonic fields.
>> 
>> On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcm...@gmail.com> wrote:
>>> 
>>> +1 to that idea. Maybe a shorter-term possibility would be to only do
>>> this compression on a field when the user has explicitly configured
>>> index sorting on the field (can we hackishly peek at it and tell?)
>>> 
>>> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpou...@gmail.com> wrote:
>>>> 
>>>> I believe that this sort of optimization would be more effective and 
>>>> robust if we made doc values look more like postings, with relatively 
>>>> small blocks of values that would get compressed independently and 
>>>> decompressed in bulk. This way, we wouldn't require data to be sorted 
>>>> across entire segments for this optimization to kick in, and we would be 
>>>> less likely to slow down the normal case.
>>>> 
>>>> On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcm...@gmail.com> wrote:
>>>>> 
>>>>> We did this monotonic detection/compression before in older times, but
>>>>> had to remove it because it caused too many slowdowns.
>>>>> 
>>>>> I think it easily causes too much type pollution, for example, for a
>>>>> typical large index with unsorted docvalues field, big segments aren't
>>>>> won't be sorted, tiny segments with a few values might happen to be
>>>>> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>>>>> document are sorted. Now we have a mix of monotonic and non-monotonic
>>>>> over the same field.
>>>>> 
>>>>> On the other hand, optimization is very fragile and rare: even for
>>>>> these log users actually sorting on that field at index-time, it will
>>>>> just apply to one field out of the somehow typical dozens/hundreds
>>>>> that they like to have. But may destroy performance of all the other
>>>>> fields and overall causes more harm than good.
>>>>> 
>>>>> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xugan...@icloud.com.invalid> 
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, 
>>>>>> DocValuesProducer valuesProducer), all numericDocValues will be visited 
>>>>>> to calculate gcd, in the meantime,  we can check if all values were 
>>>>>> sorted. if so, maybe we could use DirectMonotonicWriter to store them.  
>>>>>> DirectMonotonicWriter can get impressive compression.
>>>>>> 
>>>>>> In addition, when i use Elasticsearch to store numeric field types, in 
>>>>>> Lucene level,  the data always at least stored by 
>>>>>> NumericDocValues/SortedNumericDocValues. So when indexing some sorted 
>>>>>> values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>>>>>> 
>>>>>> Could I have some suggestions?
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Adrien
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
>> 
>> 
>> --
>> Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Use DirectMonotonicWriter store sorted NumericDocValues

Reply via email to