Re: Use DirectMonotonicWriter store sorted NumericDocValues

Robert Muir Tue, 15 Jun 2021 07:19:22 -0700

Well it definitely wouldn't be as useful as changing to a
postings-style approach. That would bring a lot more benefits to
general cases, e.g. use of PFOR and so on.


But it is also easier to implement right now, to accelerate cases
where fields are sorted, without hurting other things.

On Tue, Jun 15, 2021 at 9:53 AM Adrien Grand <jpou...@gmail.com> wrote:
>
> SegmentWriteState has a reference to SegmentInfos which itself has the index 
> sort, so I believe that it would be possible.
>
> I wonder how useful it would be in practice. E.g. in the Elasticsearch case, 
> even though we store lots of time-based data and have been looking into index 
> sorting for storage/query efficiency reasons, the index sorts that we are 
> interested in in practice look more like `host.name ASC, @timestamp DESC` 
> than just `@timestamp DESC`. The reason for sorting by `host` first is that 
> it helps a lot with storage/query efficiency of metadata that is tied to the 
> host (e.g. IP addresses, operating system, etc.), and then because 
> `host.name` is usually a low-cardinality field, queries by descending 
> timestamp remain super efficient thanks to LUCENE-9280. So we'd be more 
> interested in an optimization that would support piecewise monotonic fields.
>
> On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcm...@gmail.com> wrote:
>>
>> +1 to that idea. Maybe a shorter-term possibility would be to only do
>> this compression on a field when the user has explicitly configured
>> index sorting on the field (can we hackishly peek at it and tell?)
>>
>> On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpou...@gmail.com> wrote:
>> >
>> > I believe that this sort of optimization would be more effective and 
>> > robust if we made doc values look more like postings, with relatively 
>> > small blocks of values that would get compressed independently and 
>> > decompressed in bulk. This way, we wouldn't require data to be sorted 
>> > across entire segments for this optimization to kick in, and we would be 
>> > less likely to slow down the normal case.
>> >
>> > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcm...@gmail.com> wrote:
>> >>
>> >> We did this monotonic detection/compression before in older times, but
>> >> had to remove it because it caused too many slowdowns.
>> >>
>> >> I think it easily causes too much type pollution, for example, for a
>> >> typical large index with unsorted docvalues field, big segments aren't
>> >> won't be sorted, tiny segments with a few values might happen to be
>> >> sorted (depending on chance/luck), tiny tiny ones with e.g. a single
>> >> document are sorted. Now we have a mix of monotonic and non-monotonic
>> >> over the same field.
>> >>
>> >> On the other hand, optimization is very fragile and rare: even for
>> >> these log users actually sorting on that field at index-time, it will
>> >> just apply to one field out of the somehow typical dozens/hundreds
>> >> that they like to have. But may destroy performance of all the other
>> >> fields and overall causes more harm than good.
>> >>
>> >> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xugan...@icloud.com.invalid> 
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, 
>> >> > DocValuesProducer valuesProducer), all numericDocValues will be visited 
>> >> > to calculate gcd, in the meantime,  we can check if all values were 
>> >> > sorted. if so, maybe we could use DirectMonotonicWriter to store them.  
>> >> > DirectMonotonicWriter can get impressive compression.
>> >> >
>> >> > In addition, when i use Elasticsearch to store numeric field types, in 
>> >> > Lucene level,  the data always at least stored by 
>> >> > NumericDocValues/SortedNumericDocValues. So when indexing some sorted 
>> >> > values like ID, TIMESTAMP, maybe the upon optimization is applicable.
>> >> >
>> >> > Could I have some suggestions?
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Use DirectMonotonicWriter store sorted NumericDocValues

Reply via email to