SegmentWriteState has a reference to SegmentInfos which itself has the index sort, so I believe that it would be possible.
I wonder how useful it would be in practice. E.g. in the Elasticsearch case, even though we store lots of time-based data and have been looking into index sorting for storage/query efficiency reasons, the index sorts that we are interested in in practice look more like `host.name ASC, @timestamp DESC` than just `@timestamp DESC`. The reason for sorting by `host` first is that it helps a lot with storage/query efficiency of metadata that is tied to the host (e.g. IP addresses, operating system, etc.), and then because `host.name` is usually a low-cardinality field, queries by descending timestamp remain super efficient thanks to LUCENE-9280 <https://issues.apache.org/jira/browse/LUCENE-9280>. So we'd be more interested in an optimization that would support piecewise monotonic fields. On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcm...@gmail.com> wrote: > +1 to that idea. Maybe a shorter-term possibility would be to only do > this compression on a field when the user has explicitly configured > index sorting on the field (can we hackishly peek at it and tell?) > > On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpou...@gmail.com> wrote: > > > > I believe that this sort of optimization would be more effective and > robust if we made doc values look more like postings, with relatively small > blocks of values that would get compressed independently and decompressed > in bulk. This way, we wouldn't require data to be sorted across entire > segments for this optimization to kick in, and we would be less likely to > slow down the normal case. > > > > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcm...@gmail.com> wrote: > >> > >> We did this monotonic detection/compression before in older times, but > >> had to remove it because it caused too many slowdowns. > >> > >> I think it easily causes too much type pollution, for example, for a > >> typical large index with unsorted docvalues field, big segments aren't > >> won't be sorted, tiny segments with a few values might happen to be > >> sorted (depending on chance/luck), tiny tiny ones with e.g. a single > >> document are sorted. Now we have a mix of monotonic and non-monotonic > >> over the same field. > >> > >> On the other hand, optimization is very fragile and rare: even for > >> these log users actually sorting on that field at index-time, it will > >> just apply to one field out of the somehow typical dozens/hundreds > >> that they like to have. But may destroy performance of all the other > >> fields and overall causes more harm than good. > >> > >> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xugan...@icloud.com.invalid> > wrote: > >> > > >> > Hi, > >> > > >> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, > DocValuesProducer valuesProducer), all numericDocValues will be visited to > calculate gcd, in the meantime, we can check if all values were sorted. if > so, maybe we could use DirectMonotonicWriter to store them. > DirectMonotonicWriter can get impressive compression. > >> > > >> > In addition, when i use Elasticsearch to store numeric field types, > in Lucene level, the data always at least stored by > NumericDocValues/SortedNumericDocValues. So when indexing some sorted > values like ID, TIMESTAMP, maybe the upon optimization is applicable. > >> > > >> > Could I have some suggestions? > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > > > > -- > > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Adrien