+1 to that idea. Maybe a shorter-term possibility would be to only do this compression on a field when the user has explicitly configured index sorting on the field (can we hackishly peek at it and tell?)
On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpou...@gmail.com> wrote: > > I believe that this sort of optimization would be more effective and robust > if we made doc values look more like postings, with relatively small blocks > of values that would get compressed independently and decompressed in bulk. > This way, we wouldn't require data to be sorted across entire segments for > this optimization to kick in, and we would be less likely to slow down the > normal case. > > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcm...@gmail.com> wrote: >> >> We did this monotonic detection/compression before in older times, but >> had to remove it because it caused too many slowdowns. >> >> I think it easily causes too much type pollution, for example, for a >> typical large index with unsorted docvalues field, big segments aren't >> won't be sorted, tiny segments with a few values might happen to be >> sorted (depending on chance/luck), tiny tiny ones with e.g. a single >> document are sorted. Now we have a mix of monotonic and non-monotonic >> over the same field. >> >> On the other hand, optimization is very fragile and rare: even for >> these log users actually sorting on that field at index-time, it will >> just apply to one field out of the somehow typical dozens/hundreds >> that they like to have. But may destroy performance of all the other >> fields and overall causes more harm than good. >> >> On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xugan...@icloud.com.invalid> wrote: >> > >> > Hi, >> > >> > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, >> > DocValuesProducer valuesProducer), all numericDocValues will be visited to >> > calculate gcd, in the meantime, we can check if all values were sorted. >> > if so, maybe we could use DirectMonotonicWriter to store them. >> > DirectMonotonicWriter can get impressive compression. >> > >> > In addition, when i use Elasticsearch to store numeric field types, in >> > Lucene level, the data always at least stored by >> > NumericDocValues/SortedNumericDocValues. So when indexing some sorted >> > values like ID, TIMESTAMP, maybe the upon optimization is applicable. >> > >> > Could I have some suggestions? >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > > -- > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org