I think it makes sense indeed for time-series databases. The time field should grow by regular increments, and numerical values of consecutive documents are likely to be close to each other. Both cases should compress efficiently by doing delta of delta encoding.
We haven't really started exploring leveraging the fact that doc values have an iterator API for compression at all. I think this delta-of-delta approach would be interesting to explore. Maybe we could encode values in blocks like postings and decide how to encode each block based on the actual data. Delta-of-delta would be one option, but sometimes we might also go with RLE or FOR depending on which one suits the actual data best. Le mar. 25 avr. 2017 à 04:43, Otis Gospodnetić <[email protected]> a écrit : > Hi, > > I was reading about Facebook Beringei when I spotted this: > > > - Extremely efficient streaming compression algorithm. Our streaming > compression algorithm is able to compress real world time series data by > over 90%. The delta of delta compression algorithm used by Beringei is also > fast - we see that a single machine is able to compress more than 1.5 > million datapoints/second. > > > That "*delta of delta*" caught my attention.... This delta of delta > encoding is one of the Facebook Gorilla tricks that allows it to compress > 16 bytes into 1.37 bytes on average -- see section 4.1 that describes it -- > http://www.vldb.org/pvldb/vol8/p1816-teller.pdf > > This seems to be aimed at both time fields and numerical values. > > Would Lucene benefit from this? > > https://github.com/burmanm/gorilla-tsc seems to be a fresh Java > implementation. > > Otis > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > >
