@Adrien choosing different type of compression/storage depending on the data:::: >From BigQuery: https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format
BigQuery has background processes that constantly look at all the stored data and check if it can be optimized even further. Perhaps initially data was loaded in small chunks, and without seeing all the data, some decisions were not globally optimal. Or perhaps some parameters of the system have changed, and there are new opportunities for storage restructuring. Or perhaps, Capacitor models got more trained and tuned, and it possible to enhance existing data. While every column is being encoded, Capacitor and BigQuery collect various statistics about the data — these statistics are persisted and later are used during query execution — they both feed into query planner to help compiling optimal plans, and into the dynamic runtime to choose most efficient runtime algorithms for required operators. On Tue, Apr 25, 2017 at 7:43 PM, Otis Gospodnetić < [email protected]> wrote: > Hi, > > On Tue, Apr 25, 2017 at 4:06 AM, Adrien Grand <[email protected]> wrote: > >> I think it makes sense indeed for time-series databases. The time field >> should grow by regular increments, and numerical values of consecutive >> documents are likely to be close to each other. Both cases should compress >> efficiently by doing delta of delta encoding. >> >> We haven't really started exploring leveraging the fact that doc values >> have an iterator API for compression at all. I think this delta-of-delta >> approach would be interesting to explore. Maybe we could encode values in >> blocks like postings and decide how to encode each block based on the >> actual data. Delta-of-delta would be one option, but sometimes we might >> also go with RLE or FOR depending on which one suits the actual data best. >> > > Sounds great! I created https://issues.apache.org/jira/browse/LUCENE-7806 > > Otis > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > >> Le mar. 25 avr. 2017 à 04:43, Otis Gospodnetić < >> [email protected]> a écrit : >> >>> Hi, >>> >>> I was reading about Facebook Beringei when I spotted this: >>> >>> >>> - Extremely efficient streaming compression algorithm. Our streaming >>> compression algorithm is able to compress real world time series data by >>> over 90%. The delta of delta compression algorithm used by Beringei is >>> also >>> fast - we see that a single machine is able to compress more than 1.5 >>> million datapoints/second. >>> >>> >>> That "*delta of delta*" caught my attention.... This delta of delta >>> encoding is one of the Facebook Gorilla tricks that allows it to compress >>> 16 bytes into 1.37 bytes on average -- see section 4.1 that describes it -- >>> http://www.vldb.org/pvldb/vol8/p1816-teller.pdf >>> >>> This seems to be aimed at both time fields and numerical values. >>> >>> Would Lucene benefit from this? >>> >>> https://github.com/burmanm/gorilla-tsc seems to be a fresh Java >>> implementation. >>> >>> Otis >>> -- >>> Monitoring - Log Management - Alerting - Anomaly Detection >>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >>> >>> >
