@Adrien choosing different type of compression/storage depending on the
data::::
>From BigQuery:
https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format

BigQuery has background processes that constantly look at all the stored
data and check if it can be optimized even further. Perhaps initially data
was loaded in small chunks, and without seeing all the data, some decisions
were not globally optimal. Or perhaps some parameters of the system have
changed, and there are new opportunities for storage restructuring. Or
perhaps, Capacitor models got more trained and tuned, and it possible to
enhance existing data.

While every column is being encoded, Capacitor and BigQuery collect various
statistics about the data — these statistics are persisted and later are
used during query execution — they both feed into query planner to help
compiling optimal plans, and into the dynamic runtime to choose most
efficient runtime algorithms for required operators.




On Tue, Apr 25, 2017 at 7:43 PM, Otis Gospodnetić <
[email protected]> wrote:

> Hi,
>
> On Tue, Apr 25, 2017 at 4:06 AM, Adrien Grand <[email protected]> wrote:
>
>> I think it makes sense indeed for time-series databases. The time field
>> should grow by regular increments, and numerical values of consecutive
>> documents are likely to be close to each other. Both cases should compress
>> efficiently by doing delta of delta encoding.
>>
>> We haven't really started exploring leveraging the fact that doc values
>> have an iterator API for compression at all. I think this delta-of-delta
>> approach would be interesting to explore. Maybe we could encode values in
>> blocks like postings and decide how to encode each block based on the
>> actual data. Delta-of-delta would be one option, but sometimes we might
>> also go with RLE or FOR depending on which one suits the actual data best.
>>
>
> Sounds great!  I created https://issues.apache.org/jira/browse/LUCENE-7806
>
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> Le mar. 25 avr. 2017 à 04:43, Otis Gospodnetić <
>> [email protected]> a écrit :
>>
>>> Hi,
>>>
>>> I was reading about Facebook Beringei when I spotted this:
>>>
>>>
>>>    - Extremely efficient streaming compression algorithm. Our streaming
>>>    compression algorithm is able to compress real world time series data by
>>>    over 90%. The delta of delta compression algorithm used by Beringei is 
>>> also
>>>    fast - we see that a single machine is able to compress more than 1.5
>>>    million datapoints/second.
>>>
>>>
>>> That "*delta of delta*" caught my attention.... This delta of delta
>>> encoding is one of the Facebook Gorilla tricks that allows it to compress
>>> 16 bytes into 1.37 bytes on average -- see section 4.1 that describes it --
>>> http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
>>>
>>> This seems to be aimed at both time fields and numerical values.
>>>
>>> Would Lucene benefit from this?
>>>
>>> https://github.com/burmanm/gorilla-tsc seems to be a fresh Java
>>> implementation.
>>>
>>> Otis
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>

Reply via email to