[GitHub] [arrow-datafusion] jychen7 opened a new issue #2004: feat: ApproxPercentileCont supports sketches as input

GitBox Sun, 13 Mar 2022 08:43:03 -0700


jychen7 opened a new issue #2004:
URL: https://github.com/apache/arrow-datafusion/issues/2004



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Currently, `approx_quantile(column, quantile)` supports raw data as input 
and build sketches during query time.
   In the scenario of low latency query SLO, one common way is to pre-aggregate 
sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge 
sketches in query time (e.g. DataStore -> Datafusion).
   
   
[Here](https://github.com/apache/druid/blob/master/extensions-contrib/tdigestsketch/src/test/resources/doubles_sketch_data.tsv)
 is an example from Druid which also use TDigest algorithm. The data contains 
`["timestamp", "product", "sketch"]"` and it is encoded using [TDigest Verbose 
mode](https://github.com/tdunning/t-digest/blob/5db477108a6a56cb385776d9aa1ce2e0fbd60230/core/src/main/java/com/tdunning/math/stats/MergingDigest.java#L869-L880)
   
   **Describe the solution you'd like**
   Improve `approx_quantile(column, quantile)` to accept an optional 3rd 
params, e.g. `approx_quantile(column, quantile, format)` where format can be
   - raw (default)
   - tdigest-verbose
   - tdigest-small
   - etc (for future sketch algo, e.g DDSketch from Datadog)
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   
   In the TDigest implementation of Datafusion, there is an 
encoding/serialization used internally.
   
   
https://github.com/apache/arrow-datafusion/blob/ca952bd33402816dbb1550debb9b8cac3b13e8f2/datafusion-physical-expr/src/tdigest/mod.rs#L571-L582
   
   This encoding is a little bit different from Java one (from algo author)
   
   Datafusion: max_size, sum , count,  max,  min , centroid (mean, weight)
   Java: encoding_version, min, max, max_size, count, centroid (weight, mean)
   
   Question: do we want to modify Datafusion to align with the encoding for 
internal states? Or just do a mapping from Java one to Datafusion one during 
query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jychen7 opened a new issue #2004: feat: ApproxPercentileCont supports sketches as input

Reply via email to