jychen7 opened a new issue #2004: URL: https://github.com/apache/arrow-datafusion/issues/2004
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Currently, `approx_quantile(column, quantile)` supports raw data as input and build sketches during query time. In the scenario of low latency query SLO, one common way is to pre-aggregate sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge sketches in query time (e.g. DataStore -> Datafusion). [Here](https://github.com/apache/druid/blob/master/extensions-contrib/tdigestsketch/src/test/resources/doubles_sketch_data.tsv) is an example from Druid which also use TDigest algorithm. The data contains `["timestamp", "product", "sketch"]"` and it is encoded using [TDigest Verbose mode](https://github.com/tdunning/t-digest/blob/5db477108a6a56cb385776d9aa1ce2e0fbd60230/core/src/main/java/com/tdunning/math/stats/MergingDigest.java#L869-L880) **Describe the solution you'd like** Improve `approx_quantile(column, quantile)` to accept an optional 3rd params, e.g. `approx_quantile(column, quantile, format)` where format can be - raw (default) - tdigest-verbose - tdigest-small - etc (for future sketch algo, e.g DDSketch from Datadog) **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. **Additional context** In the TDigest implementation of Datafusion, there is an encoding/serialization used internally. https://github.com/apache/arrow-datafusion/blob/ca952bd33402816dbb1550debb9b8cac3b13e8f2/datafusion-physical-expr/src/tdigest/mod.rs#L571-L582 This encoding is a little bit different from Java one (from algo author) Datafusion: max_size, sum , count, max, min , centroid (mean, weight) Java: encoding_version, min, max, max_size, count, centroid (weight, mean) Question: do we want to modify Datafusion to align with the encoding for internal states? Or just do a mapping from Java one to Datafusion one during query? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
