[GitHub] [arrow-datafusion] domodwyer opened a new issue #1538: Quantile support

GitBox Mon, 10 Jan 2022 06:39:51 -0800


domodwyer opened a new issue #1538:
URL: https://github.com/apache/arrow-datafusion/issues/1538



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I would like to efficiently aggregate (approximate) quantile values from a 
column of data - "show me the 99th percentile of the latency column in the 
requests table"
   
   **Describe the solution you'd like**
   Implement TDigest (or similar algorithm) to provide relatively cheap 
quantile values/estimations.
   
   **Describe alternatives you've considered**
   I've had a look at some other DBs:
   
   * duckdb - tdigest & reservoir sampling
   * timescaledb - tdigest & uddsketch
   * snowflake - several options, including tdigest for cheap approximations
   * presto - qdigest
   * influxdb - tdigest
   
   For approximate results, tdigest seems popular, though the uddsketch paper 
is relatively new and also interesting.
   
   **Additional context**
   Tdigest provides quantile estimatations, I imagine it would expose an 
`approx_quantile(column, quantile)` aggregation keeping with the naming of the 
`approx_distinct()` aggregation.
   
   Example:
   
   ```sql
   SELECT approx_quantile(latency, 0.99) AS p99 FROM requests;
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] domodwyer opened a new issue #1538: Quantile support

Reply via email to