domodwyer opened a new pull request #1539: URL: https://github.com/apache/arrow-datafusion/pull/1539
# Which issue does this PR close? Closes #1538. # What changes are included in this PR? A new `approx_quantile()` aggregation function. This PR also includes support for aggreagte functions with multiple arguments - all existing aggregations had a signature such as `fn(column)`, whereas this fn requires `approx_quantile(column, quantile)`. This involved two main changes to existing code: * Support for `TypeSignature::OneOf` for aggregation functions * Changing the protobuf format to allow `AggregateExprNode` to carry multiple `LogicalExprNode` (see commit messages) I'd like to draw attention to the fact this TDigest impl is largely taken from (the Apache 2.0 licensed) work by [MnO2](https://github.com/MnO2/t-digest) and modified to fit datafusion efficiently - itself a port of [Facebook's C++ implementation](https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h). I've included this information in a comment in the file, but let me know if I should provide attribution in some other way too! # Are there any user-facing changes? A new `approx_quantile()` aggregation is available to the user, both via SQL and the dataframe API. I added the `approx_quantile()` aggregation to the prelude - let me know if that's not right. # Future I did not spend much time trying to optimise this, and I suspect it can be polished up a bit but wanted to land something working first. I may also investiage the `uddsketch` algorithm in the future and potentially swap out the tdigest approach used here if it proves to be superior. Should I include documenting this new fn anywhere as part of the PR? --- * feat: implement TDigest for approx quantile (c8ff305f) Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation (b99137fd) Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function (6d65308c) Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support (2b68d464) Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
