[GitHub] [arrow-datafusion] domodwyer opened a new pull request #1539: approx_quantile() aggregation function

GitBox Mon, 10 Jan 2022 06:53:36 -0800


domodwyer opened a new pull request #1539:
URL: https://github.com/apache/arrow-datafusion/pull/1539



   # Which issue does this PR close?
   
   Closes #1538.
   
   # What changes are included in this PR?
   A new `approx_quantile()` aggregation function.
   
   This PR also includes support for aggreagte functions with multiple 
arguments - all existing aggregations had a signature such as `fn(column)`, 
whereas this fn requires `approx_quantile(column, quantile)`. 
   
   This involved two main changes to existing code: 
   
   * Support for `TypeSignature::OneOf` for aggregation functions
   * Changing the protobuf format to allow `AggregateExprNode` to carry 
multiple `LogicalExprNode` (see commit messages)
   
   I'd like to draw attention to the fact this TDigest impl is largely taken 
from (the Apache 2.0 licensed) work by [MnO2](https://github.com/MnO2/t-digest) 
and modified to fit datafusion efficiently - itself a port of  [Facebook's C++ 
implementation](https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h).
 I've included this information in a comment in the file, but let me know if I 
should provide attribution in some other way too!
   
   # Are there any user-facing changes?
   A new `approx_quantile()` aggregation is available to the user, both via SQL 
and the dataframe API.
   
   I added the `approx_quantile()` aggregation to the prelude - let me know if 
that's not right.
   
   # Future
   
   I did not spend much time trying to optimise this, and I suspect it can be 
polished up a bit but wanted to land something working first. I may also 
investiage the `uddsketch` algorithm in the future and potentially swap out the 
tdigest approach used here if it proves to be superior.
   
   Should I include documenting this new fn anywhere as part of the PR?
   
   ---
   
   * feat: implement TDigest for approx quantile (c8ff305f)
   
         Adds a [TDigest] implementation providing approximate quantile 
estimations of
         large inputs using a small amount of (bounded) memory.
   
         A TDigest is most accurate near either "end" of the quantile range 
(that is,
         0.1, 0.9, 0.95, etc) due to the use of a scalaing function that 
increases
         resolution at the tails. The paper claims single digit part per 
million errors
         for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have 
found
         accuracy to be more than acceptable for an apprixmate function across 
the
         entire quantile range.
   
         The implementation is a modified copy of 
https://github.com/MnO2/t-digest,
         itself a Rust port of [Facebook's C++ implementation]. Both Facebook's
         implementation, and Mn02's Rust port are Apache 2.0 licensed.
   
         [TDigest]: https://arxiv.org/abs/1902.04023
         [Facebook's C++ implementation]:
         https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h
   
   * feat: approx_quantile aggregation (b99137fd)
   
         Adds the ApproxQuantile physical expression, plumbing & test cases.
   
         The function signature is:
   
                approx_quantile(column, quantile)
   
         Where column can be any numeric type (that can be cast to a float64) 
and 
         quantile is a float64 literal between 0 and 1.
   
   * feat: approx_quantile dataframe function (6d65308c)
   
         Adds the approx_quantile() dataframe function, and exports it in the 
prelude.
   
   * refactor: bastilla approx_quantile support (2b68d464)
   
         Adds bastilla wire encoding for approx_quantile.
   
         Adding support for this required modifying the AggregateExprNode proto 
message
         to support propigating multiple LogicalExprNode aggregate arguments - 
all the
         existing aggregations take a single argument, so this wasn't needed 
before.
   
         This commit adds "repeated" to the expr field, which I believe is 
backwards
         compatible as described here:
   
                
https://developers.google.com/protocol-buffers/docs/proto3#updating
   
         Specifically, adding "repeated" to an existing message field:
   
                "For ... message fields, optional is compatible with repeated"
   
         No existing tests needed fixing, and a new roundtrip test is included 
that
         covers the change to allow multiple expr.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] domodwyer opened a new pull request #1539: approx_quantile() aggregation function

Reply via email to