aryan-212 opened a new issue, #21528: URL: https://github.com/apache/datafusion/issues/21528
### Describe the bug DataFusion’s `approx_percentile_cont` and `approx_median` use a t-digest internally. However, when the number of input rows is small enough that no centroid compression occurs, each centroid represents exactly one data point (weight = 1). In this scenario, the t-digest interpolation step is still applied, even though it assumes centroids represent clusters of multiple points. This leads to incorrect results that: - Do not match exact continuous percentiles (percentile_cont) - Do not match expected discrete nearest-rank semantics (as in Databricks) - Drift away from actual observed values This behavior is particularly surprising for small datasets and unit tests, where users expect percentile outputs to correspond to real data points. ### To Reproduce ### Steps to reproduce the behavior: #### Create a small dataset (e.g., TPC-DS call_center table with ~14 rows) Run: ```sql select approx_percentile(cc_sq_ft, 0.85) from call_center; ``` Observe the output Another example: ```sql select approx_median(value) from (values (-85), (-72), (-56), (-48), (-43), (-25), (-12), (-5), (45), (83)) t(value); ``` ### Expected behavior For small datasets where no t-digest compression occurs: Results should follow nearest-rank semantics (e.g., ceil(q * n) - 1) Returned values should be actual observed data points Behavior should align with systems like Databricks (percentile_approx) Examples: - `approx_percentile(cc_sq_ft, 0.85)` → 84336 - `approx_median(...)` → should return a valid input value (not interpolated like -32) ### Additional context Currently, DataFusion applies t-digest interpolation even when: ``` number_of_centroids == number_of_input_rows ``` This indicates no compression has occurred, making interpolation both unnecessary and incorrect. The issue stems from using the interpolation path in estimate_quantile regardless of centroid structure. In the no-compression regime, the algorithm should instead switch to an exact nearest-rank computation. This inconsistency leads to unintuitive and incorrect results, especially in: Small datasets Window functions with small frame sizes Unit tests expecting deterministic outputs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
