aryan-212 opened a new pull request, #21388:
URL: https://github.com/apache/datafusion/pull/21388

   The interpolation step assumes centroids represent clusters of multiple 
points. But if the number of input rows is small (≤ the digest's `max_size` / 
compression threshold), **no compression ever happens**: every centroid has 
weight 1 and corresponds to exactly one input value.
   
   In that regime, interpolation is not just unnecessary — it is actively 
**wrong**. The t-digest interpolates between adjacent centroids based on where 
the rank falls *inside* the centroid's weight, using half-deltas to neighbors. 
When every centroid has weight 1, this produces values that drift away from any 
actual data point.
   
   
   
   This is particularly surprising for users running small queries or unit 
tests — they expect percentile functions on a handful of values to return one 
of those values.
   
   ## Concrete Example
   
   Lets take a small example from the TPCDS Schema
   
   ```sql
   select cc_sq_ft from call_center;
   ```
   
    none  | cc_sq_ft
   -- | --
   1 | 6144
   2 | 6144
   3 | 19345
   4 | 21156
   5 | 21156
   6 | 22743
   7 | 34643
   8 | 42935
   9 | 52514
   10 | 65772
   11 | 76815
   12 | 84336
   13 | 105138
   14 | 119886
   
   Now if we take a small `APPROX_PERCENTILE` query like:-
   ```sql
   select approx_percentile(cc_sq_ft,0.85) from call_center limit 50
   ```
   From here, `0.85*14` yields 11.9 or 12 so the output for the above 
`APPROX_PERCENITLE` query should be `84336` and that is what we get when we run 
the same query in Databricks
   
   <img width="1012" height="754" alt="Screenshot 2026-04-06 at 12 11 21 AM" 
src="https://github.com/user-attachments/assets/00a158d5-ca96-4a0d-adc0-108bfae49214";
 />
   
   But in Datafusion this comes up as 
   
   <img width="1130" height="234" alt="Screenshot 2026-04-06 at 12 12 21 AM" 
src="https://github.com/user-attachments/assets/baaf9634-3e54-4b4a-86b3-3d230dd64397";
 />
   
   This PR aims to fix this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to