[I] [Feature] Support Spark expression: approximate_percentile [datafusion-comet]

via GitHub Thu, 15 Jan 2026 07:29:01 -0800


andygrove opened a new issue, #3189:
URL: https://github.com/apache/datafusion-comet/issues/3189


   ## What is the problem the feature request solves?
   
   > **Note:** This issue was generated with AI assistance. The specification 
details have been extracted from Spark documentation and may need verification.
   
   Comet does not currently support the Spark `approximate_percentile` 
function, causing queries using this function to fall back to Spark's JVM 
execution instead of running natively on DataFusion.
   
   ApproximatePercentile is a Spark Catalyst aggregate expression that computes 
approximate percentiles of numeric data using the t-digest algorithm. It 
provides a memory-efficient way to estimate percentiles for large datasets 
without requiring exact sorting, trading precision for performance and memory 
usage.
   
   Supporting this expression would allow more Spark workloads to benefit from 
Comet's native acceleration.
   
   ## Describe the potential solution
   
   ### Spark Specification
   
   **Syntax:**
   ```sql
   percentile_approx(col, percentage [, accuracy])
   percentile_approx(col, array_of_percentages [, accuracy])
   ```
   
   ```scala
   // DataFrame API
   import org.apache.spark.sql.functions._
   df.agg(expr("percentile_approx(column, 0.5, 10000)"))
   ```
   
   **Arguments:**
   | Argument | Type | Description |
   |----------|------|-------------|
   | child | Expression | The column or expression to compute percentiles for |
   | percentageExpression | Expression | Single percentile (0.0-1.0) or array 
of percentiles to compute |
   | accuracyExpression | Expression | Optional accuracy parameter (default: 
10000). Higher values = more accuracy |
   
   **Return Type:** Returns the same data type as the input column. If an array 
of percentiles is provided, returns an array of the input data type with 
`containsNull = false`.
   
   **Supported Data Types:**
   - **Numeric types**: ByteType, ShortType, IntegerType, LongType, FloatType, 
DoubleType, DecimalType
   - **Temporal types**: DateType, TimestampType, TimestampNTZType
   - **Interval types**: YearMonthIntervalType, DayTimeIntervalType
   
   **Edge Cases:**
   - **Null handling**: Ignores null input values during computation
   - **Empty input**: Returns null when no non-null values are processed
   - **Invalid percentages**: Validates that percentages are between 0.0 and 
1.0 inclusive
   - **Invalid accuracy**: Requires accuracy to be positive and ≤ Int.MaxValue
   - **Non-foldable expressions**: Percentage and accuracy parameters must be 
compile-time constants
   
   **Examples:**
   ```sql
   -- Single percentile (median)
   SELECT percentile_approx(salary, 0.5) as median_salary FROM employees;
   
   -- Multiple percentiles with custom accuracy
   SELECT percentile_approx(response_time, array(0.25, 0.5, 0.75, 0.95), 50000) 
as quartiles 
   FROM web_requests;
   
   -- Using with GROUP BY
   SELECT department, percentile_approx(salary, 0.9) as p90_salary 
   FROM employees 
   GROUP BY department;
   ```
   
   ```scala
   // DataFrame API examples
   import org.apache.spark.sql.functions._
   
   // Single percentile
   df.agg(expr("percentile_approx(amount, 0.5)").as("median"))
   
   // Multiple percentiles
   df.agg(expr("percentile_approx(latency, array(0.5, 0.95, 
0.99))").as("percentiles"))
   
   // With custom accuracy
   df.agg(expr("percentile_approx(value, 0.95, 100000)").as("p95"))
   ```
   
   ### Implementation Approach
   
   See the [Comet guide on adding new 
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
 for detailed instructions.
   
   1. **Scala Serde**: Add expression handler in 
`spark/src/main/scala/org/apache/comet/serde/`
   2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
   3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if 
needed
   4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has 
built-in support first)
   
   
   ## Additional context
   
   **Difficulty:** Medium
   **Spark Expression Class:** 
`org.apache.spark.sql.catalyst.expressions.ApproximatePercentile`
   
   **Related:**
   - **Percentile**: Exact percentile computation (more expensive but precise)
   - **ApproxQuantile**: Similar approximate quantile functionality in 
DataFrame API
   - Other aggregate functions: Count, Sum, Avg, Min, Max
   
   ---
   *This issue was auto-generated from Spark reference documentation.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Feature] Support Spark expression: approximate_percentile [datafusion-comet]

Reply via email to