[I] [Feature] Support Spark expression: percentile_cont [datafusion-comet]

via GitHub Thu, 15 Jan 2026 07:29:06 -0800


andygrove opened a new issue, #3190:
URL: https://github.com/apache/datafusion-comet/issues/3190


   ## What is the problem the feature request solves?
   
   > **Note:** This issue was generated with AI assistance. The specification 
details have been extracted from Spark documentation and may need verification.
   
   Comet does not currently support the Spark `percentile_cont` function, 
causing queries using this function to fall back to Spark's JVM execution 
instead of running natively on DataFusion.
   
   PercentileCont calculates a percentile value based on a continuous 
distribution of numeric or ANSI interval columns at a given percentage. It 
implements the SQL PERCENTILE_CONT function which uses linear interpolation 
between values when the exact percentile position falls between two data 
points. This expression is a runtime-replaceable aggregate that delegates to 
the internal Percentile implementation.
   
   Supporting this expression would allow more Spark workloads to benefit from 
Comet's native acceleration.
   
   ## Describe the potential solution
   
   ### Spark Specification
   
   **Syntax:**
   ```sql
   PERCENTILE_CONT(percentage) WITHIN GROUP (ORDER BY column [ASC|DESC])
   ```
   
   ```scala
   // Used internally by Catalyst - not directly exposed in DataFrame API
   // The DataFrame API uses percentile_approx for similar functionality
   ```
   
   **Arguments:**
   | Argument | Type | Description |
   |----------|------|-------------|
   | left | Expression | The column expression to calculate percentile over 
(ORDER BY clause) |
   | right | Expression | The percentile value between 0.0 and 1.0 |
   | reverse | Boolean | Whether to reverse the ordering (DESC vs ASC), 
defaults to false |
   
   **Return Type:** Returns the same data type as the input column being 
aggregated (numeric types or ANSI interval types).
   
   **Supported Data Types:**
   Supports numeric data types and ANSI interval types as determined by the 
underlying Percentile implementation's input type validation.
   
   **Edge Cases:**
   - Null handling behavior is delegated to the underlying Percentile 
implementation
   - Empty input datasets return null
   - Percentile values outside 0.0-1.0 range cause validation errors
   - Requires exactly one ordering column - throws QueryCompilationError for 
multiple orderings
   - DISTINCT clause is not supported and will be ignored
   - UnresolvedWithinGroup state indicates incomplete ordering specification
   
   **Examples:**
   ```sql
   -- Calculate median (50th percentile) of salary
   SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary
   FROM employees;
   
   -- Calculate 95th percentile in descending order
   SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time DESC) as 
p95_response
   FROM requests;
   ```
   
   ```scala
   // This is an internal Catalyst expression
   // DataFrame API equivalent would use approxQuantile or percentile_approx
   df.stat.approxQuantile("salary", Array(0.5), 0.0)
   ```
   
   ### Implementation Approach
   
   See the [Comet guide on adding new 
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
 for detailed instructions.
   
   1. **Scala Serde**: Add expression handler in 
`spark/src/main/scala/org/apache/comet/serde/`
   2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
   3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if 
needed
   4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has 
built-in support first)
   
   
   ## Additional context
   
   **Difficulty:** Medium
   **Spark Expression Class:** 
`org.apache.spark.sql.catalyst.expressions.PercentileCont`
   
   **Related:**
   - Percentile - The underlying implementation class
   - PercentileDisc - Discrete percentile calculation
   - RuntimeReplaceableAggregate - Base trait for replaceable aggregates
   - SupportsOrderingWithinGroup - Interface for WITHIN GROUP syntax
   
   ---
   *This issue was auto-generated from Spark reference documentation.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Feature] Support Spark expression: percentile_cont [datafusion-comet]

Reply via email to