andygrove opened a new issue, #3189:
URL: https://github.com/apache/datafusion-comet/issues/3189
## What is the problem the feature request solves?
> **Note:** This issue was generated with AI assistance. The specification
details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark `approximate_percentile`
function, causing queries using this function to fall back to Spark's JVM
execution instead of running natively on DataFusion.
ApproximatePercentile is a Spark Catalyst aggregate expression that computes
approximate percentiles of numeric data using the t-digest algorithm. It
provides a memory-efficient way to estimate percentiles for large datasets
without requiring exact sorting, trading precision for performance and memory
usage.
Supporting this expression would allow more Spark workloads to benefit from
Comet's native acceleration.
## Describe the potential solution
### Spark Specification
**Syntax:**
```sql
percentile_approx(col, percentage [, accuracy])
percentile_approx(col, array_of_percentages [, accuracy])
```
```scala
// DataFrame API
import org.apache.spark.sql.functions._
df.agg(expr("percentile_approx(column, 0.5, 10000)"))
```
**Arguments:**
| Argument | Type | Description |
|----------|------|-------------|
| child | Expression | The column or expression to compute percentiles for |
| percentageExpression | Expression | Single percentile (0.0-1.0) or array
of percentiles to compute |
| accuracyExpression | Expression | Optional accuracy parameter (default:
10000). Higher values = more accuracy |
**Return Type:** Returns the same data type as the input column. If an array
of percentiles is provided, returns an array of the input data type with
`containsNull = false`.
**Supported Data Types:**
- **Numeric types**: ByteType, ShortType, IntegerType, LongType, FloatType,
DoubleType, DecimalType
- **Temporal types**: DateType, TimestampType, TimestampNTZType
- **Interval types**: YearMonthIntervalType, DayTimeIntervalType
**Edge Cases:**
- **Null handling**: Ignores null input values during computation
- **Empty input**: Returns null when no non-null values are processed
- **Invalid percentages**: Validates that percentages are between 0.0 and
1.0 inclusive
- **Invalid accuracy**: Requires accuracy to be positive and ≤ Int.MaxValue
- **Non-foldable expressions**: Percentage and accuracy parameters must be
compile-time constants
**Examples:**
```sql
-- Single percentile (median)
SELECT percentile_approx(salary, 0.5) as median_salary FROM employees;
-- Multiple percentiles with custom accuracy
SELECT percentile_approx(response_time, array(0.25, 0.5, 0.75, 0.95), 50000)
as quartiles
FROM web_requests;
-- Using with GROUP BY
SELECT department, percentile_approx(salary, 0.9) as p90_salary
FROM employees
GROUP BY department;
```
```scala
// DataFrame API examples
import org.apache.spark.sql.functions._
// Single percentile
df.agg(expr("percentile_approx(amount, 0.5)").as("median"))
// Multiple percentiles
df.agg(expr("percentile_approx(latency, array(0.5, 0.95,
0.99))").as("percentiles"))
// With custom accuracy
df.agg(expr("percentile_approx(value, 0.95, 100000)").as("p95"))
```
### Implementation Approach
See the [Comet guide on adding new
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
for detailed instructions.
1. **Scala Serde**: Add expression handler in
`spark/src/main/scala/org/apache/comet/serde/`
2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if
needed
4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has
built-in support first)
## Additional context
**Difficulty:** Medium
**Spark Expression Class:**
`org.apache.spark.sql.catalyst.expressions.ApproximatePercentile`
**Related:**
- **Percentile**: Exact percentile computation (more expensive but precise)
- **ApproxQuantile**: Similar approximate quantile functionality in
DataFrame API
- Other aggregate functions: Count, Sum, Avg, Min, Max
---
*This issue was auto-generated from Spark reference documentation.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]