andygrove opened a new issue, #3131:
URL: https://github.com/apache/datafusion-comet/issues/3131
## What is the problem the feature request solves?
> **Note:** This issue was generated with AI assistance. The specification
details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark `years` function, causing queries
using this function to fall back to Spark's JVM execution instead of running
natively on DataFusion.
The `Years` expression is a v2 partition transform that extracts the year
component from date/timestamp values for partitioning purposes. It converts
temporal data into integer year values, enabling efficient time-based
partitioning strategies in Spark SQL tables.
Supporting this expression would allow more Spark workloads to benefit from
Comet's native acceleration.
## Describe the potential solution
### Spark Specification
**Syntax:**
```sql
YEARS(column_name)
```
```scala
// DataFrame API usage
Years(col("date_column"))
```
**Arguments:**
| Argument | Type | Description |
|----------|------|-------------|
| child | Expression | The input expression, typically a date or timestamp
column |
**Return Type:** `IntegerType` - Returns the year as an integer value.
**Supported Data Types:**
- DateType
- TimestampType
- TimestampNTZType (timestamp without timezone)
**Edge Cases:**
- Null handling: Returns null when the input expression is null
- Invalid dates: Follows Spark's standard date parsing and validation rules
- Year boundaries: Correctly handles leap years and year transitions
- Timezone effects: For timestamp inputs, the year extraction respects the
session timezone setting
- Historical dates: Supports dates across the full range supported by
Spark's date types
**Examples:**
```sql
-- Creating a table partitioned by years
CREATE TABLE events (
id BIGINT,
event_time TIMESTAMP,
data STRING
) USING DELTA
PARTITIONED BY (YEARS(event_time))
-- Query that benefits from partition pruning
SELECT * FROM events
WHERE event_time >= '2023-01-01' AND event_time < '2024-01-01'
```
```scala
// DataFrame API usage in partition transforms
import org.apache.spark.sql.catalyst.expressions.Years
import org.apache.spark.sql.functions.col
// Transform expression for partitioning
val yearTransform = Years(col("timestamp_col").expr)
// Usage in DataFrameWriter for partitioned writes
df.write
.partitionBy("year_partition")
.option("partitionOverwriteMode", "dynamic")
.save("/path/to/table")
```
### Implementation Approach
See the [Comet guide on adding new
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
for detailed instructions.
1. **Scala Serde**: Add expression handler in
`spark/src/main/scala/org/apache/comet/serde/`
2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if
needed
4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has
built-in support first)
## Additional context
**Difficulty:** Medium
**Spark Expression Class:** `org.apache.spark.sql.catalyst.expressions.Years`
**Related:**
- `Months` - Monthly partition transform
- `Days` - Daily partition transform
- `Hours` - Hourly partition transform
- `Bucket` - Hash-based partition transform
- `PartitionTransformExpression` - Base class for partition transforms
---
*This issue was auto-generated from Spark reference documentation.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]