drin opened a new pull request, #18648:
URL: https://github.com/apache/datafusion/pull/18648
This decomposes a custom optimizer rule into the datafusion expression
simplifier (work-in-progress).
## Which issue does this PR close?
Closes #18319.
## Rationale for this change
To transform binary expressions that compare `date_trunc` with a constant
value into a form that can be better utilized (improved performance).
For Bauplan, we can see the following (approximate average over a handful of
runs):
Q1:
```sql
SELECT PULocationID, trip_miles, tips
FROM taxi_fhvhv
WHERE date_trunc('month', pickup_datetime) <= '2025-01-08'::DATE
```
Q2:
```sql
SELECT PULocationID, trip_miles, tips
FROM taxi_fhvhv
WHERE pickup_datetime < date_trunc('month', '2025-02-08'::DATE)
```
| Query | Time (s) | Options |
| ------- | -------- | -------- |
| Q1 | ~3 | no cache, optimization enabled |
| Q1 | ~35 | no cache, optimization disabled |
| Q2 | ~3 | no cache, optimization enabled |
| Q2 | ~3 | no cache, optimization disabled |
## What changes are included in this PR?
A few additional support functions and additional match arms in the
simplifier match expression.
## Are these changes tested?
Our custom rule has tests of the expression transformations and for correct
evaluation results. These will be added to the PR after the implementation is
in approximately good shape.
## Are there any user-facing changes?
Better performance and occasionally confusing explain plan. In short, a
`date_trunc('month', col) = '2025-12-03'::DATE` will always be false (because
the truncation result can never be a non-truncated value), which may produce an
unexpected expression (`false`).
Explain plan details below (may be overkill but it was fun to figure out):
Initial query:
```sql
SELECT PULocationID
,pickup_datetime
FROM taxi_view_2025
WHERE date_trunc('month', pickup_datetime) = '2025-12-03'
```
After simplify_expressions:
```sql
logical_plan after simplify_expressions | Projection:
taxi_view_2025.PULocationID, taxi_view_2025.pickup_datetime
|
| | Filter:
date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS
Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)
|
| |
TableScan: taxi_view_2025
```
Before and after `date_trunc_optimizer` (our custom rule):
```sql
logical_plan after optimize_projections | Filter:
date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS
Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)
|
| | TableScan:
taxi_view_2025 projection=[PULocationID, pickup_datetime]
|
| logical_plan after date_trunc_optimizer | Filter:
Boolean(false)
|
| | TableScan:
taxi_view_2025 projection=[PULocationID, pickup_datetime]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]