devanshu0987 opened a new pull request, #20059:
URL: https://github.com/apache/datafusion/pull/20059
This adds a `preimage` implementation for the `floor()` function that
transforms `floor(x) = N` into `x >= N AND x < N+1`. This enables
statistics-based predicate pushdown for queries using floor().
For example, a query like:
`SELECT * FROM t WHERE floor(price) = 100`
Is rewritten to:
`SELECT * FROM t WHERE price >= 100 AND price < 101`
This allows the query engine to leverage min/max statistics from Parquet row
groups, significantly reducing the amount of data scanned.
Benchmarks on the ClickBench hits dataset show:
- 80% file pruning (89 out of 111 files skipped)
- 70x fewer rows scanned (1.4M vs 100M)
```
CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION
'benchmarks/data/hits_partitioned/';
-- Test the floor preimage optimization
EXPLAIN ANALYZE SELECT COUNT(*) FROM hits WHERE floor(CAST("CounterID" AS
DOUBLE)) = 62;
```
Metric | Before (no preimage) | After (with preimage)
-- | -- | --
Files pruned | 111 → 111 (0 pruned) | 111 → 22 (89 pruned)
Row groups pruned | 325 → 325 (0 pruned) | 51 → 4 (47 pruned)
Rows scanned | 99,997,497 | 1,410,000
Output rows | 738,172 | 738,172
Pruning predicate | None | CAST(CounterID_max) >= 62 AND CAST(CounterID_min)
< 63
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
- Closes #.
## Rationale for this change
https://github.com/apache/datafusion/issues/19946
This epic introduced the pre-image API. This PR is using the pre-image API
to provide it for `floor` function where it is applicable.
## What changes are included in this PR?
## Are these changes tested?
- Unit Tests added
- Existing SLT tests pass for this.
## Are there any user-facing changes?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]