kosiew opened a new pull request, #19881:
URL: https://github.com/apache/datafusion/pull/19881
## Which issue does this PR close?
* Closes #18982.
---
## Rationale for this change
ClickBench encodes `EventDate` as a `UInt16` representing **days since
1970-01-01**. When DataFusion registers the ClickBench parquet file directly as
`hits`, `EventDate` ends up being compared as a string in some queries (notably
ClickBench queries 36–42), which causes the date range predicates to filter out
all rows.
To make ClickBench queries behave as authored (and align with how other
engines handle the dataset), we expose `hits` as a view that converts the raw
`UInt16` encoding into a proper SQL `DATE`.
---
## What changes are included in this PR?
* Register the underlying parquet table as **`hits_raw`** instead of `hits`.
* Add a constant **`HITS_VIEW_DDL`** that defines a `hits` view which:
* Removes the original `EventDate` column, and
* Re-introduces it as `DATE` using `CAST(CAST("EventDate" AS INTEGER) AS
DATE)`.
* Factor view creation into a helper method (`create_hits_view`) and add
error context for easier debugging.
* Update the ClickBench sqllogictest file to:
* Create `hits_raw` + `hits` view,
* Add explicit assertions validating the transformation (`15901` ↔
`2013-07-15`),
* Update expected result types where `EventDate` is now a `DATE`, and
* Drop the view before dropping the raw table.
---
## Are these changes tested?
Yes.
* Updated `datafusion/sqllogictest/test_files/clickbench.slt` to cover:
* Correct `EventDate` decoding in the `hits` view (returns `DATE`),
* Raw `hits_raw.EventDate` remains the original integer encoding, and
* Existing ClickBench queries that rely on date predicates (including the
previously failing range-filter queries) now execute with the correct types.
Script to test q36-q42.
`benchmarks/run_q36_q42.sh`
```bash
#!/usr/bin/env bash
# Script to run ClickBench queries 36-42 and display results
set -e
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null &&
pwd )
BENCHMARK=${1:-"clickbench_1"}
OUTPUT_FILE="${2:-results_q36_q42.txt}"
echo "=========================================="
echo "Running ClickBench Queries 36-42"
echo "=========================================="
echo "Benchmark: $BENCHMARK"
echo "Output file: $OUTPUT_FILE"
echo ""
# Create results file
> "$OUTPUT_FILE"
# Run queries 36-42
for q in {36..42}; do
echo "Running Query $q..."
# Run the query and extract relevant info
output=$($SCRIPT_DIR/bench.sh run $BENCHMARK $q 2>&1)
# Extract timing and row count from the first iteration
iteration_0=$(echo "$output" | grep "Query $q iteration 0" | head -1)
avg_time=$(echo "$output" | grep "Query $q avg time" | head -1)
echo "Q$q: $iteration_0" | tee -a "$OUTPUT_FILE"
echo " $avg_time" | tee -a "$OUTPUT_FILE"
echo "" | tee -a "$OUTPUT_FILE"
done
echo "=========================================="
echo "Summary saved to: $OUTPUT_FILE"
echo "=========================================="
cat "$OUTPUT_FILE"
```
Run results on this branch:
```
Q36: Query 36 iteration 0 took 138.1 ms and returned 10 rows
Query 36 avg time: 116.19 ms
Q37: Query 37 iteration 0 took 66.4 ms and returned 10 rows
Query 37 avg time: 50.57 ms
Q38: Query 38 iteration 0 took 98.9 ms and returned 10 rows
Query 38 avg time: 83.20 ms
Q39: Query 39 iteration 0 took 237.3 ms and returned 10 rows
Query 39 avg time: 223.62 ms
Q40: Query 40 iteration 0 took 40.6 ms and returned 10 rows
Query 40 avg time: 24.43 ms
Q41: Query 41 iteration 0 took 36.6 ms and returned 10 rows
Query 41 avg time: 22.90 ms
Q42: Query 42 iteration 0 took 34.0 ms and returned 10 rows
Query 42 avg time: 20.17 ms
```
On `main` branch, the queries return 0 rows.
---
## Are there any user-facing changes?
Yes (benchmark/test behavior):
* In ClickBench runs, the logical table name `hits` continues to exist, but
it is now a **view** that exposes `EventDate` as a proper `DATE` rather than
the raw `UInt16` encoding.
* This fixes ClickBench queries 36–42 so they return rows without requiring
manual casts in the SQL.
No public API changes.
---
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]