[PR] Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via `hits` view [datafusion]

via GitHub Sun, 18 Jan 2026 21:59:56 -0800


kosiew opened a new pull request, #19881:
URL: https://github.com/apache/datafusion/pull/19881


   ## Which issue does this PR close?
   
   * Closes #18982.
   
   ---
   
   ## Rationale for this change
   
   ClickBench encodes `EventDate` as a `UInt16` representing **days since 
1970-01-01**. When DataFusion registers the ClickBench parquet file directly as 
`hits`, `EventDate` ends up being compared as a string in some queries (notably 
ClickBench queries 36–42), which causes the date range predicates to filter out 
all rows.
   
   To make ClickBench queries behave as authored (and align with how other 
engines handle the dataset), we expose `hits` as a view that converts the raw 
`UInt16` encoding into a proper SQL `DATE`.
   
   ---
   
   ## What changes are included in this PR?
   
   * Register the underlying parquet table as **`hits_raw`** instead of `hits`.
   * Add a constant **`HITS_VIEW_DDL`** that defines a `hits` view which:
   
     * Removes the original `EventDate` column, and
     * Re-introduces it as `DATE` using `CAST(CAST("EventDate" AS INTEGER) AS 
DATE)`.
   * Factor view creation into a helper method (`create_hits_view`) and add 
error context for easier debugging.
   * Update the ClickBench sqllogictest file to:
   
     * Create `hits_raw` + `hits` view,
     * Add explicit assertions validating the transformation (`15901` ↔ 
`2013-07-15`),
     * Update expected result types where `EventDate` is now a `DATE`, and
     * Drop the view before dropping the raw table.
   
   ---
   
   ## Are these changes tested?
   
   Yes.
   
   * Updated `datafusion/sqllogictest/test_files/clickbench.slt` to cover:
   
     * Correct `EventDate` decoding in the `hits` view (returns `DATE`),
     * Raw `hits_raw.EventDate` remains the original integer encoding, and
     * Existing ClickBench queries that rely on date predicates (including the 
previously failing range-filter queries) now execute with the correct types.
   
   Script to test q36-q42.
   
   `benchmarks/run_q36_q42.sh`
   ```bash
   #!/usr/bin/env bash
   # Script to run ClickBench queries 36-42 and display results
   
   set -e
   
   SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && 
pwd )
   BENCHMARK=${1:-"clickbench_1"}
   OUTPUT_FILE="${2:-results_q36_q42.txt}"
   
   echo "=========================================="
   echo "Running ClickBench Queries 36-42"
   echo "=========================================="
   echo "Benchmark: $BENCHMARK"
   echo "Output file: $OUTPUT_FILE"
   echo ""
   
   # Create results file
   > "$OUTPUT_FILE"
   
   # Run queries 36-42
   for q in {36..42}; do
       echo "Running Query $q..."
       
       # Run the query and extract relevant info
       output=$($SCRIPT_DIR/bench.sh run $BENCHMARK $q 2>&1)
       
       # Extract timing and row count from the first iteration
       iteration_0=$(echo "$output" | grep "Query $q iteration 0" | head -1)
       avg_time=$(echo "$output" | grep "Query $q avg time" | head -1)
       
       echo "Q$q: $iteration_0" | tee -a "$OUTPUT_FILE"
       echo "     $avg_time" | tee -a "$OUTPUT_FILE"
       echo "" | tee -a "$OUTPUT_FILE"
   done
   
   echo "=========================================="
   echo "Summary saved to: $OUTPUT_FILE"
   echo "=========================================="
   cat "$OUTPUT_FILE"
   ```
   
   Run results on this branch:
   ```
   Q36: Query 36 iteration 0 took 138.1 ms and returned 10 rows
        Query 36 avg time: 116.19 ms
   
   Q37: Query 37 iteration 0 took 66.4 ms and returned 10 rows
        Query 37 avg time: 50.57 ms
   
   Q38: Query 38 iteration 0 took 98.9 ms and returned 10 rows
        Query 38 avg time: 83.20 ms
   
   Q39: Query 39 iteration 0 took 237.3 ms and returned 10 rows
        Query 39 avg time: 223.62 ms
   
   Q40: Query 40 iteration 0 took 40.6 ms and returned 10 rows
        Query 40 avg time: 24.43 ms
   
   Q41: Query 41 iteration 0 took 36.6 ms and returned 10 rows
        Query 41 avg time: 22.90 ms
   
   Q42: Query 42 iteration 0 took 34.0 ms and returned 10 rows
        Query 42 avg time: 20.17 ms
   ```
   
   On `main` branch, the queries return 0 rows.
   
   ---
   
   ## Are there any user-facing changes?
   
   Yes (benchmark/test behavior):
   
   * In ClickBench runs, the logical table name `hits` continues to exist, but 
it is now a **view** that exposes `EventDate` as a proper `DATE` rather than 
the raw `UInt16` encoding.
   * This fixes ClickBench queries 36–42 so they return rows without requiring 
manual casts in the SQL.
   
   No public API changes.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via `hits` view [datafusion]

Reply via email to