andygrove opened a new pull request, #4761:
URL: https://github.com/apache/datafusion-comet/pull/4761

   ## Which issue does this PR close?
   
   Closes #2649.
   
   ## Rationale for this change
   
   Native `date_trunc` (TruncTimestamp) hit two issues when run with the native 
path in non-UTC sessions:
   
   1. **Schema mismatch.** `TimestampTruncExpr::data_type()` returned 
`Timestamp(Microsecond, None)` regardless of the input, but the kernel emitted 
a timezone-annotated output. Any downstream operator that compares declared vs. 
actual schema (e.g. `RowConverter` during shuffle/sort) failed with 
`RowConverter column schema mismatch, expected Timestamp(Microsecond, 
Some(\"America/Denver\")) got Timestamp(Microsecond, Some(\"UTC\"))` — the 
symptom in the original bug report.
   
   2. **DST offset bug.** The truncation kernel reused the input's UTC offset 
for the truncated result. Truncating a timestamp across a DST boundary (e.g. 
November in `America/Denver` truncated to QUARTER → October 1) produced a 
result one hour off.
   
   The default routing for non-UTC `date_trunc` is the JVM codegen dispatcher 
(via `CodegenDispatchFallback`), which has always been correct and remains the 
default. This PR fixes the underlying native path so it is safe to opt into via 
`spark.comet.expression.TruncTimestamp.allowIncompatible=true`.
   
   ## What changes are included in this PR?
   
   **`native/spark-expr/src/datetime_funcs/timestamp_trunc.rs`** — 
`data_type()` now matches the kernel's output: NTZ inputs stay NTZ; otherwise 
the expression's session timezone is propagated.
   
   **`native/spark-expr/src/kernels/temporal.rs`**
   - `as_timestamp_tz_with_op` and the `timestamp_trunc_array_fmt_helper!` 
macro now annotate the output array with the session timezone, eliminating the 
schema mismatch.
   - DST handling in `as_micros_from_unix_epoch_utc` was already present on 
`main` (re-resolving the naive local time through the timezone after 
truncation); this PR keeps that and exercises it with a new unit test.
   - New unit test `test_timestamp_trunc_dst_boundary` covers the 
`America/Denver` November→October QUARTER case.
   
   
**`spark/src/test/scala/org/apache/comet/CometTemporalExpressionSuite.scala`** 
— the existing `non-UTC timezone takes native path when allowIncompatible is 
enabled` test was a stays-in-Comet smoke check (because the native path 
produced wrong answers). With the fixes it now compares against Spark for every 
supported format across `America/New_York`, `Europe/London`, and `Asia/Tokyo`. 
Base date is pinned to 2024 to stay within chrono-tz's precomputed DST horizon 
(≈year 2100).
   
   
**`docs/source/user-guide/latest/compatibility/expressions/_category_template/datetime.md`**
 — clarify the new state: native path is correct within the chrono-tz horizon 
and opt-in; codegen dispatcher remains the default for non-UTC sessions.
   
   ## How are these changes tested?
   
   - New Rust unit test `test_timestamp_trunc_dst_boundary` (`cargo test -p 
datafusion-comet-spark-expr kernels::temporal`).
   - `CometTemporalExpressionSuite` (34/34 pass) — exercises the fix end-to-end 
across three non-UTC timezones and all 15 supported formats.
   - `CometSqlFileTestSuite` (387/387) — runs the existing 
`trunc_timestamp_dst.sql` regression test with `allowIncompatible=true`.
   - `CometExecSuite` (131/131) and `CometCastSuite` (170/170) — no regressions.
   
   ## Notes on the previous attempt
   
   PR #3877 took a broader scope (flipping the default to native, gating only 
non-1-hour-DST timezones) and was closed because the JVM tests used a base date 
in year 3333, well past chrono-tz's DST horizon, and produced unreliable 
results. This PR is intentionally minimal: it fixes the two underlying bugs but 
does not flip the default, so the codegen dispatcher continues to handle 
non-UTC sessions safely. Migrating off chrono-tz (e.g. to jiff) would lift the 
year-2100 limit but is a much broader change worth pursuing separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to