andygrove opened a new pull request, #4761: URL: https://github.com/apache/datafusion-comet/pull/4761
## Which issue does this PR close? Closes #2649. ## Rationale for this change Native `date_trunc` (TruncTimestamp) hit two issues when run with the native path in non-UTC sessions: 1. **Schema mismatch.** `TimestampTruncExpr::data_type()` returned `Timestamp(Microsecond, None)` regardless of the input, but the kernel emitted a timezone-annotated output. Any downstream operator that compares declared vs. actual schema (e.g. `RowConverter` during shuffle/sort) failed with `RowConverter column schema mismatch, expected Timestamp(Microsecond, Some(\"America/Denver\")) got Timestamp(Microsecond, Some(\"UTC\"))` — the symptom in the original bug report. 2. **DST offset bug.** The truncation kernel reused the input's UTC offset for the truncated result. Truncating a timestamp across a DST boundary (e.g. November in `America/Denver` truncated to QUARTER → October 1) produced a result one hour off. The default routing for non-UTC `date_trunc` is the JVM codegen dispatcher (via `CodegenDispatchFallback`), which has always been correct and remains the default. This PR fixes the underlying native path so it is safe to opt into via `spark.comet.expression.TruncTimestamp.allowIncompatible=true`. ## What changes are included in this PR? **`native/spark-expr/src/datetime_funcs/timestamp_trunc.rs`** — `data_type()` now matches the kernel's output: NTZ inputs stay NTZ; otherwise the expression's session timezone is propagated. **`native/spark-expr/src/kernels/temporal.rs`** - `as_timestamp_tz_with_op` and the `timestamp_trunc_array_fmt_helper!` macro now annotate the output array with the session timezone, eliminating the schema mismatch. - DST handling in `as_micros_from_unix_epoch_utc` was already present on `main` (re-resolving the naive local time through the timezone after truncation); this PR keeps that and exercises it with a new unit test. - New unit test `test_timestamp_trunc_dst_boundary` covers the `America/Denver` November→October QUARTER case. **`spark/src/test/scala/org/apache/comet/CometTemporalExpressionSuite.scala`** — the existing `non-UTC timezone takes native path when allowIncompatible is enabled` test was a stays-in-Comet smoke check (because the native path produced wrong answers). With the fixes it now compares against Spark for every supported format across `America/New_York`, `Europe/London`, and `Asia/Tokyo`. Base date is pinned to 2024 to stay within chrono-tz's precomputed DST horizon (≈year 2100). **`docs/source/user-guide/latest/compatibility/expressions/_category_template/datetime.md`** — clarify the new state: native path is correct within the chrono-tz horizon and opt-in; codegen dispatcher remains the default for non-UTC sessions. ## How are these changes tested? - New Rust unit test `test_timestamp_trunc_dst_boundary` (`cargo test -p datafusion-comet-spark-expr kernels::temporal`). - `CometTemporalExpressionSuite` (34/34 pass) — exercises the fix end-to-end across three non-UTC timezones and all 15 supported formats. - `CometSqlFileTestSuite` (387/387) — runs the existing `trunc_timestamp_dst.sql` regression test with `allowIncompatible=true`. - `CometExecSuite` (131/131) and `CometCastSuite` (170/170) — no regressions. ## Notes on the previous attempt PR #3877 took a broader scope (flipping the default to native, gating only non-1-hour-DST timezones) and was closed because the JVM tests used a base date in year 3333, well past chrono-tz's DST horizon, and produced unreliable results. This PR is intentionally minimal: it fixes the two underlying bugs but does not flip the default, so the codegen dispatcher continues to handle non-UTC sessions safely. Migrating off chrono-tz (e.g. to jiff) would lift the year-2100 limit but is a much broader change worth pursuing separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
