andygrove opened a new pull request, #3878:
URL: https://github.com/apache/datafusion-comet/pull/3878

   ## Which issue does this PR close?
   
   Closes #2720
   
   ## Rationale for this change
   
   `CometSparkToColumnarExec` and `CometLocalTableScanExec` used 
`conf.sessionLocalTimeZone` when creating Arrow schemas for timestamp columns. 
However, the native side always deserializes `Timestamp` as 
`Timestamp(Microsecond, Some("UTC"))` in `serde.rs`. When the session timezone 
is non-UTC, this causes a timezone mismatch between the Arrow data schema and 
the native plan schema.
   
   The native `ScanExec.make_record_batch` detects this mismatch and casts the 
column to match. Since Arrow timestamps with timezone are always stored as UTC 
microseconds internally, this cast is a no-op on the data — but it adds 
unnecessary overhead and is inconsistent with `NativeBatchReader` (the Parquet 
scan path), which already hardcodes UTC.
   
   ## What changes are included in this PR?
   
   - Changed `CometSparkToColumnarExec` and `CometLocalTableScanExec` to use 
`"UTC"` instead of `conf.sessionLocalTimeZone` for Arrow schema timezone, 
matching `NativeBatchReader` and native `serde.rs`
   - Removed outdated "known issues with non-UTC timezones" warnings from 
config descriptions for `spark.comet.convert.parquet.enabled`, 
`spark.comet.convert.json.enabled`, `spark.comet.convert.csv.enabled`, and 
`spark.comet.sparkToColumnar.enabled`
   - Moved these configs from `testing` to `exec` category since the timezone 
concern is resolved
   
   ## How are these changes tested?
   
   Added two tests in `CometExecSuite`:
   - `LocalTableScanExec with timestamps in non-UTC timezone` — verifies 
correct results with `CometLocalTableScanExec` when session timezone is 
`America/Los_Angeles`
   - `sort on timestamps with non-UTC timezone via LocalTableScan` — verifies 
correct results through a sort + repartition path with non-UTC timezone
   
   All 90 tests in `CometExecSuite` pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to