andygrove opened a new pull request, #4356: URL: https://github.com/apache/datafusion-comet/pull/4356
## Which issue does this PR close? Closes #4219 (partially: INT96 case tracked in the same issue, see below). ## Rationale for this change Pre-Spark-4 ([SPARK-36182](https://issues.apache.org/jira/browse/SPARK-36182)) rejects reading a Parquet TimestampLTZ column as TimestampNTZ because LTZ encodes UTC-adjusted instants that cannot be safely reinterpreted as timezone-free values. The `native_datafusion` scan did not enforce this, and silently returned the UTC instant as the NTZ value. That is a correctness divergence on Spark 3.x: queries Spark would have failed instead return values, and downstream filters / joins / aggregations on the column may produce different results than running the same query without Comet. Spark 4.0 ([SPARK-47447](https://issues.apache.org/jira/browse/SPARK-47447)) lifted the restriction. The `native_iceberg_compat` path was already correct because its JVM-side `TypeUtil.checkParquetType` rejects the read before reaching native code. ## What changes are included in this PR? - New per-Spark-version constant `COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ` in `ShimCometConf` (false on 3.x, true on 4.x). - New `allow_timestamp_ltz_to_ntz` field on the `NativeScanCommon` proto, set from the shim constant in `CometNativeScan`, plumbed into `SparkParquetOptions` via `init_datasource_exec` / `get_options`. - New rejection arm in `SparkPhysicalExprAdapter` (`schema_adapter.rs`) that emits `reject_on_non_empty_expr` when an Arrow `Timestamp(_, Some(_))` column is read as `Timestamp(_, None)` and the flag forbids it. Deferred to runtime so empty files (SPARK-26709) continue to pass. - `iceberg-compat` JNI path passes `true` because `TypeUtil.checkParquetType` has already validated. - Compatibility guide updated (`docs/source/user-guide/latest/compatibility/scans.md`) to narrow the documented gap to INT96 and describe the correctness implications. ### Known gap INT96 → TimestampNTZ on Spark 3.x is still not rejected by `native_datafusion`. DataFusion's `coerce_int96` produces `Timestamp(unit, None)` for INT96 columns, stripping the source timezone before Comet's schema adapter sees the column. At that point INT96 is indistinguishable from a true TIMESTAMP_NTZ source, so the new pattern does not fire. The annotated LTZ encodings (`TIMESTAMP_MICROS` / `TIMESTAMP_MILLIS` with `isAdjustedToUTC=true`) are covered. The INT96 + `native_datafusion` test case in the suite stays skipped with a link back to #4219. ## How are these changes tested? `ParquetTimestampLtzAsNtzSuite` already covered the pre-Spark-4 case for `native_iceberg_compat` and the Spark 4.0+ positive case. It has been extended to parametrize the pre-Spark-4 case across both scan implementations, so the three `native_datafusion` variants (INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS) now run instead of being globally skipped. On Spark 3.5, the two annotated-encoding variants pass; the INT96 variant skips with a link to #4219. On Spark 4.0 the existing positive tests continue to pass. Verified locally: - `make` (default Spark profile) - `./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite` - `./mvnw test -Pspark-4.0 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite` - `cargo clippy --all-targets --workspace -- -D warnings` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
