andygrove opened a new pull request, #4356:
URL: https://github.com/apache/datafusion-comet/pull/4356

   ## Which issue does this PR close?
   
   Closes #4219 (partially: INT96 case tracked in the same issue, see below).
   
   ## Rationale for this change
   
   Pre-Spark-4 
([SPARK-36182](https://issues.apache.org/jira/browse/SPARK-36182)) rejects 
reading a Parquet TimestampLTZ column as TimestampNTZ because LTZ encodes 
UTC-adjusted instants that cannot be safely reinterpreted as timezone-free 
values. The `native_datafusion` scan did not enforce this, and silently 
returned the UTC instant as the NTZ value. That is a correctness divergence on 
Spark 3.x: queries Spark would have failed instead return values, and 
downstream filters / joins / aggregations on the column may produce different 
results than running the same query without Comet. Spark 4.0 
([SPARK-47447](https://issues.apache.org/jira/browse/SPARK-47447)) lifted the 
restriction.
   
   The `native_iceberg_compat` path was already correct because its JVM-side 
`TypeUtil.checkParquetType` rejects the read before reaching native code.
   
   ## What changes are included in this PR?
   
   - New per-Spark-version constant `COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ` in 
`ShimCometConf` (false on 3.x, true on 4.x).
   - New `allow_timestamp_ltz_to_ntz` field on the `NativeScanCommon` proto, 
set from the shim constant in `CometNativeScan`, plumbed into 
`SparkParquetOptions` via `init_datasource_exec` / `get_options`.
   - New rejection arm in `SparkPhysicalExprAdapter` (`schema_adapter.rs`) that 
emits `reject_on_non_empty_expr` when an Arrow `Timestamp(_, Some(_))` column 
is read as `Timestamp(_, None)` and the flag forbids it. Deferred to runtime so 
empty files (SPARK-26709) continue to pass.
   - `iceberg-compat` JNI path passes `true` because 
`TypeUtil.checkParquetType` has already validated.
   - Compatibility guide updated 
(`docs/source/user-guide/latest/compatibility/scans.md`) to narrow the 
documented gap to INT96 and describe the correctness implications.
   
   ### Known gap
   
   INT96 → TimestampNTZ on Spark 3.x is still not rejected by 
`native_datafusion`. DataFusion's `coerce_int96` produces `Timestamp(unit, 
None)` for INT96 columns, stripping the source timezone before Comet's schema 
adapter sees the column. At that point INT96 is indistinguishable from a true 
TIMESTAMP_NTZ source, so the new pattern does not fire. The annotated LTZ 
encodings (`TIMESTAMP_MICROS` / `TIMESTAMP_MILLIS` with `isAdjustedToUTC=true`) 
are covered. The INT96 + `native_datafusion` test case in the suite stays 
skipped with a link back to #4219.
   
   ## How are these changes tested?
   
   `ParquetTimestampLtzAsNtzSuite` already covered the pre-Spark-4 case for 
`native_iceberg_compat` and the Spark 4.0+ positive case. It has been extended 
to parametrize the pre-Spark-4 case across both scan implementations, so the 
three `native_datafusion` variants (INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS) 
now run instead of being globally skipped. On Spark 3.5, the two 
annotated-encoding variants pass; the INT96 variant skips with a link to #4219. 
On Spark 4.0 the existing positive tests continue to pass.
   
   Verified locally:
   
   - `make` (default Spark profile)
   - `./mvnw test -Pspark-3.5 
-Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite`
   - `./mvnw test -Pspark-4.0 
-Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite`
   - `cargo clippy --all-targets --workspace -- -D warnings`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to