andygrove opened a new pull request, #4357:
URL: https://github.com/apache/datafusion-comet/pull/4357

   ## Which issue does this PR close?
   
   Closes the INT96 portion of #4219. **Stacked on #4356** (annotated LTZ 
encodings).
   
   ## Rationale for this change
   
   #4356 added a schema-adapter rejection for \`(Timestamp(_, Some(_)), 
Timestamp(_, None))\` so the \`native_datafusion\` scan throws 
\`SchemaColumnConvertNotSupportedException\` (SPARK-36182) when a Parquet 
TimestampLTZ column is read as TimestampNTZ on Spark 3.x. That covered the 
annotated LTZ encodings (\`TIMESTAMP_MICROS\` / \`TIMESTAMP_MILLIS\` with 
\`isAdjustedToUTC=true\`) but not INT96: DataFusion's \`coerce_int96\` strips 
the timezone, so INT96 columns surface as \`Timestamp(us, None)\` and the 
existing pattern doesn't fire.
   
   apache/datafusion#22318 adds an optional \`coerce_int96_tz\` config option 
that lets us ask DataFusion to coerce INT96 to \`Timestamp(us, Some(\"UTC\"))\` 
instead. With that, INT96-derived columns carry the LTZ signal at the Arrow 
level and the existing schema-adapter rejection catches them with no additional 
Comet logic.
   
   ## What changes are included in this PR?
   
   Cumulative changes since main (includes #4356 plus this PR's INT96 
follow-on):
   
   - Plumbing for \`allow_timestamp_ltz_to_ntz\` from \`ShimCometConf\` through 
proto into \`SparkParquetOptions\`, plus the schema-adapter rejection (from 
#4356).
   - **New here:** set \`table_parquet_options.global.coerce_int96_tz = 
Some(\"UTC\".to_string())\` alongside the existing \`coerce_int96 = \"us\"\` in 
\`get_options\`.
   - **New here:** unskip the INT96 + \`native_datafusion\` variant in 
\`ParquetTimestampLtzAsNtzSuite\`.
   - **New here:** \`[patch.crates-io]\` pin to the andygrove/datafusion 
\`coerce-int96-timezone-on-53.1.0\` branch.
   
   This is intentionally a **draft** and **must not merge** until 
apache/datafusion#22318 ships in a DataFusion release and Comet bumps to it. 
The \`[patch.crates-io]\` block must be removed before the real merge.
   
   ## How are these changes tested?
   
   \`ParquetTimestampLtzAsNtzSuite\` on Spark 3.5 with the INT96 \`assume()\` 
skip removed:
   - INT96 + native_datafusion: now rejects correctly (previously returned data 
silently).
   - TIMESTAMP_MICROS / TIMESTAMP_MILLIS + native_datafusion: still rejects 
(unchanged from #4356).
   - Spark 4.0 positive cases: still pass.
   
   \`CometCastSuite\` (full): 155/155 passing. An earlier abandoned approach 
(disabling \`coerce_int96\` entirely so INT96 surfaces as nanosecond 
resolution) regressed extreme-date casts because int64 nanos overflows past 
year ~2262. Keeping \`coerce_int96 = \"us\"\` and only attaching a UTC timezone 
avoids that.
   
   Local verification:
   - \`make\`
   - \`./mvnw test -Pspark-3.5 
-Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite\`
   - \`./mvnw test -Pspark-3.5 
-Dsuites=org.apache.comet.parquet.ParquetReadV1Suite\` (no regressions)
   - \`./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.CometCastSuite\` (no 
regressions)
   - \`./mvnw test -Pspark-4.0 
-Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite\`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to