andygrove opened a new pull request, #4357: URL: https://github.com/apache/datafusion-comet/pull/4357
## Which issue does this PR close? Closes the INT96 portion of #4219. **Stacked on #4356** (annotated LTZ encodings). ## Rationale for this change #4356 added a schema-adapter rejection for \`(Timestamp(_, Some(_)), Timestamp(_, None))\` so the \`native_datafusion\` scan throws \`SchemaColumnConvertNotSupportedException\` (SPARK-36182) when a Parquet TimestampLTZ column is read as TimestampNTZ on Spark 3.x. That covered the annotated LTZ encodings (\`TIMESTAMP_MICROS\` / \`TIMESTAMP_MILLIS\` with \`isAdjustedToUTC=true\`) but not INT96: DataFusion's \`coerce_int96\` strips the timezone, so INT96 columns surface as \`Timestamp(us, None)\` and the existing pattern doesn't fire. apache/datafusion#22318 adds an optional \`coerce_int96_tz\` config option that lets us ask DataFusion to coerce INT96 to \`Timestamp(us, Some(\"UTC\"))\` instead. With that, INT96-derived columns carry the LTZ signal at the Arrow level and the existing schema-adapter rejection catches them with no additional Comet logic. ## What changes are included in this PR? Cumulative changes since main (includes #4356 plus this PR's INT96 follow-on): - Plumbing for \`allow_timestamp_ltz_to_ntz\` from \`ShimCometConf\` through proto into \`SparkParquetOptions\`, plus the schema-adapter rejection (from #4356). - **New here:** set \`table_parquet_options.global.coerce_int96_tz = Some(\"UTC\".to_string())\` alongside the existing \`coerce_int96 = \"us\"\` in \`get_options\`. - **New here:** unskip the INT96 + \`native_datafusion\` variant in \`ParquetTimestampLtzAsNtzSuite\`. - **New here:** \`[patch.crates-io]\` pin to the andygrove/datafusion \`coerce-int96-timezone-on-53.1.0\` branch. This is intentionally a **draft** and **must not merge** until apache/datafusion#22318 ships in a DataFusion release and Comet bumps to it. The \`[patch.crates-io]\` block must be removed before the real merge. ## How are these changes tested? \`ParquetTimestampLtzAsNtzSuite\` on Spark 3.5 with the INT96 \`assume()\` skip removed: - INT96 + native_datafusion: now rejects correctly (previously returned data silently). - TIMESTAMP_MICROS / TIMESTAMP_MILLIS + native_datafusion: still rejects (unchanged from #4356). - Spark 4.0 positive cases: still pass. \`CometCastSuite\` (full): 155/155 passing. An earlier abandoned approach (disabling \`coerce_int96\` entirely so INT96 surfaces as nanosecond resolution) regressed extreme-date casts because int64 nanos overflows past year ~2262. Keeping \`coerce_int96 = \"us\"\` and only attaching a UTC timezone avoids that. Local verification: - \`make\` - \`./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite\` - \`./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.parquet.ParquetReadV1Suite\` (no regressions) - \`./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.CometCastSuite\` (no regressions) - \`./mvnw test -Pspark-4.0 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite\` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
