andygrove opened a new pull request, #22318: URL: https://github.com/apache/datafusion/pull/22318
## Which issue does this PR close? - Closes #. (Filed in support of apache/datafusion-comet#4219; happy to file a DataFusion issue if desired.) ## Rationale for this change \`coerce_int96_to_resolution\` currently produces \`Timestamp(unit, None)\` for every INT96-derived column. Some downstream readers need the resulting Arrow type to carry a timezone, because the *absence* of a timezone is itself meaningful. The motivating case is Apache DataFusion Comet (a Spark accelerator) trying to enforce SPARK-36182: pre-Spark-4 Spark rejects reading a Parquet TimestampLTZ column as TimestampNTZ. Comet's schema adapter pattern-matches \`Timestamp(_, Some(_)) -> Timestamp(_, None)\` to detect this case, but for INT96 columns the post-coerce type is \`Timestamp(unit, None)\` — indistinguishable from a true TimestampNTZ source. The LTZ signal is destroyed at the wrong layer. Spark and other systems write INT96 as UTC-adjusted instants, so a caller can ask for the column to surface as \`Timestamp(unit, Some(\"UTC\"))\`, preserving the LTZ semantic at the Arrow level. ## What changes are included in this PR? - New \`TableParquetOptions.global.coerce_int96_tz: Option<String>\` config field (defaults to \`None\`). - \`coerce_int96_to_resolution\` gains a \`timezone: Option<Arc<str>>\` parameter and threads it into the constructed \`Timestamp\` type. - The new option is plumbed through \`ParquetSource\` -> \`ParquetOpener\` / \`ParquetMorselizer\` -> \`DFParquetMetadata\`. - \`with_coerce_int96_tz\` builder method on \`DFParquetMetadata\`. - Default behavior is unchanged when the option is unset. ## Are these changes tested? The existing \`coerce_int96_to_resolution_with_mixed_timestamps\` and \`coerce_int96_to_resolution_with_nested_types\` tests are updated to pass \`None\` for the new parameter and continue to assert the historical default. I'll add a positive test for the \`Some(\"UTC\")\` path on the next push; opening as draft to gather feedback on the API shape first. End-to-end validation: I have a Comet branch (apache/datafusion-comet#TBD) that depends on this fork via \`[patch.crates-io]\`, sets \`coerce_int96_tz = \"UTC\"\`, and confirms that \`ParquetTimestampLtzAsNtzSuite\` on Spark 3.5 now correctly rejects INT96 -> TimestampNTZ via Comet's existing schema-adapter pattern. \`CometCastSuite\` passes (no precision regression for extreme dates). ## Are there any user-facing changes? A new \`coerce_int96_tz\` config option. Adding rather than modifying, so no API break. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
