andygrove opened a new pull request, #22318:
URL: https://github.com/apache/datafusion/pull/22318

   ## Which issue does this PR close?
   
   - Closes #.
   
   (Filed in support of apache/datafusion-comet#4219; happy to file a 
DataFusion issue if desired.)
   
   ## Rationale for this change
   
   \`coerce_int96_to_resolution\` currently produces \`Timestamp(unit, None)\` 
for every INT96-derived column. Some downstream readers need the resulting 
Arrow type to carry a timezone, because the *absence* of a timezone is itself 
meaningful.
   
   The motivating case is Apache DataFusion Comet (a Spark accelerator) trying 
to enforce SPARK-36182: pre-Spark-4 Spark rejects reading a Parquet 
TimestampLTZ column as TimestampNTZ. Comet's schema adapter pattern-matches 
\`Timestamp(_, Some(_)) -> Timestamp(_, None)\` to detect this case, but for 
INT96 columns the post-coerce type is \`Timestamp(unit, None)\` — 
indistinguishable from a true TimestampNTZ source. The LTZ signal is destroyed 
at the wrong layer.
   
   Spark and other systems write INT96 as UTC-adjusted instants, so a caller 
can ask for the column to surface as \`Timestamp(unit, Some(\"UTC\"))\`, 
preserving the LTZ semantic at the Arrow level.
   
   ## What changes are included in this PR?
   
   - New \`TableParquetOptions.global.coerce_int96_tz: Option<String>\` config 
field (defaults to \`None\`).
   - \`coerce_int96_to_resolution\` gains a \`timezone: Option<Arc<str>>\` 
parameter and threads it into the constructed \`Timestamp\` type.
   - The new option is plumbed through \`ParquetSource\` -> \`ParquetOpener\` / 
\`ParquetMorselizer\` -> \`DFParquetMetadata\`.
   - \`with_coerce_int96_tz\` builder method on \`DFParquetMetadata\`.
   - Default behavior is unchanged when the option is unset.
   
   ## Are these changes tested?
   
   The existing \`coerce_int96_to_resolution_with_mixed_timestamps\` and 
\`coerce_int96_to_resolution_with_nested_types\` tests are updated to pass 
\`None\` for the new parameter and continue to assert the historical default. 
I'll add a positive test for the \`Some(\"UTC\")\` path on the next push; 
opening as draft to gather feedback on the API shape first.
   
   End-to-end validation: I have a Comet branch (apache/datafusion-comet#TBD) 
that depends on this fork via \`[patch.crates-io]\`, sets \`coerce_int96_tz = 
\"UTC\"\`, and confirms that \`ParquetTimestampLtzAsNtzSuite\` on Spark 3.5 now 
correctly rejects INT96 -> TimestampNTZ via Comet's existing schema-adapter 
pattern. \`CometCastSuite\` passes (no precision regression for extreme dates).
   
   ## Are there any user-facing changes?
   
   A new \`coerce_int96_tz\` config option. Adding rather than modifying, so no 
API break.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to