MaxGekk opened a new pull request #30132:
URL: https://github.com/apache/spark/pull/30132
### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by
`org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it
should be saved when the SQL config
`spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
- If there is no the key in parquet metadata, take the rebase mode from
the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
- If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing
mode for INT96 type.
- For files written by Spark >= 3.1.0, if the
`org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise
don't.
### Why are the changes needed?
- To not increase parquet size by default when
`spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after
https://github.com/apache/spark/pull/30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like
gathering statistics.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modified test in `ParquetIOSuite`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]