MaxGekk opened a new pull request #35042: URL: https://github.com/apache/spark/pull/35042
### What changes were proposed in this pull request? In the PR, I propose to add new metadata key `org.apache.spark.timeZone` which Spark writes to Parquet/Avro matadata while performing of datetimes rebase in the `LEGACY` mode (see the SQL configs: - `spark.sql.parquet.datetimeRebaseModeInWrite`, - `spark.sql.parquet.int96RebaseModeInWrite` and - `spark.sql.avro.datetimeRebaseModeInWrite`). The writers uses the current session time zone (see the SQL config `spark.sql.session.timeZone`) in rebasing of Parquet/Avro timestamp columns. At the reader side, Spark tries to get info about the writer's time zone from the new metadata property: ``` $ java -jar ~parquet-tools-1.12.0.jar meta ./part-00000-b0d90bf0-ce60-4b4f-b453-b33f61ab2b2a-c000.snappy.parquet ... extra: org.apache.spark.timeZone = America/Los_Angeles extra: org.apache.spark.legacyDateTime = ``` and use it in rebasing timestamps to the Proleptic Gregorian calendar. In the case when the reader cannot retrieve the original time zone from Parquet/Avro metadata, it uses the default JVM time zone for backward compatibility. ### Why are the changes needed? Before the changes, Spark assumes that a writer uses the default JVM time zone while rebasing of dates/timestamps. And if a reader and the writer have different JVM time zone settings, the reader cannot load such columns in the `LEGACY` mode correctly. So, the reader will have full info about writer settings after the changes. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, Parquet/Avro writers use the session time zone while timestamp rebasing in the `LEGACY` mode instead of the default JVM time zone. Need to highlight that the session time zone is set to the JVM time zone by default. ### How was this patch tested? 1. By running new tests: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" $ build/sbt "test:testOnly *ParquetRebaseDatetimeV2Suite" $ build/sbt "test:testOnly *AvroV1Suite" $ build/sbt "test:testOnly *AvroV2Suite" ``` 2. And related existing test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" $ build/sbt "test:testOnly *RebaseDateTimeSuite" $ build/sbt "test:testOnly *TimestampFormatterSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroCatalystDataConversionSuite" $ build/sbt "test:testOnly *AvroRowReaderSuite" $ build/sbt "test:testOnly *AvroSerdeSuite" $ build/sbt "test:testOnly *ParquetVectorizedSuite" ``` 3. Also modified the test `SPARK-31159: rebasing timestamps in write` to check loading timestamps in the LEGACY mode when the session time zone and JVM time zone are different. 4. Generated parquet files by Spark 3.2.0 (the commit https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea) using the test `"SPARK-31806: generate test files for checking compatibility with Spark 2.4"`. The parquet files don't contain info about the original time zone: ``` $ java -jar ~/Downloads/parquet-tools-1.12.0.jar meta sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet file: file:/Users/maximgekk/proj/parquet-rebase-save-tz/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet creator: parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d) extra: org.apache.spark.version = 3.2.0 extra: org.apache.spark.legacyINT96 = extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"dict","type":"timestamp","nullable":true,"metadata":{}},{"name":"plain","type":"timestamp","nullable":true,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema -------------------------------------------------------------------------------- dict: OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1 plain: OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1 ``` By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 2.4/3.2 in reading dates/timestamps"`, check loading of mixed parquet files generated by Spark 2.4.5/2.4.6 and 3.2.0/master. 5. Generated avro files by Spark 3.2.0 (the commit https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea) using the test `"SPARK-31855: generate test files for checking compatibility with Spark 2.4"`. The avro files don't contain info about the original time zone: ``` $ java -jar ~/Downloads/avro-tools-1.9.2.jar getmeta external/avro/src/test/resources/before_1582_timestamp_micros_v3_2_0.avro avro.schema {"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]}]} org.apache.spark.version 3.2.0 avro.codec snappy org.apache.spark.legacyDateTime ``` By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 2.4/3.2 in reading dates/timestamps"`, check loading of mixed avro files generated by Spark 2.4.5/2.4.6 and 3.2.0/master. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ef3a47038606ea426c15844b0400f5141acd5108) Signed-off-by: Max Gekk <[email protected]> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
