[GitHub] [spark] MaxGekk opened a new pull request #35042: [SPARK-37705][SQL][3.2] Rebase timestamps in the session time zone saved in Parquet/Avro metadata

GitBox Tue, 28 Dec 2021 07:14:58 -0800


MaxGekk opened a new pull request #35042:
URL: https://github.com/apache/spark/pull/35042



   ### What changes were proposed in this pull request?
   In the PR, I propose to add new metadata key `org.apache.spark.timeZone` 
which Spark writes to Parquet/Avro matadata while performing of datetimes 
rebase in the `LEGACY` mode (see the SQL configs:
   - `spark.sql.parquet.datetimeRebaseModeInWrite`,
   - `spark.sql.parquet.int96RebaseModeInWrite` and
   - `spark.sql.avro.datetimeRebaseModeInWrite`).
   
   The writers uses the current session time zone (see the SQL config 
`spark.sql.session.timeZone`) in rebasing of Parquet/Avro timestamp columns.
   
   At the reader side, Spark tries to get info about the writer's time zone 
from the new metadata property:
   ```
   $ java -jar ~parquet-tools-1.12.0.jar meta 
./part-00000-b0d90bf0-ce60-4b4f-b453-b33f61ab2b2a-c000.snappy.parquet
   ...
   extra:       org.apache.spark.timeZone = America/Los_Angeles
   extra:       org.apache.spark.legacyDateTime =
   ```
   and use it in rebasing timestamps to the Proleptic Gregorian calendar. In 
the case when the reader cannot retrieve the original time zone from 
Parquet/Avro metadata, it uses the default JVM time zone for backward 
compatibility.
   
   ### Why are the changes needed?
   Before the changes, Spark assumes that a writer uses the default JVM time 
zone while rebasing of dates/timestamps. And if a reader and the writer have 
different JVM time zone settings, the reader cannot load such columns in the 
`LEGACY` mode correctly. So, the reader will have full info about writer 
settings after the changes.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. After the changes, Parquet/Avro writers use the session time zone while 
timestamp rebasing in the `LEGACY` mode instead of the default JVM time zone. 
Need to highlight that the session time zone is set to the JVM time zone by 
default.
   
   ### How was this patch tested?
   1. By running new tests:
   ```
   $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite"
   $ build/sbt "test:testOnly *ParquetRebaseDatetimeV2Suite"
   $ build/sbt "test:testOnly *AvroV1Suite"
   $ build/sbt "test:testOnly *AvroV2Suite"
   ```
   2. And related existing test suites:
   ```
   $ build/sbt "test:testOnly *DateTimeUtilsSuite"
   $ build/sbt "test:testOnly *RebaseDateTimeSuite"
   $ build/sbt "test:testOnly *TimestampFormatterSuite"
   $ build/sbt "avro/test:testOnly 
org.apache.spark.sql.avro.AvroCatalystDataConversionSuite"
   $ build/sbt "test:testOnly *AvroRowReaderSuite"
   $ build/sbt "test:testOnly *AvroSerdeSuite"
   $ build/sbt "test:testOnly *ParquetVectorizedSuite"
   ```
   
   3. Also modified the test `SPARK-31159: rebasing timestamps in write` to 
check loading timestamps in the LEGACY mode when the session time zone and JVM 
time zone are different.
   
   4. Generated parquet files by Spark 3.2.0 (the commit 
https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea)
 using the test `"SPARK-31806: generate test files for checking compatibility 
with Spark 2.4"`. The parquet files don't contain info about the original time 
zone:
   ```
   $ java -jar ~/Downloads/parquet-tools-1.12.0.jar meta 
sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet
   file:        
file:/Users/maximgekk/proj/parquet-rebase-save-tz/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet
   creator:     parquet-mr version 1.12.1 (build 
2a5c06c58fa987f85aa22170be14d927d5ff6e7d)
   extra:       org.apache.spark.version = 3.2.0
   extra:       org.apache.spark.legacyINT96 =
   extra:       org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"dict","type":"timestamp","nullable":true,"metadata":{}},{"name":"plain","type":"timestamp","nullable":true,"metadata":{}}]}
   extra:       org.apache.spark.legacyDateTime =
   
   file schema: spark_schema
   
--------------------------------------------------------------------------------
   dict:        OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1
   plain:       OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1
   ```
   By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 
2.4/3.2 in reading dates/timestamps"`, check loading of mixed parquet files 
generated by Spark 2.4.5/2.4.6 and 3.2.0/master.
   
   5. Generated avro files by Spark 3.2.0 (the commit 
https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea)
 using the test `"SPARK-31855: generate test files for checking compatibility 
with Spark 2.4"`. The avro files don't contain info about the original time 
zone:
   ```
   $ java -jar ~/Downloads/avro-tools-1.9.2.jar getmeta 
external/avro/src/test/resources/before_1582_timestamp_micros_v3_2_0.avro
   avro.schema  
{"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]}]}
   org.apache.spark.version     3.2.0
   avro.codec   snappy
   org.apache.spark.legacyDateTime
   ```
   By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 
2.4/3.2 in reading dates/timestamps"`, check loading of mixed avro files 
generated by Spark 2.4.5/2.4.6 and 3.2.0/master.
   
   Authored-by: Max Gekk <[email protected]>
   Signed-off-by: Wenchen Fan <[email protected]>
   (cherry picked from commit ef3a47038606ea426c15844b0400f5141acd5108)
   Signed-off-by: Max Gekk <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk opened a new pull request #35042: [SPARK-37705][SQL][3.2] Rebase timestamps in the session time zone saved in Parquet/Avro metadata

Reply via email to