[GitHub] [spark] MaxGekk opened a new pull request #28479: [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns

GitBox Fri, 08 May 2020 05:18:08 -0700


MaxGekk opened a new pull request #28479:
URL: https://github.com/apache/spark/pull/28479



   ### What changes were proposed in this pull request?
   Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to 
handle especially the `DateType` when passed parameter `rebaseDateTime` is 
true. In that case, decoded days are rebased from the hybrid calendar to 
Proleptic Gregorian calendar using 
`RebaseDateTime`.`rebaseJulianToGregorianDays()`.
   
   ### Why are the changes needed?
   This fixes the bug of loading dates before the cutover day from dictionary 
encoded column in parquet files. The code below forces dictionary encoding:
   ```scala
   spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)
   Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
     .select($"dateS".cast("date").as("date")).repartition(1)
     .write
     .option("parquet.enable.dictionary", true)
     .parquet(path)
   ```
   Load the dates back:
   ```scala
   spark.read.parquet(path).show(false)
   +----------+
   |date      |
   +----------+
   |1001-01-07|
   ...
   |1001-01-07|
   +----------+
   ```
   Expected values **must be 1000-01-01** but not 1001-01-07.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. After the changes:
   ```scala
   spark.read.parquet(path).show(false)
   +----------+
   |date      |
   +----------+
   |1001-01-01|
   ...
   |1001-01-01|
   +----------+
   ```
   
   ### How was this patch tested?
   Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` 
to checked reading dictionary encoded dates.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk opened a new pull request #28479: [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns

Reply via email to