MaxGekk opened a new pull request #28479:
URL: https://github.com/apache/spark/pull/28479
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to
handle especially the `DateType` when passed parameter `rebaseDateTime` is
true. In that case, decoded days are rebased from the hybrid calendar to
Proleptic Gregorian calendar using
`RebaseDateTime`.`rebaseJulianToGregorianDays()`.
### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary
encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled",
true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date")).repartition(1)
.write
.option("parquet.enable.dictionary", true)
.parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```
### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite`
to checked reading dictionary encoded dates.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]