MaxGekk opened a new pull request #33387:
URL: https://github.com/apache/spark/pull/33387


   ### What changes were proposed in this pull request?
   In the PR, I propose to propagate either the SQL config 
`spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option 
`datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the 
settings in conversions of dates/timestamps instances from datasource filters 
to values pushed via `FilterApi` to the `parquet-column` lib.
   
   Before the changes, date/timestamp values expressed as 
days/microseconds/milliseconds are interpreted as offsets in Proleptic 
Gregorian calendar, and pushed to the parquet library as is. That works fine if 
timestamp/dates values in parquet files were saved in the `CORRECTED` mode but 
in the `LEGACY` mode, filter's values could not match to actual values.
   
   After the changes, timestamp/dates values of filters pushed down to parquet 
libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase 
settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is 
pushed down as is but if the current rebase mode is `LEGACY`, the number of 
days is rebased to **-719164**. For more context, the PR description 
https://github.com/apache/spark/pull/28067 shows the diffs between two 
calendars.
   
   ### Why are the changes needed?
   The changes fix the bug portrayed by the following example from SPARK-36034:
   ```scala
   In [27]: 
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
   >>> spark.sql("SELECT DATE '0001-01-01' AS 
date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
   >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
'0001-01-01'").show()
   +----+
   |date|
   +----+
   +----+
   ```
   The result must have the date value `0001-01-01`.
   
   ### Does this PR introduce _any_ user-facing change?
   In some sense, yes. Query results can be different in some cases. For the 
example above:
   ```scala
   scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", 
"LEGACY")
   scala> spark.sql("SELECT DATE '0001-01-01' AS 
date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
   scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
'0001-01-01'").show(false)
   +----------+
   |date      |
   +----------+
   |0001-01-01|
   +----------+
   ```
   
   ### How was this patch tested?
   By running the modified test suite `ParquetFilterSuite`:
   ```
   $ build/sbt "test:testOnly *ParquetV1FilterSuite"
   $ build/sbt "test:testOnly *ParquetV2FilterSuite"
   ```
   
   Authored-by: Max Gekk <max.gekkgmail.com>
   Signed-off-by: Hyukjin Kwon <[email protected]>
   (cherry picked from commit b09b7f7cc024d3054debd7bdb51caec3b53764d7)
   (cherry picked from commit ba7117224c797e3ade64a281b1a165cf5040c541)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to