[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36034:
------------------------------------

    Assignee:     (was: Apache Spark)

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-36034
>                 URL: https://issues.apache.org/jira/browse/SPARK-36034
>             Project: Spark
>          Issue Type: Task
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Willi Raschkowski
>            Priority: Blocker
>              Labels: correctness
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +----------+-------------------+
> |      date|(date = 0001-01-01)|
> +----------+-------------------+
> |0001-01-01|               true|
> +----------+-------------------+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +----------+
> |      date|
> +----------+
> |0001-01-01|
> +----------+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +----------+-------------------+
> |      date|(date = 0001-01-01)|
> +----------+-------------------+
> |0001-01-01|               true|
> +----------+-------------------+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> +----+
> |date|
> +----+
> +----+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>    +- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct<date:date>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to