[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Max Gekk updated SPARK-36034: ----------------------------- Fix Version/s: 3.3.0 3.2.0 > Incorrect datetime filter when reading Parquet files written in legacy mode > --------------------------------------------------------------------------- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL > Affects Versions: 3.1.2 > Reporter: Willi Raschkowski > Assignee: Max Gekk > Priority: Blocker > Labels: correctness > Fix For: 3.2.0, 3.3.0 > > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") > >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", > >>> "date = '0001-01-01'").show() > +----------+-------------------+ > | date|(date = 0001-01-01)| > +----------+-------------------+ > |0001-01-01| true| > +----------+-------------------+ > >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = > >>> '0001-01-01'").show() > +----------+ > | date| > +----------+ > |0001-01-01| > +----------+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") > >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", > >>> "date = '0001-01-01'").show() > +----------+-------------------+ > | date|(date = 0001-01-01)| > +----------+-------------------+ > |0001-01-01| true| > +----------+-------------------+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").show() > +----+ > |date| > +----+ > +----+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) > +- *(1) ColumnarToRow > +- FileScan parquet [date#154] Batched: true, DataFilters: > [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct<date:date> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org