Maxim Gekk created SPARK-31662:
----------------------------------

             Summary: Reading wrong dates from dictionary encoded columns in 
Parquet files
                 Key: SPARK-31662
                 URL: https://issues.apache.org/jira/browse/SPARK-31662
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 3.0.0, 3.1.0
            Reporter: Maxim Gekk


Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

          Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
            .select($"dateS".cast("date").as("date"))
            .repartition(1)
            .write
            .option("parquet.enable.dictionary", true)
            .mode("overwrite")
            .parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+----------+
|date      |
+----------+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+----------+
{code}

*Expected values must be 1000-01-01.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to