Maxim Gekk created SPARK-31662:
----------------------------------
Summary: Reading wrong dates from dictionary encoded columns in
Parquet files
Key: SPARK-31662
URL: https://issues.apache.org/jira/browse/SPARK-31662
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk
Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled",
true)
scala> :paste
// Entering paste mode (ctrl-D to finish)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")
// Exiting paste mode, now interpreting.
{code}
Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+----------+
|date |
+----------+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+----------+
{code}
*Expected values must be 1000-01-01.*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]