Maxim Gekk created SPARK-31159:
----------------------------------
Summary: Incompatible Parquet dates/timestamps with Spark 2.4
Key: SPARK-31159
URL: https://issues.apache.org/jira/browse/SPARK-31159
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk
Write dates/timestamps to Parquet file in Spark 2.4:
{code}
$ export TZ="UTC"
$ ~/spark-2.4/bin/spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS",
"tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
scala> spark.conf.set("spark.sql.parquet.outputTimestampType",
"TIMESTAMP_MICROS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
scala>
spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
+----------+--------------------------+
|d |ts |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
{code}
Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints
*1001-01-07* and *1001-01-07T01:02:03.123456+0000*:
{code}
$ java -jar
/Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar
dump -m
./2_4_5_micros/part-00000-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
INT32 d
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1001-01-07
INT64 ts
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+0000
{code}
Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but
different values from Spark 2.4:
{code}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
scala>
spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
+----------+--------------------------+
|d |ts |
+----------+--------------------------+
|1001-01-07|1001-01-07 01:02:03.123456|
+----------+--------------------------+
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]