[
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082228#comment-17082228
]
Wenchen Fan edited comment on SPARK-31423 at 4/13/20, 10:54 AM:
----------------------------------------------------------------
FYI this is the behavior of Spark 2.4:
{code}
scala> val df = sql("select cast('1582-10-14' as DATE) dt")
df: org.apache.spark.sql.DataFrame = [dt: date]
scala> df.show
+----------+
| dt|
+----------+
|1582-10-24|
+----------+
scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
scala> spark.read.orc("/tmp/funny_orc_date").show
+----------+
| dt|
+----------+
|1582-10-24|
+----------+
{code}
The result is wrong at the very beginning.
was (Author: cloud_fan):
FYI this is the behavior of Spark 2.4:
```
scala> val df = sql("select cast('1582-10-14' as DATE) dt")
df: org.apache.spark.sql.DataFrame = [dt: date]
scala> df.show
+----------+
| dt|
+----------+
|1582-10-24|
+----------+
scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
scala> spark.read.orc("/tmp/funny_orc_date").show
+----------+
| dt|
+----------+
|1582-10-24|
+----------+
```
The result is wrong at the very beginning.
> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> ------------------------------------------------------------------------------
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Bruce Robbins
> Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +----------+
> | dt|
> +----------+
> |1582-10-14|
> +----------+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +----------+
> | dt|
> +----------+
> |1582-10-24|
> +----------+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +-------------------+
> | ts|
> +-------------------+
> |1582-10-14 00:00:00|
> +-------------------+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off
> by 10 days
> +-------------------+
> |ts |
> +-------------------+
> |1582-10-24 00:00:00|
> +-------------------+
> scala>
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects
> original value
> +----------+
> | dt|
> +----------+
> |1582-10-14|
> +----------+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show //
> reflects original value
> +----------+
> | dt|
> +----------+
> |1582-10-14|
> +----------+
> scala>
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x
> works with DATEs and TIMESTAMPs in general when
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4,
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done
> in Spark 2.4
> +----------+
> | dt|
> +----------+
> |1582-10-24|
> +----------+
> scala>
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and
> reality)[Note 2] that this drift had already reached, the date was advanced
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and
> probably based on spark.sql.legacy.timeParserPolicy (or some other config)
> rather than file format.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]