[
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982001#comment-14982001
]
Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 9:44 AM:
-----------------------------------------------------------------------------
I've been thinking about this for a while, and I think the underlying issue is
that the conversion before storing as an Int leads to a lot of strange
behaviors. If we are going to have the Date type represent days from epoch we
should most likely throw out all information outside of the granularity.
Adding a test of
{code} checkFromToJavaDate(new Date(0)){code}
Shows the trouble of trying to take into account the more granular information
The date will be converted to some hours before epoch by the timezone magic (if
you live in america) then rounded down to -1. This means it fails the check
because
{code}[info] "19[69-12-3]1" did not equal "19[70-01-0]1"
(DateTimeUtilsSuite.scala:68){code}
This is my basic problem with integration, the operation of transforming a Date
to and from a Catalyst Date is only idempotent if the value is created in the
Locale TimeZone. For all other timezones there is a possibility that the
subtraction of the local timezone offset could cause a "Floor" call to
essential move the entire date back in time a day.
was (Author: rspitzer):
I've been thinking about this for a while, and I think the underlying issue is
that the conversion before storing as an Int leads to a lot of strange
behaviors. If we are going to have the Date type represent days from epoch we
should most likely throw out all information outside of the granularity.
Adding a test of
{code} checkFromToJavaDate(new Date(0)){code}
Shows the trouble of trying to take into account the more granular information
The date will be converted to some hours before epoch by the timezone magic (if
you live in america) then rounded down to -1. This means it fails the check
because
{code}[info] "19[69-12-3]1" did not equal "19[70-01-0]1"
(DateTimeUtilsSuite.scala:68){code}
> Catalyst DateType Shifts Input Data by Local Timezone
> -----------------------------------------------------
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0, 1.5.1
> Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't
> get a consistent result for java.sql.Date. I investigated and noticed the
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
> /**
> * Returns the number of days since epoch from from java.sql.Date.
> */
> def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
> }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying
> timestamp to the local timezone before calculating the days from epoch. This
> causes the invocation to move the actual date around.
> {code}
> // we should use the exact day as Int, for example, (year, month, day) ->
> day
> def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc +
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
> // reverse of millisToDays
> def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
> }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if
> the underlying data is worked on in different timezones than UTC.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]