[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982240#comment-14982240
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---------------------------------------------------

I think the underlying conflict here is that {{java.sql.Date}} has no timezone 
information. This means it is beholden on the end-user to properly align their 
{{Date}} object with UTC. This makes sense to me because otherwise wend up with 
a very difficult situation where only {{Date}} objects which match the Locale 
of the executor will be correctly aligned. So for example if I try to create a 
{{Catalyst.DateType}} with a {{java.sql.Date}} that is in UTC but i'm running 
in PDT the value will be aligned incorrectly (and also be returned as an 
incorrect {{java.sql.date}} since the time since epoch will be wrong). This 
requires an outside source (or user) to reformat their {Date}s dependent on the 
location of the Spark Cluster (or configuration of it's locale) if they don't 
want their Date to be corrupted. 

In C* the Driver has a `LocalDate` class to avoid this problem, so it is 
extremely clear when a specific `Year,Month,Day` tuple should be aligned 
against UTC. 

It also may be helpful to allow for a direct translation of {{Int}} -> 
{{Catalyst.DateType}} -> {{Int}} for those sources that can provide a days from 
Epoch.

> Catalyst DateType Shifts Input Data by Local Timezone
> -----------------------------------------------------
>
>                 Key: SPARK-11415
>                 URL: https://issues.apache.org/jira/browse/SPARK-11415
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1
>            Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>    * Returns the number of days since epoch from from java.sql.Date.
>    */
>   def fromJavaDate(date: Date): SQLDate = {
>     millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
>     // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
>     // will correctly work as input for function toJavaDate(Int)
>     val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
>     Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
>     val millisUtc = days.toLong * MILLIS_PER_DAY
>     millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to