[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982240#comment-14982240 ]
Russell Alexander Spitzer commented on SPARK-11415: --------------------------------------------------- I think the underlying conflict here is that {{java.sql.Date}} has no timezone information. This means it is beholden on the end-user to properly align their {{Date}} object with UTC. This makes sense to me because otherwise wend up with a very difficult situation where only {{Date}} objects which match the Locale of the executor will be correctly aligned. So for example if I try to create a {{Catalyst.DateType}} with a {{java.sql.Date}} that is in UTC but i'm running in PDT the value will be aligned incorrectly (and also be returned as an incorrect {{java.sql.date}} since the time since epoch will be wrong). This requires an outside source (or user) to reformat their {Date}s dependent on the location of the Spark Cluster (or configuration of it's locale) if they don't want their Date to be corrupted. In C* the Driver has a `LocalDate` class to avoid this problem, so it is extremely clear when a specific `Year,Month,Day` tuple should be aligned against UTC. It also may be helpful to allow for a direct translation of {{Int}} -> {{Catalyst.DateType}} -> {{Int}} for those sources that can provide a days from Epoch. > Catalyst DateType Shifts Input Data by Local Timezone > ----------------------------------------------------- > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0, 1.5.1 > Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** > * Returns the number of days since epoch from from java.sql.Date. > */ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org