Wenchen Fan commented on SPARK-23715:

is there a standard definition of from_utc_timestamp? How shall we treat the 
input timestamp? a UTC timestamp or a local timezone timestamp?

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --------------------------------------------------------------------------
>                 Key: SPARK-23715
>                 URL: https://issues.apache.org/jira/browse/SPARK-23715
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Bruce Robbins
>            Priority: Major
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +-------------------+
> |                 dt|
> +-------------------+
> |2018-03-13 07:18:23|
> +-------------------+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +-------------------+
> |                 dt|
> +-------------------+
> |2018-03-13 00:18:23|
> +-------------------+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +-------------------+
> |                 dt|
> +-------------------+
> |2018-03-13 00:18:23|
> +-------------------+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 1520947103000000)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 1520921903000000)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (1520947103000000 minus 7 hours is 1520921903000000). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts the long value by the local time zone's offset, 
> which produces a incorrect timestamp (except in the case where the input 
> datetime string just happened to have an explicit timezone that matches the 
> local timezone). FromUTCTimestamp then uses this incorrect value for further 
> processing.
> h3. When things go wrong for Unix time input:
> The cast in this case simply multiplies the integer by 1000000. The cast does 
> not shift the resulting timestamp by the local time zone's offset.
> Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
> result is wrong.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to