[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452790#comment-16452790
 ] 

Bruce Robbins commented on SPARK-23715:
---------------------------------------

[~cloud_fan]

I'll give separate answers for String input and long input.

1) String Input (e.g., "2018-04-24 15:24:00")

Across Impala, Hive, and Spark, the input to from_utc_timestamp is a UTC 
datetime value. They each work as expected when the user passes a UTC datetime 
value (with one exception for Spark, described below).

Both Hive and Impala treat the string input as a UTC datetime value. That is, 
if you enter '2018-04-24 15:24:00', from_utc_timestamp treats it as 
'2018-04-24T15:00:00+00:00' (or unix time 1524582000).

Spark's from_utc_timezone function also treats it as a UTC datetime value. 
Internally, it treats it as a local time (for the above example, it initially 
treats it as "2018-04-24T15:00:00-07:00"), but then corrects itself later in 
the code path. But from the user's perspective, Spark's from_utc_timestamp 
treats the input string as a UTC datetime value as long as the user sticks with 
the formats "yyyy-mm-dd HH:mm" and "yyyy-mm-dd'T'HH:mm".

Once the user ventures away from those formats, bad things happen. If the user 
specifies a UTC datetime value with a timezone (e.g., 
"2018-04-24T15:24:00+00:00"), Spark's from_utc_timezone returns the wrong 
result. Also, the user can specify a datetime value with a non-UTC timezone, 
which makes no sense. These two issues are the subject of this Jira.

In Hive and Impala, the user can only enter a datetime value in the format 
"yyy-MM-dd HH:mm". With Hive and Impala, the user cannot enter a timezone 
specification.

Besides the Spark bug described above, from_utc_timezone works the same between 
Impala, Hive, and Spark.

Impala has a -use_local_tz_for_unix_timestamp_conversions setting which may (or 
may not) change the behavior if its from_utc_timestamp function. I have not 
tested it.

IBM's DB2 also has a from_utc_timestamp function. I have not tested it, but the 
documentation says the input datetime value should be 'An expression that 
specifies the timestamp that is in the Coordinated Universal Time time zone', 
which is consistent with HIve, Impala, and Spark.

2) Integer Input (e.g., 1524582000)

Here Impala diverges from Spark and Hive.

Impala treats the input as the number of seconds since the epoch 1970-01-01 
00:00:00 UTC.
{noformat}
> select from_utc_timestamp(cast(0 as timestamp), "UTC") as result;
+---------------------+
| result              |
+---------------------+
| 1970-01-01 00:00:00 |
+---------------------+
{noformat}
Both Hive and Spark treat the input as the number of seconds since 1970-01-01 
00:00:00+<local-time-zone-offset>. Since I am in timezone America/Los_Angeles, 
my epoch is 1970-01-01 00:00:00 -8 hours, or 1969-12-31 16:00:00:
{noformat}
hive> select from_utc_timestamp(0, 'UTC');
OK
1969-12-31 16:00:00

spark-sql> select from_utc_timestamp(cast(0 as timestamp), 'UTC');
1969-12-31 16:00:00


hive> select from_utc_timestamp(0, 'America/Los_Angeles');
OK
1969-12-31 08:00:00

spark-sql> select from_utc_timestamp(cast(0 as timestamp), 
'America/Los_Angeles');
1969-12-31 08:00:00
{noformat}
I also classified this behavior with integer input as a bug in this Jira. 
However, it is consistent with Hive, so I am not so sure. 
 

 

 

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --------------------------------------------------------------------------
>
>                 Key: SPARK-23715
>                 URL: https://issues.apache.org/jira/browse/SPARK-23715
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +-------------------+
> |                 dt|
> +-------------------+
> |2018-03-13 07:18:23|
> +-------------------+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +-------------------+
> |                 dt|
> +-------------------+
> |2018-03-13 00:18:23|
> +-------------------+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +-------------------+
> |                 dt|
> +-------------------+
> |2018-03-13 00:18:23|
> +-------------------+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 1520947103000000)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 1520921903000000)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (1520947103000000 minus 7 hours is 1520921903000000). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts the long value by the local time zone's offset, 
> which produces a incorrect timestamp (except in the case where the input 
> datetime string just happened to have an explicit timezone that matches the 
> local timezone). FromUTCTimestamp then uses this incorrect value for further 
> processing.
> h3. When things go wrong for Unix time input:
> The cast in this case simply multiplies the integer by 1000000. The cast does 
> not shift the resulting timestamp by the local time zone's offset.
> Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
> result is wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to