Sean Owen commented on SPARK-17971:

I'll say that I find the semantics of the Hive QL datetime + timezone functions 
odd, and Spark SQL is just mimicing them. For example, the behavior of 
from_utc_timestamp is already hard to understand because it operates on longs, 
essentially, and these can only really be thought of as absolute time since the 
epoch, not a quantity with a time zone inside that can vary. That is, what's 
the "non-UTC" timestamp that comes out?

So from_utc_timestamp(x, "PST") will return a timestamp whose value is smaller 
by 8 * 3600 * 1000 because PST is GMT-8 (GMT vs UTC issue noted). But what does 
that even mean? it's still a "UTC" timestamp, just an 8-hour earlier one. It's 
the timestamp whose UTC-hour would equal the PST-hour of timestamp x.

hour() et al will answer with respect the current system timezone, yes. If your 
system is in PST, and you want to know the UTC-hour of a timestamp x, then you 
need a time whose PST-hour matches the UTC-hour of x. That's the reverse. I 
believe you want:

select hour(to_utc_timestamp(cast(1476354405 as timestamp), "PST"))

That works for me. Of course you can programmatically insert 
TimeZone.getDefault.getID instead of "PST". I believe that then works as 
desired everywhere. It has some logic in that it reads as "the hour of a UTC 
timestamp ..." but it's not straightforward IMHO. But, there are tools for this 
and these are those tools

Hive has the same, and so I think this would be considered working as intended.

I looked at MySQL just now and it seems to have similar behaviors, with 
somewhat different methods, FWIW.

> Unix timestamp handling in Spark SQL not allowing calculations on UTC times
> ---------------------------------------------------------------------------
>                 Key: SPARK-17971
>                 URL: https://issues.apache.org/jira/browse/SPARK-17971
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.2
>         Environment: MacOS X JDK 7
>            Reporter: Gabriele Del Prete
> In our Spark data pipeline we store timed events using a bigint column called 
> 'timestamp', the values contained being Unix timestamp time points.
> Our datacenter servers Java VMs are all set up to start with timezone set to 
> UTC, while developer's computers are all in the US Eastern timezone. 
> Given how Spark SQL datetime functions work, it's impossible to do 
> calculations (eg. extract and compare hours, year-month-date triplets) using 
> UTC values:
> - from_unixtime takes a bigint unix timestamp and forces it to the computer's 
> local timezone;
> - casting the bigint column to timestamp does the same (it converts it to the 
> local timezone)
> - from_utc_timestamp works in the same way, the only difference being that it 
> gets a string as input instead of a bigint.
> The result of all of this is that it's impossible to extract individual 
> fields of a UTC timestamp, since all timestamp always get converted to the 
> local timezone.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to