[GitHub] spark issue #19250: [SPARK-12297] Table timezone correction for Timestamps

squito Mon, 06 Nov 2017 06:43:26 -0800

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/19250
  
    @cloud-fan I think you misunderstand the purpose of this change.
    
    The primary purpose is actually to deal with parquet, where that option 
doesn't do anything.  We need this for parquet for two reasons:
    
    1) **Interoperability with Impala**. Impala first used an int96 to store a 
timestamp in parquet, and it always stored the time as UTC (to go with the SQL 
standard definition of _timezone_).  But spark (and hive) read it back in the 
current timezone.  Even when you don't change timezones, and the _timestamp 
with time zone_ vs. _timestamp without time zone_ distinction doesn't matter, 
you get different values before this change.  
    
    2) **SQL STANDARD TIMESTAMP**.  SQL defines _timestamp_ to be a synonym for 
_timestamp without time zone_.  The behavior of that type is defined so if you 
insert "08:30" with time zone "America/New_York", then load the data with time 
zone "America/Los_Angeles", you should still see "08:30".  Since parquet is 
stored as an instant-in-time, and spark internally applies a timezone, the 
change in timezone must be reversed, by using some consistent adustment when 
saving and reloading.  This doesn't give you real _timestamp without time 
zone_, but gets you closer.
    
    To be honest, I see limited value in this change for formats other than 
parquet -- I added only because I thought Reynold wanted it (for symmetry 
across formats, I suppose?).  As the purpose of this is to *undo* timezones, 
you can already achieve something similar in text-based formats by specifying a 
format which leaves out the timezone.  But it doesn't hurt.
    
    We could reuse "timezone" option for parquet for this purpose, but that 
would be rather strange as its almost doing the opposite as what that property 
does for text-based formats, as that property is for adding a timezone, and 
this is for "removing" it.  Its doing something special enough it seems like it 
deserves a more specific name than just "timezone".
    
    (This is all discussed at greater length, including showing how this type 
behaves in other sql engines, and how spark's behavior is non-standard, and how 
it changed in 2.0.1, in the design docs.)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19250: [SPARK-12297] Table timezone correction for Timestamps

Reply via email to