[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

ueshin Wed, 11 Oct 2017 02:45:49 -0700

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18664
  
    I disagree with using `DateTimeUtils.defaultTimeZone()` for the timezone.
    If `DateTimeUtils.defaultTimeZone()` is different from system timezone in 
Python, the return values are different between with-Arrow and without-Arrow.
    Current without-Arrow seems to respect system timezone in Python, so we 
should respect it and keep the timezone for Arrow type what it should be.
    
    Let me show the example (similar to 
https://github.com/apache/spark/pull/18664#issuecomment-318791052):
    
    I'm in Japan, so `DateTimeUtils.defaultTimeZone()` would be `"Asia/Tokyo"`, 
and modify `os.environ['TZ']` to `"America/Los_Angeles"` and session timezone 
to `"America/New_York"`.
    
    ```
    >>> import datetime
    >>> import time
    >>> from pyspark.sql.types import *
    >>>
    >>> os.environ['TZ'] = "America/Los_Angeles"
    >>> time.tzset()
    >>>
    >>> dt = datetime.datetime(1970, 1, 1, 0, 0, 1)
    >>> dt
    datetime.datetime(1970, 1, 1, 0, 0, 1)
    >>>
    >>> spark.conf.set("spark.sql.session.timeZone", "America/New_York")
    >>> df = spark.createDataFrame([(dt,)], 
schema=StructType([StructField("ts", TimestampType(), True)]))
    >>> df.show()
    +-------------------+
    |                 ts|
    +-------------------+
    |1970-01-01 03:00:01|
    +-------------------+
    
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
    >>> df.toPandas()
                       ts
    0 1970-01-01 00:00:01
    ```
    
    - Current HEAD (c4fd5ae1e61d118877151d713e8ea0e53a7afaf4) + fix compilation 
error:
    
    ```
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> df.toPandas()
                       ts
    0 1970-01-01 03:00:01
    ```
    
    - Use `DateTimeUtils.defaultTimeZone()` as the timezone:
    
    ```
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> df.toPandas()
                       ts
    0 1970-01-01 17:00:01
    ```
    
    - Adjust using system timezone in Python:
    
    ```
    >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    >>> df.toPandas()
                       ts
    0 1970-01-01 00:00:01
    ```
    
    A patch I used here is:
    
    ```diff
     def _localize_series_timestamps(s):
         """ Convert a tz-aware timestamp to local tz-naive
         """
    -    return s.dt.tz_localize(None)
    +    return s.dt.tz_convert('tzlocal()').dt.tz_localize(None)
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

Reply via email to