[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

ueshin Thu, 12 Oct 2017 03:23:22 -0700

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r144248880
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1619,11 +1619,47 @@ def to_arrow_type(dt):
             arrow_type = pa.decimal(dt.precision, dt.scale)
         elif type(dt) == StringType:
             arrow_type = pa.string()
    +    elif type(dt) == DateType:
    +        arrow_type = pa.date32()
    +    elif type(dt) == TimestampType:
    +        arrow_type = pa.timestamp('us', tz='UTC')
         else:
             raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
         return arrow_type
     
     
    +def _check_localize_series_timestamps(s):
    +    from pandas.types.common import is_datetime64_dtype
    +    # TODO: handle nested timestamps?
    +    if is_datetime64_dtype(s.dtype):
    +        # TODO: pyarrow.Column.to_pandas keeps data in UTC but removes 
timezone
    --- End diff --
    
    I'd suspect it's a bug because we expect the both behave the same way and 
they shouldn't drop the timezone info.
    Anyway, how about creating pandas DataFrame first and split into Series.
    
    I mean as follows:
    
    - python/pyspark/serializers.py
    ```diff
             reader = pa.open_stream(stream)
             for batch in reader:
    -            table = pa.Table.from_batches([batch])
    -            yield [_check_localize_series_timestamps(c.to_pandas()) for c 
in table.itercolumns()]
    +            pdf = batch.to_pandas()
    +            yield [_check_localize_series_timestamps(c) for _, c in 
pdf.iteritems()]
    ```
    
    - python/pyspark/sql/types.py
    ```diff
     def _check_localize_series_timestamps(s):
    -    from pandas.types.common import is_datetime64_dtype
    +    from pandas.types.common import is_datetime64tz_dtype
         # TODO: handle nested timestamps?
    -    if is_datetime64_dtype(s.dtype):
    -        # TODO: pyarrow.Column.to_pandas keeps data in UTC but removes 
timezone
    -        return 
s.dt.tz_localize('UTC').dt.tz_convert('tzlocal()').dt.tz_localize(None)
    +    if is_datetime64tz_dtype(s.dtype):
    +        return s.dt.tz_convert('tzlocal()').dt.tz_localize(None)
         else:
             return s
    ```
    
    We need to check if this causes performance regression or not.
    
    Btw, when using this patch, we can use `date(1969, 1, 1)` and 
`datetime(1969, 1, 1, 1, 1, 1)` for the test you marked as TODO in 
`test_vectorized_udf_timestamps`. I'm not exactly sure the reason, though.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

Reply via email to