Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r144167503
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1619,11 +1619,47 @@ def to_arrow_type(dt):
             arrow_type = pa.decimal(dt.precision, dt.scale)
         elif type(dt) == StringType:
             arrow_type = pa.string()
    +    elif type(dt) == DateType:
    +        arrow_type = pa.date32()
    +    elif type(dt) == TimestampType:
    +        arrow_type = pa.timestamp('us', tz='UTC')
         else:
             raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
         return arrow_type
     
     
    +def _check_localize_series_timestamps(s):
    +    from pandas.types.common import is_datetime64_dtype
    +    # TODO: handle nested timestamps?
    +    if is_datetime64_dtype(s.dtype):
    +        # TODO: pyarrow.Column.to_pandas keeps data in UTC but removes 
timezone
    --- End diff --
    
    It seems like `pyarrow.Column.to_pandas()` produces a different timestamp 
series than `pyarrow.Table.to_pandas()` to get a DataFrame and then accessing 
the timestamp column.  The former keeps timestamps in UTC and removes the 
timezone, so a different conversion was required to get the data right.  Not 
sure if this is an Arrow bug or not, need to look into it more.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to