Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r144167503 --- Diff: python/pyspark/sql/types.py --- @@ -1619,11 +1619,47 @@ def to_arrow_type(dt): arrow_type = pa.decimal(dt.precision, dt.scale) elif type(dt) == StringType: arrow_type = pa.string() + elif type(dt) == DateType: + arrow_type = pa.date32() + elif type(dt) == TimestampType: + arrow_type = pa.timestamp('us', tz='UTC') else: raise TypeError("Unsupported type in conversion to Arrow: " + str(dt)) return arrow_type +def _check_localize_series_timestamps(s): + from pandas.types.common import is_datetime64_dtype + # TODO: handle nested timestamps? + if is_datetime64_dtype(s.dtype): + # TODO: pyarrow.Column.to_pandas keeps data in UTC but removes timezone --- End diff -- It seems like `pyarrow.Column.to_pandas()` produces a different timestamp series than `pyarrow.Table.to_pandas()` to get a DataFrame and then accessing the timestamp column. The former keeps timestamps in UTC and removes the timezone, so a different conversion was required to get the data right. Not sure if this is an Arrow bug or not, need to look into it more.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org