Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r148601935
--- Diff: python/pyspark/serializers.py ---
@@ -274,12 +278,13 @@ def load_stream(self, stream):
"""
Deserialize ArrowRecordBatches to an Arrow table and return as a
list of pandas.Series.
"""
- from pyspark.sql.types import _check_dataframe_localize_timestamps
+ from pyspark.sql.types import
_check_dataframe_localize_timestamps, from_arrow_schema
import pyarrow as pa
reader = pa.open_stream(stream)
+ schema = from_arrow_schema(reader.schema)
for batch in reader:
# NOTE: changed from pa.Columns.to_pandas, timezone issue in
conversion fixed in 0.7.1
- pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
+ pdf = _check_dataframe_localize_timestamps(batch.to_pandas(),
schema, self._timezone)
--- End diff --
Oh, maybe I misunderstood the purpose of this conf
"spark.sql.execution.pandas.respectSessionTimeZone". If that is true then what
is the behavior of Spark?
1) convert timestamps in Pandas to remove the timezone and localize to
SESSION_LOCAL_TIMEZONE
2) show Pandas timestamps with SESSION_LOCAL_TIMEZONE set as the timezone
It seems this change is doing (1), but what's wrong with doing (2)? I
think that would be a lot cleaner
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]