Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19607#discussion_r153168918
--- Diff: python/pyspark/sql/session.py ---
@@ -444,11 +445,30 @@ def _get_numpy_record_dtype(self, rec):
record_type_list.append((str(col_names[i]), curr_type))
return np.dtype(record_type_list) if has_rec_fix else None
- def _convert_from_pandas(self, pdf):
+ def _convert_from_pandas(self, pdf, schema, timezone):
"""
Convert a pandas.DataFrame to list of records that can be used to
make a DataFrame
:return list of records
"""
+ if timezone is not None:
+ from pyspark.sql.types import
_check_series_convert_timestamps_tz_local
+ copied = False
+ if isinstance(schema, StructType):
+ for field in schema:
+ # TODO: handle nested timestamps, such as
ArrayType(TimestampType())?
+ if isinstance(field.dataType, TimestampType):
+ s =
_check_series_convert_timestamps_tz_local(pdf[field.name], timezone)
+ if not copied and s is not pdf[field.name]:
+ pdf = pdf.copy()
+ copied = True
--- End diff --
Would you mind if I ask why we should copy here? Probably, some comments
explaining it would be helpful. To be clear, Is it to prevent the original
Pandas DataFrame being updated?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]