grundprinzip commented on code in PR #39469:
URL: https://github.com/apache/spark/pull/39469#discussion_r1066570799
##########
python/pyspark/sql/connect/session.py:
##########
@@ -215,7 +215,38 @@ def createDataFrame(
_inferred_schema: Optional[StructType] = None
if isinstance(data, pd.DataFrame):
- _table = pa.Table.from_pandas(data)
+ from pandas.api.types import ( # type: ignore[attr-defined]
+ is_datetime64_dtype,
+ is_datetime64tz_dtype,
+ )
+ from pyspark.sql.pandas.types import (
+ _check_series_convert_timestamps_internal,
+ _get_local_timezone,
+ )
+
+ # Copying the frame to avoid modifying it.
+ data_copy = data.copy()
Review Comment:
I can modify this to only copy when necessary. The biggest issue is that i
need to copy it to modify the pandas type before converting it to an arrow
table. I think there is likely a better way of doing that that does not require
touching the pandas dataframe.
The general question is, is it ok to modify the input dataframe or should it
be immutable.
However, even in PySpark we modify the input dataframe when we localize the
timezones.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]