Yicong-Huang commented on code in PR #53992:
URL: https://github.com/apache/spark/pull/53992#discussion_r2751868045
##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -409,21 +414,21 @@ def arrow_to_pandas(
ndarray_as_list=ndarray_as_list,
)
- def _create_array(self, series, arrow_type, spark_type=None,
arrow_cast=False):
+ def _create_array(self, series, spark_type, *, arrow_cast=False,
prefers_large_types=False):
"""
- Create an Arrow Array from the given pandas.Series and optional type.
+ Create an Arrow Array from the given pandas.Series and Spark type.
Parameters
----------
series : pandas.Series
A single series
- arrow_type : pyarrow.DataType, optional
- If None, pyarrow's inferred type will be used
spark_type : DataType, optional
- If None, spark type converted from arrow_type will be used
- arrow_cast: bool, optional
+ The Spark type to use. If None, pyarrow's inferred type will be
used.
Review Comment:
`createDataFrame`
([conversion.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.py#L868-L872),
[connect/session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/session.py#L615-L627)):
`spark_type` can be `None` for non-timestamp columns when user doesn't provide
a schema. This is existing behavior on master:
```
spark_types = [
TimestampType() if is_datetime64_dtype(t) ...
else None # Non-timestamp columns get None
for t in data.dtypes
]
```
And later [when the type is
None](https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/serializers.py#L474-L476)
(in this case, both spark type and arrow type will be None in master), pyarrow
will [try to
infer](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.from_pandas).
```
return pa.Array.from_pandas(
series, mask=mask, type=None, safe=self._safecheck
)
```
```
> type : pyarrow.DataType, optional
> If not provided, the Arrow type is inferred from the pandas dtype.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]