Yicong-Huang commented on code in PR #53992:
URL: https://github.com/apache/spark/pull/53992#discussion_r2752551692
##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -409,21 +414,21 @@ def arrow_to_pandas(
ndarray_as_list=ndarray_as_list,
)
- def _create_array(self, series, arrow_type, spark_type=None,
arrow_cast=False):
+ def _create_array(self, series, spark_type, *, arrow_cast=False,
prefers_large_types=False):
"""
- Create an Arrow Array from the given pandas.Series and optional type.
+ Create an Arrow Array from the given pandas.Series and Spark type.
Parameters
----------
series : pandas.Series
A single series
- arrow_type : pyarrow.DataType, optional
- If None, pyarrow's inferred type will be used
spark_type : DataType, optional
- If None, spark type converted from arrow_type will be used
- arrow_cast: bool, optional
+ The Spark type to use. If None, pyarrow's inferred type will be
used.
Review Comment:
That makes sense, thanks for the suggestion.
I want to make sure I understand what you mean by factoring out the
createDataFrame usage.
In this PR I am not changing the createDataFrame behavior. Spark still
allows users to pass an optional schema (see [doc
here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html)),
and when it is not provided, we can end up without a Spark type at this stage.
That is already part of the current behavior. So even if we refactor and
isolate the createDataFrame-related logic, we would still have cases where the
Spark type is None before Arrow conversion.
Are you suggesting that instead we should make `createDataFrame` always let
Arrow infer the type first, and then convert that inferred Arrow type back into
a Spark type, so that downstream we can assume the Spark type is always defined?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]