Re: [PR] [SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization [spark]

via GitHub Sun, 01 Feb 2026 20:51:58 -0800


Yicong-Huang commented on code in PR #53992:
URL: https://github.com/apache/spark/pull/53992#discussion_r2752551692



##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -409,21 +414,21 @@ def arrow_to_pandas(
             ndarray_as_list=ndarray_as_list,
         )
 
-    def _create_array(self, series, arrow_type, spark_type=None, 
arrow_cast=False):
+    def _create_array(self, series, spark_type, *, arrow_cast=False, 
prefers_large_types=False):
         """
-        Create an Arrow Array from the given pandas.Series and optional type.
+        Create an Arrow Array from the given pandas.Series and Spark type.
 
         Parameters
         ----------
         series : pandas.Series
             A single series
-        arrow_type : pyarrow.DataType, optional
-            If None, pyarrow's inferred type will be used
         spark_type : DataType, optional
-            If None, spark type converted from arrow_type will be used
-        arrow_cast: bool, optional
+            The Spark type to use. If None, pyarrow's inferred type will be 
used.

Review Comment:
   That makes sense, thanks for the suggestion.
   
   I want to make sure I understand what you mean by factoring out the 
createDataFrame usage.
   
   In this PR I am not changing the createDataFrame behavior. Spark still 
allows users to pass an optional schema (see [doc 
here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html)),
 and when it is not provided, we can end up without a Spark type at this stage. 
That is already part of the current behavior. So even if we refactor and 
isolate the createDataFrame-related logic, we would still have cases where the 
Spark type is None before Arrow conversion.
   
   Are you suggesting that instead we should make `createDataFrame` always let 
Arrow infer the type first, and then convert that inferred Arrow type back into 
a Spark type, so that downstream we can assume the Spark type is always defined?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization [spark]

Reply via email to