Re: [PR] [SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF [spark]

via GitHub Thu, 08 Jan 2026 03:43:44 -0800


zhengruifeng commented on code in PR #53730:
URL: https://github.com/apache/spark/pull/53730#discussion_r2672029145



##########
python/pyspark/sql/pandas/types.py:
##########
@@ -891,16 +921,17 @@ def _create_converter_to_pandas(
     pandas_type = _to_corrected_pandas_type(data_type)
 
     if pandas_type is not None:
-        # SPARK-21766: if an integer field is nullable and has null values, it 
can be
-        # inferred by pandas as a float column. If we convert the column with 
NaN back
-        # to integer type e.g., np.int16, we will hit an exception. So we use 
the
-        # pandas-inferred float type, rather than the corrected type from the 
schema
-        # in this case.
         if isinstance(data_type, IntegralType) and nullable:
+            if integer_object_nulls:
+                # pandas_type like np.int64 doesn't support nullable data
+                # use Pandas extension type instead
+                nullable_type = _to_corrected_pandas_ext_type(data_type)
+            else:
+                nullable_type = np.float64
 
             def correct_dtype(pser: pd.Series) -> pd.Series:
                 if pser.isnull().any():
-                    return pser.astype(np.float64, copy=False)
+                    return pser.astype(nullable_type, copy=False)

Review Comment:
   the solution is:
   ```
   In [58]: arr = pa.array([1,2,3,None])
   
   In [59]: ser = arr.to_pandas(integer_object_nulls=True)
   
   In [60]: ser = ser.astype(pd.Int64Dtype(), copy=False)
   
   In [61]: ser
   Out[61]:
   0       1
   1       2
   2       3
   3    <NA>
   dtype: Int64
   ```
   
   actually a better solution is
   ```
   In [62]: arr = pa.array([1,2,3,None])
   
   In [63]: ser = pd.Series(arr, dtype=pd.Int64Dtype())
   ```
   in which the intermediate object-typed `ser` can be optimized out, how ever 
it doesn't fit with existing framework: `pa.Array.to_pandas + 
_create_converter_to_pandas`, we will revisit and refactor the `converter` in 
the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF [spark]

Reply via email to