(spark) branch master updated: [SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string types

ruifengz Wed, 28 Jan 2026 17:35:14 -0800

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 776e2b93cb15 [SPARK-55244][PYTHON][PS] Use np.nan as default value for 
pandas string types
776e2b93cb15 is described below

commit 776e2b93cb15ece363db162646c183b98a651f48
Author: Tian Gao <[email protected]>
AuthorDate: Thu Jan 29 09:34:51 2026 +0800

    [SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string 
types
    
    ### What changes were proposed in this pull request?
    
    We we create string type with `StringDtype`, always use `np.nan` as the 
default missing value if we are using pandas 3.
    
    ### Why are the changes needed?
    
    That's what pandas 3 has decided for their string type.
    
    
https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#the-missing-value-sentinel-is-now-always-nan
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, behavior is different for pandas 3.
    
    ### How was this patch tested?
    
    Locally some tests passed after this change with pandas 3.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #54015 from gaogaotiantian/change-default-string-type.
    
    Authored-by: Tian Gao <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 python/pyspark/pandas/typedef/typehints.py | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/pandas/typedef/typehints.py 
b/python/pyspark/pandas/typedef/typehints.py
index 7573d677b48d..96220b448fa4 100644
--- a/python/pyspark/pandas/typedef/typehints.py
+++ b/python/pyspark/pandas/typedef/typehints.py
@@ -31,6 +31,8 @@ import pandas as pd
 from pandas.api.types import CategoricalDtype, pandas_dtype
 from pandas.api.extensions import ExtensionDtype
 
+from pyspark.loose_version import LooseVersion
+
 
 extension_dtypes: Tuple[type, ...]
 try:
@@ -148,8 +150,6 @@ def as_spark_type(
     - dictionaries of field_name -> type
     - Python3's typing system
     """
-    from pyspark.loose_version import LooseVersion
-
     # For NumPy typing, NumPy version should be 1.21+
     if LooseVersion(np.__version__) >= LooseVersion("1.21"):
         if (
@@ -274,7 +274,10 @@ def spark_type_to_pandas_dtype(
                 return BooleanDtype()
             # StringType
             elif isinstance(spark_type, types.StringType):
-                return StringDtype()
+                if LooseVersion(pd.__version__) < "3.0.0":
+                    return StringDtype()
+                else:
+                    return StringDtype(na_value=np.nan)
 
         # FractionalType
         if extension_float_dtypes_available:


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string types

Reply via email to