This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 776e2b93cb15 [SPARK-55244][PYTHON][PS] Use np.nan as default value for
pandas string types
776e2b93cb15 is described below
commit 776e2b93cb15ece363db162646c183b98a651f48
Author: Tian Gao <[email protected]>
AuthorDate: Thu Jan 29 09:34:51 2026 +0800
[SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string
types
### What changes were proposed in this pull request?
We we create string type with `StringDtype`, always use `np.nan` as the
default missing value if we are using pandas 3.
### Why are the changes needed?
That's what pandas 3 has decided for their string type.
https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#the-missing-value-sentinel-is-now-always-nan
### Does this PR introduce _any_ user-facing change?
Yes, behavior is different for pandas 3.
### How was this patch tested?
Locally some tests passed after this change with pandas 3.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #54015 from gaogaotiantian/change-default-string-type.
Authored-by: Tian Gao <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
python/pyspark/pandas/typedef/typehints.py | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/python/pyspark/pandas/typedef/typehints.py
b/python/pyspark/pandas/typedef/typehints.py
index 7573d677b48d..96220b448fa4 100644
--- a/python/pyspark/pandas/typedef/typehints.py
+++ b/python/pyspark/pandas/typedef/typehints.py
@@ -31,6 +31,8 @@ import pandas as pd
from pandas.api.types import CategoricalDtype, pandas_dtype
from pandas.api.extensions import ExtensionDtype
+from pyspark.loose_version import LooseVersion
+
extension_dtypes: Tuple[type, ...]
try:
@@ -148,8 +150,6 @@ def as_spark_type(
- dictionaries of field_name -> type
- Python3's typing system
"""
- from pyspark.loose_version import LooseVersion
-
# For NumPy typing, NumPy version should be 1.21+
if LooseVersion(np.__version__) >= LooseVersion("1.21"):
if (
@@ -274,7 +274,10 @@ def spark_type_to_pandas_dtype(
return BooleanDtype()
# StringType
elif isinstance(spark_type, types.StringType):
- return StringDtype()
+ if LooseVersion(pd.__version__) < "3.0.0":
+ return StringDtype()
+ else:
+ return StringDtype(na_value=np.nan)
# FractionalType
if extension_float_dtypes_available:
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]