[
https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruifeng Zheng updated SPARK-41855:
----------------------------------
Description:
{code:python}
data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3,
value=None)]
# +---+-----+
# | id|value|
# +---+-----+
# | 1| NaN|
# | 2| 42.0|
# | 3| null|
# +---+-----+
cdf = self.connect.createDataFrame(data)
sdf = self.spark.createDataFrame(data)
print()
print()
print(cdf._show_string(100, 100, False))
print()
print(cdf.schema)
print()
print(sdf._jdf.showString(100, 100, False))
print()
print(sdf.schema)
self.compare_by_show(cdf, sdf)
{code}
{code:java}
+---+-----+
| id|value|
+---+-----+
| 1| null|
| 2| 42.0|
| 3| null|
+---+-----+
StructType([StructField('id', LongType(), True), StructField('value',
DoubleType(), True)])
+---+-----+
| id|value|
+---+-----+
| 1| NaN|
| 2| 42.0|
| 3| null|
+---+-----+
StructType([StructField('id', LongType(), True), StructField('value',
DoubleType(), True)])
{code}
this issue is due to that `createDataFrame` can't handle None/NaN properly:
1, in the conversion from local data to pd.DataFrame, it automatically converts
None to NaN
2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN
to null
was:
{code:python}
data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3,
value=None)]
# +---+-----+
# | id|value|
# +---+-----+
# | 1| NaN|
# | 2| 42.0|
# | 3| null|
# +---+-----+
cdf = self.connect.createDataFrame(data)
sdf = self.spark.createDataFrame(data)
print()
print()
print(cdf._show_string(100, 100, False))
print()
print(cdf.schema)
print()
print(sdf._jdf.showString(100, 100, False))
print()
print(sdf.schema)
self.compare_by_show(cdf, sdf)
{code}
{code:java}
+---+-----+
| id|value|
+---+-----+
| 1| null|
| 2| 42.0|
| 3| null|
+---+-----+
StructType([StructField('id', LongType(), True), StructField('value',
DoubleType(), True)])
+---+-----+
| id|value|
+---+-----+
| 1| NaN|
| 2| 42.0|
| 3| null|
+---+-----+
StructType([StructField('id', LongType(), True), StructField('value',
DoubleType(), True)])
{code}
this issue is due to that `createDataFrame` can't handle None properly:
1, in the conversion from local data to pd.DataFrame, it automatically converts
None to NaN
2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN
to null
> `createDataFrame` doesn't handle None/NaN properly
> --------------------------------------------------
>
> Key: SPARK-41855
> URL: https://issues.apache.org/jira/browse/SPARK-41855
> Project: Spark
> Issue Type: Sub-task
> Components: Connect, PySpark
> Affects Versions: 3.4.0
> Reporter: Ruifeng Zheng
> Priority: Major
>
> {code:python}
> data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0),
> Row(id=3, value=None)]
> # +---+-----+
> # | id|value|
> # +---+-----+
> # | 1| NaN|
> # | 2| 42.0|
> # | 3| null|
> # +---+-----+
> cdf = self.connect.createDataFrame(data)
> sdf = self.spark.createDataFrame(data)
> print()
> print()
> print(cdf._show_string(100, 100, False))
> print()
> print(cdf.schema)
> print()
> print(sdf._jdf.showString(100, 100, False))
> print()
> print(sdf.schema)
> self.compare_by_show(cdf, sdf)
> {code}
> {code:java}
> +---+-----+
> | id|value|
> +---+-----+
> | 1| null|
> | 2| 42.0|
> | 3| null|
> +---+-----+
> StructType([StructField('id', LongType(), True), StructField('value',
> DoubleType(), True)])
> +---+-----+
> | id|value|
> +---+-----+
> | 1| NaN|
> | 2| 42.0|
> | 3| null|
> +---+-----+
> StructType([StructField('id', LongType(), True), StructField('value',
> DoubleType(), True)])
> {code}
> this issue is due to that `createDataFrame` can't handle None/NaN properly:
> 1, in the conversion from local data to pd.DataFrame, it automatically
> converts None to NaN
> 2, then in the conversion from pd.DataFrame to pa.Table, it always converts
> NaN to null
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]