zhengruifeng opened a new pull request, #39360:
URL: https://github.com/apache/spark/pull/39360
### What changes were proposed in this pull request?
Make `createDataFrame` handle None/NaN properly
Existing implementation always convert local data to a Pandas DataFrame, and
then a PyArrow Table, this approach can not handle None/NaN properly:
1, local data -> Pandas DataFrame, method `pd.DataFrame` always convert None
in numeric columns into NaN, and there is no parameter to control this behavior;
2, Pandas DataFrame -> PyArrow Table, method `pa.Table.from_pandas` always
convert NaN into Null, and there is no parameter to control this behavior;
```
In [72]: data = [Row(id=1, value=float("NaN"), s="x"), Row(id=2, value=42.0,
s="y"), Row(id=3, value=None, s=None)]
In [73]: pdf = pd.DataFrame(data, columns=["id", "value", "s"])
In [74]: pdf
Out[74]:
id value s
0 1 NaN x
1 2 42.0 y
2 3 NaN None
In [75]: pa.Table.from_pandas(pdf)
Out[75]:
pyarrow.Table
id: int64
value: double
s: string
----
id: [[1,2,3]]
value: [[null,42,null]]
s: [["x","y",null]]
```
to correctly handle None/NaN, I found that `pa.Table.from_pylist` works.
```
In [76]: pa.Table.from_pylist([d.asDict() for d in data])
Out[76]:
pyarrow.Table
id: int64
value: double
s: string
----
id: [[1,2,3]]
value: [[nan,42,null]]
s: [["x","y",null]]
```
### Why are the changes needed?
to support None and NaN
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
added ut
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]