[GitHub] [spark] zhengruifeng opened a new pull request, #39360: [SPARK-41855][SPARK-41814][SPARK-41851][SPARK-41852][CONNECT][PYTHON] Make `createDataFrame` handle None/NaN properly

GitBox Tue, 03 Jan 2023 02:07:46 -0800


zhengruifeng opened a new pull request, #39360:
URL: https://github.com/apache/spark/pull/39360


   ### What changes were proposed in this pull request?
   Make `createDataFrame` handle None/NaN properly
   
   Existing implementation always convert local data to a Pandas DataFrame, and 
then a PyArrow Table, this approach can not handle None/NaN properly:
   
   1, local data -> Pandas DataFrame, method `pd.DataFrame` always convert None 
in numeric columns into NaN, and there is no parameter to control this behavior;
   2, Pandas DataFrame -> PyArrow Table, method `pa.Table.from_pandas` always 
convert NaN into Null, and there is no parameter to control this behavior;
   
   ```
   In [72]: data = [Row(id=1, value=float("NaN"), s="x"), Row(id=2, value=42.0, 
s="y"), Row(id=3, value=None, s=None)]
   
   In [73]: pdf = pd.DataFrame(data, columns=["id", "value", "s"])
   
   In [74]: pdf
   Out[74]: 
      id  value     s
   0   1    NaN     x
   1   2   42.0     y
   2   3    NaN  None
   
   In [75]: pa.Table.from_pandas(pdf)
   Out[75]: 
   pyarrow.Table
   id: int64
   value: double
   s: string
   ----
   id: [[1,2,3]]
   value: [[null,42,null]]
   s: [["x","y",null]]
   ```
   
   to correctly handle None/NaN, I found that `pa.Table.from_pylist` works.
   
   ```
   In [76]: pa.Table.from_pylist([d.asDict() for d in data])
   Out[76]: 
   pyarrow.Table
   id: int64
   value: double
   s: string
   ----
   id: [[1,2,3]]
   value: [[nan,42,null]]
   s: [["x","y",null]]
   ```
   
   ### Why are the changes needed?
   to support None and NaN
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   added ut
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng opened a new pull request, #39360: [SPARK-41855][SPARK-41814][SPARK-41851][SPARK-41852][CONNECT][PYTHON] Make `createDataFrame` handle None/NaN properly

Reply via email to