[GitHub] [spark] xinrong-databricks commented on pull request #36181: Implement `skipna`s of statistical functions of Series and DataFrame

GitBox Fri, 15 Apr 2022 11:10:40 -0700


xinrong-databricks commented on PR #36181:
URL: https://github.com/apache/spark/pull/36181#issuecomment-1100270625


   Thanks @zhengruifeng! 
   
   What you said makes sense and I do agree `Double.NaN` is better than `null` 
in this case.
   
   I wanted to add that with `Double.NaN`, adjustments(casting, specific 
statistical functions, etc.) are still needed in order to reach parity with 
pandas. An example is as below:
   
   ```py
   >> pdf = pd.DataFrame({'a':[1.0, 2.0, 3.0, None], 'b':[1.0, 2.0, 3.0, 
np.nan]})
   >>> pdf
        a    b
   0  1.0  1.0
   1  2.0  2.0
   2  3.0  3.0
   3  NaN  NaN
   
   >>> pdf.sum()
   a    6.0
   b    6.0
   dtype: float64
   ```
   
   I haven't calculated the exact efforts needed. I do think this is an 
interesting topic that needs further discussion.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xinrong-databricks commented on pull request #36181: Implement `skipna`s of statistical functions of Series and DataFrame

Reply via email to