xinrong-databricks commented on PR #36181:
URL: https://github.com/apache/spark/pull/36181#issuecomment-1100270625

   Thanks @zhengruifeng! 
   
   What you said makes sense and I do agree `Double.NaN` is better than `null` 
in this case.
   
   I wanted to add that with `Double.NaN`, adjustments(casting, specific 
statistical functions, etc.) are still needed in order to reach parity with 
pandas. An example is as below:
   
   ```py
   >> pdf = pd.DataFrame({'a':[1.0, 2.0, 3.0, None], 'b':[1.0, 2.0, 3.0, 
np.nan]})
   >>> pdf
        a    b
   0  1.0  1.0
   1  2.0  2.0
   2  3.0  3.0
   3  NaN  NaN
   
   >>> pdf.sum()
   a    6.0
   b    6.0
   dtype: float64
   ```
   
   I haven't calculated the exact efforts needed. I do think this is an 
interesting topic that needs further discussion.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to