amoeba commented on issue #44089:
URL: https://github.com/apache/arrow/issues/44089#issuecomment-2347131229

   Hi @psychelzh, a change here might require some discussion. My understanding 
is that different types of systems treat NaNs differently and what you're 
seeing is a result of arrow having different semantics than R. R and the pandas 
Python package treat NaNs as NA/None whereas database systems do not and I 
think arrow is emulating the database system semantics here.
   
   For comparison, Python has a similar situation as R: pandas treats NaN as 
NA/null (and actually ignores them by default so this is _slightly_ different 
from R):
   
   ```python
   In [1]: pd.Series([1, 2, np.nan]).mean()
   Out[1]: 1.5
   ```
   
   While PyArrow does not:
   
   ```python
   In [2]: pc.mean(pa.array([1, 2, np.nan]))
   Out[2]: <pyarrow.DoubleScalar: nan>
   ```
   
   However, because compatibility is important, PyArrow supports a 
`from_pandas` flag which makes PyArrow match pandas semantics:
   
   ```python
   In [3]: pc.mean(pa.array([1, 2, np.nan], from_pandas=True))
   Out[3]: <pyarrow.DoubleScalar: 1.5>
   ```
   
   There are a few ways this could be addressed and I think this type of issue 
has come up before in places like https://github.com/apache/arrow/issues/29060. 
The R package could add something similar to PyArrow's `from_pandas` flag or an 
option could be added to individual kernels (mean in this case). Thoughts 
@ianmcook @nealrichardson @jonkeane @thisisnic?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to