amoeba commented on issue #44089: URL: https://github.com/apache/arrow/issues/44089#issuecomment-2347131229
Hi @psychelzh, a change here might require some discussion. My understanding is that different types of systems treat NaNs differently and what you're seeing is a result of arrow having different semantics than R. R and the pandas Python package treat NaNs as NA/None whereas database systems do not and I think arrow is emulating the database system semantics here. For comparison, Python has a similar situation as R: pandas treats NaN as NA/null (and actually ignores them by default so this is _slightly_ different from R): ```python In [1]: pd.Series([1, 2, np.nan]).mean() Out[1]: 1.5 ``` While PyArrow does not: ```python In [2]: pc.mean(pa.array([1, 2, np.nan])) Out[2]: <pyarrow.DoubleScalar: nan> ``` However, because compatibility is important, PyArrow supports a `from_pandas` flag which makes PyArrow match pandas semantics: ```python In [3]: pc.mean(pa.array([1, 2, np.nan], from_pandas=True)) Out[3]: <pyarrow.DoubleScalar: 1.5> ``` There are a few ways this could be addressed and I think this type of issue has come up before in places like https://github.com/apache/arrow/issues/29060. The R package could add something similar to PyArrow's `from_pandas` flag or an option could be added to individual kernels (mean in this case). Thoughts @ianmcook @nealrichardson @jonkeane @thisisnic? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
