Hello!
Apologies if this has been brought before. I'd like to get devs' thoughts
on this potential inconsistency of "what are the python objects for null
values" between pandas and pyarrow.
Demonstrated with the following example:
(1) pandas seems to use "np.NaN" to represent a missing value (with pandas
1.2.4):
In [*32*]: df
Out[*32*]:
value
key
1 some_strign
In [*33*]: df2
Out[*33*]:
value2
key
2 some_other_string
In [*34*]: df.join(df2)
Out[*34*]:
value value2
key
1 some_strign *NaN*
(2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>>> s = pd.Series(["some_string", np.NaN])
>>> s
0 some_string
1 NaN
dtype: object
>>> pa.Array.from_pandas(s).to_pandas()
0 some_string
1 None
dtype: object
I have looked around the pyarrow doc and didn't find an option to use
np.NaN for null values with to_pandas so it's a bit hard to get around trip
consistency.
I appreciate any thoughts on this as to how to achieve consistency here.
Thanks!
Li