Representation of "null" values for non-numeric types in Arrow/Pandas interop

Li Jin Tue, 08 Jun 2021 12:59:47 -0700

Hello!

Apologies if this has been brought before. I'd like to get devs' thoughts
on this potential inconsistency of "what are the python objects for null
values" between pandas and pyarrow.


Demonstrated with the following example:

(1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
1.2.4):

In [*32*]: df

Out[*32*]:

           value

key

1    some_strign


In [*33*]: df2

Out[*33*]:

                value2

key

2    some_other_string


In [*34*]: df.join(df2)

Out[*34*]:

           value value2

key

1    some_strign    *NaN*



(2) pyarrow seems to use "None" to represent a missing value (4.0.1)

>>> s = pd.Series(["some_string", np.NaN])

>>> s

0    some_string

1            NaN

dtype: object

>>> pa.Array.from_pandas(s).to_pandas()

0    some_string

1           None

dtype: object


I have looked around the pyarrow doc and didn't find an option to use
np.NaN for null values with to_pandas so it's a bit hard to get around trip
consistency.


I appreciate any thoughts on this as to how to achieve consistency here.


Thanks!

Li

Representation of "null" values for non-numeric types in Arrow/Pandas interop

Reply via email to