[jira] [Created] (ARROW-14621) [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series

Weston Pace (Jira) Fri, 05 Nov 2021 19:21:24 -0700

Weston Pace created ARROW-14621:
-----------------------------------

             Summary: [Python] read_feather's "columns" argument claims to 
support any iterable but does not accept pandas series
                 Key: ARROW-14621
                 URL: https://issues.apache.org/jira/browse/ARROW-14621
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Weston Pace



>From https://github.com/apache/arrow/issues/11500:

{quote}
According to the read_feather documentation it accepts any sequence type ( 
https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html
 ) so not strictly lists, but also tuples etc... As far as they contain int or 
str.

While a pandas.Index adheres to the sequence protocol it provides custom 
implementation of many methods you would expect in a container. I personally 
think that supporting all possible implementations of classes that adheres to 
sequence protocol even when they are heavily custom is probably out of scope, 
so I would personally change the docstring to mention list|tuple and avoid 
confusion.

Implementing support for pandas.Index is fairly quick but that would probably 
end up casting the index to a tuple or list if we want to keep the code 
generic. And that would mean introducing an extra cost that the user doesn't 
know it's paying when invoking that method. I guess it's more reasonable to 
make that cost explicit and have user cast to list or tuple
{quote}
[~amol-]

{quote}
Technically the sequence protocol does not define equality. The problem seems 
to originate from the line sorted(set(columns)) == columns. We are relying on 
list == sequence => bool which is not valid when the sequence is a numpy array 
(list == np.ndarray => np.ndarray).

The correct method for comparing sequences seems to be converting both sides to 
list or using imap (although given we are already doing sorted(set(columns)) I 
think imap would be overkill).

I'm in favor of @amol- 's general point though. Changing the docs to mention 
list|tuple is probably a good solution.
{quote}
[~westonpace]

{quote}
On the other hand, supporting numpy arrays for those kind of arguments that 
expect a list is also relatively easy. We have quite some other methods that 
document a certain argument as list, while still accepting a numpy array 
(because in most cases we just rely on 1) length check and 2) iteration through 
the values).

I think the most important would be to have consistency. But if we have some 
APIs that are currently strict on requiring a list, and others loose on 
requiring any iterable, then there are two options: make everything strict, or 
fix those cases that are now strict to have them work with any iterable.
But to me it would also feel strange to start adding strict isinstance(.., 
list) checks in several places that currently are perfectly fine with accepting 
a numpy array or any iterable ...
{quote}
[~jorisvandenbossche]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-14621) [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series

Reply via email to