Weston Pace created ARROW-14621:
-----------------------------------
Summary: [Python] read_feather's "columns" argument claims to
support any iterable but does not accept pandas series
Key: ARROW-14621
URL: https://issues.apache.org/jira/browse/ARROW-14621
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Weston Pace
>From https://github.com/apache/arrow/issues/11500:
{quote}
According to the read_feather documentation it accepts any sequence type (
https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html
) so not strictly lists, but also tuples etc... As far as they contain int or
str.
While a pandas.Index adheres to the sequence protocol it provides custom
implementation of many methods you would expect in a container. I personally
think that supporting all possible implementations of classes that adheres to
sequence protocol even when they are heavily custom is probably out of scope,
so I would personally change the docstring to mention list|tuple and avoid
confusion.
Implementing support for pandas.Index is fairly quick but that would probably
end up casting the index to a tuple or list if we want to keep the code
generic. And that would mean introducing an extra cost that the user doesn't
know it's paying when invoking that method. I guess it's more reasonable to
make that cost explicit and have user cast to list or tuple
{quote}
[~amol-]
{quote}
Technically the sequence protocol does not define equality. The problem seems
to originate from the line sorted(set(columns)) == columns. We are relying on
list == sequence => bool which is not valid when the sequence is a numpy array
(list == np.ndarray => np.ndarray).
The correct method for comparing sequences seems to be converting both sides to
list or using imap (although given we are already doing sorted(set(columns)) I
think imap would be overkill).
I'm in favor of @amol- 's general point though. Changing the docs to mention
list|tuple is probably a good solution.
{quote}
[~westonpace]
{quote}
On the other hand, supporting numpy arrays for those kind of arguments that
expect a list is also relatively easy. We have quite some other methods that
document a certain argument as list, while still accepting a numpy array
(because in most cases we just rely on 1) length check and 2) iteration through
the values).
I think the most important would be to have consistency. But if we have some
APIs that are currently strict on requiring a list, and others loose on
requiring any iterable, then there are two options: make everything strict, or
fix those cases that are now strict to have them work with any iterable.
But to me it would also feel strange to start adding strict isinstance(..,
list) checks in several places that currently are perfectly fine with accepting
a numpy array or any iterable ...
{quote}
[~jorisvandenbossche]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)