[
https://issues.apache.org/jira/browse/ARROW-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103770#comment-17103770
]
German I. Ramirez-Espinoza commented on ARROW-8641:
---------------------------------------------------
[~jorisvandenbossche]: thanks for your comments. Although the bug is solved, I
finished the implementation of my idea just as a learning exercise and to
explore arrow's internals a bit more.
After writing tests for the corrected implementation of my idea I noticed two
drawbacks about it:
# {{feather.read_parquet}} on pyarrow still couldn't handle duplicated columns
# it brakes the test {{test_table_from_batches_and_schema}} which looks like
an important test of basic functionality. The reason is that this code:
{code:python}
incompatible_schema = pa.schema([pa.field('a', pa.int64())])
with pytest.raises(pa.ArrowInvalid):
pa.Table.from_batches([batch], incompatible_schema)
{code}
no longer raises an exception.
Naturally, I no longer think mine is such a good idea. I read your resolution
and realized that I also didn't take into account the fact that arrow supports
multiple feather versions.
Anyway, I think it was a fun experience for me (albeit a bit embarrassing at
the beginning because of the shabby implementation of my idea).
Cheers
> [Python] Regression in feather: no longer supports permutation in column
> selection
> ----------------------------------------------------------------------------------
>
> Key: ARROW-8641
> URL: https://issues.apache.org/jira/browse/ARROW-8641
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.0.0, 0.17.1
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> A quite annoying regression (original report from
> https://github.com/pandas-dev/pandas/issues/33878), is that when specifying
> {{columns}} to read, this now fails if the order of the columns is not
> exactly the same as in the file:
> {code:python}
> In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b',
> 'c'])
> In [29]: from pyarrow import feather
> In [30]: feather.write_feather(table, "test.feather")
> # this works fine
> In [32]: feather.read_table("test.feather", columns=['a', 'b'])
>
>
> Out[32]:
> pyarrow.Table
> a: int64
> b: int64
> In [33]: feather.read_table("test.feather", columns=['b', 'a'])
>
>
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> <ipython-input-33-e01caeabb389> in <module>
> ----> 1 feather.read_table("test.feather", columns=['b', 'a'])
> ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns,
> memory_map)
> 237 return reader.read_indices(columns)
> 238 elif all(map(lambda t: t == str, column_types)):
> --> 239 return reader.read_names(columns)
> 240
> 241 column_type_names = [t.__name__ for t in column_types]
> ~/scipy/repos/arrow/python/pyarrow/feather.pxi in
> pyarrow.lib.FeatherReader.read_names()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Schema at index 0 was different:
> b: int64
> a: int64
> vs
> a: int64
> b: int64
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)