[jira] [Commented] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection

Joris Van den Bossche (Jira) Wed, 06 May 2020 01:55:36 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100599#comment-17100599
 ]


Joris Van den Bossche commented on ARROW-8641:
----------------------------------------------

[~wesm] I was taking a look at this, and so in RecordBatchReader, the 
{{included_indices}} are converted into a {{inclusion_mask}}, and such a mask 
of course doesn't preserve the ordering of the included indices. 

Do you have a preference of which level to solve this? Do we want 
{{RecordBatchFileReader.ReadRecordBatch}} to actually honor the order of the 
{{included_indices}} (and so eg reorder the fields of the batch before 
returning it, to match the {{out_schema}}). 
Or do we want to ignore the order on that level of the IPC reader (and then the 
feather code could still decide to reorder if it wants to respect the order of 
the specified columns by the user)?

> [Python] Regression in feather: no longer supports permutation in column 
> selection
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-8641
>                 URL: https://issues.apache.org/jira/browse/ARROW-8641
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>             Fix For: 1.0.0
>
>
> A quite annoying regression (original report from 
> https://github.com/pandas-dev/pandas/issues/33878), is that when specifying 
> {{columns}} to read, this now fails if the order of the columns is not 
> exactly the same as in the file:
> {code:python}
> In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 
> 'c'])    
> In [29]: from pyarrow import feather 
> In [30]: feather.write_feather(table, "test.feather")   
> # this works fine
> In [32]: feather.read_table("test.feather", columns=['a', 'b'])               
>                                                                               
>                                                        
> Out[32]: 
> pyarrow.Table
> a: int64
> b: int64
> In [33]: feather.read_table("test.feather", columns=['b', 'a'])               
>                                                                               
>                                                        
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <ipython-input-33-e01caeabb389> in <module>
> ----> 1 feather.read_table("test.feather", columns=['b', 'a'])
> ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, 
> memory_map)
>     237         return reader.read_indices(columns)
>     238     elif all(map(lambda t: t == str, column_types)):
> --> 239         return reader.read_names(columns)
>     240 
>     241     column_type_names = [t.__name__ for t in column_types]
> ~/scipy/repos/arrow/python/pyarrow/feather.pxi in 
> pyarrow.lib.FeatherReader.read_names()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Schema at index 0 was different: 
> b: int64
> a: int64
> vs
> a: int64
> b: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection

Reply via email to