[jira] [Resolved] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.

Joris Van den Bossche (Jira) Thu, 19 Nov 2020 00:42:37 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche resolved ARROW-10122.
-------------------------------------------
    Resolution: Fixed

Issue resolved by pull request 8469
[https://github.com/apache/arrow/pull/8469]

> [Python] Selecting one column of multi-index results in a duplicated value 
> column.
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-10122
>                 URL: https://issues.apache.org/jira/browse/ARROW-10122
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>         Environment: arrow 1.0.1
> parquet 1.5.1
> pandas 1.1.0
> pyarrow 1.0.1
>            Reporter: Troy Zimmerman
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When I read one column of a multi-index, that column is duplicated as a value 
> column in the resulting Pandas data frame.
> {code:python}
> >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), 
> >> "value": np.arange(5)}) 
> >>> df = table.to_pandas().set_index(["first", "second"])
> >>> print(df)
>               value
> first second
> 0     0           0
> 1     1           1
> 2     2           2
> 3     3           3
> 4     4           4
> >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
> >>> data = ds.dataset("/tmp/test.parquet")
> {code}
> This works as expected, as does selecting all or no columns.
> {code:python}
> >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
>               value
> first second
> 0     0           0
> 1     1           1
> 2     2           2
> 3     3           3
> 4     4           4
> {code}
> This does not work as expected, as the {{first}} column is both an index and 
> a value.
> {code:python}
> >>> print(data.to_table(columns=["first", "value"]).to_pandas())
>        first  value
> first
> 0          0      0
> 1          1      1
> 2          2      2
> 3          3      3
> 4          4      4{code}
> This is easy to workaround by specifying the full multi-index in 
> {{to_table}}, but does this behavior make sense?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.

Reply via email to