[ https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche resolved ARROW-10122. ------------------------------------------- Resolution: Fixed Issue resolved by pull request 8469 [https://github.com/apache/arrow/pull/8469] > [Python] Selecting one column of multi-index results in a duplicated value > column. > ---------------------------------------------------------------------------------- > > Key: ARROW-10122 > URL: https://issues.apache.org/jira/browse/ARROW-10122 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Environment: arrow 1.0.1 > parquet 1.5.1 > pandas 1.1.0 > pyarrow 1.0.1 > Reporter: Troy Zimmerman > Assignee: Joris Van den Bossche > Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When I read one column of a multi-index, that column is duplicated as a value > column in the resulting Pandas data frame. > {code:python} > >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), > >> "value": np.arange(5)}) > >>> df = table.to_pandas().set_index(["first", "second"]) > >>> print(df) > value > first second > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4 > >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet") > >>> data = ds.dataset("/tmp/test.parquet") > {code} > This works as expected, as does selecting all or no columns. > {code:python} > >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas()) > value > first second > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4 > {code} > This does not work as expected, as the {{first}} column is both an index and > a value. > {code:python} > >>> print(data.to_table(columns=["first", "value"]).to_pandas()) > first value > first > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4{code} > This is easy to workaround by specifying the full multi-index in > {{to_table}}, but does this behavior make sense? -- This message was sent by Atlassian Jira (v8.3.4#803005)