Troy Zimmerman created ARROW-9302: ------------------------------------- Summary: Specifying columns in a dataset drops the index (pandas) metadata. Key: ARROW-9302 URL: https://issues.apache.org/jira/browse/ARROW-9302 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Troy Zimmerman
I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work. Let's start with a multi-index dataframe. {code} >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> import pyarrow.parquet as pq >>> >>> df data id when letter number a 1 0.0 a1 2020-05-05 08:30:01+00:00 b 2 1.1 b2 2020-05-05 08:30:01+00:00 3 1.2 b3 2020-05-05 08:30:01+00:00 c 4 2.1 c4 2020-05-05 08:30:01+00:00 5 2.2 c5 2020-05-05 08:30:01+00:00 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl = pa.Table.from_pandas(df) >>> tbl pyarrow.Table data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 >>> tbl.schema data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 -- schema metadata -- pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783 {code} This of course works as expected, so let's write the table to disk, and read it with a {{dataset}}. {code} >>> pq.write_table(tbl, "/tmp/df.parquet") >>> data = ds.dataset("/tmp/df.parquet") >>> data.to_table(filter=ds.field("letter") == "c").to_pandas() data id when letter number c 4 2.1 c4 2020-05-05 08:30:01+00:00 5 2.2 c5 2020-05-05 08:30:01+00:00 6 2.3 c6 2020-05-05 08:30:01+00:00 {code} The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection. {code} >>> data.to_table(filter=ds.field("letter") == "c", columns=["data", >>> "id"]).to_pandas() data id 0 2.1 c4 1 2.2 c5 2 2.3 c6 {code} Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe. {code} >>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", >>> "number", "data", "id", "when"]) >>> tbl.to_pandas() letter number data id when 0 c 4 2.1 c4 2020-05-05 08:30:01+00:00 1 c 5 2.2 c5 2020-05-05 08:30:01+00:00 2 c 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl pyarrow.Table letter: string number: int64 data: double id: string when: timestamp[us, tz=UTC] {code} It seems that when I specify any or all columns, the schema metadata is lost along the way, so {{to_pandas}} doesn't reconstruct the dataframe to match the original. Here's my relevant versions: - arrow-cpp: 0.17.1 - pyarrow: 0.17.1 - parquet-cpp: 1.5.1 - python: 3.7.6 - thrift-cpp: 0.13.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)