tazimmerman opened a new issue #7624: URL: https://github.com/apache/arrow/issues/7624
I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work. Let's start with a multi-index dataframe. ``` >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> import pyarrow.parquet as pq >>> df = ... >>> df data id when letter number a 1 0.0 a1 2020-05-05 08:30:01+00:00 b 2 1.1 b2 2020-05-05 08:30:01+00:00 3 1.2 b3 2020-05-05 08:30:01+00:00 c 4 2.1 c4 2020-05-05 08:30:01+00:00 5 2.2 c5 2020-05-05 08:30:01+00:00 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl = pa.Table.from_pandas(df) >>> tbl pyarrow.Table data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 >>> tbl.schema data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 -- schema metadata -- pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783 ``` This of course works as expected, so let's write the table to disk, and read it with a `dataset`. ``` >>> pq.write_table(tbl, "/tmp/df.parquet") >>> data = ds.dataset("/tmp/df.parquet") >>> data.to_table(filter=ds.field("letter") == "c").to_pandas() data id when letter number c 4 2.1 c4 2020-05-05 08:30:01+00:00 5 2.2 c5 2020-05-05 08:30:01+00:00 6 2.3 c6 2020-05-05 08:30:01+00:00 ``` The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection. ``` >>> data.to_table(filter=ds.field("letter") == "c", columns=["data", "id"]).to_pandas() data id 0 2.1 c4 1 2.2 c5 2 2.3 c6 ``` Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe. ``` >>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", "number", "data", "id", "when"]) >>> tbl.to_pandas() letter number data id when 0 c 4 2.1 c4 2020-05-05 08:30:01+00:00 1 c 5 2.2 c5 2020-05-05 08:30:01+00:00 2 c 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl pyarrow.Table letter: string number: int64 data: double id: string when: timestamp[us, tz=UTC] ``` It seems that when I specify any or all columns, the schema metadata is lost along the way, so `to_pandas` doesn't quite reconstruct the dataframe to match the original. If the functionality makes sense, and the effort to do it isn't already under way, I'd be happy to dig into it myself, and hopefully be able to contribute something to this awesome project. In case it helps, here's my relevant versions: - **arrow-cpp**: 0.17.1 - **pyarrow**: 0.17.1 - **parquet-cpp**: 1.5.1 - **python**: 3.7.6 - **thrift-cpp**: 0.13.0 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org