tazimmerman opened a new issue #7624:
URL: https://github.com/apache/arrow/issues/7624
I'm not sure if this is a missing feature, or just undocumented, or perhaps
not even something I should expect to work.
Let's start with a multi-index dataframe.
```
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>> df = ...
>>> df
data id when
letter number
a 1 0.0 a1 2020-05-05 08:30:01+00:00
b 2 1.1 b2 2020-05-05 08:30:01+00:00
3 1.2 b3 2020-05-05 08:30:01+00:00
c 4 2.1 c4 2020-05-05 08:30:01+00:00
5 2.2 c5 2020-05-05 08:30:01+00:00
6 2.3 c6 2020-05-05 08:30:01+00:00
>>> tbl = pa.Table.from_pandas(df)
>>> tbl
pyarrow.Table
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
>>> tbl.schema
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
-- schema metadata --
pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' +
783
```
This of course works as expected, so let's write the table to disk, and read
it with a `dataset`.
```
>>> pq.write_table(tbl, "/tmp/df.parquet")
>>> data = ds.dataset("/tmp/df.parquet")
>>> data.to_table(filter=ds.field("letter") == "c").to_pandas()
data id when
letter number
c 4 2.1 c4 2020-05-05 08:30:01+00:00
5 2.2 c5 2020-05-05 08:30:01+00:00
6 2.3 c6 2020-05-05 08:30:01+00:00
```
The filter also works as expected, and the dataframe is reconstructed
properly. Let's do it again, but this time with a column selection.
```
>>> data.to_table(filter=ds.field("letter") == "c", columns=["data",
"id"]).to_pandas()
data id
0 2.1 c4
1 2.2 c5
2 2.3 c6
```
Hmm, not quite what I was thinking, but excluding the indices from the
columns seems like a dumb move on my part, so let's try again, and this time
include all columns to be safe.
```
>>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter",
"number", "data", "id", "when"])
>>> tbl.to_pandas()
letter number data id when
0 c 4 2.1 c4 2020-05-05 08:30:01+00:00
1 c 5 2.2 c5 2020-05-05 08:30:01+00:00
2 c 6 2.3 c6 2020-05-05 08:30:01+00:00
>>> tbl
pyarrow.Table
letter: string
number: int64
data: double
id: string
when: timestamp[us, tz=UTC]
```
It seems that when I specify any or all columns, the schema metadata is lost
along the way, so `to_pandas` doesn't quite reconstruct the dataframe to match
the original.
If the functionality makes sense, and the effort to do it isn't already
under way, I'd be happy to dig into it myself, and hopefully be able to
contribute something to this awesome project.
In case it helps, here's my relevant versions:
- **arrow-cpp**: 0.17.1
- **pyarrow**: 0.17.1
- **parquet-cpp**: 1.5.1
- **python**: 3.7.6
- **thrift-cpp**: 0.13.0
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]