Troy Zimmerman created ARROW-9302:
-------------------------------------

             Summary: Specifying columns in a dataset drops the index (pandas) 
metadata.
                 Key: ARROW-9302
                 URL: https://issues.apache.org/jira/browse/ARROW-9302
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Troy Zimmerman


I'm not sure if this is a missing feature, or just undocumented, or perhaps not 
even something I should expect to work.

Let's start with a multi-index dataframe.

{code}
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>>
>>> df
               data  id                      when
letter number
a      1        0.0  a1 2020-05-05 08:30:01+00:00
b      2        1.1  b2 2020-05-05 08:30:01+00:00
       3        1.2  b3 2020-05-05 08:30:01+00:00
c      4        2.1  c4 2020-05-05 08:30:01+00:00
       5        2.2  c5 2020-05-05 08:30:01+00:00
       6        2.3  c6 2020-05-05 08:30:01+00:00

>>> tbl = pa.Table.from_pandas(df)
>>> tbl
pyarrow.Table
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
>>> tbl.schema
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
-- schema metadata --
pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783
{code}

This of course works as expected, so let's write the table to disk, and read it 
with a {{dataset}}.

{code}
>>> pq.write_table(tbl, "/tmp/df.parquet")
>>> data = ds.dataset("/tmp/df.parquet")
>>> data.to_table(filter=ds.field("letter") == "c").to_pandas()
               data  id                      when
letter number
c      4        2.1  c4 2020-05-05 08:30:01+00:00
       5        2.2  c5 2020-05-05 08:30:01+00:00
       6        2.3  c6 2020-05-05 08:30:01+00:00
{code}

The filter also works as expected, and the dataframe is reconstructed properly. 
Let's do it again, but this time with a column selection.

{code}
>>> data.to_table(filter=ds.field("letter") == "c", columns=["data", 
>>> "id"]).to_pandas()
   data  id
0   2.1  c4
1   2.2  c5
2   2.3  c6
{code}

Hmm, not quite what I was thinking, but excluding the indices from the columns 
seems like a dumb move on my part, so let's try again, and this time include 
all columns to be safe.

{code}
>>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", 
>>> "number", "data", "id", "when"])
>>> tbl.to_pandas()
  letter  number  data  id                      when
0      c       4   2.1  c4 2020-05-05 08:30:01+00:00
1      c       5   2.2  c5 2020-05-05 08:30:01+00:00
2      c       6   2.3  c6 2020-05-05 08:30:01+00:00
>>> tbl
pyarrow.Table
letter: string
number: int64
data: double
id: string
when: timestamp[us, tz=UTC]
{code}

It seems that when I specify any or all columns, the schema metadata is lost 
along the way, so {{to_pandas}} doesn't reconstruct the dataframe to match the 
original.

Here's my relevant versions:

- arrow-cpp: 0.17.1
- pyarrow: 0.17.1
- parquet-cpp: 1.5.1
- python: 3.7.6
- thrift-cpp: 0.13.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to