[GitHub] [arrow] tazimmerman opened a new issue #7624: Specifying columns in a dataset drops the index (pandas) metadata.

GitBox Thu, 02 Jul 2020 13:07:43 -0700


tazimmerman opened a new issue #7624:
URL: https://github.com/apache/arrow/issues/7624



   I'm not sure if this is a missing feature, or just undocumented, or perhaps 
not even something I should expect to work.
   
   Let's start with a multi-index dataframe.
   
   ```
   >>> import pyarrow as pa
   >>> import pyarrow.dataset as ds
   >>> import pyarrow.parquet as pq
   >>> df = ...
   >>> df
                  data  id                      when
   letter number
   a      1        0.0  a1 2020-05-05 08:30:01+00:00
   b      2        1.1  b2 2020-05-05 08:30:01+00:00
          3        1.2  b3 2020-05-05 08:30:01+00:00
   c      4        2.1  c4 2020-05-05 08:30:01+00:00
          5        2.2  c5 2020-05-05 08:30:01+00:00
          6        2.3  c6 2020-05-05 08:30:01+00:00
   
   >>> tbl = pa.Table.from_pandas(df)
   >>> tbl
   pyarrow.Table
   data: double
   id: string
   when: timestamp[ns, tz=+00:00]
   letter: string
   number: int64
   >>> tbl.schema
   data: double
   id: string
   when: timestamp[ns, tz=+00:00]
   letter: string
   number: int64
   -- schema metadata --
   pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 
783
   ```
   
   This of course works as expected, so let's write the table to disk, and read 
it with a `dataset`.
   
   ```
   >>> pq.write_table(tbl, "/tmp/df.parquet")
   >>> data = ds.dataset("/tmp/df.parquet")
   >>> data.to_table(filter=ds.field("letter") == "c").to_pandas()
                  data  id                      when
   letter number
   c      4        2.1  c4 2020-05-05 08:30:01+00:00
          5        2.2  c5 2020-05-05 08:30:01+00:00
          6        2.3  c6 2020-05-05 08:30:01+00:00
   ```
   
   The filter also works as expected, and the dataframe is reconstructed 
properly. Let's do it again, but this time with a column selection.
   
   ```
   >>> data.to_table(filter=ds.field("letter") == "c", columns=["data", 
"id"]).to_pandas()
      data  id
   0   2.1  c4
   1   2.2  c5
   2   2.3  c6
   ```
   
   Hmm, not quite what I was thinking, but excluding the indices from the 
columns seems like a dumb move on my part, so let's try again, and this time 
include all columns to be safe.
   
   ```
   >>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", 
"number", "data", "id", "when"])
   >>> tbl.to_pandas()
     letter  number  data  id                      when
   0      c       4   2.1  c4 2020-05-05 08:30:01+00:00
   1      c       5   2.2  c5 2020-05-05 08:30:01+00:00
   2      c       6   2.3  c6 2020-05-05 08:30:01+00:00
   >>> tbl
   pyarrow.Table
   letter: string
   number: int64
   data: double
   id: string
   when: timestamp[us, tz=UTC]
   ```
   
   It seems that when I specify any or all columns, the schema metadata is lost 
along the way, so `to_pandas` doesn't quite reconstruct the dataframe to match 
the original.
   
   If the functionality makes sense, and the effort to do it isn't already 
under way, I'd be happy to dig into it myself, and hopefully be able to 
contribute something to this awesome project.
   
   In case it helps, here's my relevant versions:
   
   - **arrow-cpp**: 0.17.1
   - **pyarrow**: 0.17.1
   - **parquet-cpp**: 1.5.1
   - **python**: 3.7.6
   - **thrift-cpp**: 0.13.0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] tazimmerman opened a new issue #7624: Specifying columns in a dataset drops the index (pandas) metadata.

Reply via email to