Gregory Werbin created ARROW-17200:
--------------------------------------

             Summary: [Python, Parquet] support partitioning by Pandas 
DataFrame index
                 Key: ARROW-17200
                 URL: https://issues.apache.org/jira/browse/ARROW-17200
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Parquet, Python
            Reporter: Gregory Werbin


In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" 
index level, one might want to partition by that index level when saving the 
data frame to Parquet format. This is currently not possible; you need to 
manually reset the index before writing, and re-add the index after reading. It 
would be very useful if you could supply the name of an index level to 
{{partition_cols}} instead of (or ideally in addition to) a data column name.

I originally posted this on the Pandas issue tracker 
([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke looked 
at the code and figured out that the partitioning functionality was implemented 
entirely in PyArrow, and that the change would need to happen within PyArrow 
itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to