[
https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kouhei Sutou updated ARROW-17200:
---------------------------------
Summary: [Python][Parquet] support partitioning by Pandas DataFrame index
(was: [Python, Parquet] support partitioning by Pandas DataFrame index)
> [Python][Parquet] support partitioning by Pandas DataFrame index
> ----------------------------------------------------------------
>
> Key: ARROW-17200
> URL: https://issues.apache.org/jira/browse/ARROW-17200
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Parquet, Python
> Reporter: Gregory Werbin
> Priority: Minor
>
> In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer"
> index level, one might want to partition by that index level when saving the
> data frame to Parquet format. This is currently not possible; you need to
> manually reset the index before writing, and re-add the index after reading.
> It would be very useful if you could supply the name of an index level to
> {{partition_cols}} instead of (or ideally in addition to) a data column name.
> I originally posted this on the Pandas issue tracker
> ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke
> looked at the code and figured out that the partitioning functionality was
> implemented entirely in PyArrow, and that the change would need to happen
> within PyArrow itself.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)