[
https://issues.apache.org/jira/browse/ARROW-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874884#comment-15874884
]
Uwe L. Korn commented on ARROW-571:
-----------------------------------
Writing RowGroup-wise is already supported by {{parquet_arrow}}, just not
exposed in Python. One thing we're missing in C++ is incrementally building up
the schema.
For example we have the use case that we already know 20 columns that shall be
written into the Parquet file. We can serialise these columns and free the
memory associated with them. But several other columns (of same length of
course) will be generated later in a pipeline but that first part of the
pipeline is unaware how many and of which type. Currently we build a Pandas
DataFrame until we have reached the end of the pipeline. Subsequent jobs also
only read a subset of columns (but different combinations thereof). Directly
writing out these columns are they are computed would help us save a lot of
RAM. Related issue for that: https://issues.apache.org/jira/browse/PARQUET-749
> [Python] Add APIs to build Parquet files incrementally from Arrow tables
> ------------------------------------------------------------------------
>
> Key: ARROW-571
> URL: https://issues.apache.org/jira/browse/ARROW-571
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Reporter: Wes McKinney
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)