[ 
https://issues.apache.org/jira/browse/ARROW-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874884#comment-15874884
 ] 

Uwe L. Korn commented on ARROW-571:
-----------------------------------

Writing RowGroup-wise is already supported by {{parquet_arrow}}, just not 
exposed in Python. One thing we're missing in C++ is incrementally building up 
the schema.

For example we have the use case that we already know 20 columns that shall be 
written into the Parquet file. We can serialise these columns and free the 
memory associated with them. But several other columns (of same length of 
course) will be generated later in a pipeline but that first part of the 
pipeline is unaware how many and of which type. Currently we build a Pandas 
DataFrame until we have reached the end of the pipeline. Subsequent jobs also 
only read a subset of columns (but different combinations thereof). Directly 
writing out these columns are they are computed would help us save a lot of 
RAM. Related issue for that: https://issues.apache.org/jira/browse/PARQUET-749

> [Python] Add APIs to build Parquet files incrementally from Arrow tables
> ------------------------------------------------------------------------
>
>                 Key: ARROW-571
>                 URL: https://issues.apache.org/jira/browse/ARROW-571
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to