[
https://issues.apache.org/jira/browse/PARQUET-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901509#comment-16901509
]
David Lee commented on PARQUET-1626:
------------------------------------
I'm appending RowGroups using pyarrow today.
Open a new parquet file
For each row group in File 1:
read row group. write row group to new file
For each row group in File 2:
read row group. write row group to new file
Close new parquet file
No need to mess with metadata since all those stats are saved at a row group
level.
I usually generate parquet files which are 30 to 40 megs each and I merge them
afterwards to match the HDFS blocksize.
> [C++] Ability to concat parquet files
> --------------------------------------
>
> Key: PARQUET-1626
> URL: https://issues.apache.org/jira/browse/PARQUET-1626
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-cpp
> Affects Versions: cpp-1.3.1
> Reporter: nileema shingte
> Priority: Major
> Labels: features
>
> Ability to concat the parquet files is something we've wanted for some time
> too. When we generate parquet files partitioned by an expression, we often
> end up with tiny files and would like to add a post-processing step to concat
> these files together.
> Is there a plan to add this ability to the library any time soon?
> If not, it would be great if someone can provide a somewhat detailed
> pseudocode (expanding on what [~xhochy] mentioned in the comment in
> PARQUET-1022) as a guideline for conditions/scenarios that need to be handled
> with extra care, so we can contribute this as a PR.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)