[
https://issues.apache.org/jira/browse/PARQUET-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887447#comment-16887447
]
Deepak Majeti commented on PARQUET-1626:
----------------------------------------
The simplest approach is to read the two files and export them back as a single
file. You can follow the existing reader-writer.cc example to do this.
If you want to optimize by avoiding the compression/decompression of the Data
Pages, then you have to carefully update the metadata (counts, stats, offsets,
lengths, etc.) at the File, ColumnChunk levels and append the individual
RowGroups. If you further want to append two RowGroups together, you have to
update the RowGroup metadata.
On top of this, you have to ensure the two files being merged are compatible.
In your case, this won't be a problem since you have the same writer generating
the parquet files with the same schema. But in general, if the two files are
generated from different writers, you cannot easily merge them.
> [C++] Ability to concat parquet files
> --------------------------------------
>
> Key: PARQUET-1626
> URL: https://issues.apache.org/jira/browse/PARQUET-1626
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-cpp
> Affects Versions: cpp-1.3.1
> Reporter: nileema shingte
> Priority: Major
> Labels: features
>
> Ability to concat the parquet files is something we've wanted for some time
> too. When we generate parquet files partitioned by an expression, we often
> end up with tiny files and would like to add a post-processing step to concat
> these files together.
> Is there a plan to add this ability to the library any time soon?
> If not, it would be great if someone can provide a somewhat detailed
> pseudocode (expanding on what [~xhochy] mentioned in the comment in
> PARQUET-1022) as a guideline for conditions/scenarios that need to be handled
> with extra care, so we can contribute this as a PR.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)