[ 
https://issues.apache.org/jira/browse/PARQUET-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052116#comment-16052116
 ] 

Uwe L. Korn commented on PARQUET-1022:
--------------------------------------

What do you mean with "the original last rowgroup may not be complete"? Each 
RowGroup has a number of rows specified in its metadata and exactly this number 
of rows must be in that RowGroup once the file is written to disk, otherwise we 
end up with invalid files.

Concerning the general approach, writing a Parquet would initially produce (we 
take here 2 RowGroups as an example) of the pattern {{MRRF}} where M are the 4 
magic bytes {{PAR1}}, R is a RowGroup each and F is the footer including the 
metadata.

To make an append mode, you would then read F from the existing file, insert a 
new RowGroup at the place where currently F is and then write out the new 
footer with its modified metadata F'.

Still a small note of caution: Normally appends to Parquet datasets are simply 
done by creating additional Parquet files. Most tools that ingest Parquet are 
built such that they treat multiple files and a single Parquet file nearly 
identically. Code-wise this should be much easier for you to handle. But if 
this append mode is really helpful for you, it would also be really helpful for 
us to understand why you would need it instead of just using the new file mode!

> Append mode in parquet-cpp
> --------------------------
>
>                 Key: PARQUET-1022
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1022
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp
>    Affects Versions: cpp-1.1.0
>            Reporter: yugu
>
> As said, currently trying to work out a append feature for parquet files in 
> c++.
> (been searching through repo etc, can't find example tho..)
> Current solution is to (assume no schema changes that is):
> Read in metadata
> Change metadata based on appended rows+ original rows
> Append a new row group (or multiple row group writer)
> Write the new rows.
> ---
> The problem is that, is approached this way, the original last row group may 
> not be complete filled. Was wondering if there is a fix or I'm using the api 
> wrong...
> Thanks ! : D



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to