[jira] [Commented] (PARQUET-1037) Allow final RowGroup to be unfilled

Toby Shaw (JIRA) Thu, 22 Jun 2017 05:23:57 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059262#comment-16059262
 ]


Toby Shaw commented on PARQUET-1037:
------------------------------------

So, I have a data source which I do not know the length of, but want to write 
to parquet files. In the case of a multiple columns, the sources for each 
column can be iterated through separately, but I know they will have the same 
length.

I can keep appending row groups as I fill them up, column by column, but at 
some point the source will end, and the final one will probably not be 
completed. I would like to close the ColumnWriter (via NextColumn), modifying 
the rowgroup's metadata to reflect the true number of rows written in place of 
the expected quantity. To clarify, every column chunk will be filled with the 
same number of items, but currently the metadata would be wrong (reflecting the 
original expected number).

Currently the only accurate solution is, buffer my values and write rowgroups 
when the buffer is filled or the source ends (either very high memory usage or 
very small row groups).
When removing the check, I can generate parquet files, but the metadata does 
not reflect the true quantity of rows.

> Allow final RowGroup to be unfilled
> -----------------------------------
>
>                 Key: PARQUET-1037
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1037
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Toby Shaw
>
> When a RowGroup is added with AppendRowGroup, it must be filled with exactly 
> the number of rows specified, or an exception will be thrown.
> It should be possible go back and modify the RowGroupSize metadata after a 
> column has completed prematurely. This would be useful in scenarios where the 
> total number of rows to be written is not known in advance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PARQUET-1037) Allow final RowGroup to be unfilled

Reply via email to