[
https://issues.apache.org/jira/browse/PARQUET-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059262#comment-16059262
]
Toby Shaw commented on PARQUET-1037:
------------------------------------
So, I have a data source which I do not know the length of, but want to write
to parquet files. In the case of a multiple columns, the sources for each
column can be iterated through separately, but I know they will have the same
length.
I can keep appending row groups as I fill them up, column by column, but at
some point the source will end, and the final one will probably not be
completed. I would like to close the ColumnWriter (via NextColumn), modifying
the rowgroup's metadata to reflect the true number of rows written in place of
the expected quantity. To clarify, every column chunk will be filled with the
same number of items, but currently the metadata would be wrong (reflecting the
original expected number).
Currently the only accurate solution is, buffer my values and write rowgroups
when the buffer is filled or the source ends (either very high memory usage or
very small row groups).
When removing the check, I can generate parquet files, but the metadata does
not reflect the true quantity of rows.
> Allow final RowGroup to be unfilled
> -----------------------------------
>
> Key: PARQUET-1037
> URL: https://issues.apache.org/jira/browse/PARQUET-1037
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: Toby Shaw
>
> When a RowGroup is added with AppendRowGroup, it must be filled with exactly
> the number of rows specified, or an exception will be thrown.
> It should be possible go back and modify the RowGroupSize metadata after a
> column has completed prematurely. This would be useful in scenarios where the
> total number of rows to be written is not known in advance.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)