[
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172169#comment-17172169
]
Gabor Szadovszky commented on PARQUET-1559:
-------------------------------------------
The file writing logic is mainly implemented in ParquetFileWriter. It does not
invoke a flush to the output stream meaning it is up to the underlying
OutputStream implementation (in case of Hadoop it is an FSDataOutputStream)
when to write to the disk. But, this shall be independent from the memory
footprint of parquet. After writing to the output stream the data of the row
group should be available for GC. Please note that the related statistics
(column/offset indexes, other footer values, maybe bloom filters from the next
release) are still in memory because they will be written when the file gets
closed.
> Add way to manually commit already written data to disk
> -------------------------------------------------------
>
> Key: PARQUET-1559
> URL: https://issues.apache.org/jira/browse/PARQUET-1559
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.1
> Reporter: Victor
> Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have
> the following need:
> * I'm using parquet-avro to write to a parquet file during a long running
> process
> * I would like to be able from time to time to access the already written
> data
> So I was expecting to be able to flush manually the file to ensure the data
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is
> something about metadata being at the footer of the file), what would then be
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write
> multiple files in that case.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)