[jira] [Comment Edited] (PARQUET-1559) Add way to manually commit already written data to disk

wxmimperio (Jira) Thu, 06 Aug 2020 01:42:08 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172137#comment-17172137
 ]


wxmimperio edited comment on PARQUET-1559 at 8/6/20, 8:41 AM:
--------------------------------------------------------------

[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum 
store to the page store frequently, and refresh to the outputStream, will the 
data be flushed to disk?(I know the data is unreadable at this time, but the 
column store and page store memory can be released by gc)
 pageStore.flushToFileWriter(parquetFileWriter);
 This method just refreshes page stroe to outputStream, so the data should 
still be in memory at this time until run outputStream.close().

When I reduced rowGroupSize = 8Mb and I find debug log: 
{color:#c1c7d0}LOG.debug("Flushing mem columnStore to file. allocated memory: 
{}", columnStore.getAllocatedSize()),{color} but the file on hdfs has no 
content and size. I guess it is outPutStream did not flush the data out.


was (Author: wxmimperio):
[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum 
store to the page store frequently, and refresh to the outputStream, will the 
data be flushed to disk?(I know the data is unreadable at this time, but the 
column store and page store memory can be released by gc)
pageStore.flushToFileWriter(parquetFileWriter);
This method just refreshes page stroe to outputStream, so the data should still 
be in memory at this time until outputStream.close() the data flush to disk.

When I have reduced rowGroupSize = 8Mb and I find debug log: 
LOG.debug("Flushing mem columnStore to file. allocated memory: {}", 
columnStore.getAllocatedSize()), but the file on hdfs has no content and size. 
I guess it is outPutStream did not flush the data out.

> Add way to manually commit already written data to disk
> -------------------------------------------------------
>
>                 Key: PARQUET-1559
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1559
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Victor
>            Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have 
> the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running 
> process
>  * I would like to be able from time to time to access the already written 
> data
> So I was expecting to be able to flush manually the file to ensure the data 
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is 
> something about metadata being at the footer of the file), what would then be 
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write 
> multiple files in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (PARQUET-1559) Add way to manually commit already written data to disk

Reply via email to