gszadovszky commented on PR #3351:
URL: https://github.com/apache/parquet-java/pull/3351#issuecomment-3473168771

   > > If we write e.g. multiple row groups and no I/O issue happens during the 
flushing of the first one, we would still have garbage on the target if the 
writing of the second row group fails, right? (We practically cannot prevent 
that because we don't want to keep the whole file in memory.)
   > 
   > @gszadovszky Digging into this further, I think the statement is not 
correct. The function you mentioned, 
`InternalParquetRecordWriter::flushRowGroupToStore`, is called when the 
in-memory records are about to exceed the limit. However, the data is written 
into the stream, but flush/close is not called, so the data is not visible to 
other readers. Only the final `close()` call in `ParquetFileWriter` triggers a 
`FSOutputFileStream` close, which finalizes the file. Therefore, our fix 
addresses all of the potential incomplete parquet files.
   
   Thanks for the clarification, @Jiayi-Wang-db. It makes sense. I'll approve 
this but will wait for next week for other potential feedback. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to