Jiayi-Wang-db commented on PR #3351: URL: https://github.com/apache/parquet-java/pull/3351#issuecomment-3473138098
> If we write e.g. multiple row groups and no I/O issue happens during the flushing of the first one, we would still have garbage on the target if the writing of the second row group fails, right? (We practically cannot prevent that because we don't want to keep the whole file in memory.) Digging into this further, I think the statement is not correct. The function you mentioned, `InternalParquetRecordWriter::flushRowGroupToStore`, is called when the in-memory records are about to exceed the limit. However, the data is written into the stream, but flush/close is not called, so the data is not visible to other readers. Only the final `close()` call in `ParquetFileWriter` triggers a `FSOutputFileStream` close, which finalizes the file. Therefore, our fix addresses all of the potential incomplete parquet files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
