ArnavBalyan commented on code in PR #3269:
URL: https://github.com/apache/parquet-java/pull/3269#discussion_r2298310434
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -1804,14 +1804,27 @@ private static void copy(SeekableInputStream from,
PositionOutputStream to, long
* @throws IOException if there is an error while writing
*/
public void end(Map<String, String> extraMetaData) throws IOException {
+ final long footerStart = out.getPos();
+
+ // Build the footer metadata) in memory using the helper stream
+ InMemoryPositionOutputStream buffer = new
InMemoryPositionOutputStream(footerStart);
Review Comment:
Thanks for the reply! Maybe there is a gap in my understanding. The goal of
this change was to reduce the likelihood of partial data written to disk during
the footer write.
Currently column index, offset, bloom filter and footer is serialized and
bytes are sent to stream in stages (serialize + **write**, serialize +
**write** .... etc). any network failure in the serialization process can
corrupt data (since previous stage may have sent the bytes to disk if the
internal stream buffer got full, the exposure window is large).
With the new change the serialized outputs are built in memory against a
virtual position, then emitting once (serialize + serialize + serialize +
**write**). This minimizes the risk of partial data writen, since data is held
back until all serialization is complete. The underlying concrete stream may
have varying sizes of in memory buffer and may decide to flush to disk at
anytime, but avoided in the latter code, since the write is done once at the
end.
Completely agree this will not make it atomic, and hard to eliminate this
kind of problem in the way current code is structured, but this should reduce
the likelyhood of corrupt data during the footer write. This change is only
meant to optmize the footer.
If I am missing something about the internal stream that negates the gap
removal, happy to revise and iterate 😀, Appreciate the review and guidance 🙏
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]