ArnavBalyan commented on code in PR #3269:
URL: https://github.com/apache/parquet-java/pull/3269#discussion_r2297310825


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -1804,14 +1804,27 @@ private static void copy(SeekableInputStream from, 
PositionOutputStream to, long
    * @throws IOException if there is an error while writing
    */
   public void end(Map<String, String> extraMetaData) throws IOException {
+    final long footerStart = out.getPos();
+
+    // Build the footer metadata) in memory using the helper stream
+    InMemoryPositionOutputStream buffer = new 
InMemoryPositionOutputStream(footerStart);

Review Comment:
   Yes definitely, the buffered stream would help consolidating the write and 
push to the disk in 1 attempt. Previously the writes in footer were distributed 
across serializeColumnIndexes, serializeOffsetIndexes and 
serializeBloomFilters. The buffer allows aggregating all of the above and 
writes to disk at the end once the heavy computation is done. The network 
corruption can still happen but changes will significantly reduce and can only 
occur when the final buffered stream is committed to disk thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to