ArnavBalyan commented on code in PR #3269:
URL: https://github.com/apache/parquet-java/pull/3269#discussion_r2302743901


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -1804,14 +1804,27 @@ private static void copy(SeekableInputStream from, 
PositionOutputStream to, long
    * @throws IOException if there is an error while writing
    */
   public void end(Map<String, String> extraMetaData) throws IOException {
+    final long footerStart = out.getPos();
+
+    // Build the footer metadata) in memory using the helper stream
+    InMemoryPositionOutputStream buffer = new 
InMemoryPositionOutputStream(footerStart);

Review Comment:
   Agreed, I think there is some confusion about what this PR is fixing, I 
added more details in the PR. To over-simplify, it's simply moving the 
inter-leaved writes to 1 single write at the end. 
   
   Today we interleave serialization and writes, so if an exception occurs 
between those phases, prior writes may already have been flushed by the FS 
client, leaving a truncated write. By serializing fully in memory first, we 
eliminate writes during serialization. This does not make the write atomic, we 
may need future effort for this.
   
   Maybe we are discussing different concerns/issues, just wanted to know what 
you think @wgtmac?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to