ArnavBalyan commented on code in PR #3269:
URL: https://github.com/apache/parquet-java/pull/3269#discussion_r2302743901
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -1804,14 +1804,27 @@ private static void copy(SeekableInputStream from,
PositionOutputStream to, long
* @throws IOException if there is an error while writing
*/
public void end(Map<String, String> extraMetaData) throws IOException {
+ final long footerStart = out.getPos();
+
+ // Build the footer metadata) in memory using the helper stream
+ InMemoryPositionOutputStream buffer = new
InMemoryPositionOutputStream(footerStart);
Review Comment:
Agreed, I think there is some confusion about what this PR is fixing, I
added more details in the PR. To over-simplify, it's simply moving the
inter-leaved writes to 1 single write at the end.
Today we interleave serialization and writes, so if an exception occurs
between those phases, prior writes may already have been flushed by the FS
client, leaving a truncated write (The concrete outputstream internal buffer
can flush depending on the previous serialization size and the configurations).
By serializing fully in memory first, we always eliminate writes during
serialization. This does not make the write atomic, we may need future effort
for this.
Maybe we are discussing different concerns/issues, just wanted to know what
you think @wgtmac?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]