NeQuissimus opened a new pull request #3293: URL: https://github.com/apache/iceberg/pull/3293
Noticed that the inner writer is not needed during construction and in add(). Allocating it can - depending on the selected implementation - cause many other objects to be initialized. This is particularly true with the Hadoop GCS connector, which immediately allocates a 64 MiB `byte[]`. As long as code is tight-looping over the `add()` method while adding additional data rows to the parquet file, there is no reason why `ParquetWriter` needs to initialize these instances. All usages of `this.writer` have been changed to use `getWriter()` instead. `getWriter()` will allocate a single writer object when needed. Also improved `startRowGroup()`, which previously needed access to the writer object but no longer does. The only call was to `nextRowGroupSize`, which returns the result from an `AlignmentStrategy`. However, Iceberg's code initializes a `ParquetFileWriter` with a `maxPaddingSize = 0`. No matter the particular alignment strategy, this will cause `rowGroupSize` to always be returns as the value of `nextRowGroupSize`. Hence, we can remove the indirection through the writer object entirely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
