[GitHub] [iceberg] NeQuissimus opened a new pull request #3293: Lazily initialize the writer inside ParquetWriter

GitBox Thu, 14 Oct 2021 06:14:23 -0700


NeQuissimus opened a new pull request #3293:
URL: https://github.com/apache/iceberg/pull/3293



   Noticed that the inner writer is not needed during construction and in add().
   Allocating it can - depending on the selected implementation - cause many 
other
   objects to be initialized.
   This is particularly true with the Hadoop GCS connector, which immediately
   allocates a 64 MiB `byte[]`.
   As long as code is tight-looping over the `add()` method while adding 
additional
   data rows to the parquet file, there is no reason why
   `ParquetWriter` needs to initialize these instances.
   
   All usages of `this.writer` have been changed to use `getWriter()` instead.
   `getWriter()` will allocate a single writer object when needed.
   
   Also improved `startRowGroup()`, which previously needed access to the writer
   object but no longer does.
   The only call was to `nextRowGroupSize`, which returns the result from an
   `AlignmentStrategy`.
   However, Iceberg's code initializes a `ParquetFileWriter` with a 
`maxPaddingSize
   = 0`. No matter the particular alignment strategy, this will cause
   `rowGroupSize` to always be returns as the value of `nextRowGroupSize`.
   Hence, we can remove the indirection through the writer object entirely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] NeQuissimus opened a new pull request #3293: Lazily initialize the writer inside ParquetWriter

Reply via email to