[GitHub] [iceberg] samarthjain commented on pull request #3572: Arrow: Don't close vectors in VectorizedArrowReader


samarthjain commented on pull request #3572:
URL: https://github.com/apache/iceberg/pull/3572#issuecomment-973214656

Closing of vectors is important because it otherwise results in memory
leaks. The memory leak is more apparent when `reuse` is disabled as every new
batch read allocates a new vector for the batch.

As for the original design - A vector is tied to a batch of records. When
reuse is enabled, we try and reuse the same vector (except in certain special
[cases](https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L127)).
Effectively, what this does is we keep using the same vector till the
`FileIterator` is
[exhausted](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java#L163).
A call to `model.close()` calls `ColumnarBatchReader.close()` which in turn
ends up calling `close()` on all the `VectorizedArrowReader`. If we don't close
the vector here, it will cause a memory leak.

When reuse is disabled, we allocate a new vector for every batch. Before
allocating a new vector though, it is important to close out the previous
vector. This is what
[this](https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java#L47)
code is doing.

It is also important to note that we create a `ColumnarBatchReader` and the
associated `VectorizedArrowReader` only once for a file. The lifecycle of the
above two readers is tied to the `FileIterator`. See the init section
[here](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java#L74).
When a ReadConf is constructed, we set the `VectorizedModel`
[here](https://github.com/apache/iceberg/blob/523a31bd4db5d457b8eebc37be630aaec018fce2/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L123).
This `vectorizedModel` is a `ColumnarBatchReader`.

I hope this helps.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to