samarthjain commented on pull request #3572:
URL: https://github.com/apache/iceberg/pull/3572#issuecomment-973214656


   Closing of vectors is important because it otherwise results in memory 
leaks. The memory leak is more apparent when `reuse` is disabled as every new 
batch read allocates a new vector for the batch. 
   
   As for the original design -  A vector is tied to a batch of records. When 
reuse is enabled, we try and reuse the same vector (except in certain special 
[cases](https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L127)).
 Effectively, what this does is we keep using the same vector till the 
`FileIterator` is 
[exhausted](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java#L163).
 A call to `model.close()` calls `ColumnarBatchReader.close()` which in turn 
ends up calling `close()` on all the `VectorizedArrowReader`. If we don't close 
the vector here, it will cause a memory leak. 
   
   When reuse is disabled, we allocate a new vector for every batch. Before 
allocating a new vector though, it is important to close out the previous 
vector. This is what 
[this](https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java#L47)
 code is doing.
   
   It is also important to note that we create a `ColumnarBatchReader` and the 
associated `VectorizedArrowReader` only once for a file. The lifecycle of the 
above two readers is tied to the `FileIterator`. See the init section 
[here](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java#L74).
 When a ReadConf is constructed, we set the `VectorizedModel` 
[here](https://github.com/apache/iceberg/blob/523a31bd4db5d457b8eebc37be630aaec018fce2/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L123).
 This `vectorizedModel` is a `ColumnarBatchReader`. 
   
   I hope this helps. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to