samarthjain commented on pull request #3572: URL: https://github.com/apache/iceberg/pull/3572#issuecomment-973214656
Closing of vectors is important because it otherwise results in memory leaks. The memory leak is more apparent when `reuse` is disabled as every new batch read allocates a new vector for the batch. As for the original design - A vector is tied to a batch of records. When reuse is enabled, we try and reuse the same vector (except in certain special [cases](https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L127)). Effectively, what this does is we keep using the same vector till the `FileIterator` is [exhausted](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java#L163). A call to `model.close()` calls `ColumnarBatchReader.close()` which in turn ends up calling `close()` on all the `VectorizedArrowReader`. If we don't close the vector here, it will cause a memory leak. When reuse is disabled, we allocate a new vector for every batch. Before allocating a new vector though, it is important to close out the previous vector. This is what [this](https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java#L47) code is doing. It is also important to note that we create a `ColumnarBatchReader` and the associated `VectorizedArrowReader` only once for a file. The lifecycle of the above two readers is tied to the `FileIterator`. See the init section [here](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java#L74). When a ReadConf is constructed, we set the `VectorizedModel` [here](https://github.com/apache/iceberg/blob/523a31bd4db5d457b8eebc37be630aaec018fce2/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L123). This `vectorizedModel` is a `ColumnarBatchReader`. I hope this helps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
