[
https://issues.apache.org/jira/browse/GOBBLIN-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ratandeep Ratti updated GOBBLIN-1715:
-------------------------------------
Summary: Support vectorized row batch pooling (was: Support vectorized row
batch recycling)
> Support vectorized row batch pooling
> ------------------------------------
>
> Key: GOBBLIN-1715
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1715
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-core
> Reporter: Ratandeep Ratti
> Assignee: Abhishek Tiwari
> Priority: Major
>
> The pre-allocation method allocates vastly more memory for ORC ColumnVectors
> of arrays and maps than needed and is unpredictable as it depends upon the
> size of the current column vector’s length, which can change as we allocate
> more memory to it. From the heap dump done on a kafka topic we saw that on
> the second resize call for an array ColumnVector, where request size was ~ 1k
> elements, it had requested to allocate around 444M elements. This resulted in
> over allocating way past the heap size. This was the primary reason why we
> see OOM failures during ingestion for deeply nested records
> Update: Below is an example of how a very large memory can be allocated using
> smart resizing procedure. The formula for allocating memory is
> child_vector resize =
> child_vector_request_size +
> (child_vector_request_size / rowsAdded + 1) * current_vector_size
> If we now have deeply nested arrays of arrays each of 525 elements in a row
> like The memory will be allocated as such.
> 1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by
> default is batch size = 256
> 2nd resize = (525 + 525/1 + 1) * 135181 = *71105731*
> ; current vector size = 135181
--
This message was sent by Atlassian Jira
(v8.20.10#820010)