Ratandeep Ratti created GOBBLIN-1715:
----------------------------------------
Summary: Support vectorized row batch recycling
Key: GOBBLIN-1715
URL: https://issues.apache.org/jira/browse/GOBBLIN-1715
Project: Apache Gobblin
Issue Type: Bug
Components: gobblin-core
Reporter: Ratandeep Ratti
Assignee: Abhishek Tiwari
The pre-allocation method allocates vastly more memory for ORC ColumnVectors of
arrays and maps than needed and is unpredictable as it depends upon the size of
the current column vector’s length, which can change as we allocate more memory
to it. From the heap dump done on a kafka topic we saw that on the second
resize call for an array ColumnVector, where request size was ~ 1k elements, it
had requested to allocate around 444M elements. This resulted in over
allocating way past the heap size. This was the primary reason why we see OOM
failures during ingestion for deeply nested records
Update: Below is an example of how a very large memory can be allocated using
smart resizing procedure. The formula for allocating memory is
child_vector resize =
child_vector_request_size +
(child_vector_request_size / rowsAdded + 1) * current_vector_size
If we now have deeply nested arrays of arrays each of 525 elements in a row
like The memory will be allocated as such.
1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by default
is batch size = 256
2nd resize = (525 + 525/1 + 1) * 135181 = *71105731* ;
current vector size = 135181
--
This message was sent by Atlassian Jira
(v8.20.10#820010)