[jira] [Created] (GOBBLIN-1715) Support vectorized row batch recycling

Ratandeep Ratti (Jira) Tue, 27 Sep 2022 08:31:10 -0700

Ratandeep Ratti created GOBBLIN-1715:
----------------------------------------


             Summary: Support vectorized row batch recycling
                 Key: GOBBLIN-1715
                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1715
             Project: Apache Gobblin
          Issue Type: Bug
          Components: gobblin-core
            Reporter: Ratandeep Ratti
            Assignee: Abhishek Tiwari


The pre-allocation method allocates vastly more memory for ORC ColumnVectors of 
arrays and maps than needed and is unpredictable as it depends upon the size of 
the current column vector’s length, which can change as we allocate more memory 
to it. From the heap dump done on a kafka topic we saw that on the second 
resize call for an array ColumnVector, where request size was ~ 1k elements, it 
had requested to allocate around 444M elements. This resulted in over 
allocating way past the heap size. This was the primary reason why  we see OOM 
failures during ingestion for deeply nested records

Update: Below is an example of how a very large memory can be allocated using 
smart resizing procedure. The formula for allocating memory is 

child_vector resize = 
   child_vector_request_size  + 
  (child_vector_request_size / rowsAdded + 1) * current_vector_size

If we now have deeply nested arrays of arrays each of 525 elements in a row 
like The memory will be allocated as such.

1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by default 
is batch size = 256
2nd resize = (525 + 525/1 + 1) * 135181 = *71105731*                         ; 
current vector size = 135181 





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (GOBBLIN-1715) Support vectorized row batch recycling

Reply via email to