[jira] [Updated] (GOBBLIN-1715) Support vectorized row batch pooling

Ratandeep Ratti (Jira) Fri, 30 Sep 2022 15:58:11 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ratandeep Ratti updated GOBBLIN-1715:
-------------------------------------
    Description: 
The pre-allocation method allocates vastly more memory for ORC ColumnVectors of 
arrays and maps than needed and is unpredictable as it depends upon the size of 
the current column vector’s length, which can change as we allocate more memory 
to it. From the heap dump done on a kafka topic we saw that on the second 
resize call for an array ColumnVector, where request size was ~ 1k elements, it 
had requested to allocate around 444M elements. This resulted in over 
allocating way past the heap size. This was the primary reason why  we see OOM 
failures during ingestion for deeply nested records

Update: Below is an example of how a very large memory can be allocated using 
smart resizing procedure. The formula for allocating memory is 

{noformat}
child_vector resize = 
   child_vector_request_size  + 
  (child_vector_request_size / rowsAdded + 1) * current_vector_size
{noformat}

If we now have deeply nested arrays of arrays each of 525 elements in a row 
like The memory will be allocated as such.

{noformat}
1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by default 
is batch size = 256
2nd resize = (525 + 525/1 + 1) * 135181 = *71105731*                         ; 
current vector size = 135181 
{noformat}


  was:
The pre-allocation method allocates vastly more memory for ORC ColumnVectors of 
arrays and maps than needed and is unpredictable as it depends upon the size of 
the current column vector’s length, which can change as we allocate more memory 
to it. From the heap dump done on a kafka topic we saw that on the second 
resize call for an array ColumnVector, where request size was ~ 1k elements, it 
had requested to allocate around 444M elements. This resulted in over 
allocating way past the heap size. This was the primary reason why  we see OOM 
failures during ingestion for deeply nested records

Update: Below is an example of how a very large memory can be allocated using 
smart resizing procedure. The formula for allocating memory is 

child_vector resize = 
   child_vector_request_size  + 
  (child_vector_request_size / rowsAdded + 1) * current_vector_size

If we now have deeply nested arrays of arrays each of 525 elements in a row 
like The memory will be allocated as such.

1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by default 
is batch size = 256
2nd resize = (525 + 525/1 + 1) * 135181 = *71105731*                         ; 
current vector size = 135181 




> Support vectorized row batch pooling
> ------------------------------------
>
>                 Key: GOBBLIN-1715
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1715
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-core
>            Reporter: Ratandeep Ratti
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The pre-allocation method allocates vastly more memory for ORC ColumnVectors 
> of arrays and maps than needed and is unpredictable as it depends upon the 
> size of the current column vector’s length, which can change as we allocate 
> more memory to it. From the heap dump done on a kafka topic we saw that on 
> the second resize call for an array ColumnVector, where request size was ~ 1k 
> elements, it had requested to allocate around 444M elements. This resulted in 
> over allocating way past the heap size. This was the primary reason why  we 
> see OOM failures during ingestion for deeply nested records
> Update: Below is an example of how a very large memory can be allocated using 
> smart resizing procedure. The formula for allocating memory is 
> {noformat}
> child_vector resize = 
>    child_vector_request_size  + 
>   (child_vector_request_size / rowsAdded + 1) * current_vector_size
> {noformat}
> If we now have deeply nested arrays of arrays each of 525 elements in a row 
> like The memory will be allocated as such.
> {noformat}
> 1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by 
> default is batch size = 256
> 2nd resize = (525 + 525/1 + 1) * 135181 = *71105731*                         
> ; current vector size = 135181 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (GOBBLIN-1715) Support vectorized row batch pooling

Reply via email to