[jira] [Work logged] (GOBBLIN-1715) Support vectorized row batch pooling

ASF GitHub Bot (Jira) Fri, 07 Oct 2022 13:15:06 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1715?focusedWorklogId=814805&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-814805
 ]


ASF GitHub Bot logged work on GOBBLIN-1715:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Oct/22 20:14
            Start Date: 07/Oct/22 20:14
    Worklog Time Spent: 10m 
      Work Description: rdsr commented on code in PR #3574:
URL: https://github.com/apache/gobblin/pull/3574#discussion_r990469644


##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java:
##########
@@ -153,14 +157,18 @@ public GobblinBaseOrcWriter(FsDataWriterBuilder<S, D> 
builder, State properties)
 
     // Create value-writer which is essentially a record-by-record-converter 
with buffering in batch.
     this.inputSchema = builder.getSchema();
-    TypeDescription typeDescription = getOrcSchema();
+    this.typeDescription = getOrcSchema();
     this.valueWriter = getOrcValueWriter(typeDescription, this.inputSchema, 
properties);
     this.batchSize = properties.getPropAsInt(ORC_WRITER_BATCH_SIZE, 
DEFAULT_ORC_WRITER_BATCH_SIZE);
-    this.rowBatch = typeDescription.createRowBatch(this.batchSize);
+    this.rowBatchPool = RowBatchPool.instance(properties);
+    this.enableRowBatchPool = 
properties.getPropAsBoolean(RowBatchPool.ENABLE_ROW_BATCH_POOL, true);
+    this.rowBatch = enableRowBatchPool ? 
rowBatchPool.getRowBatch(typeDescription, batchSize) : 
typeDescription.createRowBatch(batchSize);
     this.deepCleanBatch = 
properties.getPropAsBoolean(ORC_WRITER_DEEP_CLEAN_EVERY_BATCH, false);
 
     log.info("Created ORC writer, batch size: {}, {}: {}",
-            batchSize, OrcConf.ROWS_BETWEEN_CHECKS.name(), 
properties.getProp(OrcConf.ROWS_BETWEEN_CHECKS.name(),
+            batchSize, OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(),

Review Comment:
   OrcConf.ROWS_BETWEEN_CHECKS.getAttribute() is the right key to use not the 
other one





Issue Time Tracking
-------------------

    Worklog Id:     (was: 814805)
    Time Spent: 2.5h  (was: 2h 20m)

> Support vectorized row batch pooling
> ------------------------------------
>
>                 Key: GOBBLIN-1715
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1715
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-core
>            Reporter: Ratandeep Ratti
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The pre-allocation method allocates vastly more memory for ORC ColumnVectors 
> of arrays and maps than needed and is unpredictable as it depends upon the 
> size of the current column vector’s length, which can change as we allocate 
> more memory to it. From the heap dump done on a kafka topic we saw that on 
> the second resize call for an array ColumnVector, where request size was ~ 1k 
> elements, it had requested to allocate around 444M elements. This resulted in 
> over allocating way past the heap size. This was the primary reason why  we 
> see OOM failures during ingestion for deeply nested records
> Update: Below is an example of how a very large memory can be allocated using 
> smart resizing procedure. The formula for allocating memory is 
> {noformat}
> child_vector resize = 
>    child_vector_request_size  + 
>   (child_vector_request_size / rowsAdded + 1) * current_vector_size
> {noformat}
> If we now have deeply nested arrays of arrays each of 525 elements in a row 
> like The memory will be allocated as such.
> {noformat}
> 1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by 
> default is batch size = 256
> 2nd resize = (525 + 525/1 + 1) * 135181 = *71105731*                         
> ; current vector size = 135181 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1715) Support vectorized row batch pooling

Reply via email to