[ https://issues.apache.org/jira/browse/GOBBLIN-1715?focusedWorklogId=814803&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-814803 ]
ASF GitHub Bot logged work on GOBBLIN-1715: ------------------------------------------- Author: ASF GitHub Bot Created on: 07/Oct/22 20:09 Start Date: 07/Oct/22 20:09 Worklog Time Spent: 10m Work Description: homatthew commented on code in PR #3574: URL: https://github.com/apache/gobblin/pull/3574#discussion_r990466519 ########## gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java: ########## @@ -153,14 +157,18 @@ public GobblinBaseOrcWriter(FsDataWriterBuilder<S, D> builder, State properties) // Create value-writer which is essentially a record-by-record-converter with buffering in batch. this.inputSchema = builder.getSchema(); - TypeDescription typeDescription = getOrcSchema(); + this.typeDescription = getOrcSchema(); this.valueWriter = getOrcValueWriter(typeDescription, this.inputSchema, properties); this.batchSize = properties.getPropAsInt(ORC_WRITER_BATCH_SIZE, DEFAULT_ORC_WRITER_BATCH_SIZE); - this.rowBatch = typeDescription.createRowBatch(this.batchSize); + this.rowBatchPool = RowBatchPool.instance(properties); Review Comment: Considering the frequency of recycles, what's your opinion on https://www.baeldung.com/java-singleton-double-checked-locking for the singleton. Reduces the need for as many lock accesses Issue Time Tracking ------------------- Worklog Id: (was: 814803) Time Spent: 2h 20m (was: 2h 10m) > Support vectorized row batch pooling > ------------------------------------ > > Key: GOBBLIN-1715 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1715 > Project: Apache Gobblin > Issue Type: Bug > Components: gobblin-core > Reporter: Ratandeep Ratti > Assignee: Abhishek Tiwari > Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The pre-allocation method allocates vastly more memory for ORC ColumnVectors > of arrays and maps than needed and is unpredictable as it depends upon the > size of the current column vector’s length, which can change as we allocate > more memory to it. From the heap dump done on a kafka topic we saw that on > the second resize call for an array ColumnVector, where request size was ~ 1k > elements, it had requested to allocate around 444M elements. This resulted in > over allocating way past the heap size. This was the primary reason why we > see OOM failures during ingestion for deeply nested records > Update: Below is an example of how a very large memory can be allocated using > smart resizing procedure. The formula for allocating memory is > {noformat} > child_vector resize = > child_vector_request_size + > (child_vector_request_size / rowsAdded + 1) * current_vector_size > {noformat} > If we now have deeply nested arrays of arrays each of 525 elements in a row > like The memory will be allocated as such. > {noformat} > 1st resize = (525 + 525/1 + 1) * 256 = 135181 ; current vector size by > default is batch size = 256 > 2nd resize = (525 + 525/1 + 1) * 135181 = *71105731* > ; current vector size = 135181 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)