[
https://issues.apache.org/jira/browse/GOBBLIN-1891?focusedWorklogId=879203&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-879203
]
ASF GitHub Bot logged work on GOBBLIN-1891:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 30/Aug/23 20:10
Start Date: 30/Aug/23 20:10
Worklog Time Spent: 10m
Work Description: ZihanLi58 commented on code in PR #3751:
URL: https://github.com/apache/gobblin/pull/3751#discussion_r1310713892
##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/OrcConverterMemoryManager.java:
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.writer;
+
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.UnionColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+/**
+ * A helper class to calculate the size of array buffers in a {@link
VectorizedRowBatch}.
+ * This estimate is mainly based on the maximum size of each variable length
column, which can be resized
+ * Since the resizing algorithm for each column can balloon, this can affect
likelihood of OOM
+ */
+public class OrcConverterMemoryManager {
+
+ private VectorizedRowBatch rowBatch;
+
+ // TODO: Consider moving the resize algorithm from the converter to this
class
+ OrcConverterMemoryManager(VectorizedRowBatch rowBatch) {
+ this.rowBatch = rowBatch;
+ }
+
+ /**
+ * Calculates the maximum number of elements of lists and maps in a column
+ * @param col
+ * @return
+ */
+ public long calculateSizeOfColHelper(ColumnVector col) {
+ long converterBufferColSize = 0;
+ if (col instanceof ListColumnVector) {
+ ListColumnVector listColumnVector = (ListColumnVector) col;
+ converterBufferColSize += listColumnVector.child.isNull.length;
Review Comment:
Still have some question here, likely I miss understand something:
1. Is listColumnVector.child.isNull.length means the length of the current
list? I'm confused by the "child" here
2. on line 52, why it's + by not *?
3. Seems like you try to get the length here, but don't we interested in the
memory size here?
Issue Time Tracking
-------------------
Worklog Id: (was: 879203)
Time Spent: 1h 10m (was: 1h)
> Create self-tuning ORC Writer
> -----------------------------
>
> Key: GOBBLIN-1891
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1891
> Project: Apache Gobblin
> Issue Type: New Feature
> Components: gobblin-core
> Reporter: William Lo
> Assignee: Abhishek Tiwari
> Priority: Major
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> In Gobblin streaming, the Avro to ORC converter and writer constantly face
> OOM issues when the record sizes are large due to large arrays or maps.
> Since streaming pipelines are run indefinitely*, static configurations are
> usually insufficient to handle varying sizes of data, the converter buffers,
> increases in partitions, etc. This causes pipelines to often stall and make
> no progress if the incoming data size is increased beyond the memory limits
> of the container.
> We want to implement a bufferedORCWriter, which utilizes many of the same
> components as the current ORC Writer, except that the batchSize is adaptable
> to larger record sizes and takes into the account of the memory available to
> the JVM to avoid OOM issues as well as the memory the converter uses, and the
> number of partitioned writers. This should be enabled only by a
> configuration, and have knobs available so that one can increase the
> sensitivity and the performance of this writer.
> Future improvements include improving the converter to use up less unused
> memory every resize, and more accurate estimations done for memory usage in
> the orc writer.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)