[jira] [Work logged] (GOBBLIN-1891) Create self-tuning ORC Writer

ASF GitHub Bot (Jira) Wed, 30 Aug 2023 13:11:08 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1891?focusedWorklogId=879203&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-879203
 ]


ASF GitHub Bot logged work on GOBBLIN-1891:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Aug/23 20:10
            Start Date: 30/Aug/23 20:10
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3751:
URL: https://github.com/apache/gobblin/pull/3751#discussion_r1310713892


##########
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/OrcConverterMemoryManager.java:
##########
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.writer;
+
+import org.apache.orc.storage.ql.exec.vector.ColumnVector;
+import org.apache.orc.storage.ql.exec.vector.ListColumnVector;
+import org.apache.orc.storage.ql.exec.vector.MapColumnVector;
+import org.apache.orc.storage.ql.exec.vector.StructColumnVector;
+import org.apache.orc.storage.ql.exec.vector.UnionColumnVector;
+import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
+
+
+/**
+ * A helper class to calculate the size of array buffers in a {@link 
VectorizedRowBatch}.
+ * This estimate is mainly based on the maximum size of each variable length 
column, which can be resized
+ * Since the resizing algorithm for each column can balloon, this can affect 
likelihood of OOM
+ */
+public class OrcConverterMemoryManager {
+
+  private VectorizedRowBatch rowBatch;
+
+  // TODO: Consider moving the resize algorithm from the converter to this 
class
+  OrcConverterMemoryManager(VectorizedRowBatch rowBatch) {
+    this.rowBatch = rowBatch;
+  }
+
+  /**
+   * Calculates the maximum number of elements of lists and maps in a column
+   * @param col
+   * @return
+   */
+  public long calculateSizeOfColHelper(ColumnVector col) {
+    long converterBufferColSize = 0;
+    if (col instanceof ListColumnVector) {
+      ListColumnVector listColumnVector = (ListColumnVector) col;
+      converterBufferColSize += listColumnVector.child.isNull.length;

Review Comment:
   Still have some question here, likely I miss understand something:
   1. Is listColumnVector.child.isNull.length means the length of the current 
list? I'm confused by the "child" here
   2. on line 52, why it's + by not *?
   3. Seems like you try to get the length here, but don't we interested in the 
memory size here?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 879203)
    Time Spent: 1h 10m  (was: 1h)

> Create self-tuning ORC Writer
> -----------------------------
>
>                 Key: GOBBLIN-1891
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1891
>             Project: Apache Gobblin
>          Issue Type: New Feature
>          Components: gobblin-core
>            Reporter: William Lo
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In Gobblin streaming, the Avro to ORC converter and writer constantly face 
> OOM issues when the record sizes are large due to large arrays or maps.
> Since streaming pipelines are run indefinitely*, static configurations are 
> usually insufficient to handle varying sizes of data, the converter buffers, 
> increases in partitions, etc. This causes pipelines to often stall and make 
> no progress if the incoming data size is increased beyond the memory limits 
> of the container.
> We want to implement a bufferedORCWriter, which utilizes many of the same 
> components as the current ORC Writer, except that the batchSize is adaptable 
> to larger record sizes and takes into the account of the memory available to 
> the JVM to avoid OOM issues as well as the memory the converter uses, and the 
> number of partitioned writers. This should be enabled only by a 
> configuration, and have knobs available so that one can increase the 
> sensitivity and the performance of this writer.
> Future improvements include improving the converter to use up less unused 
> memory every resize, and more accurate estimations done for memory usage in 
> the orc writer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1891) Create self-tuning ORC Writer

Reply via email to