[jira] [Updated] (GOBBLIN-1891) Create self-tuning ORC Writer

William Lo (Jira) Tue, 29 Aug 2023 11:31:13 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


William Lo updated GOBBLIN-1891:
--------------------------------
    Description: 
In Gobblin streaming, the Avro to ORC converter and writer constantly face OOM 
issues when the record sizes are large due to large arrays or maps.
Since streaming pipelines are run indefinitely*, static configurations are 
usually insufficient to handle varying sizes of data, the converter buffers, 
increases in partitions, etc. This causes pipelines to often stall and make no 
progress if the incoming data size is increased beyond the memory limits of the 
container.

We want to implement a bufferedORCWriter, which utilizes many of the same 
components as the current ORC Writer, except that the batchSize is adaptable to 
larger record sizes and takes into the account of the memory available to the 
JVM to avoid OOM issues as well as the memory the converter uses, and the 
number of partitioned writers. This should be enabled only by a configuration, 
and have knobs available so that one can increase the sensitivity and the 
performance of this writer.

Future improvements include improving the converter to use up less unused 
memory every resize, and more accurate estimations done for memory usage in the 
orc writer.

  was:
In Gobblin streaming, the Avro to ORC converter and writer constantly face OOM 
issues when the record sizes are large due to large arrays or maps.
Since streaming pipelines are run indefinitely*, static configurations are 
usually insufficient to handle varying sizes of data, the converter buffers, 
increases in partitions, etc. This causes pipelines to often stall and make no 
progress if the incoming data size is increased beyond the memory limits of the 
container.

We want to implement a bufferedORCWriter, which utilizes many of the same 
components as the current ORC Writer, except that the batchSize is adaptable to 
larger record sizes and takes into the account of the memory available to the 
JVM to avoid OOM issues. This should be enabled only by a configuration, and 
have knobs available so that one can increase the sensitivity and the 
performance of this writer.


> Create self-tuning ORC Writer
> -----------------------------
>
>                 Key: GOBBLIN-1891
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1891
>             Project: Apache Gobblin
>          Issue Type: New Feature
>          Components: gobblin-core
>            Reporter: William Lo
>            Assignee: Abhishek Tiwari
>            Priority: Major
>
> In Gobblin streaming, the Avro to ORC converter and writer constantly face 
> OOM issues when the record sizes are large due to large arrays or maps.
> Since streaming pipelines are run indefinitely*, static configurations are 
> usually insufficient to handle varying sizes of data, the converter buffers, 
> increases in partitions, etc. This causes pipelines to often stall and make 
> no progress if the incoming data size is increased beyond the memory limits 
> of the container.
> We want to implement a bufferedORCWriter, which utilizes many of the same 
> components as the current ORC Writer, except that the batchSize is adaptable 
> to larger record sizes and takes into the account of the memory available to 
> the JVM to avoid OOM issues as well as the memory the converter uses, and the 
> number of partitioned writers. This should be enabled only by a 
> configuration, and have knobs available so that one can increase the 
> sensitivity and the performance of this writer.
> Future improvements include improving the converter to use up less unused 
> memory every resize, and more accurate estimations done for memory usage in 
> the orc writer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (GOBBLIN-1891) Create self-tuning ORC Writer

Reply via email to