William Lo created GOBBLIN-1891:
-----------------------------------
Summary: Create self-tuning ORC Writer
Key: GOBBLIN-1891
URL: https://issues.apache.org/jira/browse/GOBBLIN-1891
Project: Apache Gobblin
Issue Type: New Feature
Components: gobblin-core
Reporter: William Lo
Assignee: Abhishek Tiwari
In Gobblin streaming, the Avro to ORC converter and writer constantly face OOM
issues when the record sizes are large due to large arrays or maps.
Since streaming pipelines are run indefinitely*, static configurations are
usually insufficient to handle varying sizes of data, the converter buffers,
increases in partitions, etc. This causes pipelines to often stall and make no
progress if the incoming data size is increased beyond the memory limits of the
container.
We want to implement a bufferedORCWriter, which utilizes many of the same
components as the current ORC Writer, except that the batchSize is adaptable to
larger record sizes and takes into the account of the memory available to the
JVM to avoid OOM issues. This should be enabled only by a configuration, and
have knobs available so that one can increase the sensitivity and the
performance of this writer.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)