Re: [I] ParallelIterable may cause memory pressure for presto coordinator [iceberg]

via GitHub Wed, 10 Jul 2024 06:25:56 -0700


davseitsev commented on issue #3179:
URL: https://github.com/apache/iceberg/issues/3179#issuecomment-2220507710

We have the same issue. It caused a lot of OOM problems when we had an issue
with data compaction.

Here is an example of the smallest heap dump we took:

![image](https://github.com/apache/iceberg/assets/1793410/94dc9823-4933-4431-833b-6f37e2c2c92b)

As you can see, `ParallelIterator#queue` takes 21GB, it's because it
contains `313 573` items.

Some dumps contain more than `1M` of `org.apache.iceberg.BaseFileScanTask`
items (for multimple queries):

![image](https://github.com/apache/iceberg/assets/1793410/fdfa1810-3d06-44a4-8d58-c7e7fe06e20f)

A single call to `ParallelIterator.checkTasks()` submits 96 processing tasks
and it's enough to cause back pressure and memory starvation.

Maybe we can implement kind of slow-start algorithms, like `checkTasks()`
can submit only 1 new task, and if the consumer is fast enough,
`ParallelIterator#queue` will become empty and another `checkTasks()` call will
submit additional task, etc.
Maybe it worths to have separate config for the size of
`ParallelIterator.taskFutures` to limit max number of producers for a single
query but not reduce the worker pool size. Or maybe we can limit queue size and
use `ForkJoinPool#managedBlock` not to block completely parallel flows.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] ParallelIterable may cause memory pressure for presto coordinator [iceberg]

Reply via email to