[GitHub] [beam] damccorm opened a new issue, #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

GitBox Sat, 04 Jun 2022 11:14:40 -0700


damccorm opened a new issue, #20574:
URL: https://github.com/apache/beam/issues/20574

Beam jobs with slow, memory-hungry, or otherwise resource-intensive DoFn
implementations perform quite poorly (or even OOM) due to the fact that an
`[UnboundedThreadPoolExecutor|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/thread_pool_executor.py#L89]`
is used to spawn workers.

The Python SDK no longer seems to have any methods by which to control
concurrent execution of user code. Resource-intensive DoFns can control their
own execution by maintaining their own semaphores, but that causes input
elements to effectively spool in-memory, with one thread created for every new
message. If the input rate of data to a worker exceeds the worker's ability to
process those messages, an unbounded number of threads will be spawned to
handle incoming work.

Versions of Beam before 2.18 allowed specifying the \--worker_threads
experimental flag to control concurrency more effectively, but that was
[removed in November of 2019](https://github.com/apache/beam/pull/10123) by
[[email protected]] (see: BEAM-8151).

One possible solution would be to re-introduce a limit on the size of the
`_SharedUnboundedThreadPoolExecutor` to ensure that we don't create too many
threads, but I'm unsure of what kind of backpressure this would create and what
effect it may have on the rest of the harness.

Imported from Jira
[BEAM-11051](https://issues.apache.org/jira/browse/BEAM-11051). Original Jira
may contain additional context.
Reported by: psobotspotify.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

Reply via email to