Peter Sobot created BEAM-11051:
----------------------------------

             Summary: Python SDK harness's UnboundedThreadPoolExecutor performs 
poorly with slow DoFns
                 Key: BEAM-11051
                 URL: https://issues.apache.org/jira/browse/BEAM-11051
             Project: Beam
          Issue Type: Bug
          Components: sdk-py-harness
    Affects Versions: 2.24.0, 2.23.0, 2.22.0, 2.21.0, 2.20.0, 2.19.0, 2.18.0
            Reporter: Peter Sobot


Beam jobs with slow, memory-hungry, or otherwise resource-intensive DoFn 
implementations perform quite poorly (or even OOM) due to the fact that an 
{{[UnboundedThreadPoolExecutor|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/thread_pool_executor.py#L89]}}
 is used to spawn workers.

The Python SDK no longer seems to have any methods by which to control 
concurrent execution of user code. Resource-intensive DoFns can control their 
own execution by maintaining their own semaphores, but that causes input 
elements to effectively spool in-memory, with one thread created for every new 
message. If the input rate of data to a worker exceeds the worker's ability to 
process those messages, an unbounded number of threads will be spawned to 
handle incoming work.

Versions of Beam before 2.18 allowed specifying the --worker_threads 
experimental flag to control concurrency more effectively, but that was 
[removed in November of 2019|https://github.com/apache/beam/pull/10123] by 
[[email protected]] (see: BEAM-8151).

One possible solution would be to re-introduce a limit on the size of the 
{{_SharedUnboundedThreadPoolExecutor}} to ensure that we don't create too many 
threads, but I'm unsure of what kind of backpressure this would create and what 
effect it may have on the rest of the harness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to