[jira] [Work logged] (BEAM-5724) Beam creates too many sdk_worker processes with --sdk-worker-parallelism=stage

ASF GitHub Bot (JIRA) Fri, 26 Oct 2018 02:54:30 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-5724?focusedWorklogId=159075&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-159075
 ]


ASF GitHub Bot logged work on BEAM-5724:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Oct/18 09:53
            Start Date: 26/Oct/18 09:53
    Worklog Time Spent: 10m 
      Work Description: mxm commented on a change in pull request #6835: 
[BEAM-5724] Generalize flink executable context to allow more than 1 worker 
process per task manager 
URL: https://github.com/apache/beam/pull/6835#discussion_r228463062
 
 

 ##########
 File path: 
runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/FlinkDefaultExecutableStageContext.java
 ##########
 @@ -64,19 +71,60 @@ public void close() throws Exception {
     jobBundleFactory.close();
   }
 
-  enum ReferenceCountingFactory implements Factory {
-    REFERENCE_COUNTING;
+  private static class JobFactoryState {
+    private final AtomicInteger counter = new AtomicInteger(0);
+    private final List<ReferenceCountingFlinkExecutableStageContextFactory> 
factories =
+        new ArrayList<>();
+    private final int maxFactories;
 
-    private static final ReferenceCountingFlinkExecutableStageContextFactory 
actualFactory =
-        ReferenceCountingFlinkExecutableStageContextFactory.create(
-            FlinkDefaultExecutableStageContext::create);
+    private JobFactoryState(int maxFactories) {
+      if (maxFactories == 0) {
+        // Default to num_cores - 1 so that we leave some resources available 
for the java process
+        this.maxFactories = 
Math.max(Runtime.getRuntime().availableProcessors() - 1, 1);
+      } else {
+        this.maxFactories = maxFactories;
+      }
+    }
+
+    private synchronized FlinkExecutableStageContext.Factory getFactory() {
+      int count = counter.getAndIncrement();
+
+      if (count < maxFactories) {
+        factories.add(
+            ReferenceCountingFlinkExecutableStageContextFactory.create(
+                FlinkDefaultExecutableStageContext::create));
+      }
+
+      return factories.get(count % maxFactories);
+    }
+  }
+
+  enum MultiInstanceFactory implements Factory {
+    MULTI_INSTANCE;
+
+    // This map should only ever have a single element, as each job will have 
its own
+    // classloader and therefore its own instance of 
MultiInstanceFactory.INSTANCE. This
+    // code supports multiple JobInfos in order to provide a sensible 
implementation of
+    // Factory.get(JobInfo), which in theory could be called with different 
JobInfos.
+    private static final ConcurrentMap<String, JobFactoryState> jobFactories =
+        new ConcurrentHashMap<>();
 
     @Override
     public FlinkExecutableStageContext get(JobInfo jobInfo) {
-      return actualFactory.get(jobInfo);
+      JobFactoryState state =
+          jobFactories.computeIfAbsent(
+              jobInfo.jobId(),
+              k -> {
+                PortablePipelineOptions portableOptions =
+                    
PipelineOptionsTranslation.fromProto(jobInfo.pipelineOptions())
+                        .as(PortablePipelineOptions.class);
+
+                return new JobFactoryState(
+                    
MoreObjects.firstNonNull(portableOptions.getSdkWorkerParallelism(), 1L)
 
 Review comment:
   If we make the above change from `Long` to `long` with default `-1`, this 
would become:
   ```suggestion
                       Math.max(portableOptions.getSdkWorkerParallelism(), 1L)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 159075)
    Time Spent: 50m  (was: 40m)

> Beam creates too many sdk_worker processes with --sdk-worker-parallelism=stage
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-5724
>                 URL: https://issues.apache.org/jira/browse/BEAM-5724
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Micah Wylde
>            Assignee: Micah Wylde
>            Priority: Major
>              Labels: portability-flink
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> In the flink portable runner, we currently support two options for sdk worker 
> parallelism (how many python worker processes we run). The default is one per 
> taskmanager, and with --sdk-worker-parallelism=stage you get one per stage. 
> However, for complex pipelines with many beam operators that get fused into a 
> single flink task this can produce hundreds of worker processes per TM.
> Flink uses the notion of task slots to limit resource utilization on a box; I 
> think that beam should try to respect those limits as well. I think ideally 
> we'd produce a single python worker per task slot/flink operator chain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-5724) Beam creates too many sdk_worker processes with --sdk-worker-parallelism=stage

Reply via email to