Re: [PR] MSQ controller: Support in-memory shuffles; towards JVM reuse. (druid)

via GitHub Mon, 22 Apr 2024 21:46:07 -0700


gianm commented on code in PR #16168:
URL: https://github.com/apache/druid/pull/16168#discussion_r1575633139



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/kernel/controller/ControllerUtils.java:
##########
@@ -0,0 +1,334 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.msq.kernel.controller;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Iterables;
+import it.unimi.dsi.fastutil.ints.IntSet;
+import org.apache.druid.msq.exec.OutputChannelMode;
+import org.apache.druid.msq.indexing.destination.MSQDestination;
+import org.apache.druid.msq.indexing.destination.MSQSelectDestination;
+import org.apache.druid.msq.input.InputSpec;
+import org.apache.druid.msq.input.InputSpecs;
+import org.apache.druid.msq.kernel.QueryDefinition;
+import org.apache.druid.msq.kernel.StageDefinition;
+import org.apache.druid.msq.kernel.StageId;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.TreeMap;
+import java.util.TreeSet;
+
+/**
+ * Utilties for {@link ControllerQueryKernel}.
+ */
+public class ControllerUtils
+{
+  /**
+   * Put stages from {@link QueryDefinition} into groups that must each be 
launched simultaneously.
+   *
+   * This method's goal is to maximize the usage of {@link 
OutputChannelMode#MEMORY} channels, subject to constraints
+   * provided by {@link ControllerQueryKernelConfig#isPipeline()},
+   * {@link ControllerQueryKernelConfig#getMaxConcurrentStages()}, and
+   * {@link ControllerQueryKernelConfig#isFaultTolerant()}.
+   */
+  public static List<StageGroup> computeStageGroups(
+      final QueryDefinition queryDef,
+      final ControllerQueryKernelConfig config
+  )
+  {
+    final MSQDestination destination = config.getDestination();
+    final List<StageGroup> stageGroups = new ArrayList<>();
+    final boolean useDurableStorage = config.isDurableStorage();
+    final Map<StageId, Set<StageId>> inflow = computeStageInflowMap(queryDef);
+    final Map<StageId, Set<StageId>> outflow = 
computeStageOutflowMap(queryDef);
+    final Set<StageId> stagesRun = new HashSet<>();
+
+    while (stagesRun.size() < queryDef.getStageDefinitions().size()) {
+      // 1) Run all stages that cannot stream their output, as solo groups.
+      boolean didRun;
+      do {
+        didRun = false;
+
+        for (final StageId stageId : ImmutableList.copyOf(inflow.keySet())) {
+          if (!stagesRun.contains(stageId)
+              && inflow.get(stageId).isEmpty()
+              && !canStreamOutput(queryDef, stageId.getStageNumber(), config, 
outflow)) {
+            stagesRun.add(stageId);
+            stageGroups.add(
+                new StageGroup(
+                    Collections.singletonList(stageId),
+                    getOutputChannelMode(
+                        queryDef,
+                        stageId.getStageNumber(),
+                        destination.toSelectDestination(),
+                        useDurableStorage,
+                        false
+                    )
+                )
+            );
+
+            removeStageFlow(stageId, inflow, outflow);
+            didRun = true;
+          }
+        }
+      } while (didRun);
+
+      // 2) Pick some stage that can stream its output, and run that as well 
as all ready-to-run dependents.
+      StageId currentStageId = null;
+      for (final StageId stageId : ImmutableList.copyOf(inflow.keySet())) {
+        if (!stagesRun.contains(stageId)
+            && inflow.get(stageId).isEmpty()
+            && canStreamOutput(queryDef, stageId.getStageNumber(), config, 
outflow)) {
+          currentStageId = stageId;
+          break;
+        }
+      }
+
+      if (currentStageId != null) {
+        final List<StageId> currentStageGroup = new ArrayList<>();
+        final int maxStageGroupSize;
+        if (stageGroups.isEmpty()) {
+          maxStageGroupSize = config.getMaxConcurrentStages();
+        } else {
+          final StageGroup priorGroup = stageGroups.get(stageGroups.size() - 
1);
+          if (priorGroup.lastStageOutputChannelMode() == 
OutputChannelMode.MEMORY) {
+            // Prior group must run concurrently with this group.
+            maxStageGroupSize = config.getMaxConcurrentStages() - 
priorGroup.size();
+          } else {
+            // Prior group can exit before this group starts.
+            maxStageGroupSize = config.getMaxConcurrentStages();
+          }
+        }
+
+        OutputChannelMode currentOutputChannelMode = null;
+        while (currentStageId != null) {
+          final boolean canStream = canStreamOutput(queryDef, 
currentStageId.getStageNumber(), config, outflow);
+          final Set<StageId> currentOutflow = outflow.get(currentStageId);
+
+          final int maxStageGroupSizeAllowingForDownstreamConsumer;
+          if 
(queryDef.getStageDefinition(currentStageId).doesSortDuringShuffle()) {
+            // When the current group sorts, there's a pipeline break, so we 
can "leapfrog": close the prior group
+            // before starting the downstream group.
+            maxStageGroupSizeAllowingForDownstreamConsumer = 
config.getMaxConcurrentStages() - 1;

Review Comment:
   It's prevented by the fact that when there's a pipeline break (i.e. some 
stage that sorts) in a stage group, the upstream stage group is closed before 
the downstream stage group is started. So the `priorGroup.size()` and the `1` 
will not be happening simultaneously.
   
   I extended the comment to say:
   
   ```
             // When the current group sorts, there's a pipeline break, so we 
can leapfrog: close the prior group before
             // starting the downstream group. In this case, we only need to 
reserve a single concurrent-stage slot for
             // a downstream consumer.
   ```
   
   The new class-level javadoc should help shine light on this too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] MSQ controller: Support in-memory shuffles; towards JVM reuse. (druid)

Reply via email to