kfaraz commented on code in PR #18510:
URL: https://github.com/apache/druid/pull/18510#discussion_r2377719138


##########
embedded-tests/src/test/java/org/apache/druid/testing/embedded/indexing/IndexTaskTest.java:
##########
@@ -99,6 +100,9 @@ public void test_runIndexTask_forInlineDatasource()
     }
 
     cluster.callApi().waitForAllSegmentsToBeAvailable(dataSource, coordinator, 
broker);
+    broker.latchableEmitter().waitForEvent(
+        event -> event.hasDimension(DruidMetrics.DATASOURCE, dataSource)
+    );

Review Comment:
   Please remove this.



##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java:
##########
@@ -1543,6 +1550,9 @@ public void taskAddedOrUpdated(final TaskAnnouncement 
announcement, final Worker
                   HttpRemoteTaskRunnerWorkItem.State.RUNNING
               );
               tasks.put(taskId, taskItem);
+              final ServiceMetricEvent.Builder metricBuilder = new 
ServiceMetricEvent.Builder();
+              metricBuilder.setDimension(DruidMetrics.TASK_ID, taskId);
+              emitter.emit(metricBuilder.setMetric(TASK_UNKNOWN_COUNT, (long) 
1));

Review Comment:
   Nit:
   ```suggestion
                 emitter.emit(metricBuilder.setMetric(TASK_UNKNOWN_COUNT, 1L));
   ```



##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java:
##########
@@ -1543,6 +1550,9 @@ public void taskAddedOrUpdated(final TaskAnnouncement 
announcement, final Worker
                   HttpRemoteTaskRunnerWorkItem.State.RUNNING
               );
               tasks.put(taskId, taskItem);
+              final ServiceMetricEvent.Builder metricBuilder = new 
ServiceMetricEvent.Builder();
+              metricBuilder.setDimension(DruidMetrics.TASK_ID, taskId);
+              emitter.emit(metricBuilder.setMetric(TASK_UNKNOWN_COUNT, (long) 
1));

Review Comment:
   IIUC, this metric is emitted whenever the Overlord is notified of a running 
task for which it doesn't have a work item in the runner. If that is the case, 
maybe `task/unknown/count` is not the best metric name for it since it is not 
really an unknown task (the `TaskQueue` and `TaskStorage` know about it).
   
   Also, the emission of this metric is susceptible to race conditions since 
after the Overlord restarts,
   the `TaskQueue` may have called `taskRunner.run()` causing the `taskItem` to 
be non-null thus
   skipping the emission of this metric.



##########
indexing-service/src/test/java/org/apache/druid/indexing/common/task/batch/parallel/TaskMonitorTest.java:
##########
@@ -296,10 +305,23 @@ public ListenableFuture<Void> runTask(String taskId, 
Object taskObject)
       if (task.throwUnknownTypeIdError) {
         throw new RuntimeException(new ISE("Could not resolve type id 
'test_task_id'"));
       }
-      taskRunner.submit(() -> tasks.put(task.getId(), 
task.run(null).getStatusCode()));
+      TaskToolbox taskToolbox = makeToolbox();
+      taskRunner.submit(() -> tasks.put(task.getId(), 
task.run(taskToolbox).getStatusCode()));

Review Comment:
   Just curious why this change was needed?



##########
embedded-tests/src/test/java/org/apache/druid/testing/embedded/server/HttpRemoteTaskRunnerWorkerFailTest.java:
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.testing.embedded.server;
+
+import org.apache.druid.client.indexing.TaskStatusResponse;
+import org.apache.druid.common.utils.IdUtils;
+import org.apache.druid.indexer.TaskState;
+import org.apache.druid.indexing.common.task.NoopTask;
+import org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner;
+import org.apache.druid.query.DruidMetrics;
+import org.apache.druid.segment.TestDataSource;
+import org.apache.druid.testing.embedded.EmbeddedBroker;
+import org.apache.druid.testing.embedded.EmbeddedCoordinator;
+import org.apache.druid.testing.embedded.EmbeddedDruidCluster;
+import org.apache.druid.testing.embedded.EmbeddedIndexer;
+import org.apache.druid.testing.embedded.EmbeddedOverlord;
+import org.apache.druid.testing.embedded.junit5.EmbeddedClusterTestBase;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+public class HttpRemoteTaskRunnerWorkerFailTest extends EmbeddedClusterTestBase
+{
+  private final EmbeddedOverlord overlord = new EmbeddedOverlord();
+  private final EmbeddedIndexer indexer = new EmbeddedIndexer();
+
+  @Override
+  public EmbeddedDruidCluster createCluster()
+  {
+    return EmbeddedDruidCluster.withEmbeddedDerbyAndZookeeper()
+        .useLatchableEmitter()
+        .addServer(new EmbeddedCoordinator())
+        .addServer(new EmbeddedBroker())
+        .addServer(overlord)
+        .addServer(indexer);
+  }
+
+  @Test
+  public void test_overlord_marksTaskAsFailed_ifIndexerCrashes() throws 
Exception
+  {
+    final String taskId = IdUtils.newTaskId("sim_test_noop", 
TestDataSource.WIKI, null);
+    cluster.callApi().onLeaderOverlord(
+        o -> o.runTask(taskId, new NoopTask(taskId, null, null, 8000L, 0L, 
null))
+    );
+    // wait for the overlord to dispatch the task and worker start it
+    indexer.latchableEmitter().waitForEvent(
+        event -> event.hasMetricName(NoopTask.EVENT_STARTED)
+    );
+    overlord.stop();
+    overlord.start();
+    // give some time for the overlord to load the task from the worker
+    overlord.latchableEmitter().waitForEvent(
+        event -> event.hasMetricName(HttpRemoteTaskRunner.TASK_UNKNOWN_COUNT)

Review Comment:
   The emission of this metric is susceptible to race conditions, I think.
   If the `TaskQueue` calls `taskRunner.run()` before the Overlord discovers 
the task,
   the metric may never be emitted.
   
   What happens if we don't wait for the Overlord to have re-discovered the 
task from the worker
   and just proceed with killing it off?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to