Re: [PR] Handle uninitialized cache in Node role watchers (druid)

via GitHub Fri, 23 Feb 2024 12:09:40 -0800


gianm commented on code in PR #15726:
URL: https://github.com/apache/druid/pull/15726#discussion_r1501116090



##########
server/src/main/java/org/apache/druid/discovery/BaseNodeRoleWatcher.java:
##########
@@ -209,13 +217,52 @@ private void removeNode(DiscoveryDruidNode druidNode)
     }
   }
 
+  public void cacheInitializedTimedOut()
+  {
+    synchronized (lock) {
+      // No need to wait on CountDownLatch, because we are holding the lock 
under which it could only be
+      // counted down.
+      if (cacheInitialized.getCount() == 0) {
+        LOGGER.warn("cache is already initialized. ignoring timeout.");

Review Comment:
   Please update this log message (and all others that don't include it) to 
include the role, like: `Cache for role[%s] is already initialized. Ignoring 
timeout.`



##########
server/src/main/java/org/apache/druid/rpc/DiscoveryServiceLocator.java:
##########
@@ -157,5 +157,18 @@ public void nodeViewInitialized()
         }
       }
     }
+
+    @Override
+    public void nodeViewInitializedTimedOut()
+    {
+      synchronized (DiscoveryServiceLocator.this) {

Review Comment:
   Since it's the same as `nodeViewInitialized()`, this should just call 
`nodeViewInitialized();`



##########
server/src/main/java/org/apache/druid/discovery/BaseNodeRoleWatcher.java:
##########
@@ -209,13 +217,52 @@ private void removeNode(DiscoveryDruidNode druidNode)
     }
   }
 
+  public void cacheInitializedTimedOut()

Review Comment:
   This patch has a similar issue to the one described here: 
https://github.com/apache/druid/pull/15726#pullrequestreview-1875684108.
   
   It is relying on `getAllNodes()` to be called in order to detect that 
initialization has timed out. But `getAllNodes()` is not necessarily ever 
called. There needs to be some backstop that causes initialization to time out 
even if `getAllNodes()` is never called. Please add a test for this case as 
well.
   
   You will probably need a `ScheduledExecutorService` to make this happen. It 
could be owned by the `CuratorDruidNodeDiscoveryProvider` and shared across all 
watchers. I think it would work to change the `ExecutorService 
listenerExecutor` to a `ScheduledExecutorService` and use that.



##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java:
##########
@@ -542,6 +542,13 @@ public void nodeViewInitialized()
         //CountDownLatch.countDown() does nothing when count has already 
reached 0.
         workerViewInitialized.countDown();
       }
+
+      @Override
+      public void nodeViewInitializedTimedOut()
+      {
+        //CountDownLatch.countDown() does nothing when count has already 
reached 0.
+        workerViewInitialized.countDown();

Review Comment:
   If we want a legit initialization and timeout to have the same behavior for 
a given impl, best to simply call `nodeViewInitialized();` here. It's clearer.



##########
server/src/main/java/org/apache/druid/discovery/BaseNodeRoleWatcher.java:
##########
@@ -209,13 +217,52 @@ private void removeNode(DiscoveryDruidNode druidNode)
     }
   }
 
+  public void cacheInitializedTimedOut()
+  {
+    synchronized (lock) {
+      // No need to wait on CountDownLatch, because we are holding the lock 
under which it could only be
+      // counted down.
+      if (cacheInitialized.getCount() == 0) {
+        LOGGER.warn("cache is already initialized. ignoring timeout.");
+        return;
+      }
+
+      // It is important to take a snapshot here as list of nodes might change 
by the time listeners process
+      // the changes.
+      List<DiscoveryDruidNode> currNodes = Lists.newArrayList(nodes.values());
+      LOGGER.info(
+          "Node watcher of role [%s] is now initialized with %d nodes.",
+          nodeRole.getJsonName(),
+          currNodes.size());
+
+      for (DruidNodeDiscovery.Listener listener : nodeListeners) {
+        safeSchedule(
+            () -> {
+              listener.nodesAdded(currNodes);
+              listener.nodeViewInitializedTimedOut();
+            },
+            "Exception occurred in nodesAdded([%s]) in listener [%s].",
+            currNodes,
+            listener
+        );
+      }
+
+      cacheInitialized.countDown();
+      cacheInitializationTimedOut = true;

Review Comment:
   We should set `cacheInitializationTimedOut` prior to counting down 
`cacheInitialized`, because otherwise, an `await` that completes cannot 
reliably tell if the countDown was due to timeout or not.



##########
server/src/main/java/org/apache/druid/discovery/DruidNodeDiscoveryProvider.java:
##########
@@ -235,6 +235,28 @@ public void nodeViewInitialized()
           }
         }
       }
+
+      @Override
+      public void nodeViewInitializedTimedOut()
+      {
+        synchronized (lock) {
+          if (uninitializedNodeRoles == 0) {
+            log.error("Unexpected call of nodeViewInitialized()");
+            return;
+          }
+          uninitializedNodeRoles--;
+          if (uninitializedNodeRoles == 0) {
+            for (Listener listener : listeners) {
+              try {
+                listener.nodeViewInitialized();
+              }
+              catch (Exception ex) {
+                log.error(ex, "Listener[%s].nodeViewInitialized() threw 
exception. Ignored.", listener);

Review Comment:
   Should mention `nodeViewInitializedTimedOut()`, once the call is updated.



##########
server/src/main/java/org/apache/druid/discovery/DruidNodeDiscoveryProvider.java:
##########
@@ -235,6 +235,28 @@ public void nodeViewInitialized()
           }
         }
       }
+
+      @Override
+      public void nodeViewInitializedTimedOut()
+      {
+        synchronized (lock) {
+          if (uninitializedNodeRoles == 0) {
+            log.error("Unexpected call of nodeViewInitialized()");
+            return;
+          }
+          uninitializedNodeRoles--;
+          if (uninitializedNodeRoles == 0) {
+            for (Listener listener : listeners) {
+              try {
+                listener.nodeViewInitialized();

Review Comment:
   Should call `nodeViewInitializedTimedOut()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Handle uninitialized cache in Node role watchers (druid)

Reply via email to