gianm commented on code in PR #15726:
URL: https://github.com/apache/druid/pull/15726#discussion_r1501116090
##########
server/src/main/java/org/apache/druid/discovery/BaseNodeRoleWatcher.java:
##########
@@ -209,13 +217,52 @@ private void removeNode(DiscoveryDruidNode druidNode)
}
}
+ public void cacheInitializedTimedOut()
+ {
+ synchronized (lock) {
+ // No need to wait on CountDownLatch, because we are holding the lock
under which it could only be
+ // counted down.
+ if (cacheInitialized.getCount() == 0) {
+ LOGGER.warn("cache is already initialized. ignoring timeout.");
Review Comment:
Please update this log message (and all others that don't include it) to
include the role, like: `Cache for role[%s] is already initialized. Ignoring
timeout.`
##########
server/src/main/java/org/apache/druid/rpc/DiscoveryServiceLocator.java:
##########
@@ -157,5 +157,18 @@ public void nodeViewInitialized()
}
}
}
+
+ @Override
+ public void nodeViewInitializedTimedOut()
+ {
+ synchronized (DiscoveryServiceLocator.this) {
Review Comment:
Since it's the same as `nodeViewInitialized()`, this should just call
`nodeViewInitialized();`
##########
server/src/main/java/org/apache/druid/discovery/BaseNodeRoleWatcher.java:
##########
@@ -209,13 +217,52 @@ private void removeNode(DiscoveryDruidNode druidNode)
}
}
+ public void cacheInitializedTimedOut()
Review Comment:
This patch has a similar issue to the one described here:
https://github.com/apache/druid/pull/15726#pullrequestreview-1875684108.
It is relying on `getAllNodes()` to be called in order to detect that
initialization has timed out. But `getAllNodes()` is not necessarily ever
called. There needs to be some backstop that causes initialization to time out
even if `getAllNodes()` is never called. Please add a test for this case as
well.
You will probably need a `ScheduledExecutorService` to make this happen. It
could be owned by the `CuratorDruidNodeDiscoveryProvider` and shared across all
watchers. I think it would work to change the `ExecutorService
listenerExecutor` to a `ScheduledExecutorService` and use that.
##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java:
##########
@@ -542,6 +542,13 @@ public void nodeViewInitialized()
//CountDownLatch.countDown() does nothing when count has already
reached 0.
workerViewInitialized.countDown();
}
+
+ @Override
+ public void nodeViewInitializedTimedOut()
+ {
+ //CountDownLatch.countDown() does nothing when count has already
reached 0.
+ workerViewInitialized.countDown();
Review Comment:
If we want a legit initialization and timeout to have the same behavior for
a given impl, best to simply call `nodeViewInitialized();` here. It's clearer.
##########
server/src/main/java/org/apache/druid/discovery/BaseNodeRoleWatcher.java:
##########
@@ -209,13 +217,52 @@ private void removeNode(DiscoveryDruidNode druidNode)
}
}
+ public void cacheInitializedTimedOut()
+ {
+ synchronized (lock) {
+ // No need to wait on CountDownLatch, because we are holding the lock
under which it could only be
+ // counted down.
+ if (cacheInitialized.getCount() == 0) {
+ LOGGER.warn("cache is already initialized. ignoring timeout.");
+ return;
+ }
+
+ // It is important to take a snapshot here as list of nodes might change
by the time listeners process
+ // the changes.
+ List<DiscoveryDruidNode> currNodes = Lists.newArrayList(nodes.values());
+ LOGGER.info(
+ "Node watcher of role [%s] is now initialized with %d nodes.",
+ nodeRole.getJsonName(),
+ currNodes.size());
+
+ for (DruidNodeDiscovery.Listener listener : nodeListeners) {
+ safeSchedule(
+ () -> {
+ listener.nodesAdded(currNodes);
+ listener.nodeViewInitializedTimedOut();
+ },
+ "Exception occurred in nodesAdded([%s]) in listener [%s].",
+ currNodes,
+ listener
+ );
+ }
+
+ cacheInitialized.countDown();
+ cacheInitializationTimedOut = true;
Review Comment:
We should set `cacheInitializationTimedOut` prior to counting down
`cacheInitialized`, because otherwise, an `await` that completes cannot
reliably tell if the countDown was due to timeout or not.
##########
server/src/main/java/org/apache/druid/discovery/DruidNodeDiscoveryProvider.java:
##########
@@ -235,6 +235,28 @@ public void nodeViewInitialized()
}
}
}
+
+ @Override
+ public void nodeViewInitializedTimedOut()
+ {
+ synchronized (lock) {
+ if (uninitializedNodeRoles == 0) {
+ log.error("Unexpected call of nodeViewInitialized()");
+ return;
+ }
+ uninitializedNodeRoles--;
+ if (uninitializedNodeRoles == 0) {
+ for (Listener listener : listeners) {
+ try {
+ listener.nodeViewInitialized();
+ }
+ catch (Exception ex) {
+ log.error(ex, "Listener[%s].nodeViewInitialized() threw
exception. Ignored.", listener);
Review Comment:
Should mention `nodeViewInitializedTimedOut()`, once the call is updated.
##########
server/src/main/java/org/apache/druid/discovery/DruidNodeDiscoveryProvider.java:
##########
@@ -235,6 +235,28 @@ public void nodeViewInitialized()
}
}
}
+
+ @Override
+ public void nodeViewInitializedTimedOut()
+ {
+ synchronized (lock) {
+ if (uninitializedNodeRoles == 0) {
+ log.error("Unexpected call of nodeViewInitialized()");
+ return;
+ }
+ uninitializedNodeRoles--;
+ if (uninitializedNodeRoles == 0) {
+ for (Listener listener : listeners) {
+ try {
+ listener.nodeViewInitialized();
Review Comment:
Should call `nodeViewInitializedTimedOut()`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]