xvrl opened a new issue, #14387:
URL: https://github.com/apache/druid/issues/14387
It is possible for middle-manager (and potentially indexers) to be
discoverable but somehow unreachable by the overlord.
In that case, when an overlord restarts it will fail to regain leadership
while waiting on all the workers to sync.
This problem arises when running with `druid.indexer.runner.type=httpRemote`
where the overlord startup sequence relies first syncing state with all the
worker nodes. In the event that a worker node is unresponsive or there are
network connectivity issues it might make sense for the overlord to instead
ignore nodes and proceed with startup, rather than failing to become leader.
### Affected Version
24.0.2
### Description
Attached is a sample output log illustrating how startup might fail.
```
listener becomeLeader() failed. Unable to become leader:
{class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector,
exceptionType=class java.lang.RuntimeException,
exceptionMessage=java.lang.reflect.InvocationTargetException}
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:161)
at
org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:98)
at
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702)
at
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698)
at
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.reflect.InvocationTargetException
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446)
at
org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341)
at
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:158)
... 7 more
Caused by: java.lang.RuntimeException: org.apache.druid.java.util.common.RE:
Failed to sync with worker[middlemanager-0:8091].
at
org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner.start(HttpRemoteTaskRunner.java:285)
... 14 more
Caused by: org.apache.druid.java.util.common.RE: Failed to sync with
worker[middlemanager-0:8091].
at
org.apache.druid.indexing.overlord.hrtr.WorkerHolder.waitForInitialization(WorkerHolder.java:344)
at
org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner.startWorkersHandling(HttpRemoteTaskRunner.java:560)
at
org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner.start(HttpRemoteTaskRunner.java:265)
... 14 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]