xvrl opened a new issue, #14387:
URL: https://github.com/apache/druid/issues/14387

   It is possible for middle-manager (and potentially indexers) to be 
discoverable but somehow unreachable by the overlord.
   In that case, when an overlord restarts it will fail to regain leadership 
while waiting on all the workers to sync.
   
   This problem arises when running with `druid.indexer.runner.type=httpRemote` 
where the overlord startup sequence relies first syncing state with all the 
worker nodes. In the event that a worker node is unresponsive or there are 
network connectivity issues it might make sense for the overlord to instead 
ignore nodes and proceed with startup, rather than failing to become leader.
   
   ### Affected Version
   
   24.0.2
   
   ### Description
   
   Attached is a sample output log illustrating how startup might fail.
   
   ```
   listener becomeLeader() failed. Unable to become leader: 
{class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, 
exceptionType=class java.lang.RuntimeException, 
exceptionMessage=java.lang.reflect.InvocationTargetException}
   
   java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at 
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:161)
        at 
org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:98)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698)
        at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
   Caused by: java.lang.reflect.InvocationTargetException
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446)
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341)
        at 
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:158)
        ... 7 more
   Caused by: java.lang.RuntimeException: org.apache.druid.java.util.common.RE: 
Failed to sync with worker[middlemanager-0:8091].
        at 
org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner.start(HttpRemoteTaskRunner.java:285)
        ... 14 more
   Caused by: org.apache.druid.java.util.common.RE: Failed to sync with 
worker[middlemanager-0:8091].
        at 
org.apache.druid.indexing.overlord.hrtr.WorkerHolder.waitForInitialization(WorkerHolder.java:344)
        at 
org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner.startWorkersHandling(HttpRemoteTaskRunner.java:560)
        at 
org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner.start(HttpRemoteTaskRunner.java:265)
        ... 14 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to