cryptoe opened a new pull request, #14038:
URL: https://github.com/apache/druid/pull/14038

   Overlord fails to become the leaders if its encounters an exception while 
getting the taskLocks from metadatastore. 
   
   I saw this problem when one of the clusters was running a MSQ job before 
this patch: https://github.com/apache/druid/pull/13282 which changes task 
priority. 
   During upgrade, the overlord failed to start due to:
   
   
   
   ```
   2023-03-28T15:59:53,571 ERROR [LeaderSelector[/druid/overlord/_OVERLORD]] 
org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - listener 
becomeLeader() failed. Unable to become leader: 
{class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, 
exceptionType=class java.lang.RuntimeException, 
exceptionMessage=java.lang.reflect.InvocationTargetException}
   java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at 
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:179)
 
        at 
org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:98)
 
        at 
org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
 
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
   Caused by: java.lang.reflect.InvocationTargetException
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]
        at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
        at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446)
 
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) 
        at 
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:176)
 
   Caused by: java.lang.IllegalArgumentException: lock priority[0] is different 
from task priority[50]
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:148) 
~[guava-16.0.1.jar:?]
        at 
org.apache.druid.indexing.overlord.TaskLockbox.verifyAndCreateOrFindLockPosse(TaskLockbox.java:260)
 
        at 
org.apache.druid.indexing.overlord.TaskLockbox.syncFromStorage(TaskLockbox.java:169)
        at 
org.apache.druid.indexing.overlord.TaskQueue.start(TaskQueue.java:179)
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]
        at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
        at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446)
 
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) 
        at 
org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:176)
 
        ... 5 more
        
   ```
   
   Went ahead and fixed the db sync such that any task groups whose task locks 
are unable to be synced with the db are first identified. 
   Post that all the tasks in that task group are killed by the overlord piggy 
backing on the logic introduced in this patch : 
https://github.com/apache/druid/pull/13172
   
   
   ##### Key changed/added classes in this PR
    * `TaskLockBox`
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to