cryptoe opened a new pull request, #14038: URL: https://github.com/apache/druid/pull/14038
Overlord fails to become the leaders if its encounters an exception while getting the taskLocks from metadatastore. I saw this problem when one of the clusters was running a MSQ job before this patch: https://github.com/apache/druid/pull/13282 which changes task priority. During upgrade, the overlord failed to start due to: ``` 2023-03-28T15:59:53,571 ERROR [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - listener becomeLeader() failed. Unable to become leader: {class=org.apache.druid.curator.discovery.CuratorDruidLeaderSelector, exceptionType=class java.lang.RuntimeException, exceptionMessage=java.lang.reflect.InvocationTargetException} java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:179) at org.apache.druid.curator.discovery.CuratorDruidLeaderSelector$1.isLeader(CuratorDruidLeaderSelector.java:98) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: java.lang.reflect.InvocationTargetException at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:176) Caused by: java.lang.IllegalArgumentException: lock priority[0] is different from task priority[50] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148) ~[guava-16.0.1.jar:?] at org.apache.druid.indexing.overlord.TaskLockbox.verifyAndCreateOrFindLockPosse(TaskLockbox.java:260) at org.apache.druid.indexing.overlord.TaskLockbox.syncFromStorage(TaskLockbox.java:169) at org.apache.druid.indexing.overlord.TaskQueue.start(TaskQueue.java:179) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) at org.apache.druid.indexing.overlord.TaskMaster$1.becomeLeader(TaskMaster.java:176) ... 5 more ``` Went ahead and fixed the db sync such that any task groups whose task locks are unable to be synced with the db are first identified. Post that all the tasks in that task group are killed by the overlord piggy backing on the logic introduced in this patch : https://github.com/apache/druid/pull/13172 ##### Key changed/added classes in this PR * `TaskLockBox` <hr> <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. --> This PR has: - [x] been self-reviewed. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
