[
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar reopened SOLR-8069:
-----------------------------------------
There's a reproducible failure in the test added by SOLR-8075 caused by
assertion error on asserts added in this issue.
{code}
1 tests failed.
FAILED:
org.apache.solr.cloud.LeaderInitiatedRecoveryOnShardRestartTest.testRestartWithAllInLIR
Error Message:
Captured an uncaught exception in thread: Thread[id=43491,
name=coreZkRegister-5997-thread-1, state=RUNNABLE,
group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]
Stack Trace:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=43491, name=coreZkRegister-5997-thread-1,
state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]
Caused by: java.lang.AssertionError
at __randomizedtesting.SeedInfo.seed([7F78F76DDF75FAD1]:0)
at
org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:2133)
at
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:434)
at
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:197)
at
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:157)
at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:346)
at
org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1113)
at org.apache.solr.cloud.ZkController.register(ZkController.java:926)
at org.apache.solr.cloud.ZkController.register(ZkController.java:881)
at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:183)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}
The assertion is that leaderCd != null fails because
ShardLeaderElectionContext.runLeaderProcess calls
ZkController.updateLeaderInitiatedRecoveryState with a null core descriptor
which is by design because if you are marking a replica as 'active' then you
don't necessarily need to be a leader.
> Ensure that only the valid ZooKeeper registered leader can put a replica into
> Leader Initiated Recovery.
> --------------------------------------------------------------------------------------------------------
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
> Issue Type: Bug
> Reporter: Mark Miller
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.4, Trunk
>
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation
> where the rightful leader was put or put itself into LIR. Even on restart,
> this rightful leader won't take leadership and you have to manually clear the
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR
> should just be cleared.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]