[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reopened SOLR-8069:
-----------------------------------------

There's a reproducible failure in the test added by SOLR-8075 caused by 
assertion error on asserts added in this issue.

{code}
1 tests failed.
FAILED:  
org.apache.solr.cloud.LeaderInitiatedRecoveryOnShardRestartTest.testRestartWithAllInLIR

Error Message:
Captured an uncaught exception in thread: Thread[id=43491, 
name=coreZkRegister-5997-thread-1, state=RUNNABLE, 
group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]

Stack Trace:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=43491, name=coreZkRegister-5997-thread-1, 
state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]
Caused by: java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([7F78F76DDF75FAD1]:0)
        at 
org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:2133)
        at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:434)
        at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:197)
        at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:157)
        at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:346)
        at 
org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1113)
        at org.apache.solr.cloud.ZkController.register(ZkController.java:926)
        at org.apache.solr.cloud.ZkController.register(ZkController.java:881)
        at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:183)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

The assertion is that leaderCd != null fails because 
ShardLeaderElectionContext.runLeaderProcess calls 
ZkController.updateLeaderInitiatedRecoveryState with a null core descriptor  
which is by design because if you are marking a replica as 'active' then you 
don't necessarily need to be a leader.

> Ensure that only the valid ZooKeeper registered leader can put a replica into 
> Leader Initiated Recovery.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8069
>                 URL: https://issues.apache.org/jira/browse/SOLR-8069
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.4, Trunk
>
>         Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to