[ 
https://issues.apache.org/jira/browse/SOLR-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607805#comment-17607805
 ] 

Houston Putman commented on SOLR-16416:
---------------------------------------

Ok after more digging, this does not seem to be the case. What actually happens 
is that at the end of OverseerNodePrioritizer.prioritizeOverseerNodes(), the 
prioritizer will send a command to the prioritized leader to take the second 
spot in the leader election, then send a second command to the current second 
spot to rejoin at the end.

The logging in the failed tests show that the second command is received, but 
the first is never logged. After going through the HttpShardHandler, it seems 
like the error message is just swallowed and never even logged. As a first 
step, I'll add logging if an error comes back from either command. Then we can 
actually start debugging these failures.

> Leader Election not respecting joinAtHead during ZK Connection issues
> ---------------------------------------------------------------------
>
>                 Key: SOLR-16416
>                 URL: https://issues.apache.org/jira/browse/SOLR-16416
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Houston Putman
>            Priority: Major
>
> OverseerRolesTest.testDesignatedOverseerRestarts has been failing 
> consistently (around 2.5% of the time). I think this is because 
> LeaderElection.joinElection does not respect the joinAtHead flag, if 
> connectionIssues happen while setting the leader election nodes.
> LeaderElection does not use the automatic retryOnConnLoss flags when doing zk 
> operations. Instead, it waits for an error to come back, and it handles the 
> retry itself. This is fine for the normal case, because it checks if node is 
> represented in the leaderElection child nodes, and if so it ignores the 
> connection loss. However when doing joinAtHead, if the childNode exists, but 
> isn't at the place it should be, then the manual retry should be exercised.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to