[jira] [Commented] (SOLR-16013) Overseer gives up election node before closing - inflight commands can be processed twice

Joel Bernstein (Jira) Wed, 02 Mar 2022 06:31:28 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-16013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500197#comment-17500197
 ]


Joel Bernstein commented on SOLR-16013:
---------------------------------------

I can provide a little context on this issue.

We have worked around the issue in our collections operator by assigning the 
coreName specifically in the ADDREPLICA command. Thus when two overseers 
execute the same ADDREPLICA command which ever gets there first will succeed 
and the second will fail due to the duplicate coreName. This was not an easy 
fix because the collections operator needed to follow the specific rules for 
coreName creation that Solr does in order for SolrCloud to work properly. The 
docs don't even have coreName as a parameter but it does work in the 8x branch. 
 
The reason this error came up frequently for us is that we have a test 
framework for the collections operator that does many things in parallel on a 
solr cluster. It spins up collections, tears them down, scales them etc... all 
in parallel. Once you do that this bug jumps out very quickly. If you're not 
doing parallel operations you won't hit this bug unless you have the misfortune 
of having the overseer leader die while performing an ADDREPLICA.

Lastly the code in Solr that seems to be at issue is the following ZkController 
logic:

{code:java}
 customThreadPool.submit(() -> 
IOUtils.closeQuietly(overseerElector.getContext()));
 customThreadPool.submit(() -> IOUtils.closeQuietly(overseer));
{code}

This code was not always done this way. Originally the code looked like this:


{code:java}
  IOUtils.closeQuietly(overseerElector.getContext());
  IOUtils.closeQuietly(overseer);
{code}

The threadPool was added as part of larger ticket to make the tests run faster. 
I believe there is a decent chance if we revert back to the serial closing of 
the overseer the problem will be resolved, but I haven't confirmed this.


> Overseer gives up election node before closing - inflight commands can be 
> processed twice
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-16013
>                 URL: https://issues.apache.org/jira/browse/SOLR-16013
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> {{ZkController}} shutdown currently has these two lines (in this order)...
> {code:java}
>     customThreadPool.submit(() -> 
> IOUtils.closeQuietly(overseerElector.getContext()));
>     customThreadPool.submit(() -> IOUtils.closeQuietly(overseer));
> {code}
> AFAICT this means that means that the overseer nodeX will give up it's 
> election node (via overseerElector) allowing some other nodeY to be elected a 
> new overseer, **BEFORE** Overseer nodeX shuts down it's {{Overseer}} object, 
> which waits for the {{OverseerThread}} to finish processing any tasks in 
> process.
> In practice, this seems to make it possible for a single command in the 
> overseer queue to get processed twice.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-16013) Overseer gives up election node before closing - inflight commands can be processed twice

Reply via email to