[jira] [Commented] (SOLR-3620) Almost every test relating to cloud hangs since July 12, 2012

Mark Miller (JIRA) Fri, 13 Jul 2012 11:25:36 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413936#comment-13413936
 ]


Mark Miller commented on SOLR-3620:
-----------------------------------

I've committed an attempted fix.

>From what I can tell this is a shutdown deadlock issue that involves recovery 
>threads.

When a SolrCore hits a ref count of 0 (its closed by everyone using it) it will 
try and cancel any recovery thread.

This is bad if it happens when a lock on 'CoreContainer#cores' is held. You can 
end up with a deadlock of waiting. This only happens in CoreContainer#shutdown 
that I know of. It holds the cores lock and calls close on all the cores. This 
could cause a recovery to be canceled. That sequence is what can lead to 
deadlock. We defend against this by canceling all recoveries *before* getting 
the cores lock in CoreContainer#shutdown. We do that for just this case. 
Otherwise, the 'cores' lock will be held when calling SolrCore#close which 
could trigger a recovery cancel (because the ref count hits 0) which waits for 
the recovery thread to finish ('#join'). But the recovery thread could be in 
the middle of trying to recover - where it sometimes gets a core from the 
CoreContainer which uses the 'cores' lock. It's waiting for #shutdown to give 
up that lock while #shutdown waits for it to finish its loop and die.

So how is it happening here? And why did a collections API commit to improve 
tests and add a RELOAD command expose this?

First, the only reason I can think of how this could be happening even with our 
little defense cancelRecovery loop is that somehow a recovery is then getting 
kicked off again before shutdown completes.

So the fix I have tried is to add a bit of code to make sure recoveries do not 
start after the CoreContainer#shutdown method starts. Hopefully that plugs this.

If that is indeed the issue, this problem existed, and the new beefed up 
collections api test exposed it because its a test that uses more SolrCores in 
a single instance than most any other test. With more cores trying to recover 
during shutdown, I think it's easier to expose this deadlock situation.

That's my initial guess and fix attempt.
                
> Almost every test relating to cloud hangs since July 12, 2012
> -------------------------------------------------------------
>
>                 Key: SOLR-3620
>                 URL: https://issues.apache.org/jira/browse/SOLR-3620
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Uwe Schindler
>            Assignee: Mark Miller
>            Priority: Blocker
>
> I have no idea, but please review the posts on the de@lao mailing list today!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3620) Almost every test relating to cloud hangs since July 12, 2012

Reply via email to