[ 
https://issues.apache.org/jira/browse/SOLR-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-7836:
---------------------------------
    Attachment: SOLR-7836.patch

This might do it. I'm running 1,000 iterations (or until morning, whichever 
comes first) but it's gone through 150 or so already which was usually more 
than enough to trigger the new deadlock I found so I'm hopeful. Actually, I 
think this really the old deadlock, the first fix wasn't very good.

There are two things I'd appreciate any opinions on:
1> this patch moves the UpdateLog.add() command out of the ref counted 
IndexWriter in DirectUpdateHandler2.addDoc0() in multiple places (refactored to 
methods). But UpdateLog.add gets a new IndexWriter itself which is where the 
deadlock appears to be as it's fighting with CoreContainer.reload() which also 
calls DefaultSolrCoreState.newIndexWriter.

2> There are two nocommits, but they are to make it easy for someone to see the 
other bit that other eyes would be most welcome on, I moved 
if (ulog != null) ulog.add(cmd, true); 
out of a synchronized block. This fits the pattern in other places in that code 
too, so I'm not too worried, but wanted to draw attention to it if anyone wants 
to look. Actually, there are a lot of operations on ulog.something that are 
synchronized on solrCoreState.getUpdateLock(), and a bunch that aren't. What's 
up there? The change marked by //nocommit matches the (non-synchronized) usages 
in addDoc0() though.

Anyway, the problem was in DirectUpdateHandler2.addDoc0(). The ulog.add command 
line being inside a ref-counted IndexWriter when adding documents (addAndDelete 
case). ulog.add eventually tries to get a new Searcher which can be deadlocked 
with another thread called from SolrCore.reload since the reload wants a new 
IndexWriter.

I haven't run precommit or the full test suite yet, I want to make sure the 
iterations I'm doing tonight work.

This looks more intrusive than it is. I refactored some methods out of 
DirectUpdateHandler2.addDoc0() in order to make this pattern more visible. The 
original code concealed the fact that the ref counted index writer surrounded 
all the code, including the ulog.add.

this is the pattern for all the refactored methods:
RefCounted<IndexWriter> iw = solrCoreState.getIndexWriter(core);
try {
  
IndexWriter writer = iw.get();
  do the right thing
} finally {
    iw.decref();
} 

if (ulog != null) ulog.add(cmd, true);

I think this is a better approach than the original fix which passed the index 
writer down to the addAndDelete method. we'd have had to pass the IndexWriter 
on to the ulog.add command or something which would have been a pain (even if 
it was possible).

New test case is in this patch. Note that it's a race condition to get this to 
fail, so it needs to be run a lot. And using tests.iters apparently triggers 
the timeouts so....

> Possible deadlock when closing refcounted index writers.
> --------------------------------------------------------
>
>                 Key: SOLR-7836
>                 URL: https://issues.apache.org/jira/browse/SOLR-7836
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: SOLR-7836.patch, SOLR-7836.patch
>
>
> Preliminary patch for what looks like a possible race condition between 
> writerFree and pauseWriter in DefaultSorlCoreState.
> Looking for comments and/or why I'm completely missing the boat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to