[ 
https://issues.apache.org/jira/browse/SOLR-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712222#comment-14712222
 ] 

Yonik Seeley commented on SOLR-7836:
------------------------------------

I've been running ChaosMonkeySafeLeaderTest for about 3 days with my test 
script that also searches for corrupt indexes or assertion failures even when 
the test still passes.
Current trunk (as of last week): 9 corrupt indexes
Patched trunk: 14 corrupt indexes and 2 test failures (inconsistent shards)

The corrupt indexes *may* not be a problem, I don't really know.  We kill off 
servers, perhaps during replication?  Seems like that could produce corrupt 
indexes, but I don't know if that's the scenario or not.  Increasing the 
incidence of those doesn't necessarily point to a problem either.  But 
inconsistent shards vs not... does seem like a problem if it holds.

I've reviewed the locking code again, and it looks solid, so I'm not sure 
what's going on.

Here's a typical corrupt index trace:
{code}
  2> 21946 WARN  (RecoveryThread-collection1) [n:127.0.0.1:51815_ c:collection1 
s:shard1 r:core_node2 x:collection1] o.a.s.h.IndexFetcher Could not retrie
ve checksum from file.
  2> org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file 
truncated?): actual footer=1698720114 vs expected footer=-1071082520 
(resource=MMapIndexInput(path="/opt/code/lusolr_clean2/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest_B7DC9C42462BF20D-001/shard-2-001/cores/collection1/data/index/_0.fdt"))
  2>    at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416)
  2>    at 
org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401)
  2>    at 
org.apache.solr.handler.IndexFetcher.compareFile(IndexFetcher.java:876)
  2>    at 
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:839)
  2>    at 
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:437)
  2>    at 
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:265)
  2>    at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:382)
  2>    at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
  2>    at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
  2>    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
{code}


> Possible deadlock when closing refcounted index writers.
> --------------------------------------------------------
>
>                 Key: SOLR-7836
>                 URL: https://issues.apache.org/jira/browse/SOLR-7836
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>             Fix For: Trunk, 5.4
>
>         Attachments: SOLR-7836-reorg.patch, SOLR-7836-synch.patch, 
> SOLR-7836.patch, SOLR-7836.patch, SOLR-7836.patch, SOLR-7836.patch, 
> deadlock_3.res.zip, deadlock_5_pass_iw.res.zip, deadlock_test
>
>
> Preliminary patch for what looks like a possible race condition between 
> writerFree and pauseWriter in DefaultSorlCoreState.
> Looking for comments and/or why I'm completely missing the boat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to