[ https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yonik Seeley updated SOLR-8129: ------------------------------- Attachment: fail.151005_064958 OK, here's another failure I've been analyzing. It comes down to this: 1) leader is shutdown (CoreContainer.shutdown is called) 2) a single doc is sent from the leader to one replica successfully, but unsuccessfully to a different replica (rejected task from shutdown executor that the client is using to send) 3) tons of other updates are still being accepted and sent by the leader 4) much later, a peersync sees everything as OK since recent versions match up. One mystery is why ConcurrentUpdateSolrClient is trying to create a new Runner when there is obviously another runner already running (since it still accepts and sends new updates after that point). A general way to fix this is to make sure that shutdown happens much more quickly... we should stop reading and processing updates. {code} // good-doc comes into leader 53975 2> 43204 DEBUG (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-476} {wt=javabin&version=2} // bad-doc comes into leader 53975 2> 43204 DEBUG (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-477} {wt=javabin&version=2} // good-doc added to shard 57414 2> 43273 DEBUG (qtp702407469-354) [n:127.0.0.1:57414__%2Fzl c:collection1 s:shard2 r:core_node5 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-476} {update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2} 2> 43280 INFO (qtp702407469-354) [n:127.0.0.1:57414__%2Fzl c:collection1 s:shard2 r:core_node5 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=/_/zl path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2} {add=[ft1-467 (1514187847514456064), ft1-470 (1514187847746191360), ft1-473 (1514187847754579968), ft1-476 (1514187847755628544)]} 0 42 // the leader is going to be stopped in the future 2> 43345 INFO (Thread-272) [ ] o.a.s.c.ChaosMonkey monkey: stop shard! 53975 2> 43345 INFO (Thread-272) [ ] o.a.s.c.CoreContainer Shutting down CoreContainer instance=171681388 // overseer gets state:down for leader 53975 2> 43378 INFO (OverseerStateUpdate-94636738141945860-127.0.0.1:49439__%2Fzl-n_0000000000) [n:127.0.0.1:49439__%2Fzl ] o.a.s.c.Overseer processMessage: queueSize: 1, message = { // BUT... 53975 appears to keep processing updates... there are ~736 more updates like the following, continuing another couple of seconds through to time 45555 fail.151005_064958: 2> 43381 DEBUG (qtp1536362844-204) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-1165} {wt=javabin&version=2} // good-doc is added to replica 44323 2> 43449 DEBUG (qtp1776514246-272) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-476} {update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2} 2> 43456 INFO (qtp1776514246-272) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=/_/zl path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2} {add=[ft1-473 (1514187847754579968), ft1-476 (1514187847755628544)]} 0 108 // more signs of node being stopped 2> 43453 WARN (qtp1536362844-205) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.e.j.s.ServletHandler /_/zl/collection1/update 2> org.apache.solr.common.SolrException: Error processing the request. CoreContainer is either not initialized or shutting down. // bad-doc is added to replica 44323 2> 43471 DEBUG (qtp1776514246-273) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-477} {update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2} 2> 43501 INFO (qtp1776514246-273) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=/_/zl path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2} {add=[ft1-477 (1514187847778697216)]} 0 30 // This is the same update thread that has our bad-doc on the leader... it can't send because the update executor has been shut down 2> 43472 ERROR (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.u.SolrCmdDistributor java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1@183e1649 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@7552e28d[Shutting down, pool size = 5, active threads = 5, queued tasks = 0, completed tasks = 62] // Takes a while for other executors to be shut down 2> 45288 ERROR (qtp1536362844-204) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.u.SolrCmdDistributor java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1@4dbd9a35 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@7552e28d[Shutting down, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 65] // Peer sync... the versions show that 44323 is ahead by 14 updates, but everything else matches up after that point. // Our missing update (1514187848796864512) is older than the oldest update of either list. 2> 49313 DEBUG (zkCallback-21-thread-2-processing-n:127.0.0.1:44323__%2Fzl) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:44323/_/zl sorted versions from http://127.0.0.1:57414/_/zl/collection1/ = [1514187849851731968, ... 2> 49313 INFO (zkCallback-21-thread-2-processing-n:127.0.0.1:44323__%2Fzl) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:44323/_/zl Our versions are newer. ourLowThreshold=1514187848979316736 otherHigh=1514187849717514240 2> 49567 DEBUG (qtp702407469-351) [n:127.0.0.1:57414__%2Fzl c:collection1 s:shard2 r:core_node5 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:57414/_/zl sorted versions from http://127.0.0.1:44323/_/zl/collection1/ = [1514187849932472320, ... 2> 50327 ERROR (qtp1776514246-274) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.c.SolrCore org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: collection1 slice: shard2 // the leader is still not all the way shut down... 2> 62700 ERROR (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.c.s.i.ConcurrentUpdateSolrClient interrupted 2> ######shard2 is not consistent. Got 2456 from http://127.0.0.1:44323/_/zl/collection1 (previous client) and got 2455 from http://127.0.0.1:57414/_/zl/collection1 2> 110640 INFO (qtp1776514246-276) [n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.c.S.Request [collection1] webapp=/_/zl path=/select params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2} hits=2456 status=0 QTime=91 2> 110762 INFO (qtp702407469-354) [n:127.0.0.1:57414__%2Fzl c:collection1 s:shard2 r:core_node5 x:collection1] o.a.s.c.S.Request [collection1] webapp=/_/zl path=/select params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2} hits=2455 status=0 QTime=9 2> ######http://127.0.0.1:44323/_/zl/collection1: SolrDocumentList[sz=2456]=[SolrDocument{id=0-100, _version_=1514187858469978112}, SolrDocument{id=0-1000, _version_=1514187879861977089}, SolrDocument{id=0-1005, _version_=1514187879928037376}, SolrDocument{id=0-1009, _version_=1514187879977320448}, SolrDocument{id=0-1011, _version_=1514187880001437696}] , [...] , [SolrDocument{id=ft1-92, _version_=1514187844608851968}, SolrDocument{id=ft1-95, _version_=1514187844616192000}, SolrDocument{id=ft1-96, _version_=1514187844623532032}, SolrDocument{id=ft1-985, _version_=1514187847262797824}, SolrDocument{id=ft1-99, _version_=1514187844641357824}] 2> ######http://127.0.0.1:57414/_/zl/collection1: SolrDocumentList[sz=2455]=[SolrDocument{id=0-100, _version_=1514187858469978112}, SolrDocument{id=0-1000, _version_=1514187879861977089}, SolrDocument{id=0-1005, _version_=1514187879928037376}, SolrDocument{id=0-1009, _version_=1514187879977320448}, SolrDocument{id=0-1011, _version_=1514187880001437696}] , [...] , [SolrDocument{id=ft1-92, _version_=1514187844608851968}, SolrDocument{id=ft1-95, _version_=1514187844616192000}, SolrDocument{id=ft1-96, _version_=1514187844623532032}, SolrDocument{id=ft1-985, _version_=1514187847262797824}, SolrDocument{id=ft1-99, _version_=1514187844641357824}] 2> ###### sizes=2456,2455 2> ###### Only in http://127.0.0.1:44323/_/zl/collection1: [{_version_=1514187847778697216, id=ft1-477}] {code} > HdfsChaosMonkeyNothingIsSafeTest failures > ----------------------------------------- > > Key: SOLR-8129 > URL: https://issues.apache.org/jira/browse/SOLR-8129 > Project: Solr > Issue Type: Bug > Reporter: Yonik Seeley > Attachments: fail.151005_064958, fail.151005_080319 > > > New HDFS chaos test in SOLR-8123 hits a number of types of failures, > including shard inconsistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org