[ 
https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-8129:
-------------------------------
    Attachment: fail.151005_064958

OK, here's another failure I've been analyzing.
It comes down to this:
1) leader is shutdown (CoreContainer.shutdown is called)
2) a single doc is sent from the leader to one replica successfully, but 
unsuccessfully to a different replica (rejected task from shutdown executor 
that the client is using to send)
3) tons of other updates are still being accepted and sent by the leader
4) much later, a peersync sees everything as OK since recent versions match up.

One mystery is why ConcurrentUpdateSolrClient is trying to create a new Runner 
when there is obviously another runner already running (since it still accepts 
and sends new updates after that point).

A general way to fix this is to make sure that shutdown happens much more 
quickly... we should stop reading and processing updates.

{code}

// good-doc comes into leader 53975
  2> 43204 DEBUG (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 
s:shard2 r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-476} {wt=javabin&version=2}

// bad-doc comes into leader 53975
  2> 43204 DEBUG (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 
s:shard2 r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-477} {wt=javabin&version=2}

// good-doc added to shard 57414
  2> 43273 DEBUG (qtp702407469-354) [n:127.0.0.1:57414__%2Fzl c:collection1 
s:shard2 r:core_node5 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-476} 
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2}
  2> 43280 INFO  (qtp702407469-354) [n:127.0.0.1:57414__%2Fzl c:collection1 
s:shard2 r:core_node5 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] 
webapp=/_/zl path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2}
 {add=[ft1-467 (1514187847514456064), ft1-470 (1514187847746191360), ft1-473 
(1514187847754579968), ft1-476 (1514187847755628544)]} 0 42


// the leader is going to be stopped in the future
  2> 43345 INFO  (Thread-272) [    ] o.a.s.c.ChaosMonkey monkey: stop shard! 
53975
  2> 43345 INFO  (Thread-272) [    ] o.a.s.c.CoreContainer Shutting down 
CoreContainer instance=171681388

// overseer gets state:down for leader 53975 
  2> 43378 INFO  
(OverseerStateUpdate-94636738141945860-127.0.0.1:49439__%2Fzl-n_0000000000) 
[n:127.0.0.1:49439__%2Fzl    ] o.a.s.c.Overseer processMessage: queueSize: 1, 
message = {

// BUT... 53975 appears to keep processing updates... there are ~736 more 
updates like the following, continuing another couple of seconds through to 
time 45555
fail.151005_064958:  2> 43381 DEBUG (qtp1536362844-204) 
[n:127.0.0.1:53975__%2Fzl c:collection1 s:shard2 r:core_node1 x:collection1] 
o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-1165} {wt=javabin&version=2}



// good-doc is added to replica 44323
  2> 43449 DEBUG (qtp1776514246-272) [n:127.0.0.1:44323__%2Fzl c:collection1 
s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-476} 
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2}
  2> 43456 INFO  (qtp1776514246-272) [n:127.0.0.1:44323__%2Fzl c:collection1 
s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] 
webapp=/_/zl path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2}
 {add=[ft1-473 (1514187847754579968), ft1-476 (1514187847755628544)]} 0 108

// more signs of node being stopped
  2> 43453 WARN  (qtp1536362844-205) [n:127.0.0.1:53975__%2Fzl c:collection1 
s:shard2 r:core_node1 x:collection1] o.e.j.s.ServletHandler 
/_/zl/collection1/update
  2> org.apache.solr.common.SolrException: Error processing the request. 
CoreContainer is either not initialized or shutting down.


// bad-doc is added to replica 44323
  2> 43471 DEBUG (qtp1776514246-273) [n:127.0.0.1:44323__%2Fzl c:collection1 
s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-477} 
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2}
  2> 43501 INFO  (qtp1776514246-273) [n:127.0.0.1:44323__%2Fzl c:collection1 
s:shard2 r:core_node3 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] 
webapp=/_/zl path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:53975/_/zl/collection1/&wt=javabin&version=2}
 {add=[ft1-477 (1514187847778697216)]} 0 30

// This is the same update thread that has our bad-doc on the leader... it 
can't send because the update executor has been shut down
  2> 43472 ERROR (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 
s:shard2 r:core_node1 x:collection1] o.a.s.u.SolrCmdDistributor 
java.util.concurrent.RejectedExecutionException: Task 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1@183e1649 
rejected from 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@7552e28d[Shutting
 down, pool size = 5, active threads = 5, queued tasks = 0, completed tasks = 
62]

// Takes a while for other executors to be shut down
  2> 45288 ERROR (qtp1536362844-204) [n:127.0.0.1:53975__%2Fzl c:collection1 
s:shard2 r:core_node1 x:collection1] o.a.s.u.SolrCmdDistributor 
java.util.concurrent.RejectedExecutionException: Task 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1@4dbd9a35 
rejected from 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@7552e28d[Shutting
 down, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 
65]


// Peer sync... the versions show that 44323 is ahead by 14 updates, but 
everything else matches up after that point.
// Our missing update (1514187848796864512) is older than the oldest update of 
either list.
  2> 49313 DEBUG (zkCallback-21-thread-2-processing-n:127.0.0.1:44323__%2Fzl) 
[n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] 
o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:44323/_/zl  
sorted versions from http://127.0.0.1:57414/_/zl/collection1/ = 
[1514187849851731968, ...
  2> 49313 INFO  (zkCallback-21-thread-2-processing-n:127.0.0.1:44323__%2Fzl) 
[n:127.0.0.1:44323__%2Fzl c:collection1 s:shard2 r:core_node3 x:collection1] 
o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:44323/_/zl  
Our versions are newer. ourLowThreshold=1514187848979316736 
otherHigh=1514187849717514240


  2> 49567 DEBUG (qtp702407469-351) [n:127.0.0.1:57414__%2Fzl c:collection1 
s:shard2 r:core_node5 x:collection1] o.a.s.u.PeerSync PeerSync: 
core=collection1 url=http://127.0.0.1:57414/_/zl  sorted versions from 
http://127.0.0.1:44323/_/zl/collection1/ = [1514187849932472320, ...


  2> 50327 ERROR (qtp1776514246-274) [n:127.0.0.1:44323__%2Fzl c:collection1 
s:shard2 r:core_node3 x:collection1] o.a.s.c.SolrCore 
org.apache.solr.common.SolrException: No registered leader was found after 
waiting for 4000ms , collection: collection1 slice: shard2

// the leader is still not all the way shut down... 
  2> 62700 ERROR (qtp1536362844-206) [n:127.0.0.1:53975__%2Fzl c:collection1 
s:shard2 r:core_node1 x:collection1] o.a.s.c.s.i.ConcurrentUpdateSolrClient 
interrupted





  2> ######shard2 is not consistent.  Got 2456 from 
http://127.0.0.1:44323/_/zl/collection1 (previous client) and got 2455 from 
http://127.0.0.1:57414/_/zl/collection1
  2> 110640 INFO  (qtp1776514246-276) [n:127.0.0.1:44323__%2Fzl c:collection1 
s:shard2 r:core_node3 x:collection1] o.a.s.c.S.Request [collection1] 
webapp=/_/zl path=/select 
params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2}
 hits=2456 status=0 QTime=91 
  2> 110762 INFO  (qtp702407469-354) [n:127.0.0.1:57414__%2Fzl c:collection1 
s:shard2 r:core_node5 x:collection1] o.a.s.c.S.Request [collection1] 
webapp=/_/zl path=/select 
params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2}
 hits=2455 status=0 QTime=9 
  2> ######http://127.0.0.1:44323/_/zl/collection1: 
SolrDocumentList[sz=2456]=[SolrDocument{id=0-100, 
_version_=1514187858469978112}, SolrDocument{id=0-1000, 
_version_=1514187879861977089}, SolrDocument{id=0-1005, 
_version_=1514187879928037376}, SolrDocument{id=0-1009, 
_version_=1514187879977320448}, SolrDocument{id=0-1011, 
_version_=1514187880001437696}] , [...] , [SolrDocument{id=ft1-92, 
_version_=1514187844608851968}, SolrDocument{id=ft1-95, 
_version_=1514187844616192000}, SolrDocument{id=ft1-96, 
_version_=1514187844623532032}, SolrDocument{id=ft1-985, 
_version_=1514187847262797824}, SolrDocument{id=ft1-99, 
_version_=1514187844641357824}]
  2> ######http://127.0.0.1:57414/_/zl/collection1: 
SolrDocumentList[sz=2455]=[SolrDocument{id=0-100, 
_version_=1514187858469978112}, SolrDocument{id=0-1000, 
_version_=1514187879861977089}, SolrDocument{id=0-1005, 
_version_=1514187879928037376}, SolrDocument{id=0-1009, 
_version_=1514187879977320448}, SolrDocument{id=0-1011, 
_version_=1514187880001437696}] , [...] , [SolrDocument{id=ft1-92, 
_version_=1514187844608851968}, SolrDocument{id=ft1-95, 
_version_=1514187844616192000}, SolrDocument{id=ft1-96, 
_version_=1514187844623532032}, SolrDocument{id=ft1-985, 
_version_=1514187847262797824}, SolrDocument{id=ft1-99, 
_version_=1514187844641357824}]
  2> ###### sizes=2456,2455
  2> ###### Only in http://127.0.0.1:44323/_/zl/collection1: 
[{_version_=1514187847778697216, id=ft1-477}]
{code}

> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
>                 Key: SOLR-8129
>                 URL: https://issues.apache.org/jira/browse/SOLR-8129
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Yonik Seeley
>         Attachments: fail.151005_064958, fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures, 
> including shard inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to