[jira] [Comment Edited] (SOLR-3180) ChaosMonkey test failures

Yonik Seeley (JIRA) Fri, 04 Jan 2013 08:08:13 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543979#comment-13543979
 ]


Yonik Seeley edited comment on SOLR-3180 at 1/4/13 4:07 PM:
------------------------------------------------------------

Here's an analyzed log that I traced all the way to the end.
The issues involved are all timeout related (socket timeouts).
Timing out an update request in general is bad, since the request itself 
normally continues on and can finish at some point in the future.
We should strive to only time out requests that are truely / hopelessly hung.

{code}

  2> 54461 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930
  2> 54462 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 !
  2> 54473 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!

# NOTE: for some reason, this didn't seem to slow 35930 down... it's still 
accepting and processing updates at this point!

  2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930
  2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 !
  2> 59546 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
 

  2> 65367 T48 C2 P47815 PRE_UPDATE FINISH  {wt=javabin&version=2}
  2> 65367 T48 C2 P47815 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} {wt=javabin&version=2}
# C2 is forwarding to the leader, C6 (P59996)
# NOTE: this is the same thread that finally times out

  2> 65372 T75 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 65374 T75 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189915618770944

  2> 65384 T179 C8 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65384 T229 C11 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65384 T124 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65385 T179 C8 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189915618770944
  2> 65385 T229 C11 P44046 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65387 T229 C11 P44046 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189915618770944)]} 0 3
  2> 65445 T179 C8 P40531 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189915618770944)]} 0 62
# C8 and C11 finish, but C3 (P35930) does not


  2> 80583 T48 C2 P47815 oasc.Diagnostics.logThreadDumps SEVERE REQUESTING 
THREAD DUMP DUE TO TIMEOUT: Timeout occured while waiting response from server 
at: http://127.0.0.1:59996/to_/y/collection1

  2>    "qtp330001436-124" Id=124 TIMED_WAITING
  2>            at java.lang.Thread.sleep(Native Method)
  2>            at 
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:944)
  2>            at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:342)
  2>            at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessor.java:76)
  2>            at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
  2>            at 
org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
  2>            at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  2>            at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 
# C3 is waiting for the ZK connection

  2> 80926 T75 C6 P59996 oasc.SolrException.log SEVERE shard update error 
StdNode: 
http://127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException:
 Timeout occured while waiting response from server at: 
http://127.0.0.1:35930/to_/y/collection1
  2> 80632 T73 C6 P59996 oasc.SolrException.log SEVERE shard update error 
StdNode: 
http://127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException:
 Timeout occured while waiting response from server at: 
http://127.0.0.1:35930/to_/y/collection1

  2> 80603 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to 
http://127.0.0.1:59996/to_/y/collection1/ failed - retrying ... 
  2> 80955 T75 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask 
http://127.0.0.1:35930/to_/y to recover
  2> 80956 T73 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask 
http://127.0.0.1:35930/to_/y to recover
 
# there are actually 2 updates failing, hence the 2 requests to recover
# It looks like C2 retries forwarding to the leader just at about the same time 
that the leader times out the requests to the replicas and asks them to recover

  2> 80965 T120 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been 
requested that we recover
  2> 80971 T121 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been 
requested that we recover


  2> 80972 T75 C6 P59996 /update 
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
 {add=[10070 (1423189915618770944)]} 0 15600
######### This is success because we added the doc to multiple shards and asked 
the shard that failed to recover?

  2> 81461 T76 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
######### Now this is the retry from C2 sending to the leader, but we've 
already counted the previous update a success on C6 (but C2 had already timed 
it out)

  2> 81463 T76 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189932489310208
  2> 81465 T76 C6 P59996 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}

  2> 81471 T122 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 81486 T324 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 81489 T324 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189932489310208
  2> 81497 T324 C15 P44046 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189932489310208)]} 0 11


############# The following update doesn't have "forward to leader" on it!  
This must be a retry from
############# the cloud client?
  2> 95400 T74 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} {wt=javabin&version=2}
  2> 95402 T74 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189947105411072
  2> 95404 T74 C6 P59996 PRE_UPDATE FINISH  {wt=javabin&version=2}

  2> 95410 T327 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95411 T327 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189947105411072
  2> 95414 T327 C15 P44046 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95415 T327 C15 P44046 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189947105411072)]} 0 6
 
  2>  C16_STATE=coll:collection1 core:collection1 props:{shard=shard3, 
roles=null, state=active, core=collection1, collection=collection1, 
node_name=127.0.0.1:40531_to_%2Fy, base_url=http://127.0.0.1:40531/to_/y}
  2> 95420 T358 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95424 T358 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189947105411072
  2> 95432 T358 C16 P40531 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95433 T358 C16 P40531 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189947105411072)]} 0 13

  2> 95436 T74 C6 P59996 /update {wt=javabin&version=2} {add=[10070 
(1423189947105411072)]} 0 36
############## The update is a success because apparently only C15 and C16 are 
active replcias of the shard?  


  2> 95465 T124 C3 P35930 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}

############# Here comes the delete for the document, since the second add 
finished! (even though the first is still hung up)
############# This must mean that the cloud client timed out the request?

  2> 96296 T25 C12 P52041 oasup.LogUpdateProcessor.processDelete PRE_UPDATE 
DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1} 
{wt=javabin&version=2}
  2> 96297 T25 C12 P52041 /update {wt=javabin&version=2} {delete=[10070 
(-1423189948044935168)]} 0 1
  2> 96300 T75 C6 P59996 oasup.LogUpdateProcessor.processDelete PRE_UPDATE 
DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1} 
{wt=javabin&version=2}


  2> 96757 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to 
http://127.0.0.1:59996/to_/y/collection1/ failed - retrying ... 
#### try to forward to the leader again, and this is the update that finally 
sticks

  2> 97306 T77 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 97308 T77 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189949105045504

  2> 97315 T326 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97314 T359 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10070} 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97321 T359 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189949105045504
  2> 97323 T359 C16 P40531 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97323 T359 C16 P40531 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189949105045504)]} 0 10
  2> 97321 T326 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10070 version=1423189949105045504
  2> 97325 T326 C15 P44046 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97325 T326 C15 P44046 /update 
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10070 (1423189949105045504)]} 0 10
 
  2> 97326 T77 C6 P59996 /update 
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
 {add=[10070 (1423189949105045504)]} 0 20
##### And now this update succeeds (which was a retry between non-leader and 
leader of the original update)


  2> 97328 T48 C2 P47815 /update {wt=javabin&version=2} {add=[10070]} 0 31962
#### This is the first (and only) time that the original update succeeded!

  2> ###### Only in cloudDocList: [{id=60073}, {id=10070}]
  2> 209316 T76 C6 P59996 REQ /select 
{fl=id,_version_&shard.url=127.0.0.1:59996/to_/y/collection1/|127.0.0.1:35930/to_/y/collection1/&NOW=1357259813791&tests=checkShardConsistency(vsControl)/getVers&q=id:(+60073+10070)&ids=60073,10070&distrib=false&isShard=true&wt=javabin&rows=100000&version=2}
 status=0 QTime=1 
  2> 209318 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE 
controlClient :{numFound=0,start=0,docs=[]}
  2>            cloudClient :{numFound=2,start=0,docs=[SolrDocument{id=10070, 
_version_=1423189949105045504}, SolrDocument{id=60073, 
_version_=1423189949103996928}]}
{code}
There was a lot of timeout / retry activity that could cause problems for other 
tests / scenarios, but this test is simpler
because it waits for a response to the add before moving on to possibly delete 
that add.  For this scenario, the
retry that caused the issue was from the cloud client.  It timed out it's 
original update and retried the update.  The retry completed.  Then the test 
deleted that document.  Then the *original* update succeeded and added the doc 
back.

Having the same timeouts on forwards to leaders as forwards from leaders has 
turned out to be not-so-good.  Because the former happens *before* the latter, 
if a replica update hangs, the to_leader update will timeout and retry 
*slightly* before the from_leader times out to the replica (and maybe succeeds 
by asking that replica to recover!).

Q) A replica receiving a forward *from* a leader - do we really need to have a 
ZK connection to accept that update?
Maybe so for defensive check reasons?

Here's how I think we need to fix this:
A) We need to figure out how long an update to a replica forwarded by the 
leader can reasonably take.  Then we need to make the socket timeout be greater 
than that.
B) We need to figure out how long an update to a leader can take (taking into 
account (A)), and make the socket timeout to the leader greater than that.
C) We need to figure out how long an update to a non-leader (which is then 
forwarded to a leader) can take, and make the socket timeout from SolrJ servers 
to be greater than that.
D) Make sure that the generic Jetty socket timeouts are greater than all of the 
above?

If it's too hard to separate all these different socket timeouts now, then the 
best approximation
would be to try and minimize the time any update can take, and raise all of the 
timeouts up high enough
such that we should never see them.

We should probably also take care to only retry in certain scenarios.  For 
instance if we try to forward to a leader, but can't reach the leader.  We 
should retry on connect timeout, but never on socket timeout.
                
      was (Author: [email protected]):
    Here's an analyzed log that I traced all the way to the end.
The issues involved are all timeout related (socket timeouts).
Timing out an update request in general is bad, since the request itself 
normally continues on and can finish at some point in the future.
We should strive to only time out requests that are truely / hopelessly hung.

{code}

There was a lot of timeout / retry activity that could cause problems for other 
tests / scenarios, but this test is simpler
because it waits for a response to the add before moving on to possibly delete 
that add.  For this scenario, the
retry that caused the issue was from the cloud client.  It timed out it's 
original update and retried the update.  The retry completed.  Then the test 
deleted that document.  Then the *original* update succeeded and added the doc 
back.

Having the same timeouts on forwards to leaders as forwards from leaders has 
turned out to be not-so-good.  Because the former happens *before* the latter, 
if a replica update hangs, the to_leader update will timeout and retry 
*slightly* before the from_leader times out to the replica (and maybe succeeds 
by asking that replica to recover!).

Q) A replica receiving a forward *from* a leader - do we really need to have a 
ZK connection to accept that update?
Maybe so for defensive check reasons?

Here's how I think we need to fix this:
A) We need to figure out how long an update to a replica forwarded by the 
leader can reasonably take.  Then we need to make the socket timeout be greater 
than that.
B) We need to figure out how long an update to a leader can take (taking into 
account (A)), and make the socket timeout to the leader greater than that.
C) We need to figure out how long an update to a non-leader (which is then 
forwarded to a leader) can take, and make the socket timeout from SolrJ servers 
to be greater than that.
D) Make sure that the generic Jetty socket timeouts are greater than all of the 
above?

If it's too hard to separate all these different socket timeouts now, then the 
best approximation
would be to try and minimize the time any update can take, and raise all of the 
timeouts up high enough
such that we should never see them.

We should probably also take care to only retry in certain scenarios.  For 
instance if we try to forward to a leader, but can't reach the leader.  We 
should retry on connect timeout, but never on socket timeout.
                  
> ChaosMonkey test failures
> -------------------------
>
>                 Key: SOLR-3180
>                 URL: https://issues.apache.org/jira/browse/SOLR-3180
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>         Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, 
> fail.130101_034142.txt, fail.130102_020942.txt, fail.130103_105104.txt, 
> fail.130103_193722.txt, fail.inconsistent.txt, test_report_1.txt
>
>
> Handle intermittent failures in the ChaosMonkey tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-3180) ChaosMonkey test failures

Reply via email to