[jira] [Commented] (SOLR-8085) ChaosMonkey Safe Leader Test fail with shard inconsistency.

Yonik Seeley (JIRA) Wed, 23 Sep 2015 07:43:49 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904585#comment-14904585
 ]


Yonik Seeley commented on SOLR-8085:
------------------------------------

OK, here's an analyisis of fail.150922_130608

it looks like LIR happens during normal recovery after startup, and we finally 
end up doing recovery with recoverAfterStartup=false, which uses recent 
versions in peersync (and include docs that have been buffered while we were 
recovering) rather than the true startup versions.  This causes peersync to 
pass when it should not have.

{code}

(add first appears, node 47239 appears to be coming up at the time)
  2> 74317 INFO  (qtp324612161-378) [n:127.0.0.1:40940_ c:collection1 s:shard3 
r:core_node6 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}
 {add=[0-333 (1513033816377131008)]} 0 19
  2> 74317 INFO  (qtp324612161-378) [n:127.0.0.1:40940_ c:collection1 s:shard3 
r:core_node6 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}
 {add=[0-333 (1513033816377131008)]} 0 19
  2> 74317 INFO  (qtp661063741-234) [n:127.0.0.1:38911_ c:collection1 s:shard3 
r:core_node2 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update params={wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 
0 31
  
(node coming up)
  2> 75064 INFO  (coreLoadExecutor-196-thread-1-processing-n:127.0.0.1:47239_) 
[n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.VersionInfo Refreshing highest value of _version_ for 256 version 
buckets from index
  2> 75065 INFO  (coreLoadExecutor-196-thread-1-processing-n:127.0.0.1:47239_) 
[n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.UpdateLog Took 31.0ms to seed version buckets with highest version 
1513033812159758336
  2> 75120 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ZkController 
Replaying tlog for http://127.0.0.1:47239/collection1/ during startup... NOTE: 
This can take a while.
  2> 75162 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog add 
add{flags=a,_version_=1513033807303802880,id=1-3}
  2> 75162 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{flags=a,_version_=1513033807303802880,id=1-3} 
LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay=true}

(replay finished)
  2> 75280 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{flags=a,_version_=1513033812159758336,id=1-132} 
LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay=true}
 
(meanwhile, the leader is asking us to recover?)
  2> 75458 WARN  (updateExecutor-14-thread-5-processing-x:collection1 
r:core_node2 http:////127.0.0.1:47239//collection1// n:127.0.0.1:38911_ 
s:shard3 c:collection1) [n:127.0.0.1:38911_ c:collection1 s:shard3 r:core_node2 
x:collection1] o.a.s.c.LeaderInitiatedRecoveryThread Asking core=collection1 
coreNodeName=core_node10 on http://127.0.0.1:47239 to recover; unsuccessful 
after 2 of 120 attempts so far ...
  
(and we see the request to recover)
  2> 75475 INFO  (qtp2087242119-1282) [n:127.0.0.1:47239_    ] 
o.a.s.h.a.CoreAdminHandler It has been requested that we recover: 
core=collection1

(so we cancel the existing recovery)
  2> 75478 INFO  (Thread-1246) [n:127.0.0.1:47239_ c:collection1 s:shard3 
r:core_node10 x:collection1] o.a.s.u.DefaultSolrCoreState Running recovery - 
first canceling any ongoing recovery

  2> 75552 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Starting 
recovery process. recoveringAfterStartup=true

  2> 75610 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy ###### 
startupVersions=[1513033812159758336, [...]
  2> 75611 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Publishing state 
of core collection1 as recovering, leader is 
http://127.0.0.1:38911/collection1/ and I am http://127.0.0.1:47239/collection1/
  2> 75611 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.ZkController publishing 
state=recovering
 
  2> 75642 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Sending prep 
recovery command to http://127.0.0.1:38911; WaitForState: 
action=PREPRECOVERY&core=collection1&nodeName=127.0.0.1%3A47239_&coreNodeName=core_node10&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true

(In the meantime, we still receive updates.  1-414 is one of the updates that 
we are missing!!!  I assume we are buffering these?)
  2> 75782 DEBUG (qtp2087242119-1281) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=1-414} 
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}
  2> 75828 INFO  (qtp2087242119-1281) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor 
[collection1] webapp= path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}
 {add=[1-414 (1513033817943703552)]} 0 45

(what's this about?)
  2> 75923 INFO  (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.s.SolrIndexSearcher 
Opening Searcher@2ba72586[collection1] main

(wait... replay? is this from when we started up and we never stopped it?  
should we have?)
  2> 76135 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.p.LogUpdateProcessor PRE_UPDATE FINISH 
LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay=true}


(log replay finishes... the dbq *:* shows that this is the tlog from the 
beginning of this test.. i.e. we never got to do a commit at first)
  2> 76181 INFO  (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.p.LogUpdateProcessor [collection1] {deleteByQuery=*:* 
(-1513033805984694272),add=[1-3 (1513033807303802880), 0-5 
(1513033807434874880), 1-6 (1513033807628861440), 1-7 (1513033807699116032), 
1-9 (1513033807797682176), 0-14 (1513033807941337088), 0-15 
(1513033807976988672), 1-14 (1513033807991668736), 1-15 (1513033808035708928), 
0-17 (1513033808075554816), ... (61 adds)]} 0 1060
  2> 76182 WARN  (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog Log 
replay finished. recoveryInfo=RecoveryInfo{adds=61 deletes=0 deleteByQuery=1 
errors=0 positionOfStart=0}

(we're canceling recoveryh again?)
  2> 76182 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] 
o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing 
recovery
  2> 76186 WARN  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy 
Stopping recovery for core=collection1 coreNodeName=core_node10
  
(did this fail because we canceled the recovery?)
  2> 76194 ERROR (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Error while 
trying to recover.:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: http://127.0.0.1:38911
  2>    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
  2>    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
  2>    at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:598)
  2>    at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:361)
  2>    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
  2> Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://127.0.0.1:38911
  2>    at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:590)
  2>    at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)
  2>    at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)
  2>    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  2>    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
  2>    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  2>    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  2>    at java.lang.Thread.run(Thread.java:745)
  2> Caused by: java.net.SocketException: Socket closed
  2> 76202 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Finished 
recovery process.

(note: this is the last we hear of this thread... not clear if that's OK or not)
  2> 76202 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ActionThrottle The 
last recovery attempt started 711ms ago.
  2> 76202 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ 
x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ 
c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ActionThrottle 
Throttling recovery attempts - waiting for 9288ms
  
 (still receiving updates, while we are waiting to do the next recovery 
attempt...  are these still buffering? how do we tell?)
  2> 76252 DEBUG (qtp2087242119-1283) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=0-432} 
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}


(OK, looks like we are finally trying the recovery process again... but note 
that "recoveringAfterStartup" is now false!!!!)
  2> 85491 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Starting 
recovery process. recoveringAfterStartup=false
  2> 85531 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Publishing state 
of core collection1 as recovering, leader is 
http://127.0.0.1:38911/collection1/ and I am http://127.0.0.1:47239/collection1/

(Peersync.  Because recoveringAfterStartup is false, it will use most recent 
versions to compare, rather than startup versions)
(As an example, version 1513033825996767232 is very recent, at timestamp 83465, 
while we were still not recovered)
  2> 92554 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Attempting to 
PeerSync from http://127.0.0.1:38911/collection1/ - recoveringAfterStartup=false
  2> 92554 DEBUG (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: 
core=collection1 url=http://127.0.0.1:47239 startingVersions=100 
[1513033825996767232, [...]
  2> 92646 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: 
core=collection1 url=http://127.0.0.1:47239  Received 100 versions from 
http://127.0.0.1:38911/collection1/
  2> 92647 DEBUG (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: 
core=collection1 url=http://127.0.0.1:47239  sorted versions from 
http://127.0.0.1:38911/collection1/ = [1513033825996767232, ...
  2> 92647 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: 
core=collection1 url=http://127.0.0.1:47239  Our versions are newer. 
ourLowThreshold=1513033822028955648 otherHigh=1513033825172586496
  2> 92647 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: 
core=collection1 url=http://127.0.0.1:47239 DONE. sync succeeded

(And we pass peersync when we should not)
  2> 92796 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 
s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy PeerSync 
Recovery was successful - registering as Active.

  2> ######shard3 is not consistent.  Got 375 from 
http://127.0.0.1:40940/collection1lastClient and got 230 from 
http://127.0.0.1:47239/collection1
  2> ###### sizes=375,230
  2> ###### Only in http://127.0.0.1:40940/collection1: 
[{_version_=1513033815669342208, id=0-295}, [...], 
{_version_=1513033816377131008, id=0-333}]
 
{code}

> ChaosMonkey Safe Leader Test fail with shard inconsistency.
> -----------------------------------------------------------
>
>                 Key: SOLR-8085
>                 URL: https://issues.apache.org/jira/browse/SOLR-8085
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>         Attachments: fail.150922_125320, fail.150922_130608
>
>
> I've been discussing this fail I found with Yonik.
> The problem seems to be that a replica tries to recover and publishes 
> recovering - the attempt then fails, but docs are now coming in from the 
> leader. The replica tries to recover again and has gotten enough docs to pass 
> peery sync.
> I'm trying a possible solution now where we won't allow peer sync after a 
> recovery that is not successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8085) ChaosMonkey Safe Leader Test fail with shard inconsistency.

Reply via email to