[jira] [Commented] (SOLR-8129) HdfsChaosMonkeyNothingIsSafeTest failures

Yonik Seeley (JIRA) Thu, 08 Oct 2015 21:21:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949872#comment-14949872
 ]


Yonik Seeley commented on SOLR-8129:
------------------------------------

This seems to be a case of extreme reordering of updates due to thread 
scheduling or whatever.
The leader is selected by the chaos monkey to be killed, and there are a bunch 
of updates that are sent to one shard but not another (i.e. they may have been 
waiting to be sent, but didn't make it before the leader was killed).
A bunch more updates flood in, and when a leader election is held and peersync 
is done, everyone thinks they are up-to-date because the most recent updates do 
match.

{code}
43897 has fewer docs than 36605: missing: {_version_=1514192462261780480, 
id=ft1-1715} ... {_version_=1514192462057308160, id=ft1-1700} ... 
{_version_=1514192462056259585, id=ft1-1698}

// Port 43897 is recovering and begins buffering updates
  2> 31048 INFO  (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1 
s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery 
process. recoveringAfterStartup=true
  2> 31049 INFO  (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1 
s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy ###### 
startupVersions=[]

// going "active"
  2> 38552 INFO  (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1 
s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy Registering as 
Active after recovery.

// example of a normal add
  2> 55692 INFO  (qtp1287419856-309) [n:127.0.0.1:43897_ c:collection1 s:shard2 
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
 {add=[ft1-48 (1514192447743197184)]} 0 188
  
// The leader is going to be killed
  2> 68281 INFO  (Thread-378) [    ] o.a.s.c.ChaosMonkey monkey: stop shard! 
49240
 
// This is presumably where 43897 see the leader marked as "down" in the 
clusterstate
  2> 68347 INFO  (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_) 
[n:127.0.0.1:43897_    ] o.a.s.c.c.ZkStateReader A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/collection1/state.json for collection collection1 has 
occurred - updating... (live nodes size: 9)

// updates are still being processed by the leader
  2> 68743 DEBUG (qtp139680011-202) [n:127.0.0.1:49240_ c:collection1 s:shard2 
r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-1700} {wt=javabin&version=2}

// Replica 36605 receives and processes the update, but 43897 never does (this 
may be normal as the leader is being killed)
  2> 69010 DEBUG (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2 
r:core_node7 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE 
add{,id=ft1-1700} 
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
  2> 69015 INFO  (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2 
r:core_node7 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
 {add=[ft1-1656 (1514192461610614784), ft1-1670 (1514192462015365120), ft1-1679 
(1514192462052065280), ft1-1682 (1514192462053113856), ft1-1688 
(1514192462055211008), ft1-1695 (1514192462056259584), ft1-1698 
(1514192462056259585), ft1-1700 (1514192462057308160), ft1-1702 
(1514192462058356736)]} 0 224

// a ton of more update flood in
  2> 69710 INFO  (qtp1287419856-654) [n:127.0.0.1:43897_ c:collection1 s:shard2 
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
 {add=[ft1-1516 (1514192460689965056), ft1-1518 (1514192460829425664), ft1-1519 
(1514192460867174400), ft1-1539 (1514192460868222976), ft1-1540 
(1514192460868222977), ft1-1542 (1514192460868222978), ft1-1546 
(1514192460869271552), ft1-1552 (1514192460869271553), ft1-1563 
(1514192460871368704), ft1-1564 (1514192460871368705), ... (41 adds)]} 0 992
  2> 70571 INFO  (qtp1287419856-655) [n:127.0.0.1:43897_ c:collection1 s:shard2 
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
 {add=[ft1-1790 (1514192462017462272), ft1-1791 (1514192462035288064), ft1-1803 
(1514192462036336640), ft1-1810 (1514192462037385216), ft1-1813 
(1514192462042628096), ft1-1814 (1514192462043676672), ft1-1815 
(1514192462043676673), ft1-1818 (1514192462044725248), ft1-1819 
(1514192462044725249), ft1-1820 (1514192462303723520), ... (45 adds)]} 0 856
 2> 70911 INFO  (qtp1287419856-309) [n:127.0.0.1:43897_ c:collection1 s:shard2 
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
 {add=[ft1-1998 (1514192463049261056), ft1-1999 (1514192463084912640), ft1-2001 
(1514192463085961216), ft1-2003 (1514192463085961217), ft1-2011 
(1514192463087009792), ft1-2012 (1514192463088058368), ft1-2016 
(1514192463099592704), ft1-2020 (1514192463101689856), ft1-2022 
(1514192463102738432), ft1-2023 (1514192463392145408), ... (35 adds)]} 0 325
  2> 71386 INFO  (qtp1287419856-653) [n:127.0.0.1:43897_ c:collection1 s:shard2 
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= 
path=/update 
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
 {add=[ft1-2203 (1514192463788507136), ft1-2206 (1514192464060088320), ft1-2217 
(1514192464074768384), ft1-2219 (1514192464074768385), ft1-2224 
(1514192464075816960), ft1-2229 (1514192464075816961), ft1-2231 
(1514192464075816962), ft1-2239 (1514192464076865536), ft1-2249 
(1514192464077914112), ft1-2268 (1514192464077914113), ... (27 adds)]} 0 471
  

// the most 100 recent versions match for both replicas
  2> 76191 DEBUG (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_) 
[n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] 
o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:43897  sorted 
versions from http://127.0.0.1:36605/collection1/ = [1514192465046798337, 
1514192465046798336, ... , 1514192463538946050, 1514192463538946049]
  2> 76241 DEBUG (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2 
r:core_node7 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 
url=http://127.0.0.1:36605  sorted versions from 
http://127.0.0.1:43897/collection1/ = [1514192465046798337, 
1514192465046798336, ... , 1514192463538946050, 1514192463538946049]

// 43897 becomes the new leader
  2> 76242 INFO  (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_) 
[n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] 
o.a.s.c.ShardLeaderElectionContext I am the new leader: 
http://127.0.0.1:43897/collection1/ shard2


// end of test
  2> ######shard2 is not consistent.  Got 2076 from 
http://127.0.0.1:43897/collection1 (previous client) and got 2103 from 
http://127.0.0.1:36605/collection1
  2> 129545 INFO  (qtp1287419856-653) [n:127.0.0.1:43897_ c:collection1 
s:shard2 r:core_node4 x:collection1] o.a.s.c.S.Request [collection1] webapp= 
path=/select 
params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2}
 hits=2076 status=0 QTime=114 
  2> 129656 INFO  (qtp33344819-422) [n:127.0.0.1:36605_ c:collection1 s:shard2 
r:core_node7 x:collection1] o.a.s.c.S.Request [collection1] webapp= 
path=/select 
params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2}
 hits=2103 status=0 QTime=4 
  2> ######http://127.0.0.1:43897/collection1: 
SolrDocumentList[sz=2076]=[SolrDocument{id=0-0, _version_=1514192448210862080}, 
SolrDocument{id=0-100, _version_=1514192469996077056}, SolrDocument{id=0-1000, 
_version_=1514192493627834368}, SolrDocument{id=0-1006, 
_version_=1514192493742129152}, SolrDocument{id=0-1008, 
_version_=1514192493795606528}] , [...] , [SolrDocument{id=ft1-984, 
_version_=1514192458787848193}, SolrDocument{id=ft1-991, 
_version_=1514192459077255168}, SolrDocument{id=ft1-992, 
_version_=1514192459148558336}, SolrDocument{id=ft1-994, 
_version_=1514192459149606912}, SolrDocument{id=ft1-997, 
_version_=1514192459150655488}]
  2> ######http://127.0.0.1:36605/collection1: 
SolrDocumentList[sz=2103]=[SolrDocument{id=0-0, _version_=1514192448210862080}, 
SolrDocument{id=0-100, _version_=1514192469996077056}, SolrDocument{id=0-1000, 
_version_=1514192493627834368}, SolrDocument{id=0-1006, 
_version_=1514192493742129152}, SolrDocument{id=0-1008, 
_version_=1514192493795606528}] , [...] , [SolrDocument{id=ft1-984, 
_version_=1514192458787848193}, SolrDocument{id=ft1-991, 
_version_=1514192459077255168}, SolrDocument{id=ft1-992, 
_version_=1514192459148558336}, SolrDocument{id=ft1-994, 
_version_=1514192459149606912}, SolrDocument{id=ft1-997, 
_version_=1514192459150655488}]
  2> ###### sizes=2076,2103
  2> ###### Only in http://127.0.0.1:36605/collection1: 
[{_version_=1514192462261780480, id=ft1-1715}, {_version_=1514192462260731904, 
id=ft1-1711}, {_version_=1514192462263877632, id=ft1-1719}, 
{_version_=1514192462262829056, id=ft1-1718}, {_version_=1514192462264926208, 
id=ft1-1720}, {_version_=1514192462015365120, id=ft1-1670}, 
{_version_=1514192462276460544, id=ft1-1722}, {_version_=1514192462415921152, 
id=ft1-1844}, {_version_=1514192462413824000, id=ft1-1840}, 
{_version_=1514192462230323200, id=ft1-1709}, {_version_=1514192462414872576, 
id=ft1-1841}, {_version_=1514192462412775424, id=ft1-1838}, 
{_version_=1514192462412775425, id=ft1-1839}, {_version_=1514192462411726848, 
id=ft1-1837}, {_version_=1514192462230323201, id=ft1-1710}, 
{_version_=1514192462057308160, id=ft1-1700}, {_version_=1514192462058356737, 
id=ft1-1703}, {_version_=1514192462058356736, id=ft1-1702}, 
{_version_=1514192462375026688, id=ft1-1796}, {_version_=1514192462400192512, 
id=ft1-1798}, {_version_=1514192462399143936, id=ft1-1797}, 
{_version_=1514192462193623040, id=ft1-1707}, {_version_=1514192462052065280, 
id=ft1-1679}, {_version_=1514192462055211008, id=ft1-1688}, 
{_version_=1514192462053113856, id=ft1-1682}, {_version_=1514192462056259584, 
id=ft1-1695}, {_version_=1514192462056259585, id=ft1-1698}]
 {code}

> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
>                 Key: SOLR-8129
>                 URL: https://issues.apache.org/jira/browse/SOLR-8129
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Yonik Seeley
>         Attachments: fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures, 
> including shard inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8129) HdfsChaosMonkeyNothingIsSafeTest failures

Reply via email to