[
https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949872#comment-14949872
]
Yonik Seeley commented on SOLR-8129:
------------------------------------
This seems to be a case of extreme reordering of updates due to thread
scheduling or whatever.
The leader is selected by the chaos monkey to be killed, and there are a bunch
of updates that are sent to one shard but not another (i.e. they may have been
waiting to be sent, but didn't make it before the leader was killed).
A bunch more updates flood in, and when a leader election is held and peersync
is done, everyone thinks they are up-to-date because the most recent updates do
match.
{code}
43897 has fewer docs than 36605: missing: {_version_=1514192462261780480,
id=ft1-1715} ... {_version_=1514192462057308160, id=ft1-1700} ...
{_version_=1514192462056259585, id=ft1-1698}
// Port 43897 is recovering and begins buffering updates
2> 31048 INFO (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1
s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery
process. recoveringAfterStartup=true
2> 31049 INFO (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1
s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy ######
startupVersions=[]
// going "active"
2> 38552 INFO (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1
s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy Registering as
Active after recovery.
// example of a normal add
2> 55692 INFO (qtp1287419856-309) [n:127.0.0.1:43897_ c:collection1 s:shard2
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=
path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
{add=[ft1-48 (1514192447743197184)]} 0 188
// The leader is going to be killed
2> 68281 INFO (Thread-378) [ ] o.a.s.c.ChaosMonkey monkey: stop shard!
49240
// This is presumably where 43897 see the leader marked as "down" in the
clusterstate
2> 68347 INFO (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_)
[n:127.0.0.1:43897_ ] o.a.s.c.c.ZkStateReader A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/collection1/state.json for collection collection1 has
occurred - updating... (live nodes size: 9)
// updates are still being processed by the leader
2> 68743 DEBUG (qtp139680011-202) [n:127.0.0.1:49240_ c:collection1 s:shard2
r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE
add{,id=ft1-1700} {wt=javabin&version=2}
// Replica 36605 receives and processes the update, but 43897 never does (this
may be normal as the leader is being killed)
2> 69010 DEBUG (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2
r:core_node7 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE
add{,id=ft1-1700}
{update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
2> 69015 INFO (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2
r:core_node7 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=
path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
{add=[ft1-1656 (1514192461610614784), ft1-1670 (1514192462015365120), ft1-1679
(1514192462052065280), ft1-1682 (1514192462053113856), ft1-1688
(1514192462055211008), ft1-1695 (1514192462056259584), ft1-1698
(1514192462056259585), ft1-1700 (1514192462057308160), ft1-1702
(1514192462058356736)]} 0 224
// a ton of more update flood in
2> 69710 INFO (qtp1287419856-654) [n:127.0.0.1:43897_ c:collection1 s:shard2
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=
path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
{add=[ft1-1516 (1514192460689965056), ft1-1518 (1514192460829425664), ft1-1519
(1514192460867174400), ft1-1539 (1514192460868222976), ft1-1540
(1514192460868222977), ft1-1542 (1514192460868222978), ft1-1546
(1514192460869271552), ft1-1552 (1514192460869271553), ft1-1563
(1514192460871368704), ft1-1564 (1514192460871368705), ... (41 adds)]} 0 992
2> 70571 INFO (qtp1287419856-655) [n:127.0.0.1:43897_ c:collection1 s:shard2
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=
path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
{add=[ft1-1790 (1514192462017462272), ft1-1791 (1514192462035288064), ft1-1803
(1514192462036336640), ft1-1810 (1514192462037385216), ft1-1813
(1514192462042628096), ft1-1814 (1514192462043676672), ft1-1815
(1514192462043676673), ft1-1818 (1514192462044725248), ft1-1819
(1514192462044725249), ft1-1820 (1514192462303723520), ... (45 adds)]} 0 856
2> 70911 INFO (qtp1287419856-309) [n:127.0.0.1:43897_ c:collection1 s:shard2
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=
path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
{add=[ft1-1998 (1514192463049261056), ft1-1999 (1514192463084912640), ft1-2001
(1514192463085961216), ft1-2003 (1514192463085961217), ft1-2011
(1514192463087009792), ft1-2012 (1514192463088058368), ft1-2016
(1514192463099592704), ft1-2020 (1514192463101689856), ft1-2022
(1514192463102738432), ft1-2023 (1514192463392145408), ... (35 adds)]} 0 325
2> 71386 INFO (qtp1287419856-653) [n:127.0.0.1:43897_ c:collection1 s:shard2
r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp=
path=/update
params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
{add=[ft1-2203 (1514192463788507136), ft1-2206 (1514192464060088320), ft1-2217
(1514192464074768384), ft1-2219 (1514192464074768385), ft1-2224
(1514192464075816960), ft1-2229 (1514192464075816961), ft1-2231
(1514192464075816962), ft1-2239 (1514192464076865536), ft1-2249
(1514192464077914112), ft1-2268 (1514192464077914113), ... (27 adds)]} 0 471
// the most 100 recent versions match for both replicas
2> 76191 DEBUG (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_)
[n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1]
o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:43897 sorted
versions from http://127.0.0.1:36605/collection1/ = [1514192465046798337,
1514192465046798336, ... , 1514192463538946050, 1514192463538946049]
2> 76241 DEBUG (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2
r:core_node7 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1
url=http://127.0.0.1:36605 sorted versions from
http://127.0.0.1:43897/collection1/ = [1514192465046798337,
1514192465046798336, ... , 1514192463538946050, 1514192463538946049]
// 43897 becomes the new leader
2> 76242 INFO (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_)
[n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1]
o.a.s.c.ShardLeaderElectionContext I am the new leader:
http://127.0.0.1:43897/collection1/ shard2
// end of test
2> ######shard2 is not consistent. Got 2076 from
http://127.0.0.1:43897/collection1 (previous client) and got 2103 from
http://127.0.0.1:36605/collection1
2> 129545 INFO (qtp1287419856-653) [n:127.0.0.1:43897_ c:collection1
s:shard2 r:core_node4 x:collection1] o.a.s.c.S.Request [collection1] webapp=
path=/select
params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2}
hits=2076 status=0 QTime=114
2> 129656 INFO (qtp33344819-422) [n:127.0.0.1:36605_ c:collection1 s:shard2
r:core_node7 x:collection1] o.a.s.c.S.Request [collection1] webapp=
path=/select
params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2}
hits=2103 status=0 QTime=4
2> ######http://127.0.0.1:43897/collection1:
SolrDocumentList[sz=2076]=[SolrDocument{id=0-0, _version_=1514192448210862080},
SolrDocument{id=0-100, _version_=1514192469996077056}, SolrDocument{id=0-1000,
_version_=1514192493627834368}, SolrDocument{id=0-1006,
_version_=1514192493742129152}, SolrDocument{id=0-1008,
_version_=1514192493795606528}] , [...] , [SolrDocument{id=ft1-984,
_version_=1514192458787848193}, SolrDocument{id=ft1-991,
_version_=1514192459077255168}, SolrDocument{id=ft1-992,
_version_=1514192459148558336}, SolrDocument{id=ft1-994,
_version_=1514192459149606912}, SolrDocument{id=ft1-997,
_version_=1514192459150655488}]
2> ######http://127.0.0.1:36605/collection1:
SolrDocumentList[sz=2103]=[SolrDocument{id=0-0, _version_=1514192448210862080},
SolrDocument{id=0-100, _version_=1514192469996077056}, SolrDocument{id=0-1000,
_version_=1514192493627834368}, SolrDocument{id=0-1006,
_version_=1514192493742129152}, SolrDocument{id=0-1008,
_version_=1514192493795606528}] , [...] , [SolrDocument{id=ft1-984,
_version_=1514192458787848193}, SolrDocument{id=ft1-991,
_version_=1514192459077255168}, SolrDocument{id=ft1-992,
_version_=1514192459148558336}, SolrDocument{id=ft1-994,
_version_=1514192459149606912}, SolrDocument{id=ft1-997,
_version_=1514192459150655488}]
2> ###### sizes=2076,2103
2> ###### Only in http://127.0.0.1:36605/collection1:
[{_version_=1514192462261780480, id=ft1-1715}, {_version_=1514192462260731904,
id=ft1-1711}, {_version_=1514192462263877632, id=ft1-1719},
{_version_=1514192462262829056, id=ft1-1718}, {_version_=1514192462264926208,
id=ft1-1720}, {_version_=1514192462015365120, id=ft1-1670},
{_version_=1514192462276460544, id=ft1-1722}, {_version_=1514192462415921152,
id=ft1-1844}, {_version_=1514192462413824000, id=ft1-1840},
{_version_=1514192462230323200, id=ft1-1709}, {_version_=1514192462414872576,
id=ft1-1841}, {_version_=1514192462412775424, id=ft1-1838},
{_version_=1514192462412775425, id=ft1-1839}, {_version_=1514192462411726848,
id=ft1-1837}, {_version_=1514192462230323201, id=ft1-1710},
{_version_=1514192462057308160, id=ft1-1700}, {_version_=1514192462058356737,
id=ft1-1703}, {_version_=1514192462058356736, id=ft1-1702},
{_version_=1514192462375026688, id=ft1-1796}, {_version_=1514192462400192512,
id=ft1-1798}, {_version_=1514192462399143936, id=ft1-1797},
{_version_=1514192462193623040, id=ft1-1707}, {_version_=1514192462052065280,
id=ft1-1679}, {_version_=1514192462055211008, id=ft1-1688},
{_version_=1514192462053113856, id=ft1-1682}, {_version_=1514192462056259584,
id=ft1-1695}, {_version_=1514192462056259585, id=ft1-1698}]
{code}
> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
> Key: SOLR-8129
> URL: https://issues.apache.org/jira/browse/SOLR-8129
> Project: Solr
> Issue Type: Bug
> Reporter: Yonik Seeley
> Attachments: fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures,
> including shard inconsistency.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]