[
https://issues.apache.org/jira/browse/SOLR-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-11469:
----------------------------
Attachment: SOLR-11469.patch
Here's my initial attempt at a fix, mainly focusing on...
* adding comments
* adding logging
* renaming variables to be more explict what we're expecting
* tightening up the call to {{findLeaderReplicaWithDuplicatedName}} so we
explictly look for the leader of shard1 since that's what we assert against
later
* add extra asserts that shard2 doesn't have an election either (the existing
asserts only checked the second collection)
With this fix, the test *seems* to pass a little more often -- but it's still
easy to get a diff type of failure that i was also suspicious would be very
plausible given the existing code...
The entire premise of {{findLeaderReplicaWithDuplicatedName}} is that we can
find "a leader" from collection1 with the same {{Replica.getName()}} as a
Replica from collection2 -- but IIUC there's no garuntee that will be true.
Here's an example failure with the patch applied...
{noformat}
[junit4] 2> 8485 INFO
(TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874]) [ ]
o.a.s.SolrTestCaseJ4 ###Starting test
[junit4] 2> 8486 INFO
(TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874]) [ ]
o.a.s.c.LeaderElectionContextKeyTest All Col1 Replicas:
[core_node2:{"core":"testCollection1_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"},
core_node4:{"core":"testCollection1_shard2_replica_n3","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
[junit4] 2> 8486 INFO
(TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874]) [ ]
o.a.s.c.LeaderElectionContextKeyTest All Col2 Replicas:
[core_node3:{"core":"testCollection2_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"},
core_node4:{"core":"testCollection2_shard2_replica_n2","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
[junit4] 2> 8488 INFO
(TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874]) [ ]
o.a.s.SolrTestCaseJ4 ###Ending test
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=LeaderElectionContextKeyTest -Dtests.method=test
-Dtests.seed=B0F9446FF638874 -Dtests.slow=true -Dtests.locale=ga
-Dtests.timezone=Asia/Chongqing -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
[junit4] FAILURE 0.02s | LeaderElectionContextKeyTest.test <<<
[junit4] > Throwable #1: java.lang.AssertionError: Unable to find
col1+shard1 leader w/same name as replica in col2:
[core_node2:{"core":"testCollection1_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
<=?=>
[core_node3:{"core":"testCollection2_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"},
core_node4:{"core":"testCollection2_shard2_replica_n2","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
[junit4] > at
__randomizedtesting.SeedInfo.seed([B0F9446FF638874:835BAB9C519FE58C]:0)
[junit4] > at
org.apache.solr.cloud.LeaderElectionContextKeyTest.test(LeaderElectionContextKeyTest.java:95)
[junit4] > at java.lang.Thread.run(Thread.java:748)
{noformat}
Note:
* that seed won't reproduce reliably, because the leader node _might_ randomly
have the sane name as one of the replicas from the other collection)
* In the particular log above, if we did out testing/assertions against
col1+shard2 instead of col1+shard1 then we'd get lucky and find the
coreNodeName overlap with col2 thta the test expects -- but unless i'm missing
something that's still just a fluke and not something we can depend upon
I'm not really sure how to make this test work reliably? ... unless maybe we
manually add replicas with explicitly specified {{coreNodeName}} and force them
to be the leader????
> LeaderElectionContextKeyTest has flawed logic: 50% of the time it checks the
> wrong shard's elections
> ----------------------------------------------------------------------------------------------------
>
> Key: SOLR-11469
> URL: https://issues.apache.org/jira/browse/SOLR-11469
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Attachments: SOLR-11469.patch
>
>
> LeaderElectionContextKeyTest is very flaky -- and on millers beastit reports
> it shows a suspiciously close to "50%" failure rate.
> Digging into the test i realized that it creates a 2 shard index, then picks
> "a leader" to kill (arbitrarily) and then asserts that the leader election
> nodes for *shard1* are affected ... so ~50% of the time it kills the shard2
> leader and then fails because it doesn't see an election in shard1.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]