[jira] [Commented] (SOLR-13176) Testing of TLOG Replicas needs to be re-instated, may be hiding bugs

Hoss Man (JIRA) Wed, 30 Jan 2019 11:24:14 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756468#comment-16756468
 ]


Hoss Man commented on SOLR-13176:
---------------------------------

{quote}if the logic in waitForInSyncWithLeader is commented out and just 
returns immediately I expect lots of tests to fail, something like: ... won't 
work. All those other tests were not modified to handle TLOG replicas, they 
assume the same behavior of NRT.
{quote}
Ok ... so talking it through just to make sure i'm not missing anything: Unlike 
an NRT replica the contract of a TLOG (or PULL) replica doesn't guarantee that 
docs are searchable when a commit(w/ waitSearcher) happens – there is an 
inherent delay waiting for replica to run IndexFether to pull the replicas from 
the leader ... which is why the logic in 
{{TestInjection.waitForInSyncWithLeader}} existed.

But that means that even if {{TestInjection.waitForInSyncWithLeader}} was 
implemented perfectly (and it clearly wasn't, see SOLR-12313 w/mark miller's 
comments, and my observations about it reliably failing in 
{{ForceLeaderTest.testReplicasInLIRNoLeader}} when TLOG replicas were in use) 
then these test that were randomizing TLOG replicas were still garunteed fail 
unless assertions were enabled – because 
{{TestInjection.waitForInSyncWithLeader}} is only ever executed as part of a 
java {{assert}}.

It sounds like we either need new variants of these tests that take into 
consideration the contract of TLOG replicas, or – if we want to reinstate 
randomization of TLOG replicas in all these tests, then the _spirit_ of what 
{{TestInjection.waitForInSyncWithLeader}} was trying to do needs to be re-added 
in a way that isn't dependent on assertions being enabled, doesn't cause the 
replica updates to time out forever, and "fails cleanly" with a useful error 
message when there is a problem
 * Perhaps a _single_ "fetch index immediately" call can run in the replica's 
DUH code path, w/o retries – but only happens if the test has sets a static 
boolean (so we don't re-cause SOLR-13168 in non tests) and then trust the tests 
to check that the *right* content gets replicated?
 * Or make the test invoke some {{waitForAllTlogReplicasInSyncWithLeaders}} 
type logic only once it's done a sequence of updates and now wants to do a 
sequence of queries?
 ** since the code in the test can know for a fact that all nodes are running 
in the same JVM, this wouldn't have to rely on polling & network connections 
like {{TestInjection.waitForInSyncWithLeader}} did ... it could whitebox reach 
into each TLOG replica SolrCore via MiniSolrCloudCluster to check the status of 
the IndexFetcher in use

> Testing of TLOG Replicas needs to be re-instated, may be hiding bugs
> --------------------------------------------------------------------
>
>                 Key: SOLR-13176
>                 URL: https://issues.apache.org/jira/browse/SOLR-13176
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>
> As part of mark miller's push to cleanup tests, one change he made as part of 
> his _big__ SOLR-12801 commit (circa Nov2018) was to dissable the randomized 
> use of TLOG replicas in a lot of tests
> His comments at the time were that he suspected a lot of the problems he was 
> seeing was due to a poor implementation of 
> {{TestInjection.waitForInSyncWithLeader()}} (which only comes into play for 
> TLOG replicas) ultimately leading to him creating SOLR-12313.
> But based on some limited experimentation I made w/trying to re-enable TLOG 
> replica randomization in some tests after (essentially) removing 
> {{TestInjection.waitForInSyncWithLeader()}} in SOLR-13168 i'm still seeing a 
> lot of sporadic test failures when TLOG replicas get used... the only change 
> is that instead of "failing slow" because of the stalls introduced by 
> {{TestInjection.waitForInSyncWithLeader()}} they started failing quickly.
> *It's not clear if these failures are because the tests have bugs; or if the 
> tests don't account for the expected behavior of the TLOG replica types in 
> certain situations; or if the code paths being tested have bugs when dealing 
> with TLOG replicas.*
> ----
> Bottom line: As things stand today, TLOG replicas aren't being very 
> thoroughly tested, particularly in edge cases (http partitions, LIR, leader 
> election, mixed used of replica types, etc...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13176) Testing of TLOG Replicas needs to be re-instated, may be hiding bugs

Reply via email to