[
https://issues.apache.org/jira/browse/SOLR-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756468#comment-16756468
]
Hoss Man commented on SOLR-13176:
---------------------------------
{quote}if the logic in waitForInSyncWithLeader is commented out and just
returns immediately I expect lots of tests to fail, something like: ... won't
work. All those other tests were not modified to handle TLOG replicas, they
assume the same behavior of NRT.
{quote}
Ok ... so talking it through just to make sure i'm not missing anything: Unlike
an NRT replica the contract of a TLOG (or PULL) replica doesn't guarantee that
docs are searchable when a commit(w/ waitSearcher) happens – there is an
inherent delay waiting for replica to run IndexFether to pull the replicas from
the leader ... which is why the logic in
{{TestInjection.waitForInSyncWithLeader}} existed.
But that means that even if {{TestInjection.waitForInSyncWithLeader}} was
implemented perfectly (and it clearly wasn't, see SOLR-12313 w/mark miller's
comments, and my observations about it reliably failing in
{{ForceLeaderTest.testReplicasInLIRNoLeader}} when TLOG replicas were in use)
then these test that were randomizing TLOG replicas were still garunteed fail
unless assertions were enabled – because
{{TestInjection.waitForInSyncWithLeader}} is only ever executed as part of a
java {{assert}}.
It sounds like we either need new variants of these tests that take into
consideration the contract of TLOG replicas, or – if we want to reinstate
randomization of TLOG replicas in all these tests, then the _spirit_ of what
{{TestInjection.waitForInSyncWithLeader}} was trying to do needs to be re-added
in a way that isn't dependent on assertions being enabled, doesn't cause the
replica updates to time out forever, and "fails cleanly" with a useful error
message when there is a problem
* Perhaps a _single_ "fetch index immediately" call can run in the replica's
DUH code path, w/o retries – but only happens if the test has sets a static
boolean (so we don't re-cause SOLR-13168 in non tests) and then trust the tests
to check that the *right* content gets replicated?
* Or make the test invoke some {{waitForAllTlogReplicasInSyncWithLeaders}}
type logic only once it's done a sequence of updates and now wants to do a
sequence of queries?
** since the code in the test can know for a fact that all nodes are running
in the same JVM, this wouldn't have to rely on polling & network connections
like {{TestInjection.waitForInSyncWithLeader}} did ... it could whitebox reach
into each TLOG replica SolrCore via MiniSolrCloudCluster to check the status of
the IndexFetcher in use
> Testing of TLOG Replicas needs to be re-instated, may be hiding bugs
> --------------------------------------------------------------------
>
> Key: SOLR-13176
> URL: https://issues.apache.org/jira/browse/SOLR-13176
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
>
> As part of mark miller's push to cleanup tests, one change he made as part of
> his _big__ SOLR-12801 commit (circa Nov2018) was to dissable the randomized
> use of TLOG replicas in a lot of tests
> His comments at the time were that he suspected a lot of the problems he was
> seeing was due to a poor implementation of
> {{TestInjection.waitForInSyncWithLeader()}} (which only comes into play for
> TLOG replicas) ultimately leading to him creating SOLR-12313.
> But based on some limited experimentation I made w/trying to re-enable TLOG
> replica randomization in some tests after (essentially) removing
> {{TestInjection.waitForInSyncWithLeader()}} in SOLR-13168 i'm still seeing a
> lot of sporadic test failures when TLOG replicas get used... the only change
> is that instead of "failing slow" because of the stalls introduced by
> {{TestInjection.waitForInSyncWithLeader()}} they started failing quickly.
> *It's not clear if these failures are because the tests have bugs; or if the
> tests don't account for the expected behavior of the TLOG replica types in
> certain situations; or if the code paths being tested have bugs when dealing
> with TLOG replicas.*
> ----
> Bottom line: As things stand today, TLOG replicas aren't being very
> thoroughly tested, particularly in edge cases (http partitions, LIR, leader
> election, mixed used of replica types, etc...)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]