Hoss Man created SOLR-13486:
-------------------------------

             Summary: race condition between leader's "replay on startup" and 
non-leader's "recover from leader" can leave replicas out of sync 
(TestCloudConsistency)
                 Key: SOLR-13486
                 URL: https://issues.apache.org/jira/browse/SOLR-13486
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Hoss Man


I've been investigating some jenkins failures from TestCloudConsistency, which 
at first glance suggest a problem w/replica(s) recovering after a network 
partition from the leader - but in digging into the logs the root cause 
acturally seems to be a thread race conditions when a replica (the leader) is 
first registered...
 * The {{ZkContainer.registerInZk(...)}} method (which is called by 
{{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
run in a background thread (via the {{ZkContainer.coreZkRegister}} 
ExecutorService)
 * {{ZkContainer.registerInZk(...)}} delegates to 
{{ZKController.register(...)}} which is ultimately responsible for checking if 
there are any "old" tlogs on disk, and if so handling the "Replaying tlog for 
<URL> during startup" logic
 * Because this happens in a background thread, other logic/requests can be 
handled by this core/replica in the meantime - before it starts (or while in 
the middle of) replaying the tlogs
 ** Notably: *leader's that have not yet replayed tlogs on startup will 
erroneously respond to RTG / Fingerprint / PeerSync requests from other 
replicas w/incomplete data*

...In general, it seems scary / fishy to me that a replica can (aparently) 
become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
... during startup" logic ... particularly since this can happen even for 
replicas that are/become leaders. It seems like this could potentially cause a 
whole host of problems, only one of which manifests in this particular test 
failure:
 * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog ... 
during startup" check:
 ** replicaX can recognize (via zk terms) that it should be the leader(X)
 ** this leaderX can then instruct some other replicaY to recover from it
 ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
(either on it's own volition, or because it was instructed to by leaderX) in an 
attempt to recover
 *** the responses to these recovery requests will not include updates in the 
tlog files that existed on leaderX prior to startup that hvae not yet been 
replayed
 * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
during startup" can finish
 ** replicaY now thinks it is in sync with leaderX, but leaderX has (replayed) 
updates the other replicas know nothing about



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to