mcfatealan commented on issue #1077: ZOOKEEPER-3531: Synchronization on 
ACLCache cause cluster to hang whe…
URL: https://github.com/apache/zookeeper/pull/1077#issuecomment-528928031
 
 
   Hi @lvfangmin thanks for pointing out the potential risks here!
   I totally agree that we should always be careful about potential correctness 
issue when dealing with changes to synchronization. Meanwhile, in this case 
(plz correct me if I'm wrong here) I think the proposed fix is rather safe 
that: 1) it does not break existing fixes in ZOOKEEPER-3306 2) it does not 
introduce new problems.
   
   ### 1)
   The race condition mentioned in the ZOOKEEPER-3306 exists before the fix 
mentioned in this thread. We carefully inspect the issue you mentioned in 
ZOOKEEPER-3306. And the fix introduced in ZOOKEEPER-3306 is not to prevent this 
race condition to avoid the fuzzy snapshot, but to add the missing ACL during 
replay. New transactions are always possible between `serializeACL` and 
`serializeNodes`, and our fix would not create a new race condition here.
   
   After the fix we can still pass the test 
`testACLCreatedDuringFuzzySnapshotSync`.
   
   ### 2)
   Essentially the proposed fix is finer-grained synchronization. The original 
codes look like:
   **v1**
   ```java
   synchronized public void serialize(OutputArchive oa) throws IOException {
       writeACL(longKeyMap);
   }
   ```
   
   which is equilvalent to: 
   **v1.b=v1**
   
   ```java
   public void serialize(OutputArchive oa) throws IOException {
       synchronized (this) {
           clonedLongKeyMap = copyACL(longKeyMap);
           writeACL(clonedLongKeyMap);
       }
   }
   ```
   
   This is how codes look like after our fix:
   **v2**
   
   ```java
   public void serialize(OutputArchive oa) throws IOException {
       synchronized (this) {
           clonedLongKeyMap = copyACL(longKeyMap);
       }
       writeACL(clonedLongKeyMap);
   }
   ```
   Because `clonedLongKeyMap` is a local, immutable data structure, it would 
not create new correctness issues even we move it out of synchronization 
blocks. Suppose right now we have a thread `t2` coming in when `t1` is doing 
`writeACL`, it is still not gonna changing the in-memory state or on-disk 
records. The only difference this change made is when `writeACL` gets stuck, it 
would get stuck outside the synchronization block so it would not block other 
access to ACLCache.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to