mcfatealan commented on issue #1077: ZOOKEEPER-3531: Synchronization on ACLCache cause cluster to hang whe… URL: https://github.com/apache/zookeeper/pull/1077#issuecomment-528928031 Hi @lvfangmin thanks for pointing out the potential risks here! I totally agree that we should always be careful about potential correctness issue when dealing with changes to synchronization. Meanwhile, in this case (plz correct me if I'm wrong here) I think the proposed fix is rather safe that: 1) it does not break existing fixes in ZOOKEEPER-3306 2) it does not introduce new problems. ### 1) The race condition mentioned in the ZOOKEEPER-3306 exists before the fix mentioned in this thread. We carefully inspect the issue you mentioned in ZOOKEEPER-3306. And the fix introduced in ZOOKEEPER-3306 is not to prevent this race condition to avoid the fuzzy snapshot, but to add the missing ACL during replay. New transactions are always possible between `serializeACL` and `serializeNodes`, and our fix would not create a new race condition here. After the fix we can still pass the test `testACLCreatedDuringFuzzySnapshotSync`. ### 2) Essentially the proposed fix is finer-grained synchronization. The original codes look like: **v1** ```java synchronized public void serialize(OutputArchive oa) throws IOException { writeACL(longKeyMap); } ``` which is equilvalent to: **v1.b=v1** ```java public void serialize(OutputArchive oa) throws IOException { synchronized (this) { clonedLongKeyMap = copyACL(longKeyMap); writeACL(clonedLongKeyMap); } } ``` This is how codes look like after our fix: **v2** ```java public void serialize(OutputArchive oa) throws IOException { synchronized (this) { clonedLongKeyMap = copyACL(longKeyMap); } writeACL(clonedLongKeyMap); } ``` Because `clonedLongKeyMap` is a local, immutable data structure, it would not create new correctness issues even we move it out of synchronization blocks. Suppose right now we have a thread `t2` coming in when `t1` is doing `writeACL`, it is still not gonna changing the in-memory state or on-disk records. The only difference this change made is when `writeACL` gets stuck, it would get stuck outside the synchronization block so it would not block other access to ACLCache.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services