[
https://issues.apache.org/jira/browse/ZOOKEEPER-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andor Molnar reassigned ZOOKEEPER-3531:
---------------------------------------
Assignee: Chang Lou
> Synchronization on ACLCache cause cluster to hang when network/disk issues
> happen during datatree serialization
> ---------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3531
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3531
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.5.2, 3.5.3, 3.5.4, 3.5.5
> Reporter: Chang Lou
> Assignee: Chang Lou
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.6.0
>
> Attachments: fix.patch, generator.py
>
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> During our ZooKeeper fault injection testing, we observed that sometimes the
> ZK cluster could hang (requests time out, node status shows ok). After
> inspecting the issue, we believe this is caused by I/O (serializing ACLCache)
> inside a critical section. The bug is essentially similar to what is
> described in ZooKeeper-2201.
> org.apache.zookeeper.server.DataTree#serialize calls the aclCache.serialize
> when doing dataree serialization, however,
> org.apache.zookeeper.server.ReferenceCountedACLCache#serialize could get
> stuck at OutputArchieve.writeInt due to potential network/disk issues. This
> can cause the system experiences hanging issues similar to ZooKeeper-2201
> (any attempt to create/delete/modify the DataNode will cause the leader to
> hang at the beginning of the request processor chain). The root cause is the
> lock contention between:
> * org.apache.zookeeper.server.DataTree#serialize ->
> org.apache.zookeeper.server.ReferenceCountedACLCache#serialize
> * PrepRequestProcessor#getRecordForPath ->
> org.apache.zookeeper.server.DataTree#getACL(org.apache.zookeeper.server.DataNode)
> -> org.apache.zookeeper.server.ReferenceCountedACLCache#convertLong
> When the snapshot gets stuck in acl serialization, it would prevent all other
> operations to ReferenceCountedACLCache. Since getRecordForPath calls
> ReferenceCountedACLCache#convertLong, any op triggering getRecordForPath will
> cause the leader to hang at the beginning of the request processor chain:
> {code:java}
> org.apache.zookeeper.server.ReferenceCountedACLCache.convertLong(ReferenceCountedACLCache.java:87)
> org.apache.zookeeper.server.DataTree.getACL(DataTree.java:734)
> - locked org.apache.zookeeper.server.DataNode@4a062b7d
> org.apache.zookeeper.server.ZKDatabase.aclForNode(ZKDatabase.java:371)
> org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:170)
> - locked java.util.ArrayDeque@3f7394f7
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:417)
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:757)
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:145)
> {code}
> Similar to ZooKeeper-2201, the leader can still send out heartbeats so the
> cluster will not recover until the network/disk issue resolves.
> Steps to reproduce this bug:
> # start a cluster with 1 leader and n followers
> # manually create some ACLs, to enlarge the window of dumping acls so it
> would be more likely to hang at serializing ACLCache when delay happens. (we
> wrote a script to generate such workloads, see attachments)
> # inject long network/disk write delays and run some benchmarks to trigger
> snapshots
> # once stuck, you should observe new requests to the cluster would fail.
> Essentially the core problem is the OutputArchive write should not be kept
> inside this synchronization block. So a straightforward solution is to move
> writes out of sync block: do a copy inside the sync block and perform
> vulnerable network writes afterwards. The patch for this solution is attached
> and verified. Another more systematic fix is perhaps replacing all
> synchronized methods in the ReferenceCountedACLCache with ConcurrentHashMap.
> We double checked that the issue remains in the latest version of master
> branch (68c21988d55c57e483370d3ee223c22da2d1bbcf).
> Attachments are 1) patch for fix and regression test 2) scripts to generate
> workloads to fill ACL cache
--
This message was sent by Atlassian Jira
(v8.3.4#803005)