[ https://issues.apache.org/jira/browse/ZOOKEEPER-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ZOOKEEPER-4846: -------------------------------------- Labels: pull-request-available (was: ) > Failure to reload database due to missing ACL > --------------------------------------------- > > Key: ZOOKEEPER-4846 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4846 > Project: ZooKeeper > Issue Type: Bug > Reporter: Damien Diederen > Assignee: Damien Diederen > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > ZooKeeper snapshots are {_}fuzzy{_}, as the server does not stop processing > requests while ACLs and nodes are being streamed to disk. > ACLs, notably, are streamed {_}first{_}, as a mapping between the full > serialized ACL and an "ACL ID" referenced by the node. > Consequently, a snapshot can very well contain ACL IDs which do not exist in > the mapping. Prior to ZOOKEEPER-4799, such situations would produce harmless > (if annoying) "Ignoring acl XYZ as it does not exist in the cache" INFO > entries in the server logs. > With ZOOKEEPER-4799, we started "eagerly" fetching the referenced ACLs in > {{DataTree}} operations such as {{{}createNode{}}}, {{{}deleteNode{}}}, > etc.—as opposed to just fetching them from request processors. > This can result in fatal errors during the {{fastForwardFromEdits}} phase of > restoring a database, when transactions are processed on top of an > inconsistent data tree—preventing the server from starting. > The errors are thrown in this code path: > {code:java} > // ReferenceCountedACLCache.java:90 > List<ACL> acls = longKeyMap.get(longVal); > if (acls == null) { > LOG.error("ERROR: ACL not available for long {}", longVal); > throw new RuntimeException("Failed to fetch acls for " + longVal); > } > {code} > Here is a scenario leading to such a failure: > * An existing node {{{}/foo{}}}, sporting an unique ACL, is deleted. This is > recorded in transaction log {{{}$SNAP-1{}}}; said ACL is also deallocated; > * Snapshot {{$SNAP}} is started; > * The ACL map is serialized to {{{}$SNAP{}}}; > * A new node {{/foo}} sporting the same unique ACL is created in a portion > of the data tree which still has to be serialized; > * Node {{/foo}} is serialized to {{{}$SNAP{}}}—but its ACL isn't; > * The server is restarted; > * The {{DataTree}} is initialized from {{{}$SNAP{}}}, including node > {{/foo}} with a dangling ACL reference; > * Transaction log {{$SNAP-1}} is being replayed, leading to a > {{{}deleteNode("/foo"){}}}; > * {{getACL(node)}} panics, preventing a successful restart. -- This message was sent by Atlassian Jira (v8.20.10#820010)