[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-4846:
--------------------------------------
    Labels: pull-request-available  (was: )

> Failure to reload database due to missing ACL
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-4846
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4846
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Damien Diederen
>            Assignee: Damien Diederen
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> ZooKeeper snapshots are {_}fuzzy{_}, as the server does not stop processing 
> requests while ACLs and nodes are being streamed to disk.
> ACLs, notably, are streamed {_}first{_}, as a mapping between the full 
> serialized ACL and an "ACL ID" referenced by the node.
> Consequently, a snapshot can very well contain ACL IDs which do not exist in 
> the mapping. Prior to ZOOKEEPER-4799, such situations would produce harmless 
> (if annoying) "Ignoring acl XYZ as it does not exist in the cache" INFO 
> entries in the server logs.
> With ZOOKEEPER-4799, we started "eagerly" fetching the referenced ACLs in 
> {{DataTree}} operations such as {{{}createNode{}}}, {{{}deleteNode{}}}, 
> etc.—as opposed to just fetching them from request processors.
> This can result in fatal errors during the {{fastForwardFromEdits}} phase of 
> restoring a database, when transactions are processed on top of an 
> inconsistent data tree—preventing the server from starting.
> The errors are thrown in this code path:
> {code:java}
> // ReferenceCountedACLCache.java:90
> List<ACL> acls = longKeyMap.get(longVal);
> if (acls == null) {
>     LOG.error("ERROR: ACL not available for long {}", longVal);
>     throw new RuntimeException("Failed to fetch acls for " + longVal);
> }
> {code}
> Here is a scenario leading to such a failure:
>  * An existing node {{{}/foo{}}}, sporting an unique ACL, is deleted. This is 
> recorded in transaction log {{{}$SNAP-1{}}}; said ACL is also deallocated;
>  * Snapshot {{$SNAP}} is started;
>  * The ACL map is serialized to {{{}$SNAP{}}};
>  * A new node {{/foo}} sporting the same unique ACL is created in a portion 
> of the data tree which still has to be serialized;
>  * Node {{/foo}} is serialized to {{{}$SNAP{}}}—but its ACL isn't;
>  * The server is restarted;
>  * The {{DataTree}} is initialized from {{{}$SNAP{}}}, including node 
> {{/foo}} with a dangling ACL reference;
>  * Transaction log {{$SNAP-1}} is being replayed, leading to a 
> {{{}deleteNode("/foo"){}}};
>  * {{getACL(node)}} panics, preventing a successful restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to