ztzg opened a new pull request, #2183:
URL: https://github.com/apache/zookeeper/pull/2183

   ZooKeeper snapshots are *fuzzy*, as the server does not stop processing 
requests while ACLs and nodes are being streamed to disk.
   
   ACLs, notably, are streamed *first*, as a mapping between the full 
serialized ACL and an "ACL ID" referenced by the node.
   
   Consequently, a snapshot can very well contain ACL IDs which do not exist in 
the mapping. Prior to ZOOKEEPER-4799, such situations would produce harmless 
(if annoying) "Ignoring acl XYZ as it does not exist in the cache" INFO entries 
in the server logs.
   
   With ZOOKEEPER-4799, we started "eagerly" fetching the referenced ACLs in 
`DataTree` operations such as `createNode`, `deleteNode`, etc.—as opposed to 
just fetching them from request processors.
   
   This can result in fatal errors during the `fastForwardFromEdits` phase of 
restoring a database, when transactions are processed on top of an inconsistent 
data tree—preventing the server from starting.
   
   The errors are thrown in this code path:
   
   ``` java
   // ReferenceCountedACLCache.java:90
   List<ACL> acls = longKeyMap.get(longVal);
   if (acls == null) {
       LOG.error("ERROR: ACL not available for long {}", longVal);
       throw new RuntimeException("Failed to fetch acls for " + longVal);
   }
   ```
   
   Here is a scenario leading to such a failure:
   
   - An existing node `/foo`, sporting an unique ACL, is deleted. This is 
recorded in transaction log `$SNAP-1`; said ACL is also deallocated;
   - Snapshot `$SNAP` is started;
   - The ACL map is serialized to `$SNAP`;
   - A new node `/foo` sporting the same unique ACL is created in a portion of 
the data tree which still has to be serialized;
   - Node `/foo` is serialized to `$SNAP`—but its ACL isn't;
   - The server is restarted;
   - The `DataTree` is initialized from `$SNAP`, including node `/foo` with a 
dangling ACL reference;
   - Transaction log `$SNAP-1` is being replayed, leading to a 
`deleteNode("/foo")`;
   - `getACL(node)` panics.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to