MalcolmSanders created FLINK-11809:
--------------------------------------

             Summary: Implement etcd based StateHandleStore
                 Key: FLINK-11809
                 URL: https://issues.apache.org/jira/browse/FLINK-11809
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
            Reporter: MalcolmSanders


Implement StateHandleStore using jetcd.

Previously ZooKeeperStateHandleStore stores data in a dfs file while records 
its metadata to a zookeeper node in order to keep the data size of a zookeeper 
node small. EtcdStateHandleStore should work in the same way while there is a 
corner case that should be carefully dealt with using etcd.

As described in FLINK-6612, the ZooKeeperStateHandleStore does not guard 
against concurrent delete operations which could happen in case of a lost 
leadership and a new leadership grant. The problem is that checkpoint nodes can 
get deleted even after they have been recovered by another 
ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and 
thwarts future recoveries. In order to guard against deletions of ZooKeeper 
nodes which are still being used by a different ZooKeeperStateHandleStore, a 
locking mechanism has been introduced to make sure that zookeeper nodes are 
allowed to be deleted only after all ZooKeeperStateHandleStores have released 
their lock. The locking mechanism is implemented via ephemeral child nodes of 
the respective ZooKeeper node. Whenever a ZooKeeperStateHandleStore wants to 
lock a ZNode, thus, protecting it from being deleted, it creates an ephemeral 
child node. The node's name is unique to the ZooKeeperStateHandleStore 
instance. The delete operations will then only delete the node if it does not 
have any children associated. In order to guard against orphaned lock nodes, 
they are created as ephemeral nodes. This means that they will be deleted by 
ZooKeeper once the connection of the ZooKeeper client which created the node 
timed out.

The solution leverages that a zookeeper directory cannot be deleted if it still 
has child nodes and a ephemeral node will be deleted by zookeeper server once 
the corresponding client is disconnected. Since etcd doesn’t have hierarchical 
structure like zookeeper, there is no actual relations between a path and its 
so-called parent directory. Suppose we create a persistent key-value to store 
metadata and then create an ephemeral key-value using previous key as the 
prefix. Once the client disconnects with etcd server, the ephemeral key-value 
pair will be deleted from etcd server while there’ll be no effect on its parent 
key-value pair.

My proposal to tackle this case is illustrated in Part 6 Implementation of etcd 
based StateHandleStore in [design 
doc|https://docs.google.com/document/d/12-gIZDuT4IOWG7gmwSqNFsOHuGlkdRHge0ahJ7M311Y/edit#heading=h.sqkj9zjvgicu].
 Any comments or suggestions will be appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to