[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13561885#comment-13561885
 ] 

Thawan Kooburat commented on ZOOKEEPER-1608:
--------------------------------------------

Thank you for your interest. For status update, I go an initial version working 
in our internal branch. However, it will take a while to get this production 
ready.  One of the major obstacle to pass all unit tests. If I have time, I 
will try to create a patch against upstream branch to get early feedback.  

  
                
> Add support for key-value store as optional storage engine
> ----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1608
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1608
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.4.3
>            Reporter: Thawan Kooburat
>
> Problem:
> 1. ZooKeeper need to load the entire dataset into its memory. So the total 
> data size and number of znode are limited by the amount of available memory.
> 2. We want to minimize ZooKeeper down time, but found that it is bound by 
> snapshot loading and writing time. The bigger the database, the longer it 
> take for the system to recover. The worst case is that if the data size grow 
> too large and initLimit wasn't update accordingly, the quorum won't form 
> after failure.  
> Implementation: (still work in progress)
> 1. Create a new type of DataTree that supported key-value storage as backing 
> store. Our current candidate backing store is Oracle's Berkeley DB Java 
> Edition
> 2. There is no need to use snapshot facility for this type of DataTree. Since 
> doing a sync write of lastProcessedZxid into the backing store is the same as 
> taking a snapshot. However, the system still use txnlog as before. The system 
> can be considered as having only a single snapshot. It has to rely on backing 
> store to detect data corruption and recovery.  
> 3. There is no need to do any per-node locking. CommitProcessor 
> (ZOOKEEPER-1505) prevents concurrent read and write to reach the DataTree. 
> The DataTree is also accessed by PrepRequestProcessor (to create 
> ChangeRecord), but I believe that read and write to the same znode cannot 
> happens concurrently.
> 4. There are 3 types of data which is required to be persisted in backing 
> store: ACLs, znodes and sessions. However, we also store other data reduce 
> oDataTree initialization time or serialization cost such as list of node's 
> children and list of ephemeral node. 
> 5. Each Zookeeper's txn may translate into multiple actions on the DataTree. 
> For example, creating a node may result in  AddingZNODE, AddingChildren and 
> AddingEphemeralNode. However, as a long as these operations are idempotent, 
> there is no need to group them into a transaction. So txns can be replayed on 
> DataTree without corrupting the data. This also means that the system don't 
> need key-value store that support transaction semantic. Currently, only 
> operations related to quota break this assumption because it use increment 
> operation.    
> 6. SNAP protocol is supported so the ensemble can be upgraded online. In the 
> future we may add extend SNAP protocol to send raw data file in order to save 
> CPU cost when sending large database.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to