[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209924#comment-17209924
 ] 

Michael Han commented on ZOOKEEPER-3783:
----------------------------------------

Please refer to sub tasks for the progress of this ticket. Use this ticket as 
an umbrella to consolidate all on disk storage engine related tickets / tasks.

> Build ZooKeeper based on RocksDB 
> ---------------------------------
>
>                 Key: ZOOKEEPER-3783
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3783
>             Project: ZooKeeper
>          Issue Type: New Feature
>          Components: server
>            Reporter: Fangmin Lv
>            Assignee: Michael Han
>            Priority: Major
>         Attachments: 0001-Use-RocksDB-for-taking-a-snapshot.patch
>
>
> Here are the things we have done so far with ZK on RocksDB project, it's in 
> an initial phase, we don't have time to explore more, but put here in case 
> the community is interested in and can help move this forward.
>  
> *[Ideas]*
>  
> When we were designing and working on this project, it was set into 3 phases:
>  
> *1. Moving ZK snapshot onto RocksDB to reducing snapshot disk IO*
>  
> It still keeps the in memory DataTree but uses RocksDB for it's snapshot to 
> leverage the benefit of incremental snapshot in RocksDB to reduce the write 
> amplification when having large snapshot but only small partial of hot znodes 
> being changed.
>  
> Since ZK has it’s own txn file, the WAL in RocksDB could be disabled to avoid 
> double writing.
>  
> *2. Removing in memory DataTree to get rid of memory bound*
>  
> Read data from RocksDB (maybe with some cache), but keep key structures like 
> children in memory, so ZK won't be bounded to the memory limit, and still 
> have good through for APIs like getChildren.
>  
> *3. Moving txn log into RocksDB with different column family*
>  
> Consider to move the log into the RocksDB to get rid of the ZK txn files, so 
> that we don't need to manage the txn files ourselves and probably the 
> backup/restore would be simpler.
>  
> *[Implementation for phase 1]*
>  
> The attachment is the initial version of for the 1st phase, it may not able 
> to apply to the latest master code since it was based on a very old commit, 
> but you can get some idea from this.
>  
> The idea is simple, when changes are made in DataTree, they're recorded and 
> applied to the RocksDB when processTxn in ZKDatabase.
>  
> When loading from disk, it will iterate through the RocksDB keys to restore 
> the in memory DataTree.
>  
> It also considered things for slowly rollout or rollback, so it can support 
> reading from snapshot file and write to RocksDB, or read from RocksDB and 
> write to file during runtime based on the flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to