[jira] [Commented] (ZOOKEEPER-3783) Build ZooKeeper based on RocksDB

Michael Han (Jira) Thu, 02 Apr 2020 12:20:41 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074025#comment-17074025
 ]


Michael Han commented on ZOOKEEPER-3783:
----------------------------------------

This is awesome Fangmin, thanks for contributing the existing work back to the 
community! 

I have strong interests to drive this project forward, and I believe this will 
drastically improve the scale of data ZooKeeper can handle, and can lead to 
many interesting use cases for different scale of workloads. Though, there are 
many tough trade offs we need make, and lots of details to figure out given how 
fundamental these change will impact ZooKeeper. 

I would expect lots of discussions being made regarding the general design and 
various trade offs we want to make. I'll write an RFC to capture what I have in 
my mind and let's discuss with community and devs and iterate starting from 
there.

> Build ZooKeeper based on RocksDB 
> ---------------------------------
>
>                 Key: ZOOKEEPER-3783
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3783
>             Project: ZooKeeper
>          Issue Type: New Feature
>          Components: server
>            Reporter: Fangmin Lv
>            Priority: Major
>         Attachments: 0001-Use-RocksDB-for-taking-a-snapshot.patch
>
>
> Here are the things we have done so far with ZK on RocksDB project, it's in 
> an initial phase, we don't have time to explore more, but put here in case 
> the community is interested in and can help move this forward.
>  
> *[Ideas]*
>  
> When we were designing and working on this project, it was set into 3 phases:
>  
> *1. Moving ZK snapshot onto RocksDB to reducing snapshot disk IO*
>  
> It still keeps the in memory DataTree but uses RocksDB for it's snapshot to 
> leverage the benefit of incremental snapshot in RocksDB to reduce the write 
> amplification when having large snapshot but only small partial of hot znodes 
> being changed.
>  
> Since ZK has it’s own txn file, the WAL in RocksDB could be disabled to avoid 
> double writing.
>  
> *2. Removing in memory DataTree to get rid of memory bound*
>  
> Read data from RocksDB (maybe with some cache), but keep key structures like 
> children in memory, so ZK won't be bounded to the memory limit, and still 
> have good through for APIs like getChildren.
>  
> *3. Moving txn log into RocksDB with different column family*
>  
> Consider to move the log into the RocksDB to get rid of the ZK txn files, so 
> that we don't need to manage the txn files ourselves and probably the 
> backup/restore would be simpler.
>  
> *[Implementation for phase 1]*
>  
> The attachment is the initial version of for the 1st phase, it may not able 
> to apply to the latest master code since it was based on a very old commit, 
> but you can get some idea from this.
>  
> The idea is simple, when changes are made in DataTree, they're recorded and 
> applied to the RocksDB when processTxn in ZKDatabase.
>  
> When loading from disk, it will iterate through the RocksDB keys to restore 
> the in memory DataTree.
>  
> It also considered things for slowly rollout or rollback, so it can support 
> reading from snapshot file and write to RocksDB, or read from RocksDB and 
> write to file during runtime based on the flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3783) Build ZooKeeper based on RocksDB

Reply via email to