[
https://issues.apache.org/jira/browse/ZOOKEEPER-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209924#comment-17209924
]
Michael Han commented on ZOOKEEPER-3783:
----------------------------------------
Please refer to sub tasks for the progress of this ticket. Use this ticket as
an umbrella to consolidate all on disk storage engine related tickets / tasks.
> Build ZooKeeper based on RocksDB
> ---------------------------------
>
> Key: ZOOKEEPER-3783
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3783
> Project: ZooKeeper
> Issue Type: New Feature
> Components: server
> Reporter: Fangmin Lv
> Assignee: Michael Han
> Priority: Major
> Attachments: 0001-Use-RocksDB-for-taking-a-snapshot.patch
>
>
> Here are the things we have done so far with ZK on RocksDB project, it's in
> an initial phase, we don't have time to explore more, but put here in case
> the community is interested in and can help move this forward.
>
> *[Ideas]*
>
> When we were designing and working on this project, it was set into 3 phases:
>
> *1. Moving ZK snapshot onto RocksDB to reducing snapshot disk IO*
>
> It still keeps the in memory DataTree but uses RocksDB for it's snapshot to
> leverage the benefit of incremental snapshot in RocksDB to reduce the write
> amplification when having large snapshot but only small partial of hot znodes
> being changed.
>
> Since ZK has it’s own txn file, the WAL in RocksDB could be disabled to avoid
> double writing.
>
> *2. Removing in memory DataTree to get rid of memory bound*
>
> Read data from RocksDB (maybe with some cache), but keep key structures like
> children in memory, so ZK won't be bounded to the memory limit, and still
> have good through for APIs like getChildren.
>
> *3. Moving txn log into RocksDB with different column family*
>
> Consider to move the log into the RocksDB to get rid of the ZK txn files, so
> that we don't need to manage the txn files ourselves and probably the
> backup/restore would be simpler.
>
> *[Implementation for phase 1]*
>
> The attachment is the initial version of for the 1st phase, it may not able
> to apply to the latest master code since it was based on a very old commit,
> but you can get some idea from this.
>
> The idea is simple, when changes are made in DataTree, they're recorded and
> applied to the RocksDB when processTxn in ZKDatabase.
>
> When loading from disk, it will iterate through the RocksDB keys to restore
> the in memory DataTree.
>
> It also considered things for slowly rollout or rollback, so it can support
> reading from snapshot file and write to RocksDB, or read from RocksDB and
> write to file during runtime based on the flag.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)