http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown b/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown new file mode 100644 index 0000000..12009cc --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown @@ -0,0 +1,88 @@ +--- +title: Reducing Lock Contention in RocksDB +layout: post +author: sdong +category: blog +redirect_from: + - /blog/521/lock/ +--- + +In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs. + +RocksDB has a simple thread synchronization mechanism (See [RocksDB Architecture Guide](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide)  to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including: + +<!--truncate--> + + * Increase or decrease reference counters of mem tables and SST tables + + + * Change and check meta data structures, before and after finishing compactions, flushes and new mem table creations + + + * Coordinating writers + + +This DB mutex used to be scalability bottleneck preventing us from scaling to more than 16 threads. To address the issue, we improved RocksDB in several ways. + +1. Consolidate reference counters and introduce "super version". For every read operation, mutex was acquired, and reference counters for each mem table and each SST table were increased. One such operation is not expensive but if you are building a high throughput server with lots of reads, the lock contention will become the bottleneck. This is especially true if you store all your data in RAM. + +To solve this problem, we created a meta-meta data structure called â[super version](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c)â, which holds reference counters to all those mem table and SST tables, so that readers only need to increase the reference counters for this single data structure. In RocksDB, list of live mem tables and SST tables only changes infrequently, which would happen when new mem tables are created or flush/compaction happens. Now, at those times, a new super version is created with their reference counters increased. A super version lists live mem tables and SST tables so a reader only needs acquire the lock in order to find the latest super version and increase its reference counter. From the super version, the reader can find all the mem and SST tables which are safety accessible as long as the reader holds the reference count for the super version. + +2. We replace some reference counters to stc::atomic objects, so that decreasing reference count of an object usually doesnât need to be inside the mutex any more. + +3. Make fetching super version and reference counting lock-free in read queries. After consolidating reference counting to one single super version and removing the locking for decreasing reference counts, in read case, we only acquire mutex for one thing: fetch the latest super version and increase the reference count for that (dereference the counter is done in an atomic decrease). We designed and implemented a (mostly) lock-free approach to do it. See [details](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). We will write a separate blog post for that. + +4. Avoid disk I/O inside the mutex. As we know, each disk I/O to hard drives takes several milliseconds. It can be even longer if file system journal is involved or I/Os are queued. Even occasional disk I/O within mutex can cause huge performance outliers. +We identified in two situations, we might do disk I/O inside mutex and we removed them: +(1) Opening and closing transactional log files. We moved those operations out of the mutex. +(2) Information logging. In multiple places we write to logs within mutex. There is a chance that file write will wait for disk I/O to finish before finishing, even if fsync() is not issued, especially in EXT systems. We occasionally see 100+ milliseconds write() latency on EXT. Instead of removing those logging, we came up with a solution of delay logging. When inside mutex, instead of directly writing to the log file, we write to a log buffer, with the timing information. As soon as mutex is released, we flush the log buffer to log files. + +5. Reduce object creation inside the mutex. +Object creation can be slow because it involves malloc (in our case). Malloc sometimes is slow because it needs to lock some shared data structures. Allocating can also be slow because we sometimes do expensive operations in some of our classes' constructors. For these reasons, we try to reduce object creations inside the mutex. Here are two examples: + +(1) std::vector uses malloc inside. We introduced â[autovector](https://reviews.facebook.net/rROCKSDBc01676e46d3be08c3c140361ef1f5884f47d3b3c)â data structure, in which memory for first a few elements are pre-allocated as members of the autovector class. When an autovector is used as a stack variable, no malloc will be needed unless the pre-allocated buffer is used up. This autovector is quite useful for manipulating those meta data structures. Those meta operations are often locked inside DB mutex. + +(2) When building an iterator, we used to creating iterator of every live men table and SST table within the mutex and a merging iterator on top of them. Besides malloc, some of those iterators can be quite expensive to create, like sorting. Now, instead of doing that, we simply increase the reference counters of them, and release the mutex before creating any iterator. + +6. Deal with mutexes in LRU caches. +When I said there was only one single major lock, I was lying. In RocksDB, all LRU caches had exclusive mutexes within to protect writes to the LRU lists, which are done in both of read and write operations. LRU caches are used in block cache and table cache. Both of them are accessed more frequently than DB data structures. Lock contention of these two locks are as intense as the DB mutex. Even if LRU cache is sharded into ShardedLRUCache, we can still see lock contentions, especially table caches. We further address this issue in two way: +(1) Bypassing table caches. A table cache maintains list of SST tableâs read handlers. Those handlers contain SST filesâ descriptors, table metadata, and possibly data indexes, as well as bloom filters. When the table handler needs to be evicted based on LRU, those information is cleared. When the SST table needs to be read and its table handler is not in LRU cache, the table is opened and those metadata is loaded. In some cases, users want to tune the system in a way that table handler evictions should never happen. It is common for high-throughput, low-latency servers. We introduce a mode where table cache is bypassed in read queries. In this mode, all table handlers are cached and accessed directly, so there is no need to query and adjust table caches for reading the database. It is the usersâ responsibility to reserve enough resource for it. This mode can be turned on by setting options.max_open_files=-1. + +(2) [New PlainTable format](//github.com/facebook/rocksdb/wiki/PlainTable-Format) (optimized for SST in ramfs/tmpfs) does not organize data by blocks. Data are located by memory addresses so no block cache is needed. + +With all of those improvements, lock contention is not a bottleneck anymore, which is shown in our [memory-only benchmark](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) . Furthermore, lock contentions are not causing some huge (50 milliseconds+) latency outliers they used to cause. + +### Comments + +**[Lee Hounshell]([email protected])** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with âdb/LOCK: Resource temporarily unavailableâ So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. + +**[Siying Dong]([email protected])** + +Sorry for the delay. We donât have feature support for this scenario yet. Here is an example you can work around this problem. You can build a snapshot of the DB by doing this: + +1. create a separate directory on the same host for a snapshot of the DB. +1. call `DB::DisableFileDeletions()` +1. call `DB::GetLiveFiles()` to get a full list of the files. +1. for all the files except manifest, add a hardlink file in your new directory pointing to the original file +1. copy the manifest file and truncate the size (you can read the comments of `DB::GetLiveFiles()` for more information) +1. call `DB::EnableFileDeletions()` +1. now you can open the snapshot directory in another process to access those files. Please remember to delete the directory after reading the data to allow those files to be recycled. + +By the way, the best way to ask those questions is in our [facebook group](https://www.facebook.com/groups/rocksdb.dev/). Let us know if you need any further help. + +**[Darshan]([email protected])** + +Will this consistency problem of RocksDB all occurs in case of single put/write? +What all ACID properties is supported by RocksDB, only durability irrespective of single or batch write? + +**[Siying Dong]([email protected])** + +We recently [introduced optimistic transaction](https://reviews.facebook.net/D33435) which can help you ensure all of ACID. + +This blog post is mainly about optimizations in implementation. The RocksDB consistency semantic is not changed.
http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown b/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown new file mode 100644 index 0000000..61c90dc --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown @@ -0,0 +1,24 @@ +--- +title: RocksDB 3.0 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/557/rocksdb-3-0-release/ +--- + +Check out new RocksDB release on [Github](https://github.com/facebook/rocksdb/releases/tag/3.0.fb)! + +New features in RocksDB 3.0: + + * [Column Family support](https://github.com/facebook/rocksdb/wiki/Column-Families) + + + * [Ability to chose different checksum function](https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6) + + + * Deprecated ReadOptions::prefix_seek and ReadOptions::prefix + +<!--truncate--> + +Check out the full [change log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md). http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown b/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown new file mode 100644 index 0000000..3015674 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown @@ -0,0 +1,20 @@ +--- +title: RocksDB 3.1 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/575/rocksdb-3-1-release/ +--- + +Check out the new release on [Github](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1)! + +New features in RocksDB 3.1: + + * [Materialized hash index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7) + + + * [FIFO compaction style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style) + + +We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown b/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown new file mode 100644 index 0000000..6a641f2 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown @@ -0,0 +1,47 @@ +--- +title: PlainTable â A New File Format +layout: post +author: sdong +category: blog +redirect_from: + - /blog/599/plaintable-a-new-file-format/ +--- + +In this post, we are introducing "PlainTable" -- a file format we designed for RocksDB, initially to satisfy a production use case at Facebook. + +Design goals: + +1. All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance). +1. Optimize for [prefix hashing](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) +1. Less than or around 1 micro-second average latency for single Get() or Seek(). +1. Minimize memory consumption. +1. Queries efficiently return empty results + +<!--truncate--> + +Notice that our priority was not to maximize query performance, but to strike a balance between query performance and memory consumption. PlainTable query performance is not as good as you would see with a nicely-designed hash table, but they are of the same order of magnitude, while keeping memory overhead to a minimum. + +Since we are targeting micro-second latency, it is on the level of the number of CPU cache misses (if they cannot be parallellized, which are usually the case for index look-ups). On our target hardware with Intel CPUs of multiple sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of data TLB). + +To meet our requirements, given that only hash prefix iterating is needed, we made two decisions: + +1. to use a hash index, which is +1. directly addressed to rows, with no block structure. + +Having addressed our latency goal, the next task was to design a very compact hash index to minimize memory consumption. Some tricks we used to meet this goal: + +1. We only use 32-bit integers for data and index offsets.The first bit serves as a flag, so we can avoid using 8-byte pointers. +1. We never copy keys or parts of keys to index search structures. We store only offsets from which keys can be retrieved, to make comparisons with search keys. +1. Since our file is immutable, we can accurately estimate the number of hash buckets needed. + +To make sure the format works efficiently with empty queries, we added a bloom filter check before the query. This adds only one cache miss for non-empty cases [1], but avoids multiple cache misses for most empty results queries. This is a good trade-off for use cases with a large percentage of empty results. + +These are the design goals and basic ideas of PlainTable file format. For detailed information, see [this wiki page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). + +[1] Bloom filter checks typically require multiple memory access. However, because they are independent, they usually do not make the CPU pipeline stale. In any case, we improved the bloom filter to improve data locality - we may cover this further in a future blog post. + +### Comments + +**[Siying Dong]([email protected])** + +Does [http://rocksdb.org/feed/](http://rocksdb.org/feed/) work? http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown b/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown new file mode 100644 index 0000000..4411c7a --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown @@ -0,0 +1,89 @@ +--- +title: Avoid Expensive Locks in Get() +layout: post +author: leijin +category: blog +redirect_from: + - /blog/677/avoid-expensive-locks-in-get/ +--- + +As promised in the previous [blog post](blog/2014/05/14/lock.html)! + +RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called [SuperVersion](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c). + +<!--truncate--> + +At the beginning of `GetImpl()`, it used to do this: + + + <span class="zw-portion">mutex_.Lock(); + </span>auto* s = super_version_->Ref(); + mutex_.Unlock(); + + +The lock is necessary because pointer super_version_ may be updated, the corresponding SuperVersion may be deleted while Ref() is in progress. + + +`Ref()` simply increases the reference counter and returns âthisâ pointer. However, this simple operation posed big challenges for in-memory workload and stopped RocksDB from scaling read throughput beyond 8 cores. Running 32 read threads on a 32-core CPU leads to [70% system CPU usage](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). This is outrageous! + + + + +Luckily, we found a way to circumvent this problem by using [thread local storage](http://en.wikipedia.org/wiki/Thread-local_storage). Version change is a rare event comparable to millions of read requests. On the very first Get() request, each thread pays the mutex cost to acquire a reference to the new super version. Instead of releasing the reference after use, the reference is cached in threadâs local storage. An atomic variable is used to track global super version number. Subsequent reads simply compare the local super version number against the global super version number. If they are the same, the cached super version reference may be used directly, at no cost. If a version change is detected, mutex must be acquired to update the reference. The cost of mutex lock is amortized among millions of reads and becomes negligible. + + + + +The code looks something like this: + + + + + + SuperVersion* s = thread_local_->Get(); + if (s->version_number != super_version_number_.load()) { + // slow path, cleanup of current super version is omitted + mutex_.Lock(); + s = super_version_->Ref(); + mutex_.Unlock(); + } + + + + +The result is quite amazing. RocksDB can nicely [scale to 32 cores](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) and most CPU time is spent in user land. + + + + +Daryl Grove gives a pretty good [comparison between mutex and atomic](https://blogs.oracle.com/d/entry/the_cost_of_mutexes). However, the real cost difference lies beyond what is shown in the assembly code. Mutex can keep threads spinning on CPU or even trigger thread context switches in which all readers compete to access the critical area. Our approach prevents mutual competition by directing threads to check against a global version which does not change at high frequency, and is therefore much more cache-friendly. + + + + +The new approach entails one issue: a thread can visit GetImpl() once but can never come back again. SuperVersion is referenced and cached in its thread local storage. All resources (e.g., memtables, files) which belong to that version are frozen. A âsupervisorâ is required to visit each threadâs local storage and free its resources without incurring a lock. We designed a lockless sweep using CAS (compare and switch instruction). Here is how it works: + + + + +(1) A reader thread uses CAS to acquire SuperVersion from its local storage and to put in a special flag (SuperVersion::kSVInUse). + + + + +(2) Upon completion of GetImpl(), the reader thread tries to return SuperVersion to local storage by CAS, expecting the special flag (SuperVersion::kSVInUse) in its local storage. If it does not see SuperVersion::kSVInUse, that means a âsweepâ was done and the reader thread is responsible for cleanup (this is expensive, but does not happen often on the hot path). + + + + +(3) After any flush/compaction, the background thread performs a sweep (CAS) across all threadsâ local storage and frees encountered SuperVersion. A reader thread must re-acquire a new SuperVersion reference on its next visit. + +### Comments + +**[David Barbour]([email protected])** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with âdb/LOCK: Resource temporarily unavailableâ So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown b/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown new file mode 100644 index 0000000..e4eba6a --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown @@ -0,0 +1,30 @@ +--- +title: RocksDB 3.2 release +layout: post +author: leijin +category: blog +redirect_from: + - /blog/647/rocksdb-3-2-release/ +--- + +Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.2)! + +New Features in RocksDB 3.2: + + * PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory() + + + * Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory() + +<!--truncate--> + +Public API changes: + + + * We removed seek compaction as a concept from RocksDB + + + * Add two paramters to NewHashLinkListRepFactory() for logging on too many entries in a hash bucket when flushing + + + * Added new option BlockBasedTableOptions::hash_index_allow_collision. When enabled, prefix hash index for block-based table will not store prefix and allow hash collision, reducing memory consumption http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown b/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown new file mode 100644 index 0000000..d858e4f --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown @@ -0,0 +1,34 @@ +--- +title: RocksDB 3.3 Release +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/1301/rocksdb-3-3-release/ +--- + +Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.3)! + +New Features in RocksDB 3.3: + + * **JSON API prototype**. + + + * **Performance improvement on HashLinkList**: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory(). + +<!--truncate--> + + * **More effective on storage space reclaim**: RocksDB is now able to reclaim storage space more effectively during the compaction process. This is done by compensating the size of each deletion entry by the 2X average value size, which makes compaction to be triggerred by deletion entries more easily. + + + * **TimeOut API to write**: Now WriteOptions have a variable called timeout_hint_us. With timeout_hint_us set to non-zero, any write associated with this timeout_hint_us may be aborted when it runs longer than the specified timeout_hint_us, and it is guaranteed that any write completes earlier than the specified time-out will not be aborted due to the time-out condition. + + + * **rate_limiter option**: We added an option that controls total throughput of flush and compaction. The throughput is specified in bytes/sec. Flush always has precedence over compaction when available bandwidth is constrained. + + + +Public API changes: + + + * Removed NewTotalOrderPlainTableFactory because it is not used and implemented semantically incorrect. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown b/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown new file mode 100644 index 0000000..22178f7 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown @@ -0,0 +1,74 @@ +--- +title: Cuckoo Hashing Table Format +layout: post +author: radheshyam +category: blog +redirect_from: + - /blog/1427/new-bloom-filter-format/ +--- + +## Introduction + +We recently introduced a new [Cuckoo Hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing) based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by [PlainTable](https://github.com/facebook/rocksdb/wiki/PlainTable-Format)[ format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup. + +<!--truncate--> + +Goals: + + * Reduce memory accesses per lookup to 1 or 2 + + + * Get an end to end point lookup rate of at least 4 Mqps + + + * Minimize database size + + +Assumptions: + + * Key length and value length are fixed + + + * The database is operated in read only mode + + +Non-goals: + + + * While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future. + + +Details for setting up the table format can be found in [GitHub](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format). + + +## Cuckoo Hashing Algorithm + +In order to achieve high lookup speeds, we did multiple optimizations, including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple hash functions, _h1, ..., __hn._ + +### Original Cuckoo Hashing + +To insert any new key _k_, we compute hashes of the key _h1(k), ..., __hn__(k)_. We insert the key in the first hash location that is free. If all the locations are blocked, we try to move one of the colliding keys to a different location by trying to re-insert it. + +Finding smallest set of keys to displace in order to accommodate the new key is naturally a shortest path problem in a directed graph where nodes are buckets of hash table and there is an edge from bucket _A_ to bucket _B_ if the element stored in bucket _A_ can be accommodated in bucket _B_ using one of the hash functions. The source nodes are the possible hash locations for the given key _k_ and destination is any one of the empty buckets. We use this algorithm to handle collision. + +To retrieve a key _k_, we compute hashes, _h1(k), ..., __hn__(k)_ and the key must be present in one of these locations. + +Our goal is to minimize average (and maximum) number of hash functions required and hence the number of memory accesses. In our experiments, with a hash utilization of 90%, we found that the average number of lookups is 1.8 and maximum is 3. Around 44% of keys are accommodated in first hash location and 33% in second location. + + +### Cache Friendly Cuckoo Hashing + +We noticed the following two sub-optimal properties in original Cuckoo implementation: + + + * If the key is not present in first hash location, we jump to second hash location which may not be in cache. This results in many cache misses. + + + * Because only 44% of keys are located in first cuckoo block, we couldn't have an optimal prefetching strategy - prefetching all hash locations for a key is wasteful. But prefetching only the first hash location helps only 44% of cases. + + + +The solution is to insert more keys near first location. In case of collision in the first hash location - _h1(k)_, we try to insert it in next few buckets, _h1(k)+1, _h1(k)+2, _..., h1(k)+t-1_. If all of these _t_ locations are occupied, we skip over to next hash function _h2_ and repeat the process. We call the set of _t_ buckets as a _Cuckoo Block_. We chose _t_ such that size of a block is not bigger than a cache line and we prefetch the first cuckoo block. + + +With the new algorithm, for 90% hash utilization, we found that 85% of keys are accommodated in first Cuckoo Block. Prefetching the first cuckoo block yields best results. For a database of 100 million keys with key length 8 and value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are working on improving it further. End to end RocksDB performance results can be found [here](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format). http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown b/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown new file mode 100644 index 0000000..96fa50a --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown @@ -0,0 +1,52 @@ +--- +title: New Bloom Filter Format +layout: post +author: zagfox +category: blog +redirect_from: + - /blog/1367/cuckoo/ +--- + +## Introduction + +In this post, we are introducing "full filter block" --- a new bloom filter format for [block based table](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format). This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an [example](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups. + + +<!--truncate--> + +## What is Bloom Filter + +In brief, [bloom filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter) is a bits array generated for a set of keys that could tell if an arbitrary key may exist in that set. + +In RocksDB, we generate such a bloom filter for each SST file. When we conduct a query for a key, we first goes to the bloom filter block of SST file. If key may exist in filter, we goes into data block in SST file to search for the key. If not, we would return directly. So it could help speed up point look up operation a lot. + +## Original Bloom Filter Format + +Original bloom filter creates filters for each individual data block in SST file. It has complex structure (ref [here](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)) which results in a lot of non-adjacent memory look ups. + +Here's the work flow for checking original bloom filter in block based table: + +1. Given the target key, we goes to the index block to get the "data block ID" where this key may reside. +1. Using the "data block ID", we goes to the filter block and get the correct "offset of filter". +1. Using the "offset of filter", we goes to the actual filter and do the checking. + +## New Bloom Filter Format + +New bloom filter creates filter for all keys in SST file and we name it "full filter". The data structure of full filter is very simple, there is just one big filter: + +  [ full filter ] + +In this way, the work flow of bloom filter checking is much simplified. + +(1) Given the target key, we goes directly to the filter block and conduct the filter checking. + +To be specific, there would be no checking for index block and no address jumping inside of filter block. + +Though it is a big filter, the total filter size would be the same as the original filter. + +One little draw back is that the new bloom filter introduces more memory consumption when building SST file because we need to buffer keys (or their hashes) before generating filter. Original filter just creates a bunch of small filters so it just buffer a small amount of keys. For full filter, we buffer hashes of all keys, which would take more memory when SST file size increases. + + +## Usage & Customization + +You can refer to the document here for [usage](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#usage-of-new-bloom-filter) and [customization](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#customize-your-own-filterpolicy). http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown b/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown new file mode 100644 index 0000000..1878a5a --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown @@ -0,0 +1,38 @@ +--- +title: RocksDB 3.5 Release! +layout: post +author: leijin +category: blog +redirect_from: + - /blog/1547/rocksdb-3-5-release/ +--- + +New RocksDB release - 3.5! + + +**New Features** + + + 1. Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it. + + + 2. new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index. + +<!--truncate--> + +**Public API changes** + + + 1. The Prefix Extractor used with V2 compaction filters is now passed user key to SliceTransform::Transform instead of unparsed RocksDB key. + + + 2. Move BlockBasedTable related options to BlockBasedTableOptions from Options. Change corresponding JNI interface. Options affected include: no_block_cache, block_cache, block_cache_compressed, block_size, block_size_deviation, block_restart_interval, filter_policy, whole_key_filtering. filter_policy is changed to shared_ptr from a raw pointer. + + + 3. Remove deprecated options: disable_seek_compaction and db_stats_log_interval + + + 4. OptimizeForPointLookup() takes one parameter for block cache size. It now builds hash index, bloom filter, and block cache. + + +[https://github.com/facebook/rocksdb/releases/tag/v3.5](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.5) http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown b/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown new file mode 100644 index 0000000..f18de0b --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown @@ -0,0 +1,112 @@ +--- +title: Migrating from LevelDB to RocksDB +layout: post +author: lgalanis +category: blog +redirect_from: + - /blog/1811/migrating-from-leveldb-to-rocksdb-2/ +--- + +If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace `leveldb::Options` with `rocksdb::LevelDBOptions`. Then, use `rocksdb::ConvertOptions( )` to convert the `LevelDBOptions` struct into appropriate RocksDB options. Here is an example: + +<!--truncate--> + +LevelDB code: + +```c++ +#include <string> +#include "leveldb/db.h" + +using namespace leveldb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + opt.block_size = 4096; + + Status s = DB::Open(opt, "/tmp/mydb", &db); + + delete db; +} +``` + +RocksDB code: + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/utilities/leveldb_options.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + LevelDBOptions opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + opt.block_size = 4096; + + Options rocksdb_options = ConvertOptions(opt); + // add rocksdb specific options here + + Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); + + delete db; +} +``` + +The difference is: + +```diff +-#include "leveldb/db.h" ++#include "rocksdb/db.h" ++#include "rocksdb/utilities/leveldb_options.h" + +-using namespace leveldb; ++using namespace rocksdb; + +- Options opt; ++ LevelDBOptions opt; + +- Status s = DB::Open(opt, "/tmp/mydb", &db); ++ Options rocksdb_options = ConvertOptions(opt); ++ // add rockdb specific options here ++ ++ Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); +``` + +Once you get up and running with RocksDB you can then focus on tuning RocksDB further by modifying the converted options struct. + +The reason why ConvertOptions is handy is because a lot of individual options in RocksDB have moved to other structures in different components. For example, block_size is not available in struct rocksdb::Options. It resides in struct rocksdb::BlockBasedTableOptions, which is used to create a TableFactory object that RocksDB uses internally to create the proper TableBuilder objects. If you were to write your application from scratch it would look like this: + +RocksDB code from scratch: + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/table.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + + BlockBasedTableOptions topt; + topt.block_size = 4096; + opt.table_factory.reset(NewBlockBasedTableFactory(topt)); + + Status s = DB::Open(opt, "/tmp/mydb_rocks", &db); + + delete db; +} +``` + +The LevelDBOptions utility can ease migration to RocksDB from LevelDB and allows us to break down the various options across classes as it is needed. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown b/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown new file mode 100644 index 0000000..cddc0dd --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown @@ -0,0 +1,41 @@ +--- +title: Reading RocksDB options from a file +layout: post +author: lgalanis +category: blog +redirect_from: + - /blog/1883/reading-rocksdb-options-from-a-file/ +--- + +RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: `write_buffer_size=1024;max_write_buffer_number=2`. Any whitespace around `=` and `;` is OK. Moreover, options can be nested as necessary. For example `BlockBasedTableOptions` can be nested as follows: `write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};`. Similarly any white space around `{` or `}` is ok. Here is what it looks like in code: + +<!--truncate--> + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/table.h" +#include "rocksdb/utilities/convenience.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + + std::string options_string = + "create_if_missing=true;max_open_files=1000;" + "block_based_table_factory={block_size=4096}"; + + Status s = GetDBOptionsFromString(opt, options_string, &opt); + + s = DB::Open(opt, "/tmp/mydb_rocks", &db); + + // use db + + delete db; +} +``` + +Using `GetDBOptionsFromString` is a convenient way of changing options for your RocksDB application without needing to resort to recompilation or tedious command line parsing. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown b/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown new file mode 100644 index 0000000..7f9f776 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown @@ -0,0 +1,20 @@ +--- +title: 'WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes' +layout: post +author: sdong +category: blog +redirect_from: + - /blog/1901/write-batch-with-index/ +--- + +RocksDB can be used as a storage engine of a higher level database. In fact, we are currently plugging RocksDB into MySQL and MongoDB as one of their storage engines. RocksDB can help with guaranteeing some of the ACID properties: durability is guaranteed by RocksDB by design; while consistency and isolation need to be enforced by concurrency controls on top of RocksDB; Atomicity can be implemented by committing a transaction's writes with one write batch to RocksDB in the end. + +<!--truncate--> + +However, if we enforce atomicity by only committing all writes in the end of the transaction in one batch, you cannot get the updated value from RocksDB previously written by the same transaction (read-your-own-write). To read the updated value, the databases on top of RocksDB need to maintain an internal buffer for all the written keys, and when a read happens they need to merge the result from RocksDB and from this buffer. This is a problem we faced when building the RocksDB storage engine in MongoDB. We solved it by creating a utility class, WriteBatchWithIndex (a write batch with a searchable index) and made it part of public API so that the community can also benefit from it. + +Before talking about the index part, let me introduce write batch first. The write batch class, `WriteBatch`, is a RocksDB data structure for atomic writes of multiple keys. Users can buffer their updates to a `WriteBatch` by calling `write_batch.Put("key1", "value1")` or `write_batch.Delete("key2")`, similar as calling RocksDB's functions of the same names. In the end, they call `db->Write(write_batch)` to atomically update all those batched operations to the DB. It is how a database can guarantee atomicity, as shown above. Adding a searchable index to `WriteBatch`, we now have `WriteBatchWithIndex`. Users can put updates to WriteBatchIndex in the same way as to `WriteBatch`. In the end, users can get a `WriteBatch` object from it and issue `db->Write()`. Additionally, users can create an iterator of a WriteBatchWithIndex, seek to any key location and iterate from there. + +To implement read-your-own-write using `WriteBatchWithIndex`, every time the user creates a transaction, we create a `WriteBatchWithIndex` attached to it. All the writes of the transaction go to the `WriteBatchWithIndex` first. When we commit the transaction, we atomically write the batch to RocksDB. When the user wants to call `Get()`, we first check if the value exists in the `WriteBatchWithIndex` and return the value if existing, by seeking and reading from an iterator of the write batch, before checking data in RocksDB. For example, here is the we implement it in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L245-L260). If a range query comes, we pass a DB's iterator to `WriteBatchWithIndex`, which creates a super iterator which combines the results from the DB iterator with the batch's iterator. Using this super iterator, we can iterate the DB with the t ransaction's own writes. Here is the iterator creation codes in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L266-L269). In this way, the database can solve the read-your-own-write problem by using RocksDB to handle a transaction's uncommitted writes. + +Using `WriteBatchWithIndex`, we successfully implemented read-your-own-writes in the RocksDB storage engine of MongoDB. If you also have a read-your-own-write problem, `WriteBatchWithIndex` can help you implement it quickly and correctly. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown b/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown new file mode 100644 index 0000000..1ffe2c5 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown @@ -0,0 +1,16 @@ +--- +title: Integrating RocksDB with MongoDB +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/1967/integrating-rocksdb-with-mongodb-2/ +--- + +Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems. + +<!--truncate--> + +The next big challenge for us is to bring RocksDB storage engine to general purpose databases. Today we have an exciting milestone to share with our community! We're running MongoDB with RocksDB in production and seeing great results! You can read more about it here: [http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + +Keep tuned for benchmarks and more stability and performance improvements. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown b/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown new file mode 100644 index 0000000..f3a55fa --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown @@ -0,0 +1,10 @@ +--- +title: RocksDB in osquery +layout: post +author: icanadi +category: lgalanis +redirect_from: + - /blog/1997/rocksdb-in-osquery/ +--- + +Check out [this](https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/) blog post by [Mike Arpaia](https://www.facebook.com/mike.arpaia) and [Ted Reed](https://www.facebook.com/treeded) about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown b/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown new file mode 100644 index 0000000..b3e2703 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown @@ -0,0 +1,92 @@ +--- +title: RocksDB 2015 H2 roadmap +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/2015/rocksdb-2015-h2-roadmap/ +--- + +Every 6 months, RocksDB team gets together to prioritize the work ahead of us. We just went through this exercise and we wanted to share the results with the community. Here's what RocksDB team will be focusing on for the next 6 months: + +<!--truncate--> + +**MyRocks** + +As you might know, we're working hard to integrate RocksDB as a storage engine for MySQL. This project is pretty important for us because we're heavy users of MySQL. We're already getting pretty good performance results, but there is more work to be done. We need to focus on both performance and stability. The most high priority items on are list are: + + + + + 1. Reduce CPU costs of RocksDB as a MySQL storage engine + + + 2. Implement pessimistic concurrency control to support repeatable read isolation level in MyRocks + + + 3. Reduce P99 read latency, which is high mostly because of lingering tombstones + + + 4. Port ZSTD compression + + +**MongoRocks** + +Another database that we're working on is MongoDB. The project of integrating MongoDB with RocksDB storage engine is called MongoRocks. It's already running in production at Parse [1] and we're seeing surprisingly few issues. Our plans for the next half: + + + + + 1. Keep improving performance and stability, possibly reuse work done on MyRocks (workloads are pretty similar). + + + 2. Increase internal and external adoption. + + + 3. Support new MongoDB 3.2. + + +**RocksDB on cheaper storage media** + +Up to now, our mission was to build the best key-value store âfor fast storageâ (flash and in-memory). However, there are some use-cases at Facebook that don't need expensive high-end storage. In the next six months, we plan to deploy RocksDB on cheaper storage media. We will optimize performance to RocksDB on either or both: + + + + + 1. Hard drive storage array. + + + 2. Tiered Storage. + + +**Quality of Service** + +When talking to our customers, there are couple of issues that keep reoccurring. We need to fix them to make our customers happy. We will improve RocksDB to provide better assurance of performance and resource usage. Non-exhaustive list includes: + + + + + 1. Iterate P99 can be high due to the presence of tombstones. + + + 2. Write stalls can happen during high write loads. + + + 3. Better control of memory and disk usage. + + + 4. Service quality and performance of backup engine. + + +**Operation's user experience** + +As we increase deployment of RocksDB, engineers are spending more time on debugging RocksDB issues. We plan to improve user experience when running RocksDB. The goal is to reduce TTD (time-to-debug). The work includes monitoring, visualizations and documentations. + +[1]( http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + + +### Comments + +**[Mike]([email protected])** + +Whatâs the status of this roadmap? âRocksDB on cheaper storage mediaâ, has this been implemented? http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown b/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown new file mode 100644 index 0000000..fe7b7b2 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown @@ -0,0 +1,78 @@ +--- +title: Spatial indexing in RocksDB +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/2039/spatial-indexing-in-rocksdb/ +--- + +About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth's map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale. + +<!--truncate--> + +When we started looking at this project, the first thing that surprised us was that our planet is not that big. Earth's entire map data can fit in memory on a reasonably high-end machine. Thus, we also decided to build a spatial database optimized for memory-resident dataset. + +The first use-case of our spatial database was an experimental map renderer. As part of our project, we successfully loaded [Open Street Maps](https://www.openstreetmap.org/) dataset and hooked it up with [Mapnik](http://mapnik.org/), a map rendering engine. + +The usual Mapnik workflow is to load the map data into a SQL-based database and then define map layers with SQL statements. To render a tile, Mapnik needs to execute a couple of SQL queries. The benefit of this approach is that you don't need to reload your database when you change your map style. You can just change your SQL query and Mapnik picks it up. In our model, we decided to precompute the features we need for each tile. We need to know the map style before we create the database. However, when rendering the map tile, we only fetch the features that we need to render. + +We haven't open sourced the RocksDB Mapnik plugin or the database loading pipeline. However, the spatial indexing is available in RocksDB under a name [SpatialDB](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/spatial_db.h). The API is focused on map rendering use-case, but we hope that it can also be used for other spatial-based applications. + +Let's take a tour of the API. When you create a spatial database, you specify the spatial indexes that need to be built. Each spatial index is defined by a bounding box and granularity. For map rendering, we create a spatial index for each zoom levels. Higher zoom levels have more granularity. + + + + SpatialDB::Create( + SpatialDBOptions(), + "/data/map", { + SpatialIndexOptions("zoom10", BoundingBox(0, 0, 100, 100), 10), + SpatialIndexOptions("zoom16", BoundingBox(0, 0, 100, 100), 16) + } + ); + + + + +When you insert a feature (building, street, country border) into SpatialDB, you need to specify the list of spatial indexes that will index the feature. In the loading phase we process the map style to determine the list of zoom levels on which we'll render the feature. For example, we will not render the building on zoom level that shows an entire country. Building will only be indexed on higher zoom level's index. Country borders will be indexes on all zoom levels. + + + + FeatureSet feature; + feature.Set("type", "building"); + feature.Set("height", 6); + db->Insert(WriteOptions(), BoundingBox<double>(5, 5, 10, 10), + well_known_binary_blob, feature, {"zoom16"}); + + + + +The indexing part is pretty simple. For each feature, we first find a list of index tiles that it intersects. Then, we add a link from the tile's [quad key](https://msdn.microsoft.com/en-us/library/bb259689.aspx) to the feature's primary key. Using quad keys improves data locality, i.e. features closer together geographically will have similar quad keys. Even though we're optimizing for a memory-resident dataset, data locality is still very important due to different caching effects. + +After you're done inserting all the features, you can call an API Compact() that will compact the dataset and speed up read queries. + + + + db->Compact(); + + + + +SpatialDB's query specifies: 1) bounding box we're interested in, and 2) a zoom level. We find all tiles that intersect with the query's bounding box and return all features in those tiles. + + + + + Cursor* c = db_->Query(ReadOptions(), BoundingBox<double>(1, 1, 7, 7), "zoom16"); + for (c->Valid(); c->Next()) { + Render(c->blob(), c->feature_set()); + } + + + + +Note: `Render()` function is not part of RocksDB. You will need to use one of many open source map renderers, for example check out [Mapnik](http://mapnik.org/). + +TL;DR If you need an embedded spatial database, check out RocksDB's SpatialDB. [Let us know](https://www.facebook.com/groups/rocksdb.dev/) how we can make it better. + +If you're interested in learning more, check out this [talk](https://www.youtube.com/watch?v=T1jWsDMONM8). http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown b/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown new file mode 100644 index 0000000..b6bb47d --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown @@ -0,0 +1,30 @@ +--- +title: RocksDB is now available in Windows Platform +layout: post +author: dmitrism +category: blog +redirect_from: + - /blog/2033/rocksdb-is-now-available-in-windows-platform/ +--- + +Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment. + +We at Microsoft Bing could not be left behind. As a result we are happy to [announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores. + +<!--truncate--> + +We are happy to make this available for the community. Keep tuned for more announcements to come. + +### Comments + +**[Siying Dong]([email protected])** + +Appreciate your contributions to RocksDB project! I believe it will benefits many users! + +**[empresas sevilla]([email protected])** + +Magnifico artÃculo|, un placer leer el blog + +**[jak usunac]([email protected])** + +I believe it will benefits too http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown b/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown new file mode 100644 index 0000000..0ff3a05 --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown @@ -0,0 +1,29 @@ +--- +title: Dynamic Level Size for Level-Based Compaction +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2207/dynamic-level/ +--- + +In this article, we follow up on the first part of an answer to one of the questions in our [AMA](https://www.reddit.com/r/IAmA/comments/3de3cv/we_are_rocksdb_engineering_team_ask_us_anything/ct4a8tb), the dynamic level size in level-based compaction. + +<!--truncate--> + +Level-based compaction is the original LevelDB compaction style and one of the two major compaction styles in RocksDB (See [our wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#multi-threaded-compactions)). In RocksDB we introduced parallelism and more configurable options to it but the main algorithm stayed the same, until we recently introduced the dynamic level size mode. + + +In level-based compaction, we organize data to different sorted runs, called levels. Each level has a target size. Usually target size of levels increases by the same size multiplier. For example, you can set target size of level 1 to be 1GB, and size multiplier to be 10, and the target size of level 1, 2, 3, 4 will be 1GB, 10GB, 100GB and 1000GB. Before level 1, there will be some staging file flushed from mem tables, called Level 0 files, which will later be merged to level 1. Compactions will be triggered as soon as actual size of a level exceeds its target size. We will merge a subset of data of that level to next level, to reduce size of the level. More compactions will be triggered until sizes of all the levels are lower than their target sizes. In a steady state, the size of each level will be around the same size of the size of level targets. + + +Level-based compactionâs advantage is its good space efficiency. We usually use the metric space amplification to measure the space efficiency. In this article ignore the effects of data compression so space amplification= size_on_file_system / size_of_user_data. + + +How do we estimate space amplification of level-based compaction? We focus specifically on the databases in steady state, which means database size is stable or grows slowly over time. This means updates will add roughly the same or little more data than what is removed by deletes. Given that, if we compact all the data all to the last level, the size of level will be equal as the size of last level before the compaction. On the other hand, the size of user data will be approximately the size of DB if we compact all the levels down to the last level. So the size of the last level will be a good estimation of user data size. So total size of the DB divided by the size of the last level will be a good estimation of space amplification. + + +Applying the equation, if we have four non-zero levels, their sizes are 1GB, 10GB, 100GB, 1000GB, the size amplification will be approximately (1000GB + 100GB + 10GB + 1GB) / 1000GB = 1.111, which is a very good number. However, there is a catch here: how to make sure the last levelâs size is 1000GB, the same as the levelâs size target? A user has to fine tune level sizes to achieve this number and will need to re-tune if DB size changes. The theoretic number 1.11 is hard to achieve in practice. In a worse case, if you have the target size of last level to be 1000GB but the user data is only 200GB, then the actual space amplification will be (200GB + 100GB + 10GB + 1GB) / 200GB = 1.555, a much worse number. + + +To solve this problem, my colleague Igor Kabiljo came up with a solution of dynamic level size target mode. You can enable it by setting options.level_compaction_dynamic_level_bytes=true. In this mode, size target of levels are changed dynamically based on size of the last level. Suppose the level size multiplier to be 10, and the DB size is 200GB. The target size of the last level is automatically set to be the actual size of the level, which is 200GB, the second to last levelâs size target will be automatically set to be size_last_level / 10 = 20GB, the third last levelâs will be size_last_level/100 = 2GB, and next level to be size_last_level/1000 = 200MB. We stop here because 200MB is within the range of the first level. In this way, we can achieve the 1.111 space amplification, without fine tuning of the level size targets. More details can be found in [code comments of the option](https://github.com/facebook/rocksdb/blob/v3.11/include/rocksdb/options.h#L366-L423) in the h eader file. http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown b/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown new file mode 100644 index 0000000..332a29f --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown @@ -0,0 +1,193 @@ +--- +title: GetThreadList +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/2261/getthreadlist/ +--- + +We recently added a new API, called `GetThreadList()`, that exposes the RocksDB background thread activity. With this feature, developers will be able to obtain the real-time information about the currently running compactions and flushes such as the input / output size, elapsed time, the number of bytes it has written. Below is an example output of `GetThreadList`. To better illustrate the example, we have put a sample output of `GetThreadList` into a table where each column represents a thread status: + +<!--truncate--> + +<table width="637" > +<tbody > +<tr style="border:2px solid #000000" > + +<td style="padding:3px" >ThreadID +</td> + +<td >140716395198208 +</td> + +<td >140716416169728 +</td> +</tr> +<tr > + +<td style="padding:3px" >DB +</td> + +<td >db1 +</td> + +<td >db2 +</td> +</tr> +<tr > + +<td style="padding:3px" >CF +</td> + +<td >default +</td> + +<td >picachu +</td> +</tr> +<tr > + +<td style="padding:3px" >ThreadType +</td> + +<td >High Pri +</td> + +<td >Low Pri +</td> +</tr> +<tr > + +<td style="padding:3px" >Operation +</td> + +<td >Flush +</td> + +<td >Compaction +</td> +</tr> +<tr > + +<td style="padding:3px" >ElapsedTime +</td> + +<td >143.459 ms +</td> + +<td >607.538 ms +</td> +</tr> +<tr > + +<td style="padding:3px" >Stage +</td> + +<td >FlushJob::WriteLevel0Table +</td> + +<td >CompactionJob::Install +</td> +</tr> +<tr > + +<td style="vertical-align:top;padding:3px" >OperationProperties +</td> + +<td style="vertical-align:top;padding:3px" > +BytesMemtables 4092938 +BytesWritten 1050701 +</td> + +<td style="vertical-align:top" > +BaseInputLevel 1 +BytesRead 4876417 +BytesWritten 4140109 +IsDeletion 0 +IsManual 0 +IsTrivialMove 0 +JobID 146 +OutputLevel 2 +TotalInputBytes 4883044 +</td> +</tr> +</tbody> +</table> + +In the above output, we can see `GetThreadList()` reports the activity of two threads: one thread running flush job (middle column) and the other thread running a compaction job (right-most column). In each thread status, it shows basic information about the thread such as thread id, it's target db / column family, and the job it is currently doing and the current status of the job. For instance, we can see thread 140716416169728 is doing compaction on the `picachu` column family in database `db2`. In addition, we can see the compaction has been running for 600 ms, and it has read 4876417 bytes out of 4883044 bytes. This indicates the compaction is about to complete. The stage property indicates which code block the thread is currently executing. For instance, thread 140716416169728 is currently running `CompactionJob::Install`, which further indicates the compaction job is almost done. + +Below we briefly describe its API. + + +## How to Enable it? + + +To enable thread-tracking of a rocksdb instance, simply set `enable_thread_tracking` to true in its DBOptions: + +```c++ +// If true, then the status of the threads involved in this DB will +// be tracked and available via GetThreadList() API. +// +// Default: false +bool enable_thread_tracking; +``` + + + +## The API + + +The GetThreadList API is defined in [include/rocksdb/env.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/env.h#L317-L318), which is an Env +function: + +```c++ +virtual Status GetThreadList(std::vector* thread_list) +``` + +Since an Env can be shared across multiple rocksdb instances, the output of +`GetThreadList()` include the background activity of all the rocksdb instances +that using the same Env. + +The `GetThreadList()` API simply returns a vector of `ThreadStatus`, each describes +the current status of a thread. The `ThreadStatus` structure, defined in +[include/rocksdb/thread_status.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/thread_status.h), contains the following information: + +```c++ +// An unique ID for the thread. +const uint64_t thread_id; + +// The type of the thread, it could be HIGH_PRIORITY, +// LOW_PRIORITY, and USER +const ThreadType thread_type; + +// The name of the DB instance where the thread is currently +// involved with. It would be set to empty string if the thread +// does not involve in any DB operation. +const std::string db_name; + +// The name of the column family where the thread is currently +// It would be set to empty string if the thread does not involve +// in any column family. +const std::string cf_name; + +// The operation (high-level action) that the current thread is involved. +const OperationType operation_type; + +// The elapsed time in micros of the current thread operation. +const uint64_t op_elapsed_micros; + +// An integer showing the current stage where the thread is involved +// in the current operation. +const OperationStage operation_stage; + +// A list of properties that describe some details about the current +// operation. Same field in op_properties[] might have different +// meanings for different operations. +uint64_t op_properties[kNumOperationProperties]; + +// The state (lower-level action) that the current thread is involved. +const StateType state_type; +``` + +If you are interested in the background thread activity of your RocksDB application, please feel free to give `GetThreadList()` a try :) http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown b/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown new file mode 100644 index 0000000..6852b8f --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown @@ -0,0 +1,45 @@ +--- +title: Use Checkpoints for Efficient Snapshots +layout: post +author: rven2 +category: blog +redirect_from: + - /blog/2609/use-checkpoints-for-efficient-snapshots/ +--- + +**Checkpoint** is a feature in RocksDB which provides the ability to take a snapshot of a running RocksDB database in a separate directory. Checkpoints can be used as a point in time snapshot, which can be opened Read-only to query rows as of the point in time or as a Writeable snapshot by opening it Read-Write. Checkpoints can be used for both full and incremental backups. + +<!--truncate--> + + +The Checkpoint feature enables RocksDB to create a consistent snapshot of a given RocksDB database in the specified directory. If the snapshot is on the same filesystem as the original database, the SST files will be hard-linked, otherwise SST files will be copied. The manifest and CURRENT files will be copied. In addition, if there are multiple column families, log files will be copied for the period covering the start and end of the checkpoint, in order to provide a consistent snapshot across column families. + + + + +A Checkpoint object needs to be created for a database before checkpoints are created. The API is as follows: + + + + +`Status Create(DB* db, Checkpoint** checkpoint_ptr);` + + + + +Given a checkpoint object and a directory, the CreateCheckpoint function creates a consistent snapshot of the database in the given directory. + + + + +`Status CreateCheckpoint(const std::string& checkpoint_dir);` + + + + +The directory should not already exist and will be created by this API. The directory will be an absolute path. The checkpoint can be used as a âread-only copy of the DB or can be opened as a standalone DB. When opened read/write, the SST files continue to be hard links and these links are removed when the files are obsoleted. When the user is done with the snapshot, the user can delete the directory to remove the snapshot. + + + + +Checkpoints are used for online backup in âMyRocks. which is MySQL using RocksDB as the storage engine . ([MySQL on RocksDB](https://github.com/facebook/mysql-5.6)) â http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown ---------------------------------------------------------------------- diff --git a/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown b/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown new file mode 100644 index 0000000..b21b04f --- /dev/null +++ b/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown @@ -0,0 +1,244 @@ +--- +title: Analysis File Read Latency by Level +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2537/analysis-file-read-latency-by-level/ +--- + +In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users. + +As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it's not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it's not easy for users to figure out whether the it's because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs. + +<!--truncate--> + +In order to make tuning easier, we added new instrumentation to help users analysis latency distribution of file reads in different levels. If users turn DB statistics on, we always keep track of distribution of file read latency for each level. Users can retrieve the information by querying DB property ârocksdb.statsâ ( [https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316](https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316) ). It will also printed out as a part of compaction summary in info logs periodically. + +The output looks like this: + + +``` +** Level 0 read latency histogram (micros): +Count: 696 Average: 489.8118 StdDev: 222.40 +Min: 3.0000 Median: 452.3077 Max: 1896.0000 +Percentiles: P50: 452.31 P75: 641.30 P99: 1068.00 P99.9: 1860.80 P99.99: 1896.00 +------------------------------------------------------ +[ 2, 3 ) 1 0.144% 0.144% +[ 18, 20 ) 1 0.144% 0.287% +[ 45, 50 ) 5 0.718% 1.006% +[ 50, 60 ) 26 3.736% 4.741% # +[ 60, 70 ) 6 0.862% 5.603% +[ 90, 100 ) 1 0.144% 5.747% +[ 120, 140 ) 2 0.287% 6.034% +[ 140, 160 ) 1 0.144% 6.178% +[ 160, 180 ) 1 0.144% 6.322% +[ 200, 250 ) 9 1.293% 7.615% +[ 250, 300 ) 45 6.466% 14.080% # +[ 300, 350 ) 88 12.644% 26.724% ### +[ 350, 400 ) 88 12.644% 39.368% ### +[ 400, 450 ) 71 10.201% 49.569% ## +[ 450, 500 ) 65 9.339% 58.908% ## +[ 500, 600 ) 74 10.632% 69.540% ## +[ 600, 700 ) 92 13.218% 82.759% ### +[ 700, 800 ) 64 9.195% 91.954% ## +[ 800, 900 ) 35 5.029% 96.983% # +[ 900, 1000 ) 12 1.724% 98.707% +[ 1000, 1200 ) 6 0.862% 99.569% +[ 1200, 1400 ) 2 0.287% 99.856% +[ 1800, 2000 ) 1 0.144% 100.000% + +** Level 1 read latency histogram (micros): +(......not pasted.....) + +** Level 2 read latency histogram (micros): +(......not pasted.....) + +** Level 3 read latency histogram (micros): +(......not pasted.....) + +** Level 4 read latency histogram (micros): +(......not pasted.....) + +** Level 5 read latency histogram (micros): +Count: 25583746 Average: 421.1326 StdDev: 385.11 +Min: 1.0000 Median: 376.0011 Max: 202444.0000 +Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 +------------------------------------------------------ +[ 0, 1 ) 2351 0.009% 0.009% +[ 1, 2 ) 6077 0.024% 0.033% +[ 2, 3 ) 8471 0.033% 0.066% +[ 3, 4 ) 788 0.003% 0.069% +[ 4, 5 ) 393 0.002% 0.071% +[ 5, 6 ) 786 0.003% 0.074% +[ 6, 7 ) 1709 0.007% 0.080% +[ 7, 8 ) 1769 0.007% 0.087% +[ 8, 9 ) 1573 0.006% 0.093% +[ 9, 10 ) 1495 0.006% 0.099% +[ 10, 12 ) 3043 0.012% 0.111% +[ 12, 14 ) 2259 0.009% 0.120% +[ 14, 16 ) 1233 0.005% 0.125% +[ 16, 18 ) 762 0.003% 0.128% +[ 18, 20 ) 451 0.002% 0.130% +[ 20, 25 ) 794 0.003% 0.133% +[ 25, 30 ) 1279 0.005% 0.138% +[ 30, 35 ) 1172 0.005% 0.142% +[ 35, 40 ) 1363 0.005% 0.148% +[ 40, 45 ) 409 0.002% 0.149% +[ 45, 50 ) 105 0.000% 0.150% +[ 50, 60 ) 80 0.000% 0.150% +[ 60, 70 ) 280 0.001% 0.151% +[ 70, 80 ) 1583 0.006% 0.157% +[ 80, 90 ) 4245 0.017% 0.174% +[ 90, 100 ) 6572 0.026% 0.200% +[ 100, 120 ) 9724 0.038% 0.238% +[ 120, 140 ) 3713 0.015% 0.252% +[ 140, 160 ) 2383 0.009% 0.261% +[ 160, 180 ) 18344 0.072% 0.333% +[ 180, 200 ) 51873 0.203% 0.536% +[ 200, 250 ) 631722 2.469% 3.005% +[ 250, 300 ) 2721970 10.639% 13.644% ## +[ 300, 350 ) 5909249 23.098% 36.742% ##### +[ 350, 400 ) 6522507 25.495% 62.237% ##### +[ 400, 450 ) 4296332 16.793% 79.030% ### +[ 450, 500 ) 2130323 8.327% 87.357% ## +[ 500, 600 ) 1553208 6.071% 93.428% # +[ 600, 700 ) 642129 2.510% 95.938% # +[ 700, 800 ) 372428 1.456% 97.394% +[ 800, 900 ) 187561 0.733% 98.127% +[ 900, 1000 ) 85858 0.336% 98.462% +[ 1000, 1200 ) 82730 0.323% 98.786% +[ 1200, 1400 ) 50691 0.198% 98.984% +[ 1400, 1600 ) 38026 0.149% 99.133% +[ 1600, 1800 ) 32991 0.129% 99.261% +[ 1800, 2000 ) 30200 0.118% 99.380% +[ 2000, 2500 ) 62195 0.243% 99.623% +[ 2500, 3000 ) 36684 0.143% 99.766% +[ 3000, 3500 ) 21317 0.083% 99.849% +[ 3500, 4000 ) 10216 0.040% 99.889% +[ 4000, 4500 ) 8351 0.033% 99.922% +[ 4500, 5000 ) 4152 0.016% 99.938% +[ 5000, 6000 ) 6328 0.025% 99.963% +[ 6000, 7000 ) 3253 0.013% 99.976% +[ 7000, 8000 ) 2082 0.008% 99.984% +[ 8000, 9000 ) 1546 0.006% 99.990% +[ 9000, 10000 ) 1055 0.004% 99.994% +[ 10000, 12000 ) 1566 0.006% 100.000% +[ 12000, 14000 ) 761 0.003% 100.003% +[ 14000, 16000 ) 462 0.002% 100.005% +[ 16000, 18000 ) 226 0.001% 100.006% +[ 18000, 20000 ) 126 0.000% 100.006% +[ 20000, 25000 ) 107 0.000% 100.007% +[ 25000, 30000 ) 43 0.000% 100.007% +[ 30000, 35000 ) 15 0.000% 100.007% +[ 35000, 40000 ) 14 0.000% 100.007% +[ 40000, 45000 ) 16 0.000% 100.007% +[ 45000, 50000 ) 1 0.000% 100.007% +[ 50000, 60000 ) 22 0.000% 100.007% +[ 60000, 70000 ) 10 0.000% 100.007% +[ 70000, 80000 ) 5 0.000% 100.007% +[ 80000, 90000 ) 14 0.000% 100.007% +[ 90000, 100000 ) 11 0.000% 100.007% +[ 100000, 120000 ) 33 0.000% 100.007% +[ 120000, 140000 ) 6 0.000% 100.007% +[ 140000, 160000 ) 3 0.000% 100.007% +[ 160000, 180000 ) 7 0.000% 100.007% +[ 200000, 250000 ) 2 0.000% 100.007% +``` + + +In this example, you can see we only issued 696 reads from level 0 while issued 25 million reads from level 5. The latency distribution is also clearly shown among those reads. This will be helpful for users to analysis OS page cache efficiency. + +Currently the read latency per level includes reads from data blocks, index blocks, as well as bloom filter blocks. We are also working on a feature to break down those three type of blocks. + +### Comments + +**[Tao Feng]([email protected])** + +Is this feature also included in RocksJava? + +**[Siying Dong]([email protected])** + +Should be. As long as you enable statistics, you should be able to get the value from `RocksDB.getProperty()` with property `rocksdb.dbstats`. Let me know if you canât find it. + +**[chiddu]([email protected])** + +> In this example, you can see we only issued 696 reads from level 0 while issued 256K reads from level 5. + +Isnât it 2.5 M of reads instead of 256K ? . + +Also could anyone please provide more description on the histogram ? especially + +> Count: 25583746 Average: 421.1326 StdDev: 385.11 +> Min: 1.0000 Median: 376.0011 Max: 202444.0000 +> Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 + +and + +> [ 0, 1 ) 2351 0.009% 0.009% +> [ 1, 2 ) 6077 0.024% 0.033% +> [ 2, 3 ) 8471 0.033% 0.066% +> [ 3, 4 ) 788 0.003% 0.069%â + +thanks in advance + +**[Siying Dong]([email protected])** + +Thank you for pointing out the mistake. I fixed it now. + +In this output, there are 2.5 million samples, average latency is 421 micro seconds, with standard deviation 385. Median is 376, max value is 202 milliseconds. 0.009% has value of 1, 0.024% has value of 1, 0.033% has value of 2. Accumulated value from 0 to 2 is 0.066%. + +Hope it helps. + +**[chiddu]([email protected])** + +Thank you Siying for the quick reply, I was running couple of benchmark testing to check the performance of rocksdb on SSD. One of the test is similar to what is mentioned in the wiki, TEST 4 : Random read , except the key_size is 10 and value_size is 20. I am inserting 1 billion hashes and reading 1 billion hashes with 32 threads. The histogram shows something like this + +``` +Level 5 read latency histogram (micros): +Count: 7133903059 Average: 480.4357 StdDev: 309.18 +Min: 0.0000 Median: 551.1491 Max: 224142.0000 +Percentiles: P50: 551.15 P75: 651.44 P99: 996.52 P99.9: 2073.07 P99.99: 3196.32 +ââââââââââââââââââ +[ 0, 1 ) 28587385 0.401% 0.401% +[ 1, 2 ) 686572516 9.624% 10.025% ## +[ 2, 3 ) 567317522 7.952% 17.977% ## +[ 3, 4 ) 44979472 0.631% 18.608% +[ 4, 5 ) 50379685 0.706% 19.314% +[ 5, 6 ) 64930061 0.910% 20.224% +[ 6, 7 ) 22613561 0.317% 20.541% +â¦â¦â¦â¦moreâ¦â¦â¦â¦. +``` + +If I understand your previous comment correctly, + +1. How is it that the count is around 7 billion when I have only inserted 1 billion hashes ? is the stat broken ? +1. What does the percentiles and the numbers signify ? +1. 0, 1 ) 28587385 0.401% 0.401% what does this â28587385â stand for in the histogram row ? + +**[Siying Dong]([email protected])** + +If I remember correctly, with db_bench, if you specify ânum=1000000000 âthreads=32, it is every thread reading one billion keys, total of 32 billions. Is it the case you ran into? + +28,587,385 means that number of data points take the value [0,1) +28,587,385 / 7,133,903,058 = 0.401% provides percentage. + +**[chiddu]([email protected])** + +I do have `num=1000000000` and `t=32`. The script says reading 1 billion hashes and not 32 billion hashes. + +this is the script on which I have used + +``` +echo âLoad 1B keys sequentially into databaseâ¦..â +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=1; sync=0; r=1000000000; t=1; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench âbenchmarks=fillseq âdisable_seek_compaction=1 âmmap_read=0 âstatistics=1 âhistogram=1 ânum=$r âthreads=$t âvalue_size=$vs âblock_size=$bs âcache_size=$cs âbloom_bits=10 âcache_numshardbits=6 âopen_files=$of âverify_checksum=1 âdb=/data/mysql/leveldb/test âsync=$sync âdisable_wal=1 âcompression_type=none âstats_interval=$si âcompression_ratio=0.5 âdisable_data_sync=$dds âwrite_buffer_size=$wbs âtarget_file_size_base=$mb âmax_write_buffer_number=$wbn âmax_background_compactions=$mbc âlevel0_file_num_compaction_trigger=$ctrig âlevel0_slowdown_writes_trigger=$delay âlevel0_stop_writes_trigger=$stop ânum_levels=$levels âdelete_obsolete_files_period_micros=$del âmin_level_to_compress=$mcz âmax_grandparent_overl ap_factor=$overlap âstats_per_interval=1 âmax_bytes_for_level_base=$bpl âuse_existing_db=0 âkey_size=10 + +echo âReading 1B keys in database in random orderâ¦.â +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000000; t=32; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench âbenchmarks=readrandom âdisable_seek_compaction=1 âmmap_read=0 âstatistics=1 âhistogram=1 ânum=$r âthreads=$t âvalue_size=$vs âblock_size=$bs âcache_size=$cs âbloom_bits=10 âcache_numshardbits=6 âopen_files=$of âverify_checksum=1 âdb=/some_data_base âsync=$sync âdisable_wal=1 âcompression_type=none âstats_interval=$si âcompression_ratio=0.5 âdisable_data_sync=$dds âwrite_buffer_size=$wbs âtarget_file_size_base=$mb âmax_write_buffer_number=$wbn âmax_background_compactions=$mbc âlevel0_file_num_compaction_trigger=$ctrig âlevel0_slowdown_writes_trigger=$delay âlevel0_stop_writes_trigger=$stop ânum_levels=$levels âdelete_obsolete_files_period_micros=$del âmin_level_to_compress=$mcz âmax_grandparent_overlap_fa ctor=$overlap âstats_per_interval=1 âmax_bytes_for_level_base=$bpl âuse_existing_db=1 âkey_size=10 +``` + +After running this script, there were no issues wrt to loading billion hashes , but when it came to reading part, its been almost 4 days and still I have only read 7 billion hashes and have read 200 million hashes in 2 and half days. Is there something which is missing in db_bench or something which I am missing ? + +**[Siying Dong]([email protected])** + +Itâs a printing error then. If you have `num=1000000000` and `t=32`, it will be 32 threads, and each reads 1 billion keys.
