[04/51] [partial] nifi-minifi-cpp git commit: MINIFI-372: Replace leveldb with RocksDB

jeremydyer Mon, 09 Oct 2017 09:25:38 -0700

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown
----------------------------------------------------------------------
diff --git a/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown
new file mode 100644
index 0000000..12009cc
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-05-14-lock.markdown
@@ -0,0 +1,88 @@
+---
+title: Reducing Lock Contention in RocksDB
+layout: post
+author: sdong
+category: blog
+redirect_from:
+  - /blog/521/lock/
+---
+
+In this post, we briefly introduce the recent improvements we did to RocksDB 
to improve the issue of lock contention costs.
+
+RocksDB has a simple thread synchronization mechanism (See [RocksDB 
Architecture 
Guide](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide) Â 
to understand terms used below, like SST tables or mem tables). SST tables are 
immutable after being writtenÂ and mem tables are lock-free data structures 
supporting single writer and multiple readers. There is only one single major 
lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, 
including:
+
+<!--truncate-->
+
+  * Increase or decrease reference counters of mem tables and SST tables
+
+
+  * Change and check meta data structures, before and after finishing 
compactions, flushes and new mem table creations
+
+
+  * Coordinating writers
+
+
+This DB mutex used to be scalability bottleneck preventing us fromÂ scaling to 
more than 16 threads. To address the issue, we improved RocksDB in several ways.
+
+1. Consolidate reference counters and introduce "super version". For every 
read operation, mutex was acquired, and reference counters for each mem table 
and each SST table were increased. One such operation is not expensive but if 
you are building a high throughput server with lots of reads, the lock 
contention will become the bottleneck. This is especially true if you store all 
your data in RAM.
+
+To solve this problem, we created a meta-meta data structure called â[super 
version](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c)â,
 which holds reference counters to all those mem table and SST tables, so that 
readers only need to increase the reference counters for this single data 
structure. In RocksDB, list of live mem tables and SST tables only changes 
infrequently, which would happen when new mem tables are created or 
flush/compaction happens. Now, at those times, a new super version is created 
with their reference counters increased. A super version lists live mem tables 
and SST tables so a reader only needs acquire the lock in order to find the 
latest super version and increase its reference counter. From the super 
version, the reader can find all the mem and SST tables which are safety 
accessible as long as the reader holds the reference count for the super 
version.
+
+2. We replace some reference counters to stc::atomic objects, so that 
decreasing reference count of an object usually doesnât need to be inside the 
mutex any more.
+
+3. Make fetching super version and reference counting lock-free in read 
queries. After consolidating reference counting to one single super version and 
removing the locking for decreasing reference counts, in read case, we only 
acquire mutex for one thing: fetch the latest super version and increase the 
reference count for that (dereference the counter is done in an atomic 
decrease). We designed and implemented a (mostly) lock-free approach to do it. 
See 
[details](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf).
 We will write a separate blog post for that.
+
+4. Avoid disk I/O inside the mutex. As we know, each disk I/O to hard drives 
takes several milliseconds. It can be even longer if file system journal is 
involved or I/Os are queued. Even occasional disk I/O within mutex can cause 
huge performance outliers.
+We identified in two situations, we might do disk I/O inside mutex and we 
removed them:
+(1) Opening and closing transactional log files. We moved those operations out 
of the mutex.
+(2) Information logging. In multiple places we write to logs within mutex. 
There is a chance that file write will wait for disk I/O to finish before 
finishing, even if fsync() is not issued, especially in EXT systems. We 
occasionally see 100+ milliseconds write() latency on EXT. Instead of removing 
those logging, we came up with a solution of delay logging. When inside mutex, 
instead of directly writing to the log file, we write to a log buffer, with the 
timing information. As soon as mutex is released, we flush the log buffer to 
log files.
+
+5. Reduce object creation inside the mutex.
+Object creation can be slow because it involves malloc (in our case). Malloc 
sometimes is slow because it needs to lock some shared data structures. 
Allocating can also be slow because we sometimes do expensive operations in 
some of our classes' constructors. For these reasons, we try to reduce object 
creations inside the mutex. Here are two examples:
+
+(1) std::vector uses malloc inside. We introduced 
â[autovector](https://reviews.facebook.net/rROCKSDBc01676e46d3be08c3c140361ef1f5884f47d3b3c)â
 data structure, in which memory for first a few elements are pre-allocated as 
members of the autovector class. When an autovector is used as a stack 
variable, no malloc will be needed unless the pre-allocated buffer is used up. 
This autovector is quite useful for manipulating those meta data structures. 
Those meta operations are often locked inside DB mutex.
+
+(2) When building an iterator, we used to creating iterator of every live men 
table and SST table within the mutex and a merging iterator on top of them. 
Besides malloc, some of those iterators can be quite expensive to create, like 
sorting. Now, instead of doing that, we simply increase the reference counters 
of them, and release the mutex before creating any iterator.
+
+6. Deal with mutexes in LRU caches.
+When I said there was only one single major lock, I was lying. In RocksDB, all 
LRU caches had exclusive mutexes within to protect writes to the LRU lists, 
which are done in both of read and write operations. LRU caches are used in 
block cache and table cache. Both of them are accessed more frequently than DB 
data structures. Lock contention of these two locks are as intense as the DB 
mutex. Even if LRU cache is sharded into ShardedLRUCache, we can still see lock 
contentions, especially table caches. We further address this issue in two way:
+(1) Bypassing table caches. A table cache maintains list of SST tableâs read 
handlers. Those handlers contain SST filesâ descriptors, table metadata, and 
possibly data indexes, as well as bloom filters. When the table handler needs 
to be evicted based on LRU, those information is cleared. When the SST table 
needs to be read and its table handler is not in LRU cache, the table is opened 
and those metadata is loaded. In some cases, users want to tune the system in a 
way that table handler evictions should never happen. It is common for 
high-throughput, low-latency servers. We introduce a mode where table cache is 
bypassed in read queries. In this mode, all table handlers are cached and 
accessed directly, so there is no need to query and adjust table caches for 
reading the database. It is the usersâ responsibility to reserve enough 
resource for it. This mode can be turned on by setting 
options.max_open_files=-1.
+
+(2) [New PlainTable 
format](//github.com/facebook/rocksdb/wiki/PlainTable-Format) (optimized for 
SST in ramfs/tmpfs) does not organize data by blocks. Data are located by 
memory addresses so no block cache is needed.
+
+With all of those improvements, lock contention is not a bottleneck anymore, 
which is shown in our [memory-only 
benchmark](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks)Â
 . Furthermore, lock contentions are not causing some huge (50 milliseconds+) 
latency outliers they used to cause.
+
+### Comments
+
+**[Lee Hounshell]([email protected])**
+
+Please post an example of reading the same rocksdb concurrently.
+
+We are using the latest 3.0 rocksdb; however, when two separate processes
+try and open the same rocksdb for reading, only one of the open requests
+succeed. The other open always fails with âdb/LOCK: Resource temporarily 
unavailableâ So far we have not found an option that allows sharing the 
rocksdb for reads. An example would be most appreciated.
+
+**[Siying Dong]([email protected])**
+
+Sorry for the delay. We donât have feature support for this scenario yet. 
Here is an example you can work around this problem. You can build a snapshot 
of the DB by doing this:
+
+1. create a separate directory on the same host for a snapshot of the DB.
+1. call `DB::DisableFileDeletions()`
+1. call `DB::GetLiveFiles()` to get a full list of the files.
+1. for all the files except manifest, add a hardlink file in your new 
directory pointing to the original file
+1. copy the manifest file and truncate the size (you can read the comments of 
`DB::GetLiveFiles()` for more information)
+1. call `DB::EnableFileDeletions()`
+1. now you can open the snapshot directory in another process to access those 
files. Please remember to delete the directory after reading the data to allow 
those files to be recycled.
+
+By the way, the best way to ask those questions is in our [facebook 
group](https://www.facebook.com/groups/rocksdb.dev/). Let us know if you need 
any further help.
+
+**[Darshan]([email protected])**
+
+Will this consistency problem of RocksDB all occurs in case of single 
put/write?
+What all ACID properties is supported by RocksDB, only durability irrespective 
of single or batch write?
+
+**[Siying Dong]([email protected])**
+
+We recently [introduced optimistic 
transaction](https://reviews.facebook.net/D33435) which can help you ensure all 
of ACID.
+
+This blog post is mainly about optimizations in implementation. The RocksDB 
consistency semantic is not changed.


http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown
new file mode 100644
index 0000000..61c90dc
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown
@@ -0,0 +1,24 @@
+---
+title: RocksDB 3.0 release
+layout: post
+author: icanadi
+category: blog
+redirect_from:
+  - /blog/557/rocksdb-3-0-release/
+---
+
+Check out new RocksDB release on 
[Github](https://github.com/facebook/rocksdb/releases/tag/3.0.fb)!
+
+New features in RocksDB 3.0:
+
+  * [Column Family 
support](https://github.com/facebook/rocksdb/wiki/Column-Families)
+
+
+  * [Ability to chose different checksum 
function](https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6)
+
+
+  * Deprecated ReadOptions::prefix_seek and ReadOptions::prefix
+
+<!--truncate-->
+
+Check out the full [change 
log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md).

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown
new file mode 100644
index 0000000..3015674
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown
@@ -0,0 +1,20 @@
+---
+title: RocksDB 3.1 release
+layout: post
+author: icanadi
+category: blog
+redirect_from:
+  - /blog/575/rocksdb-3-1-release/
+---
+
+Check out the new release on 
[Github](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1)!
+
+New features in RocksDB 3.1:
+
+  * [Materialized hash 
index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7)
+
+
+  * [FIFO compaction 
style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style)
+
+
+We released 3.1 so fast after 3.0 because one of our internal customers needed 
materialized hash index.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown
 
b/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown
new file mode 100644
index 0000000..6a641f2
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown
@@ -0,0 +1,47 @@
+---
+title: PlainTable â A New File Format
+layout: post
+author: sdong
+category: blog
+redirect_from:
+  - /blog/599/plaintable-a-new-file-format/
+---
+
+In this post, we are introducing "PlainTable" -- a file format we designed for 
RocksDB, initially to satisfy a production use case at Facebook.
+
+Design goals:
+
+1. All data stored in memory, in files stored in tmpfs/ramfs. Support DBs 
larger than 100GB (may be sharded across multiple RocksDB instance).
+1. Optimize for [prefix 
hashing](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf)
+1. Less than or around 1 micro-second average latency for single Get() or 
Seek().
+1. Minimize memory consumption.
+1. Queries efficiently return empty results
+
+<!--truncate-->
+
+Notice that our priority was not to maximize query performance, but to strike 
a balance between query performance and memory consumption. PlainTable query 
performance is not as good as you would see with a nicely-designed hash table, 
but they are of the same order of magnitude, while keeping memory overhead to a 
minimum.
+
+Since we are targeting micro-second latency, it is on the level of the number 
of CPU cache misses (if they cannot be parallellized, which are usually the 
case for index look-ups). On our target hardware with Intel CPUs of multiple 
sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of 
data TLB).
+
+To meet our requirements, given that only hash prefix iterating is needed, we 
made two decisions:
+
+1. to use a hash index, which is
+1. directly addressed to rows, with no block structure.
+
+Having addressed our latency goal, the next task was to design a very compact 
hash index to minimize memory consumption. Some tricks we used to meet this 
goal:
+
+1. We only use 32-bit integers for data and index offsets.The first bit serves 
as a flag, so we can avoid using 8-byte pointers.
+1. We never copy keys or parts of keys to index search structures. We store 
only offsets from which keys can be retrieved, to make comparisons with search 
keys.
+1. Since our file is immutable, we can accurately estimate the number of hash 
buckets needed.
+
+To make sure the format works efficiently with empty queries, we added a bloom 
filter check before the query. This adds only one cache miss for non-empty 
cases [1], but avoids multiple cache misses for most empty results queries. 
This is a good trade-off for use cases with a large percentage of empty results.
+
+These are the design goals and basic ideas of PlainTable file format. For 
detailed information, see [this wiki 
page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format).
+
+[1] Bloom filter checks typically require multiple memory access. However, 
because they are independent, they usually do not make the CPU pipeline stale. 
In any case, we improved the bloom filter to improve data locality - we may 
cover this further in a future blog post.
+
+### Comments
+
+**[Siying Dong]([email protected])**
+
+Does [http://rocksdb.org/feed/](http://rocksdb.org/feed/) work?

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown
 
b/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown
new file mode 100644
index 0000000..4411c7a
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown
@@ -0,0 +1,89 @@
+---
+title: Avoid Expensive Locks in Get()
+layout: post
+author: leijin
+category: blog
+redirect_from:
+  - /blog/677/avoid-expensive-locks-in-get/
+---
+
+As promised in the previous [blog post](blog/2014/05/14/lock.html)!
+
+RocksDB employs a multiversion concurrency control strategy. Before reading 
data, it needs to grab the current version, which is encapsulated in a data 
structure called 
[SuperVersion](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c).
+
+<!--truncate-->
+
+At the beginning of `GetImpl()`, it used to do this:
+
+
+    <span class="zw-portion">mutex_.Lock();
+    </span>auto* s = super_version_->Ref();
+    mutex_.Unlock();
+
+
+The lock is necessary because pointer super_version_ may be updated, the 
corresponding SuperVersion may be deleted while Ref() is in progress.
+
+
+`Ref()` simply increases the reference counter and returns âthisâ pointer. 
However, this simple operation posed big challenges for in-memory workload and 
stopped RocksDB from scaling read throughput beyond 8 cores. Running 32 read 
threads on a 32-core CPU leads to [70% system CPU 
usage](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf).
 This is outrageous!
+
+
+
+
+Luckily, we found a way to circumvent this problem by using [thread local 
storage](http://en.wikipedia.org/wiki/Thread-local_storage). Version change is 
a rare event comparable to millions of read requests. On the very first Get() 
request, each thread pays the mutex cost to acquire a reference to the new 
super version. Instead of releasing the reference after use, the reference is 
cached in threadâs local storage. An atomic variable is used to track global 
super version number. Subsequent reads simply compare the local super version 
number against the global super version number. If they are the same, the 
cached super version reference may be used directly, at no cost. If a version 
change is detected, mutex must be acquired to update the reference. The cost of 
mutex lock is amortized among millions of reads and becomes negligible.
+
+
+
+
+The code looks something like this:
+
+
+
+
+
+    SuperVersion* s = thread_local_->Get();
+    if (s->version_number != super_version_number_.load()) {
+      // slow path, cleanup of current super version is omitted
+      mutex_.Lock();
+      s = super_version_->Ref();
+      mutex_.Unlock();
+    }
+
+
+
+
+The result is quite amazing. RocksDB can nicely [scale to 32 
cores](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf)Â
 and most CPU time is spent in user land.
+
+
+
+
+Daryl Grove gives a pretty good [comparison between mutex and 
atomic](https://blogs.oracle.com/d/entry/the_cost_of_mutexes). However, the 
real cost difference lies beyond what is shown in the assembly code. Mutex can 
keep threads spinning on CPU or even trigger thread context switches in which 
all readers compete to access the critical area. Our approach prevents mutual 
competition by directing threads to check against a global version which does 
not change at high frequency, and is therefore much more cache-friendly.
+
+
+
+
+The new approach entails one issue: a thread can visit GetImpl() once but can 
never come back again. SuperVersion is referenced and cached in its thread 
local storage. All resources (e.g., memtables, files) which belong to that 
version are frozen. A âsupervisorâ is required to visit each threadâs 
local storage and free its resources without incurring a lock. We designed a 
lockless sweep using CAS (compare and switch instruction). Here is how it works:
+
+
+
+
+(1) A reader thread uses CAS to acquire SuperVersion from its local storage 
and to put in a special flag (SuperVersion::kSVInUse).
+
+
+
+
+(2) Upon completion of GetImpl(), the reader thread tries to return 
SuperVersion to local storage by CAS, expecting the special flag 
(SuperVersion::kSVInUse) in its local storage. If it does not see 
SuperVersion::kSVInUse, that means a âsweepâ was done and the reader thread 
is responsible for cleanup (this is expensive, but does not happen often on the 
hot path).
+
+
+
+
+(3) After any flush/compaction, the background thread performs a sweep (CAS) 
across all threadsâ local storage and frees encountered SuperVersion. A 
reader thread must re-acquire a new SuperVersion reference on its next visit.
+
+### Comments
+
+**[David Barbour]([email protected])**
+
+Please post an example of reading the same rocksdb concurrently.
+
+We are using the latest 3.0 rocksdb; however, when two separate processes
+try and open the same rocksdb for reading, only one of the open requests
+succeed. The other open always fails with âdb/LOCK: Resource temporarily 
unavailableâ So far we have not found an option that allows sharing the 
rocksdb for reads. An example would be most appreciated.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown
new file mode 100644
index 0000000..e4eba6a
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown
@@ -0,0 +1,30 @@
+---
+title: RocksDB 3.2 release
+layout: post
+author: leijin
+category: blog
+redirect_from:
+  - /blog/647/rocksdb-3-2-release/
+---
+
+Check out new RocksDB release onÂ 
[GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.2)!
+
+New Features in RocksDB 3.2:
+
+  * PlainTable now supports a new key encoding: for keys of the same prefix, 
the prefix is only written once. It can be enabled through encoding_type 
paramter of NewPlainTableFactory()
+
+
+  * Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable 
to BlockBasedTabe, or vise versa. It can be created using 
NewAdaptiveTableFactory()
+
+<!--truncate-->
+
+Public API changes:
+
+
+  * We removed seek compaction as a concept from RocksDB
+
+
+  * Add two paramters to NewHashLinkListRepFactory() for logging on too many 
entries in a hash bucket when flushing
+
+
+  * Added new option BlockBasedTableOptions::hash_index_allow_collision. When 
enabled, prefix hash index for block-based table will not store prefix and 
allow hash collision, reducing memory consumption

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown
new file mode 100644
index 0000000..d858e4f
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown
@@ -0,0 +1,34 @@
+---
+title: RocksDB 3.3 Release
+layout: post
+author: yhciang
+category: blog
+redirect_from:
+  - /blog/1301/rocksdb-3-3-release/
+---
+
+Check out new RocksDB release onÂ 
[GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.3)!
+
+New Features in RocksDB 3.3:
+
+  * **JSON API prototype**.
+
+
+  * **Performance improvement on HashLinkList**:  We addressed performance 
outlier of HashLinkList caused by skewed bucket by switching data in the bucket 
from linked list to skip list. Add parameter threshold_use_skiplist in 
NewHashLinkListRepFactory().
+
+<!--truncate-->
+
+  * **More effective on storage space reclaim**:  RocksDB is now able to 
reclaim storage space more effectively during the compaction process.  This is 
done by compensating the size of each deletion entry by the 2X average value 
size, which makes compaction to be triggerred by deletion entries more easily.
+
+
+  * **TimeOut API to write**:  Now WriteOptions have a variable called 
timeout_hint_us.  With timeout_hint_us set to non-zero, any write associated 
with this timeout_hint_us may be aborted when it runs longer than the specified 
timeout_hint_us, and it is guaranteed that any write completes earlier than the 
specified time-out will not be aborted due to the time-out condition.
+
+
+  * **rate_limiter option**: We added an option that controls total throughput 
of flush and compaction. The throughput is specified in bytes/sec. Flush always 
has precedence over compaction when available bandwidth is constrained.
+
+
+
+Public API changes:
+
+
+  * Removed NewTotalOrderPlainTableFactory because it is not used and 
implemented semantically incorrect.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown
----------------------------------------------------------------------
diff --git a/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown
new file mode 100644
index 0000000..22178f7
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown
@@ -0,0 +1,74 @@
+---
+title: Cuckoo Hashing Table Format
+layout: post
+author: radheshyam
+category: blog
+redirect_from:
+  - /blog/1427/new-bloom-filter-format/
+---
+
+## Introduction
+
+We recently introduced a new [Cuckoo 
Hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing)Â based SST file format 
which is optimized for fast pointÂ lookups. The new formatÂ was built for 
applications which require very high pointÂ lookupÂ rates (~4Mqps) in read only 
mode but do not use operations like range scan, merge operator, etc. But, the 
existingÂ RocksDBÂ file formats were built to support range scan and other 
operations and the current best pointÂ lookupÂ inÂ RocksDBÂ is 1.2Â MqpsÂ given 
by [PlainTable](https://github.com/facebook/rocksdb/wiki/PlainTable-Format)[Â 
format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format).Â This 
prompted a hashing based file format, which we present here. The new table 
format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 
2 memory accesses perÂ lookup.
+
+<!--truncate-->
+
+Goals:
+
+  * Reduce memory accesses perÂ lookupÂ to 1 or 2
+
+
+  * Get an end to end pointÂ lookupÂ rate of at least 4 Mqps
+
+
+  * Minimize database size
+
+
+Assumptions:
+
+  * Key length and value length are fixed
+
+
+  * The database is operated in read only mode
+
+
+Non-goals:
+
+
+  * While optimizing the performance of Get() operation was our primary goal, 
compaction and build times were secondary. We may work on improving them in 
future.
+
+
+Details for setting up the table format can be found in 
[GitHub](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format).
+
+
+## Cuckoo Hashing Algorithm
+
+In order to achieve highÂ lookupÂ speeds, we did multipleÂ optimizations, 
including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple 
hash functions,Â _h1, ...,Â __hn._
+
+### Original Cuckoo Hashing
+
+To insert any new key _k_, we compute hashes of the key _h1(k), ...,Â 
__hn__(k)_. We insert the key in the first hash location that is free. If all 
the locations are blocked, we try to move one of the colliding keys to a 
different location by trying to re-insert it.
+
+Finding smallest set of keys to displace in order to accommodate the new key 
is naturally a shortest path problem in a directed graph where nodes are 
buckets of hash table and there is an edge from bucket _A_ to bucket _B_ if the 
element stored in bucket _A_ can be accommodated in bucket _B_ using one of the 
hash functions. The source nodes are the possible hash locations for the given 
key _k_Â and destination is any one of the empty buckets. We use this algorithm 
to handle collision.
+
+To retrieve a key _k_, we compute hashes, _h1(k), ...,Â __hn__(k)_Â and the 
key must be present in one of these locations.
+
+Our goal is to minimize average (and maximum) number of hash functions 
required and hence the number of memory accesses. In our experiments, with a 
hash utilization of 90%, we found that the average number ofÂ lookupsÂ is 1.8 
and maximum is 3. Around 44% of keys are accommodated in first hash location 
and 33% in second location.
+
+
+### Cache Friendly Cuckoo Hashing
+
+We noticed the following two sub-optimal properties in original Cuckoo 
implementation:
+
+
+  * If the key is not presentÂ in first hash location, we jump to second hash 
location which may not be in cache. This results in many cache misses.
+
+
+  * Because only 44% of keys are located in first cuckoo block, we couldn't 
have an optimalÂ prefetchingÂ strategy -Â prefetchingÂ all hash locations for a 
key is wasteful. ButÂ prefetchingÂ only the first hash location helps only 44% 
of cases.
+
+
+
+The solution is to insert more keys near first location. In case of collision 
in the first hash location - _h1(k)_, we try to insert it in next few buckets, 
_h1(k)+1, _h1(k)+2,Â _..., h1(k)+t-1_. If all of these _t_Â locations are 
occupied, we skip over to next hash function _h2_Â and repeat the process. We 
call the set of _t_ buckets as a _Cuckoo Block_. We chose _t_Â such that size 
of a block is not bigger than a cache line and weÂ prefetchÂ the first cuckoo 
block.
+
+
+With the new algorithm, forÂ 90% hash utilization, we found that 85% of keys 
are accommodated in first Cuckoo Block.Â PrefetchingÂ the first cuckoo block 
yields best results.Â For a database of 100 million keys with key length 8 and 
value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are 
working on improving it further. End to end RocksDB performance results can be 
found [here](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format).

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown
new file mode 100644
index 0000000..96fa50a
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown
@@ -0,0 +1,52 @@
+---
+title: New Bloom Filter Format
+layout: post
+author: zagfox
+category: blog
+redirect_from:
+  - /blog/1367/cuckoo/
+---
+
+## Introduction
+
+In this post, we areÂ introducing "full filter block" --- aÂ new bloom filter 
format forÂ [block based 
table](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format).
 This could bring about 40% of improvement for key query under in-memory (all 
data stored inÂ memory, files stored inÂ tmpfs/ramfs, anÂ 
[example](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks)Â
 workload. The main idea behind is to generate a big filter that covers all the 
keys in SST file to avoid lots of unnecessary memory look ups.
+
+
+<!--truncate-->
+
+## What is Bloom Filter
+
+In brief,Â [bloom 
filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter)Â is a 
bits array generated for a set of keys that could tell if an arbitrary key may 
exist in that set.
+
+InÂ RocksDB, we generate such a bloom filter for each SST file. When we 
conduct a query for a key, we first goes to the bloom filter block of SST file. 
If key may exist in filter, we goes into data block in SST file to search for 
the key. If not, we would return directly. So it could help speed up pointÂ 
look up operation a lot.
+
+## Original Bloom Filter Format
+
+Original bloom filter creates filters for each individual data block in SST 
file. It has complex structure (refÂ 
[here](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block))Â
 which results in a lot of non-adjacent memory look ups.
+
+Here's the work flow for checking original bloom filter in block based table:
+
+1. Given the target key, we goes to the index block to get the "data block ID" 
where this key may reside.
+1. Using the "data block ID", we goes to the filter block and get the correct 
"offset ofÂ filter".
+1. UsingÂ the "offset of filter", we goes to the actual filter and do the 
checking.
+
+## New Bloom Filter Format
+
+New bloom filter creates filter for all keys in SST file and we name it "full 
filter". The data structure of full filter is very simple, there is just one 
big filter:
+
+Â  Â  [ full filter ]
+
+In this way, the work flow of bloom filter checking is much simplified.
+
+(1) Given the target key, we goes directly to the filter block and conduct the 
filter checking.
+
+To be specific, there would be no checking for index block and no address 
jumping inside of filter block.
+
+Though it is a big filter, the total filter size would be the same as the 
original filter.
+
+One little draw back is that the new bloom filter introduces more memory 
consumption when building SST file because we need toÂ buffer keys (or their 
hashes) before generating filter. Original filter just creates a bunch of small 
filters so it just buffer a small amount of keys. For full filter, we buffer 
hashes of all keys, whichÂ would take more memoryÂ when SST file size increases.
+
+
+## Usage & Customization
+
+You can refer to the document here forÂ 
[usage](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#usage-of-new-bloom-filter)Â
 andÂ 
[customization](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#customize-your-own-filterpolicy).

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown 
b/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown
new file mode 100644
index 0000000..1878a5a
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown
@@ -0,0 +1,38 @@
+---
+title: RocksDB 3.5 Release!
+layout: post
+author: leijin
+category: blog
+redirect_from:
+  - /blog/1547/rocksdb-3-5-release/
+---
+
+New RocksDB release - 3.5!
+
+
+**New Features**
+
+
+  1. Add include/utilities/write_batch_with_index.h, providing a utility class 
to query data out of WriteBatch when building it.
+
+
+  2. new ReadOptions.total_order_seek to force total order seek when 
block-based table is built with hash index.
+
+<!--truncate-->
+
+**Public API changes**
+
+
+  1. The Prefix Extractor used with V2 compaction filters is now passed user 
key to SliceTransform::Transform instead of unparsed RocksDB key.
+
+
+  2. Move BlockBasedTable related options to BlockBasedTableOptions from 
Options. Change corresponding JNI interface. Options affected include: 
no_block_cache, block_cache, block_cache_compressed, block_size, 
block_size_deviation, block_restart_interval, filter_policy, 
whole_key_filtering. filter_policy is changed to shared_ptr from a raw pointer.
+
+
+  3. Remove deprecated options: disable_seek_compaction and 
db_stats_log_interval
+
+
+  4. OptimizeForPointLookup() takes one parameter for block cache size. It now 
builds hash index, bloom filter, and block cache.
+
+
+[https://github.com/facebook/rocksdb/releases/tag/v3.5](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.5)

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown
new file mode 100644
index 0000000..f18de0b
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown
@@ -0,0 +1,112 @@
+---
+title: Migrating from LevelDB to RocksDB
+layout: post
+author: lgalanis
+category: blog
+redirect_from:
+  - /blog/1811/migrating-from-leveldb-to-rocksdb-2/
+---
+
+If you have an existing application that uses LevelDB and would like to 
migrate to using RocksDB, one problem you need to overcome is to map the 
options for LevelDB to proper options for RocksDB. As of release 3.9 this can 
be automatically done by using our option conversion utility found in 
rocksdb/utilities/leveldb_options.h. What is needed, is to first replace 
`leveldb::Options` with `rocksdb::LevelDBOptions`. Then, use 
`rocksdb::ConvertOptions( )` to convert the `LevelDBOptions` struct into 
appropriate RocksDB options. Here is an example:
+
+<!--truncate-->
+
+LevelDB code:
+
+```c++
+#include <string>
+#include "leveldb/db.h"
+
+using namespace leveldb;
+
+int main(int argc, char** argv) {
+  DB *db;
+
+  Options opt;
+  opt.create_if_missing = true;
+  opt.max_open_files = 1000;
+  opt.block_size = 4096;
+
+  Status s = DB::Open(opt, "/tmp/mydb", &db);
+
+  delete db;
+}
+```
+
+RocksDB code:
+
+```c++
+#include <string>  
+#include "rocksdb/db.h"  
+#include "rocksdb/utilities/leveldb_options.h"  
+
+using namespace rocksdb;  
+
+int main(int argc, char** argv) {  
+  DB *db;  
+
+  LevelDBOptions opt;  
+  opt.create_if_missing = true;  
+  opt.max_open_files = 1000;  
+  opt.block_size = 4096;  
+
+  Options rocksdb_options = ConvertOptions(opt);  
+  // add rocksdb specific options here  
+
+  Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db);
+
+  delete db;  
+}  
+```
+
+The difference is:
+
+```diff
+-#include "leveldb/db.h"
++#include "rocksdb/db.h"
++#include "rocksdb/utilities/leveldb_options.h"
+
+-using namespace leveldb;
++using namespace rocksdb;
+
+-  Options opt;
++  LevelDBOptions opt;
+
+-  Status s = DB::Open(opt, "/tmp/mydb", &db);
++  Options rocksdb_options = ConvertOptions(opt);
++  // add rockdb specific options here
++
++  Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db);
+```
+
+Once you get up and running with RocksDB you can then focus on tuning RocksDB 
further by modifying the converted options struct.
+
+The reason why ConvertOptions is handy is because a lot of individual options 
in RocksDB have moved to other structures in different components. For example, 
block_size is not available in struct rocksdb::Options. It resides in struct 
rocksdb::BlockBasedTableOptions, which is used to create a TableFactory object 
that RocksDB uses internally to create the proper TableBuilder objects. If you 
were to write your application from scratch it would look like this:
+
+RocksDB code from scratch:
+
+```c++
+#include <string>
+#include "rocksdb/db.h"
+#include "rocksdb/table.h"
+
+using namespace rocksdb;
+
+int main(int argc, char** argv) {
+  DB *db;
+
+  Options opt;
+  opt.create_if_missing = true;
+  opt.max_open_files = 1000;
+
+  BlockBasedTableOptions topt;
+  topt.block_size = 4096;
+  opt.table_factory.reset(NewBlockBasedTableFactory(topt));
+
+  Status s = DB::Open(opt, "/tmp/mydb_rocks", &db);
+
+  delete db;
+}
+```
+
+The LevelDBOptions utility can ease migration to RocksDB from LevelDB and 
allows us to break down the various options across classes as it is needed.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown
new file mode 100644
index 0000000..cddc0dd
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown
@@ -0,0 +1,41 @@
+---
+title: Reading RocksDB options from a file
+layout: post
+author: lgalanis
+category: blog
+redirect_from:
+  - /blog/1883/reading-rocksdb-options-from-a-file/
+---
+
+RocksDB options can be provided using a file or any string to RocksDB. The 
format is straightforward: `write_buffer_size=1024;max_write_buffer_number=2`. 
Any whitespace around `=` and `;` is OK. Moreover, options can be nested as 
necessary. For example `BlockBasedTableOptions` can be nested as follows: 
`write_buffer_size=1024; max_write_buffer_number=2; 
block_based_table_factory={block_size=4k};`. Similarly any white space around 
`{` or `}` is ok. Here is what it looks like in code:
+
+<!--truncate-->
+
+```c++
+#include <string>
+#include "rocksdb/db.h"
+#include "rocksdb/table.h"
+#include "rocksdb/utilities/convenience.h"
+
+using namespace rocksdb;                                                       
                                    
+
+int main(int argc, char** argv) {                                              
                                    
+  DB *db;
+
+  Options opt;
+
+  std::string options_string =                                                 
                                    
+    "create_if_missing=true;max_open_files=1000;"                              
                                    
+    "block_based_table_factory={block_size=4096}";                             
                                    
+
+  Status s = GetDBOptionsFromString(opt, options_string, &opt);
+
+  s = DB::Open(opt, "/tmp/mydb_rocks", &db);                                   
                                    
+
+  // use db
+
+  delete db;
+}
+```
+
+Using `GetDBOptionsFromString` is a convenient way of changing options for 
your RocksDB application without needing to resort to recompilation or tedious 
command line parsing.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown 
b/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown
new file mode 100644
index 0000000..7f9f776
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown
@@ -0,0 +1,20 @@
+---
+title: 'WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes'
+layout: post
+author: sdong
+category: blog
+redirect_from:
+  - /blog/1901/write-batch-with-index/
+---
+
+RocksDB can be used as a storage engine of a higher level database. In fact, 
we are currently plugging RocksDB into MySQL and MongoDB as one of their 
storage engines. RocksDB can help with guaranteeing some of the ACID 
properties: durability is guaranteed by RocksDB by design; while consistency 
and isolation need to be enforced by concurrency controls on top of RocksDB; 
Atomicity can be implemented by committing a transaction's writes with one 
write batch to RocksDB in the end.
+
+<!--truncate-->
+
+However, if we enforce atomicity by only committing all writes in the end of 
the transaction in one batch, you cannot get the updated value from RocksDB 
previously written by the same transaction (read-your-own-write). To read the 
updated value, the databases on top of RocksDB need to maintain an internal 
buffer for all the written keys, and when a read happens they need to merge the 
result from RocksDB and from this buffer. This is a problem we faced when 
building the RocksDB storage engine in MongoDB. We solved it by creating a 
utility class, WriteBatchWithIndex (a write batch with a searchable index) and 
made it part of public API so that the community can also benefit from it.
+
+Before talking about the index part, let me introduce write batch first. The 
write batch class, `WriteBatch`, is a RocksDB data structure for atomic writes 
of multiple keys. Users can buffer their updates to a `WriteBatch` by calling 
`write_batch.Put("key1", "value1")` or `write_batch.Delete("key2")`, similar as 
calling RocksDB's functions of the same names. In the end, they call 
`db->Write(write_batch)` to atomically update all those batched operations to 
the DB. It is how a database can guarantee atomicity, as shown above. Adding a 
searchable index to `WriteBatch`, we now have `WriteBatchWithIndex`. Users can 
put updates to WriteBatchIndex in the same way as to `WriteBatch`. In the end, 
users can get a `WriteBatch` object from it and issue `db->Write()`. 
Additionally, users can create an iterator of a WriteBatchWithIndex, seek to 
any key location and iterate from there.
+
+To implement read-your-own-write using `WriteBatchWithIndex`, every time the 
user creates a transaction, we create a `WriteBatchWithIndex` attached to it. 
All the writes of the transaction go to the `WriteBatchWithIndex` first. When 
we commit the transaction, we atomically write the batch to RocksDB. When the 
user wants to call `Get()`, we first check if the value exists in the 
`WriteBatchWithIndex` and return the value if existing, by seeking and reading 
from an iterator of the write batch, before checking data in RocksDB. For 
example, here is the we implement it in MongoDB's RocksDB storage engine:Â 
[link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L245-L260).
 If a range query comes, we pass a DB's iterator to `WriteBatchWithIndex`, 
which creates a super iterator which combines the results from the DB iterator 
with the batch's iterator. Using this super iterator, we can iterate the DB 
with the t
 ransaction's own writes. Here is the iterator creation codes in MongoDB's 
RocksDB storage engine: 
[link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L266-L269).
 In this way, the database can solve the read-your-own-write problem by using 
RocksDB to handle a transaction's uncommitted writes.
+
+Using `WriteBatchWithIndex`, we successfully implemented read-your-own-writes 
in the RocksDB storage engine of MongoDB. If you also have a 
read-your-own-write problem, `WriteBatchWithIndex` can help you implement it 
quickly and correctly.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown
new file mode 100644
index 0000000..1ffe2c5
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown
@@ -0,0 +1,16 @@
+---
+title: Integrating RocksDB with MongoDB
+layout: post
+author: icanadi
+category: blog
+redirect_from:
+  - /blog/1967/integrating-rocksdb-with-mongodb-2/
+---
+
+Over the last couple of years, we have been busy integrating RocksDB with 
various services here at Facebook that needed to store key-value pairs locally. 
We have also seen other companies using RocksDB as local storage components of 
their distributed systems.
+
+<!--truncate-->
+
+The next big challenge for us is to bring RocksDB storage engine to general 
purpose databases. Today we have an exciting milestone to share with our 
community! We're running MongoDB with RocksDB in production and seeing great 
results! You can read more about it here: 
[http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/)
+
+Keep tuned for benchmarks and more stability and performance improvements.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown 
b/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown
new file mode 100644
index 0000000..f3a55fa
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown
@@ -0,0 +1,10 @@
+---
+title: RocksDB in osquery
+layout: post
+author: icanadi
+category: lgalanis
+redirect_from:
+  - /blog/1997/rocksdb-in-osquery/
+---
+
+Check out 
[this](https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/)
 blog post by [Mike Arpaia](https://www.facebook.com/mike.arpaia) and [Ted 
Reed](https://www.facebook.com/treeded) about how osquery leverages RocksDB to 
build an embedded pub-sub system. This article is a great read and contains 
insights on how to properly use RocksDB.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown 
b/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown
new file mode 100644
index 0000000..b3e2703
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown
@@ -0,0 +1,92 @@
+---
+title: RocksDB 2015 H2 roadmap
+layout: post
+author: icanadi
+category: blog
+redirect_from:
+  - /blog/2015/rocksdb-2015-h2-roadmap/
+---
+
+Every 6 months, RocksDB team gets together to prioritize the work ahead of us. 
We just went through this exercise and we wanted to share the results with the 
community. Here's what RocksDB team will be focusing on for the next 6 months:
+
+<!--truncate-->
+
+**MyRocks**
+
+As you might know, we're working hard to integrate RocksDB as a storage engine 
for MySQL. This project is pretty important for us because we're heavy users of 
MySQL. We're already getting pretty good performance results, but there is more 
work to be done. We need to focus on both performance and stability. The most 
high priority items on are list are:
+
+
+
+
+  1. Reduce CPU costs of RocksDB as a MySQL storage engine
+
+
+  2. Implement pessimistic concurrency control to support repeatable read 
isolation level in MyRocks
+
+
+  3. Reduce P99 read latency, which is high mostly because of lingering 
tombstones
+
+
+  4. Port ZSTD compression
+
+
+**MongoRocks**
+
+Another database that we're working on is MongoDB. The project of integrating 
MongoDB with RocksDB storage engine is called MongoRocks. It's already running 
in production at Parse [1] and we're seeing surprisingly few issues. Our plans 
for the next half:
+
+
+
+
+  1. Keep improving performance and stability, possibly reuseÂ work done onÂ 
MyRocks (workloads are pretty similar).
+
+
+  2. Increase internal and external adoption.
+
+
+  3. Support new MongoDB 3.2.
+
+
+**RocksDB on cheaper storage media**
+
+Up to now, our mission was to build the best key-value store âfor fast 
storageâ (flash and in-memory). However, there are some use-cases at Facebook 
that don't need expensive high-end storage. In the next six months, we plan to 
deploy RocksDB on cheaper storage media. We will optimize performance to 
RocksDB on either or both:
+
+
+
+
+  1. Hard drive storage array.
+
+
+  2. Tiered Storage.
+
+
+**Quality of Service**
+
+When talking to our customers, there are couple of issues that keep 
reoccurring. We need to fix them to make our customers happy. We will improve 
RocksDB to provide better assurance of performance and resource usage. 
Non-exhaustive list includes:
+
+
+
+
+  1. Iterate P99 can be high due to the presence of tombstones.
+
+
+  2. Write stalls can happen during high write loads.
+
+
+  3. Better control of memory and disk usage.
+
+
+  4. Service quality and performance of backup engine.
+
+
+**Operation's user experience**
+
+As we increase deployment of RocksDB, engineers are spending more time on 
debugging RocksDB issues. We plan to improve user experience when running 
RocksDB. The goal is to reduce TTD (time-to-debug). The work includes 
monitoring, visualizations and documentations.
+
+[1]( 
http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/)
+
+
+### Comments
+
+**[Mike]([email protected])**
+
+Whatâs the status of this roadmap? âRocksDB on cheaper storage mediaâ, 
has this been implemented?

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown
new file mode 100644
index 0000000..fe7b7b2
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown
@@ -0,0 +1,78 @@
+---
+title: Spatial indexing in RocksDB
+layout: post
+author: icanadi
+category: blog
+redirect_from:
+  - /blog/2039/spatial-indexing-in-rocksdb/
+---
+
+About a year ago, there was a need to develop a spatial database at Facebook. 
We needed to store and index Earth's map data. Before building our own, we 
looked at the existing spatial databases. They were all very good technology, 
but also general purpose. We could sacrifice a general-purpose API, so we 
thought we could build a more performant database, since it would be 
specifically designed for our use-case. Furthermore, we decided to build the 
spatial database on top of RocksDB, because we have a lot of operational 
experience with running and tuning RocksDB at a large scale.
+
+<!--truncate-->
+
+When we started looking at this project, the first thing that surprised us was 
that our planet is not that big. Earth's entire map data can fit in memory on a 
reasonably high-end machine. Thus, we also decided to build a spatial database 
optimized for memory-resident dataset.
+
+The first use-case of our spatial database was an experimental map renderer. 
As part of our project, we successfully loaded [Open Street 
Maps](https://www.openstreetmap.org/) dataset and hooked it up with 
[Mapnik](http://mapnik.org/), a map rendering engine.
+
+The usual Mapnik workflow is to load the map data into a SQL-based database 
and then define map layers with SQL statements. To render a tile, Mapnik needs 
to execute a couple of SQL queries. The benefit of this approach is that you 
don't need to reload your database when you change your map style. You can just 
change your SQL query and Mapnik picks it up. In our model, we decided to 
precompute the features we need for each tile. We need to know the map style 
before we create the database. However, when rendering the map tile, we only 
fetch the features that we need to render.
+
+We haven't open sourced the RocksDB Mapnik plugin or the database loading 
pipeline. However, the spatial indexing is available in RocksDB under a name 
[SpatialDB](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/spatial_db.h).
 The API is focused on map rendering use-case, but we hope that it can also be 
used for other spatial-based applications.
+
+Let's take a tour of the API. When you create a spatial database, you specify 
the spatial indexes that need to be built. Each spatial index is defined by a 
bounding box and granularity. For map rendering, we create a spatial index for 
each zoom levels. Higher zoom levels have more granularity.
+
+
+
+    SpatialDB::Create(
+      SpatialDBOptions(),
+      "/data/map", {
+        SpatialIndexOptions("zoom10", BoundingBox(0, 0, 100, 100), 10),
+        SpatialIndexOptions("zoom16", BoundingBox(0, 0, 100, 100), 16)
+      }
+    );
+
+
+
+
+When you insert a feature (building, street, country border) into SpatialDB, 
you need to specify the list of spatial indexes that will index the feature. In 
the loading phase we process the map style to determine the list of zoom levels 
on which we'll render the feature. For example, we will not render the building 
on zoom level that shows an entire country. Building will only be indexed on 
higher zoom level's index. Country borders will be indexes on all zoom levels.
+
+
+
+    FeatureSet feature;
+    feature.Set("type", "building");
+    feature.Set("height", 6);
+    db->Insert(WriteOptions(), BoundingBox<double>(5, 5, 10, 10),
+               well_known_binary_blob, feature, {"zoom16"});
+
+
+
+
+The indexing part is pretty simple. For each feature, we first find a list of 
index tiles that it intersects. Then, we add a link from the tile's [quad 
key](https://msdn.microsoft.com/en-us/library/bb259689.aspx) to the feature's 
primary key. Using quad keys improves data locality, i.e. features closer 
together geographically will have similar quad keys. Even though we're 
optimizing for a memory-resident dataset, data locality is still very important 
due to different caching effects.
+
+After you're done inserting all the features, you can call an API Compact() 
that will compact the dataset and speed up read queries.
+
+
+
+    db->Compact();
+
+
+
+
+SpatialDB's query specifies: 1) bounding box we're interested in, and 2) a 
zoom level. We find all tiles that intersect with the query's bounding box and 
return all features in those tiles.
+
+
+
+
+    Cursor* c = db_->Query(ReadOptions(), BoundingBox<double>(1, 1, 7, 7), 
"zoom16");
+    for (c->Valid(); c->Next()) {
+        Render(c->blob(), c->feature_set());
+    }
+
+
+
+
+Note: `Render()` function is not part of RocksDB. You will need to use one of 
many open source map renderers, for example check out 
[Mapnik](http://mapnik.org/).
+
+TL;DR If you need an embedded spatial database, check out RocksDB's SpatialDB. 
[Let us know](https://www.facebook.com/groups/rocksdb.dev/) how we can make it 
better.
+
+If you're interested in learning more, check out this 
[talk](https://www.youtube.com/watch?v=T1jWsDMONM8).

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown
new file mode 100644
index 0000000..b6bb47d
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown
@@ -0,0 +1,30 @@
+---
+title: RocksDB is now available in Windows Platform
+layout: post
+author: dmitrism
+category: blog
+redirect_from:
+  - /blog/2033/rocksdb-is-now-available-in-windows-platform/
+---
+
+Over the past 6 months we have seen a number of use cases where RocksDB is 
successfully used by the community and various companies to achieve high 
throughput and volume in a modern server environment.
+
+We at Microsoft Bing could not be left behind. As a result we are happy to 
[announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created 
here at Microsoft which we intend to use as a storage option for one of our 
key/value data stores.
+
+<!--truncate-->
+
+We are happy to make this available for the community. Keep tuned for more 
announcements to come.
+
+### Comments
+
+**[Siying Dong]([email protected])**
+
+Appreciate your contributions to RocksDB project! I believe it will benefits 
many users!
+
+**[empresas sevilla]([email protected])**
+
+Magnifico artÃculo|, un placer leer el blog
+
+**[jak usunac]([email protected])**
+
+I believe it will benefits too

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown
----------------------------------------------------------------------
diff --git a/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown 
b/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown
new file mode 100644
index 0000000..0ff3a05
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown
@@ -0,0 +1,29 @@
+---
+title: Dynamic Level Size for Level-Based Compaction
+layout: post
+author: sdong
+category: blog
+redirect_from:
+  - /blog/2207/dynamic-level/
+---
+
+In this article, we follow up on the first part of an answer to one of the 
questions in our 
[AMA](https://www.reddit.com/r/IAmA/comments/3de3cv/we_are_rocksdb_engineering_team_ask_us_anything/ct4a8tb),
 the dynamic level size in level-based compaction.
+
+<!--truncate-->
+
+Level-based compaction is the original LevelDBÂ compaction style andÂ one of 
the two major compaction styles in RocksDB (See [our 
wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#multi-threaded-compactions)).
 In RocksDB we introduced parallelism and more configurable options to it but 
the main algorithmÂ stayed the same, until we recently introduced the dynamic 
level size mode.
+
+
+In level-based compaction, we organize data to different sorted runs, called 
levels. Each level has a target size.Â  Usually target size of levels increases 
by the same size multiplier. For example, you can set target size of level 1 to 
be 1GB, and size multiplier to be 10, and the target size of level 1, 2, 3, 4 
will be 1GB, 10GB, 100GB and 1000GB. Before level 1, there will be some staging 
file flushed from mem tables, called Level 0 files, which will later be merged 
to level 1. Compactions will be triggered as soon as actual size of a level 
exceeds its target size. We will merge a subset of data of that level to next 
level, to reduce size of the level. More compactions will be triggered until 
sizes of all the levels are lower than their target sizes. In a steady state, 
the size of each level will be around the same size of the size of level 
targets.
+
+
+Level-based compactionâs advantage is its good space efficiency. We usually 
use the metric space amplification to measure the space efficiency. In this 
article ignore the effects of data compression so space amplification= 
size_on_file_system / size_of_user_data.
+
+
+How do we estimate space amplification of level-based compaction? We focus 
specifically on the databases in steady state, which means database size is 
stable or grows slowly over time. This means updates will add roughly the same 
or little more data than what is removed by deletes. Given that, if we compact 
all the data all to the last level, the size of level will be equal as the size 
of last level before the compaction. On the other hand, the size of user data 
will be approximately the size of DB if we compact all the levels down to the 
last level. So the size of the last level will be a good estimation of user 
data size. So total size of the DB divided by the size of the last level will 
be a good estimation of spaceÂ amplification.
+
+
+Applying the equation, if we have four non-zero levels, their sizes are 1GB, 
10GB, 100GB, 1000GB, the size amplification will be approximately (1000GB + 
100GB + 10GB + 1GB) / 1000GB = 1.111, which is a very good number. However, 
there is a catch here: how to make sure the last levelâs size is 1000GB, the 
same as the levelâs size target? A user has to fine tune level sizes to 
achieve this number and will need to re-tune if DB size changes. The theoretic 
number 1.11 is hard to achieve in practice. In a worse case, if you have the 
target size of last level to be 1000GB but the user data is only 200GB, then 
the actual spaceÂ amplification will be (200GB + 100GB + 10GB + 1GB) / 200GB = 
1.555, a much worse number.
+
+
+To solve this problem, my colleagueÂ Igor Kabiljo came up with a solution of 
dynamic level size target mode. You can enable it by setting 
options.level_compaction_dynamic_level_bytes=true. In this mode, size target of 
levels are changed dynamically based on size of the last level. Suppose the 
level size multiplier to be 10, and the DB size is 200GB. The target size of 
the last level is automatically set to be the actual size of the level, which 
is 200GB, the second to last levelâs size target will be automatically set to 
be size_last_level / 10 = 20GB, the third last levelâs will be 
size_last_level/100 = 2GB, and next level to be size_last_level/1000 = 200MB. 
We stop here because 200MB is within the range of the first level. In this way, 
we can achieve the 1.111 spaceÂ amplification, without fine tuning of the level 
size targets. More details can be found in [code comments of the 
option](https://github.com/facebook/rocksdb/blob/v3.11/include/rocksdb/options.h#L366-L423)
 in the h
 eader file.

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown
----------------------------------------------------------------------
diff --git a/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown 
b/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown
new file mode 100644
index 0000000..332a29f
--- /dev/null
+++ b/thirdparty/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown
@@ -0,0 +1,193 @@
+---
+title: GetThreadList
+layout: post
+author: yhciang
+category: blog
+redirect_from:
+  - /blog/2261/getthreadlist/
+---
+
+We recently added a new API, called `GetThreadList()`, that exposes the 
RocksDBÂ background thread activity. With this feature, developers will be able 
toÂ obtain the real-time information about the currently running compactionsÂ 
and flushes such as the input / output size, elapsed time, the number ofÂ bytes 
it has written. Below is an example output of `GetThreadList`.  To better 
illustrate the example, we have put a sample output of `GetThreadList` into a 
table where each column represents a thread status:
+
+<!--truncate-->
+
+<table width="637" >
+<tbody >
+<tr style="border:2px solid #000000" >
+
+<td style="padding:3px" >ThreadID
+</td>
+
+<td >140716395198208
+</td>
+
+<td >140716416169728
+</td>
+</tr>
+<tr >
+
+<td style="padding:3px" >DB
+</td>
+
+<td >db1
+</td>
+
+<td >db2
+</td>
+</tr>
+<tr >
+
+<td style="padding:3px" >CF
+</td>
+
+<td >default
+</td>
+
+<td >picachu
+</td>
+</tr>
+<tr >
+
+<td style="padding:3px" >ThreadType
+</td>
+
+<td >High Pri
+</td>
+
+<td >Low Pri
+</td>
+</tr>
+<tr >
+
+<td style="padding:3px" >Operation
+</td>
+
+<td >Flush
+</td>
+
+<td >Compaction
+</td>
+</tr>
+<tr >
+
+<td style="padding:3px" >ElapsedTime
+</td>
+
+<td >143.459 ms
+</td>
+
+<td >607.538 ms
+</td>
+</tr>
+<tr >
+
+<td style="padding:3px" >Stage
+</td>
+
+<td >FlushJob::WriteLevel0Table
+</td>
+
+<td >CompactionJob::Install
+</td>
+</tr>
+<tr >
+
+<td style="vertical-align:top;padding:3px" >OperationProperties
+</td>
+
+<td style="vertical-align:top;padding:3px" >
+BytesMemtables 4092938
+BytesWritten 1050701
+</td>
+
+<td style="vertical-align:top" >
+BaseInputLevel 1
+BytesRead 4876417
+BytesWritten 4140109
+IsDeletion 0
+IsManual 0
+IsTrivialMove 0
+JobID 146
+OutputLevel 2
+TotalInputBytes 4883044
+</td>
+</tr>
+</tbody>
+</table>
+
+In the above output, we can see `GetThreadList()` reports the activity of two 
threads: one thread running flush job (middle column) and the other threadÂ 
running a compaction job (right-most column).  In each thread status, it shows 
basic information about the thread such as thread id, it's target db / column 
family, and the job it is currently doing and the current status of the job.  
For instance, we can see thread 140716416169728 is doing compaction on the 
`picachu` column family in database `db2`.  In addition, we can see the 
compaction has been running for 600 ms, and it has read 4876417 bytes out of 
4883044 bytes. This indicates the compaction is about to complete.  The stage 
property indicates which code block the thread is currently executing.  For 
instance, thread 140716416169728 is currently running `CompactionJob::Install`, 
which further indicates the compaction job is almost done.
+
+Below we briefly describe its API.
+
+
+## How to Enable it?
+
+
+To enable thread-tracking of a rocksdb instance, simply set 
`enable_thread_tracking` to true in its DBOptions:
+
+```c++
+// If true, then the status of the threads involved in this DB will
+// be tracked and available via GetThreadList() API.
+//
+// Default: false
+bool enable_thread_tracking;
+```
+
+
+
+## The API
+
+
+The GetThreadList API is defined in 
[include/rocksdb/env.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/env.h#L317-L318),
 which is an Env
+function:
+
+```c++
+virtual Status GetThreadList(std::vector* thread_list)
+```
+
+Since an Env can be shared across multiple rocksdb instances, the output of
+`GetThreadList()` include the background activity of all the rocksdb instances
+that using the same Env.
+
+The `GetThreadList()` API simply returns a vector of `ThreadStatus`, each 
describes
+the current status of a thread. The `ThreadStatus` structure, defined in
+[include/rocksdb/thread_status.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/thread_status.h),
 contains the following information:
+
+```c++
+// An unique ID for the thread.
+const uint64_t thread_id;
+
+// The type of the thread, it could be HIGH_PRIORITY,
+// LOW_PRIORITY, and USER
+const ThreadType thread_type;
+
+// The name of the DB instance where the thread is currently
+// involved with. It would be set to empty string if the thread
+// does not involve in any DB operation.
+const std::string db_name;
+
+// The name of the column family where the thread is currently
+// It would be set to empty string if the thread does not involve
+// in any column family.
+const std::string cf_name;
+
+// The operation (high-level action) that the current thread is involved.
+const OperationType operation_type;
+
+// The elapsed time in micros of the current thread operation.
+const uint64_t op_elapsed_micros;
+
+// An integer showing the current stage where the thread is involved
+// in the current operation.
+const OperationStage operation_stage;
+
+// A list of properties that describe some details about the current
+// operation. Same field in op_properties[] might have different
+// meanings for different operations.
+uint64_t op_properties[kNumOperationProperties];
+
+// The state (lower-level action) that the current thread is involved.
+const StateType state_type;
+```
+
+If you are interested in the background thread activity of your RocksDB 
application, please feel free to give `GetThreadList()` a try :)

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown
new file mode 100644
index 0000000..6852b8f
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown
@@ -0,0 +1,45 @@
+---
+title: Use Checkpoints for Efficient Snapshots
+layout: post
+author: rven2
+category: blog
+redirect_from:
+  - /blog/2609/use-checkpoints-for-efficient-snapshots/
+---
+
+**Checkpoint**Â is a feature in RocksDBÂ which provides the ability to take a 
snapshot of a running RocksDB database in a separate directory. Checkpoints can 
be used as a point in time snapshot, which can be opened Read-only to query 
rows as of the point in time or as a Writeable snapshot by opening it 
Read-Write. Checkpoints can be used for both full and incremental backups.
+
+<!--truncate-->
+
+
+The Checkpoint feature enables RocksDBÂ to create a consistent snapshot of a 
given RocksDB database in the specified directory. If the snapshot is on the 
same filesystemÂ as the original database, the SSTÂ files will be hard-linked, 
otherwise SSTÂ files will be copied. The manifest and CURRENT files will be 
copied. In addition, if there are multiple column families, log files will be 
copied for the period covering the start and end of the checkpoint, in order to 
provide a consistent snapshot across column families.
+
+
+
+
+A Checkpoint object needs to be created for a database before checkpoints are 
created. The APIÂ is as follows:
+
+
+
+
+`Status Create(DB* db, Checkpoint** checkpoint_ptr);`
+
+
+
+
+Given a checkpoint object and a directory, the CreateCheckpoint function 
creates a consistentÂ snapshot of the database in the given directory.
+
+
+
+
+`Status CreateCheckpoint(const std::string& checkpoint_dir);`
+
+
+
+
+The directory should not already exist and will be created by this API. The 
directory will be an absolute path. The checkpoint can be used as a 
âread-only copy of the DB or can be opened as a standalone DB. When opened 
read/write, the SSTÂ files continue to be hard links and these links are 
removed when the files are obsoleted. When the user is done with the snapshot, 
the user can delete the directory to remove the snapshot.
+
+
+
+
+Checkpoints are used for online backup in âMyRocks. which is MySQLÂ using 
RocksDBÂ as the storage engine . ([MySQLÂ on 
RocksDB](https://github.com/facebook/mysql-5.6)) â

http://git-wip-us.apache.org/repos/asf/nifi-minifi-cpp/blob/48867732/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown
----------------------------------------------------------------------
diff --git 
a/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown
 
b/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown
new file mode 100644
index 0000000..b21b04f
--- /dev/null
+++ 
b/thirdparty/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown
@@ -0,0 +1,244 @@
+---
+title: Analysis File Read Latency by Level
+layout: post
+author: sdong
+category: blog
+redirect_from:
+  - /blog/2537/analysis-file-read-latency-by-level/
+---
+
+In many use cases of RocksDB, people rely on OS page cache for caching 
compressed data. With this approach, verifying effective of the OS page caching 
is challenging, because file system is a black box to users.
+
+As an example, a user can tune the DB as following: use level-based 
compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve 
about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly 
cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. 
However, in practice, it's not easy to verify whether OS page cache does 
exactly what we expect. For example, if we end up with doing 4 instead of 2 
I/Os per query, it's not easy for users to figure out whether the it's because 
of efficiency of OS page cache or reading multiple blocks for a level. Analysis 
like it is especially important if users run RocksDB on hard drive disks, for 
the gap of latency between hard drives and memory is much higher than 
flash-based SSDs.
+
+<!--truncate-->
+
+In order to make tuning easier, we added new instrumentation to help users 
analysis latency distribution of file reads in different levels. If users turn 
DB statistics on, we always keep track of distribution of file read latency for 
each level. Users can retrieve the information by querying DB property 
ârocksdb.statsâ ( 
[https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316](https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316)
 ). It will also printed out as a part of compaction summary in info logs 
periodically.
+
+The output looks like this:
+
+
+```
+** Level 0 read latency histogram (micros):
+Count: 696 Average: 489.8118 StdDev: 222.40
+Min: 3.0000 Median: 452.3077 Max: 1896.0000
+Percentiles: P50: 452.31 P75: 641.30 P99: 1068.00 P99.9: 1860.80 P99.99: 
1896.00
+------------------------------------------------------
+[ 2, 3 ) 1 0.144% 0.144%
+[ 18, 20 ) 1 0.144% 0.287%
+[ 45, 50 ) 5 0.718% 1.006%
+[ 50, 60 ) 26 3.736% 4.741% #
+[ 60, 70 ) 6 0.862% 5.603%
+[ 90, 100 ) 1 0.144% 5.747%
+[ 120, 140 ) 2 0.287% 6.034%
+[ 140, 160 ) 1 0.144% 6.178%
+[ 160, 180 ) 1 0.144% 6.322%
+[ 200, 250 ) 9 1.293% 7.615%
+[ 250, 300 ) 45 6.466% 14.080% #
+[ 300, 350 ) 88 12.644% 26.724% ###
+[ 350, 400 ) 88 12.644% 39.368% ###
+[ 400, 450 ) 71 10.201% 49.569% ##
+[ 450, 500 ) 65 9.339% 58.908% ##
+[ 500, 600 ) 74 10.632% 69.540% ##
+[ 600, 700 ) 92 13.218% 82.759% ###
+[ 700, 800 ) 64 9.195% 91.954% ##
+[ 800, 900 ) 35 5.029% 96.983% #
+[ 900, 1000 ) 12 1.724% 98.707%
+[ 1000, 1200 ) 6 0.862% 99.569%
+[ 1200, 1400 ) 2 0.287% 99.856%
+[ 1800, 2000 ) 1 0.144% 100.000%
+
+** Level 1 read latency histogram (micros):
+(......not pasted.....)
+
+** Level 2 read latency histogram (micros):
+(......not pasted.....)
+
+** Level 3 read latency histogram (micros):
+(......not pasted.....)
+
+** Level 4 read latency histogram (micros):
+(......not pasted.....)
+
+** Level 5 read latency histogram (micros):
+Count: 25583746 Average: 421.1326 StdDev: 385.11
+Min: 1.0000 Median: 376.0011 Max: 202444.0000
+Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 
9056.52
+------------------------------------------------------
+[ 0, 1 ) 2351 0.009% 0.009%
+[ 1, 2 ) 6077 0.024% 0.033%
+[ 2, 3 ) 8471 0.033% 0.066%
+[ 3, 4 ) 788 0.003% 0.069%
+[ 4, 5 ) 393 0.002% 0.071%
+[ 5, 6 ) 786 0.003% 0.074%
+[ 6, 7 ) 1709 0.007% 0.080%
+[ 7, 8 ) 1769 0.007% 0.087%
+[ 8, 9 ) 1573 0.006% 0.093%
+[ 9, 10 ) 1495 0.006% 0.099%
+[ 10, 12 ) 3043 0.012% 0.111%
+[ 12, 14 ) 2259 0.009% 0.120%
+[ 14, 16 ) 1233 0.005% 0.125%
+[ 16, 18 ) 762 0.003% 0.128%
+[ 18, 20 ) 451 0.002% 0.130%
+[ 20, 25 ) 794 0.003% 0.133%
+[ 25, 30 ) 1279 0.005% 0.138%
+[ 30, 35 ) 1172 0.005% 0.142%
+[ 35, 40 ) 1363 0.005% 0.148%
+[ 40, 45 ) 409 0.002% 0.149%
+[ 45, 50 ) 105 0.000% 0.150%
+[ 50, 60 ) 80 0.000% 0.150%
+[ 60, 70 ) 280 0.001% 0.151%
+[ 70, 80 ) 1583 0.006% 0.157%
+[ 80, 90 ) 4245 0.017% 0.174%
+[ 90, 100 ) 6572 0.026% 0.200%
+[ 100, 120 ) 9724 0.038% 0.238%
+[ 120, 140 ) 3713 0.015% 0.252%
+[ 140, 160 ) 2383 0.009% 0.261%
+[ 160, 180 ) 18344 0.072% 0.333%
+[ 180, 200 ) 51873 0.203% 0.536%
+[ 200, 250 ) 631722 2.469% 3.005%
+[ 250, 300 ) 2721970 10.639% 13.644% ##
+[ 300, 350 ) 5909249 23.098% 36.742% #####
+[ 350, 400 ) 6522507 25.495% 62.237% #####
+[ 400, 450 ) 4296332 16.793% 79.030% ###
+[ 450, 500 ) 2130323 8.327% 87.357% ##
+[ 500, 600 ) 1553208 6.071% 93.428% #
+[ 600, 700 ) 642129 2.510% 95.938% #
+[ 700, 800 ) 372428 1.456% 97.394%
+[ 800, 900 ) 187561 0.733% 98.127%
+[ 900, 1000 ) 85858 0.336% 98.462%
+[ 1000, 1200 ) 82730 0.323% 98.786%
+[ 1200, 1400 ) 50691 0.198% 98.984%
+[ 1400, 1600 ) 38026 0.149% 99.133%
+[ 1600, 1800 ) 32991 0.129% 99.261%
+[ 1800, 2000 ) 30200 0.118% 99.380%
+[ 2000, 2500 ) 62195 0.243% 99.623%
+[ 2500, 3000 ) 36684 0.143% 99.766%
+[ 3000, 3500 ) 21317 0.083% 99.849%
+[ 3500, 4000 ) 10216 0.040% 99.889%
+[ 4000, 4500 ) 8351 0.033% 99.922%
+[ 4500, 5000 ) 4152 0.016% 99.938%
+[ 5000, 6000 ) 6328 0.025% 99.963%
+[ 6000, 7000 ) 3253 0.013% 99.976%
+[ 7000, 8000 ) 2082 0.008% 99.984%
+[ 8000, 9000 ) 1546 0.006% 99.990%
+[ 9000, 10000 ) 1055 0.004% 99.994%
+[ 10000, 12000 ) 1566 0.006% 100.000%
+[ 12000, 14000 ) 761 0.003% 100.003%
+[ 14000, 16000 ) 462 0.002% 100.005%
+[ 16000, 18000 ) 226 0.001% 100.006%
+[ 18000, 20000 ) 126 0.000% 100.006%
+[ 20000, 25000 ) 107 0.000% 100.007%
+[ 25000, 30000 ) 43 0.000% 100.007%
+[ 30000, 35000 ) 15 0.000% 100.007%
+[ 35000, 40000 ) 14 0.000% 100.007%
+[ 40000, 45000 ) 16 0.000% 100.007%
+[ 45000, 50000 ) 1 0.000% 100.007%
+[ 50000, 60000 ) 22 0.000% 100.007%
+[ 60000, 70000 ) 10 0.000% 100.007%
+[ 70000, 80000 ) 5 0.000% 100.007%
+[ 80000, 90000 ) 14 0.000% 100.007%
+[ 90000, 100000 ) 11 0.000% 100.007%
+[ 100000, 120000 ) 33 0.000% 100.007%
+[ 120000, 140000 ) 6 0.000% 100.007%
+[ 140000, 160000 ) 3 0.000% 100.007%
+[ 160000, 180000 ) 7 0.000% 100.007%
+[ 200000, 250000 ) 2 0.000% 100.007%
+```
+
+
+In this example, you can see we only issued 696 reads from level 0 while 
issued 25 million reads from level 5. The latency distribution is also clearly 
shown among those reads. This will be helpful for users to analysis OS page 
cache efficiency.
+
+Currently the read latency per level includes reads from data blocks, index 
blocks, as well as bloom filter blocks. We are also working on a feature to 
break down those three type of blocks.
+
+### Comments
+
+**[Tao Feng]([email protected])**
+
+Is this feature also included in RocksJava?
+
+**[Siying Dong]([email protected])**
+
+Should be. As long as you enable statistics, you should be able to get the 
value from `RocksDB.getProperty()` with property `rocksdb.dbstats`. Let me know 
if you canât find it.
+
+**[chiddu]([email protected])**
+
+> In this example, you can see we only issued 696 reads from level 0 while 
issued 256K reads from level 5.
+
+Isnât it 2.5 M of reads instead of 256K ? .
+
+Also could anyone please provide more description on the histogram ? especially
+
+> Count: 25583746 Average: 421.1326 StdDev: 385.11
+> Min: 1.0000 Median: 376.0011 Max: 202444.0000
+> Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 
9056.52
+
+and
+
+> [ 0, 1 ) 2351 0.009% 0.009%
+> [ 1, 2 ) 6077 0.024% 0.033%
+> [ 2, 3 ) 8471 0.033% 0.066%
+> [ 3, 4 ) 788 0.003% 0.069%â
+
+thanks in advance
+
+**[Siying Dong]([email protected])**
+
+Thank you for pointing out the mistake. I fixed it now.
+
+In this output, there are 2.5 million samples, average latency is 421 micro 
seconds, with standard deviation 385. Median is 376, max value is 202 
milliseconds. 0.009% has value of 1, 0.024% has value of 1, 0.033% has value of 
2. Accumulated value from 0 to 2 is 0.066%.
+
+Hope it helps.
+
+**[chiddu]([email protected])**
+
+Thank you Siying for the quick reply, I was running couple of benchmark 
testing to check the performance of rocksdb on SSD. One of the test is similar 
to what is mentioned in the wiki, TEST 4 : Random read , except the key_size is 
10 and value_size is 20. I am inserting 1 billion hashes and reading 1 billion 
hashes with 32 threads. The histogram shows something like this
+
+```
+Level 5 read latency histogram (micros):
+Count: 7133903059 Average: 480.4357 StdDev: 309.18
+Min: 0.0000 Median: 551.1491 Max: 224142.0000
+Percentiles: P50: 551.15 P75: 651.44 P99: 996.52 P99.9: 2073.07 P99.99: 3196.32
+ââââââââââââââââââ
+[ 0, 1 ) 28587385 0.401% 0.401%
+[ 1, 2 ) 686572516 9.624% 10.025% ##
+[ 2, 3 ) 567317522 7.952% 17.977% ##
+[ 3, 4 ) 44979472 0.631% 18.608%
+[ 4, 5 ) 50379685 0.706% 19.314%
+[ 5, 6 ) 64930061 0.910% 20.224%
+[ 6, 7 ) 22613561 0.317% 20.541%
+â¦â¦â¦â¦moreâ¦â¦â¦â¦.
+```
+
+If I understand your previous comment correctly,
+
+1. How is it that the count is around 7 billion when I have only inserted 1 
billion hashes ? is the stat broken ?
+1. What does the percentiles and the numbers signify ?
+1. 0, 1 ) 28587385 0.401% 0.401% what does this â28587385â stand for in 
the histogram row ?
+
+**[Siying Dong]([email protected])**
+
+If I remember correctly, with db_bench, if you specify ânum=1000000000 
âthreads=32, it is every thread reading one billion keys, total of 32 
billions. Is it the case you ran into?
+
+28,587,385 means that number of data points take the value [0,1)
+28,587,385 / 7,133,903,058 = 0.401% provides percentage.
+
+**[chiddu]([email protected])**
+
+I do have `num=1000000000` and `t=32`. The script says reading 1 billion 
hashes and not 32 billion hashes.
+
+this is the script on which I have used
+
+```
+echo âLoad 1B keys sequentially into databaseâ¦..â
+bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; 
stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=1; sync=0; r=1000000000; 
t=1; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench 
âbenchmarks=fillseq âdisable_seek_compaction=1 âmmap_read=0 
âstatistics=1 âhistogram=1 ânum=$r âthreads=$t âvalue_size=$vs 
âblock_size=$bs âcache_size=$cs âbloom_bits=10 âcache_numshardbits=6 
âopen_files=$of âverify_checksum=1 âdb=/data/mysql/leveldb/test 
âsync=$sync âdisable_wal=1 âcompression_type=none âstats_interval=$si 
âcompression_ratio=0.5 âdisable_data_sync=$dds âwrite_buffer_size=$wbs 
âtarget_file_size_base=$mb âmax_write_buffer_number=$wbn 
âmax_background_compactions=$mbc âlevel0_file_num_compaction_trigger=$ctrig 
âlevel0_slowdown_writes_trigger=$delay âlevel0_stop_writes_trigger=$stop 
ânum_levels=$levels âdelete_obsolete_files_period_micros=$del 
âmin_level_to_compress=$mcz âmax_grandparent_overl
 ap_factor=$overlap âstats_per_interval=1 âmax_bytes_for_level_base=$bpl 
âuse_existing_db=0 âkey_size=10
+
+echo âReading 1B keys in database in random orderâ¦.â
+bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; 
stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000000; 
t=32; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench 
âbenchmarks=readrandom âdisable_seek_compaction=1 âmmap_read=0 
âstatistics=1 âhistogram=1 ânum=$r âthreads=$t âvalue_size=$vs 
âblock_size=$bs âcache_size=$cs âbloom_bits=10 âcache_numshardbits=6 
âopen_files=$of âverify_checksum=1 âdb=/some_data_base âsync=$sync 
âdisable_wal=1 âcompression_type=none âstats_interval=$si 
âcompression_ratio=0.5 âdisable_data_sync=$dds âwrite_buffer_size=$wbs 
âtarget_file_size_base=$mb âmax_write_buffer_number=$wbn 
âmax_background_compactions=$mbc âlevel0_file_num_compaction_trigger=$ctrig 
âlevel0_slowdown_writes_trigger=$delay âlevel0_stop_writes_trigger=$stop 
ânum_levels=$levels âdelete_obsolete_files_period_micros=$del 
âmin_level_to_compress=$mcz âmax_grandparent_overlap_fa
 ctor=$overlap âstats_per_interval=1 âmax_bytes_for_level_base=$bpl 
âuse_existing_db=1 âkey_size=10
+```
+
+After running this script, there were no issues wrt to loading billion hashes 
, but when it came to reading part, its been almost 4 days and still I have 
only read 7 billion hashes and have read 200 million hashes in 2 and half days. 
Is there something which is missing in db_bench or something which I am missing 
?
+
+**[Siying Dong]([email protected])**
+
+Itâs a printing error then. If you have `num=1000000000` and `t=32`, it will 
be 32 threads, and each reads 1 billion keys.

[04/51] [partial] nifi-minifi-cpp git commit: MINIFI-372: Replace leveldb with RocksDB

Reply via email to