Repository: incubator-distributedlog
Updated Branches:
  refs/heads/asf-site 5702ebd47 -> ccacde342


http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/ccacde34/content/feed.xml
----------------------------------------------------------------------
diff --git a/content/feed.xml b/content/feed.xml
index a67d392..6e8ceef 100644
--- a/content/feed.xml
+++ b/content/feed.xml
@@ -6,9 +6,236 @@
 </description>
     <link>http://distributedlog.incubator.apache.org/</link>
     <atom:link href="http://distributedlog.incubator.apache.org/feed.xml"; 
rel="self" type="application/rss+xml"/>
-    <pubDate>Tue, 13 Sep 2016 02:14:40 -0700</pubDate>
-    <lastBuildDate>Tue, 13 Sep 2016 02:14:40 -0700</lastBuildDate>
+    <pubDate>Mon, 19 Sep 2016 22:48:09 +0800</pubDate>
+    <lastBuildDate>Mon, 19 Sep 2016 22:48:09 +0800</lastBuildDate>
     <generator>Jekyll v3.2.1</generator>
     
+      <item>
+        <title>A Technical Review of Kafka and DistributedLog</title>
+        <description>&lt;p&gt;We open sourced &lt;a 
href=&quot;http://DistributedLog.io&quot;&gt;DistributedLog&lt;/a&gt; in May 
2016.
+It generated a lot of interest in the community. One frequent question we are 
asked is how does DistributedLog
+compare to &lt;a href=&quot;http://kafka.apache.org/&quot;&gt;Apache 
Kafka&lt;/a&gt; . Technically DistributedLog is not a full fledged partitioned
+pub/sub system like Apache Kafka. DistributedLog is a replicated log stream 
store, using &lt;a href=&quot;http://bookKeeper.apache.org/&quot;&gt;Apache 
BookKeeper&lt;/a&gt; as its log segment store.
+It focuses on offering &lt;em&gt;durability&lt;/em&gt;, 
&lt;em&gt;replication&lt;/em&gt; and &lt;em&gt;strong consistency&lt;/em&gt; as 
essentials for building reliable
+real-time systems. One can use DistributedLog to build and experiment with 
different messaging models
+(such as Queue, Pub/Sub).&lt;/p&gt;
+
+&lt;p&gt;Since both of them share very similar data model around &lt;code 
class=&quot;highlighter-rouge&quot;&gt;log&lt;/code&gt;, this blog post will 
discuss the difference
+between Apache Kafka and DistributedLog from a technical perspective. We will 
try our best to be objective.
+However, since we are not experts in Apache Kafka, we may have made wrong 
assumptions about Apache Kafka.
+Please do correct us if that is the case.&lt;/p&gt;
+
+&lt;p&gt;First, let’s have a quick overview of Kafka and 
DistributedLog.&lt;/p&gt;
+
+&lt;h2 id=&quot;what-is-kafka&quot;&gt;What is Kafka?&lt;/h2&gt;
+
+&lt;p&gt;Kafka is a distributed messaging system originally built at Linkedin 
and now part of &lt;a href=&quot;http://apache.org/&quot;&gt;Apache Software 
Foundation&lt;/a&gt;.
+It is a partition based pub/sub system. The key abstraction in Kafka is a 
topic.
+A topic is partitioned into multiple partitions. Each partition is stored and 
replicated in multiple brokers.
+Producers publish their records to partitions of a topic (round-robin or 
partitioned by keys), and consumers
+consume the published records of that topic. Data is published to a leader 
broker for a partition and
+replicated to follower brokers; all read requests are subsequently served by 
the leader broker.
+The follower brokers are used for redundancy and backup in case the leader can 
no longer serve requests.
+The left diagram in Figure 1 shows the data flow in Kafka.&lt;/p&gt;
+
+&lt;h2 id=&quot;what-is-distributedlog&quot;&gt;What is 
DistributedLog?&lt;/h2&gt;
+
+&lt;p&gt;Unlike Kafka, DistributedLog is not a partitioned pub/sub system. It 
is a replicated log stream store.
+The key abstraction in DistributedLog is a continuous replicated log stream. A 
log stream is segmented
+into multiple log segments. Each log segment is stored as a &lt;a 
href=&quot;http://BookKeeper.apache.org/docs/r4.1.0/BookKeeperOverview.html&quot;&gt;ledger&lt;/a&gt;
 in Apache BookKeeper,
+whose data is replicated and distributed evenly across multiple bookies (a 
bookie is a storage node in Apache BookKeeper).
+All the records of a log stream are sequenced by the owner of the log stream - 
a set of write proxies that
+manage the ownership of log streams &lt;sup 
id=&quot;fnref:corelibrary&quot;&gt;&lt;a href=&quot;#fn:corelibrary&quot; 
class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Each of the log records 
appended to a log stream will
+be assigned a sequence number. The readers can start reading the log stream 
from any provided sequence number.
+The read requests will be load balanced across the storage replicas of that 
stream.
+The right diagram in Figure 1 shows the data flow in DistributedLog.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; 
src=&quot;/images/blog/kafka-distributedlog-dataflow.png&quot; alt=&quot;Kafka 
vs DistributedLog&quot; width=&quot;500&quot; /&gt;&lt;/p&gt;
+&lt;div class=&quot;text-center&quot;&gt;
+&lt;i&gt;Figure 1 - Apache Kafka vs Apache DistributedLog&lt;/i&gt;
+&lt;/div&gt;
+
+&lt;h2 id=&quot;how-do-kafka-and-distributedlog-differ&quot;&gt;How do Kafka 
and DistributedLog differ?&lt;/h2&gt;
+
+&lt;p&gt;For an apples-to-apples comparison, we will only compare Kafka 
partitions with DistributedLog streams in this blog post. The table below lists 
the most important differences between the two systems.&lt;/p&gt;
+
+&lt;table border=&quot;1px&quot;&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;&amp;nbsp;&lt;/th&gt;
+      &lt;th&gt;&lt;strong&gt;Kafka&lt;/strong&gt;&lt;/th&gt;
+      &lt;th&gt;&lt;strong&gt;DistributedLog&lt;/strong&gt;&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Data 
Segmentation/Distribution&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;The data of a Kafka partition lives in a set of brokers; The 
data is segmented into log segment files locally on each broker.&lt;/td&gt;
+      &lt;td&gt;The data of a DL stream is segmented into multiple log 
segments. The log segments are evenly distributed across the storage 
nodes.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Data Retention&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;The data is expired after configured retention time or 
compacted to keep the latest values for keys.&lt;/td&gt;
+      &lt;td&gt;The data can be either expired after configured retention time 
or explicitly truncated to given positions.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Write&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;Multiple-Writers via Broker&lt;/td&gt;
+      &lt;td&gt;Multiple-Writers via Write Proxy; Single-Writer via Core 
Library&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Read&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;Read from leader broker&lt;/td&gt;
+      &lt;td&gt;Read from any storage node which has the data&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Replication&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;ISR (In-Sync-Replica) Replication - Both leader and followers 
are brokers.&lt;/td&gt;
+      &lt;td&gt;Quorum vote replication - Leaders are write proxies; Followers 
are bookies.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Replication Repair&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;Done by adding new replica to copy data from the leader 
broker.&lt;/td&gt;
+      &lt;td&gt;The replication repair is done by bookie AutoRecovery 
mechanism to ensure replication factor.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Cluster Expand&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;Need to re-distribute (balance) partition when adding new 
brokers. Carefully track the size of each partition and make sure rebalancing 
doesn’t saturate network and I/O.&lt;/td&gt;
+      &lt;td&gt;No data redistribution when either adding write proxies or 
bookies; New log segment will be automatically allocated on newly added 
bookies.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;File (set of files) per partition&lt;/td&gt;
+      &lt;td&gt;Interleaved Storage Format&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;Durability&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;Only write to filesystem page cache. Configurable to either 
wait for acknowledgement from leader or from all replicas in ISR.&lt;/td&gt;
+      &lt;td&gt;All writes are persisted to disk via explicit fsync. Wait for 
acknowledgements from a configurable quorum of bookies.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;&lt;strong&gt;I/O Isolation&lt;/strong&gt;&lt;/td&gt;
+      &lt;td&gt;No physical I/O isolation. Rely on filesystem page 
cache.&lt;/td&gt;
+      &lt;td&gt;Physical I/O isolation. Separate journal disk for write and 
ledger disks for reads.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h3 id=&quot;data-model&quot;&gt;Data Model&lt;/h3&gt;
+
+&lt;p&gt;A Kafka partition is a log stored as a (set of) file(s) in the 
broker’s disks.
+Each record is a key/value pair (key can be omitted for round-robin 
publishes). 
+The key is used for assigning the record to a Kafka partition and also for 
&lt;a 
href=&quot;https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction&quot;&gt;log
 compaction&lt;/a&gt;.
+All the data of a partition is stored only on a set of brokers, replicated 
from leader broker to follower brokers.&lt;/p&gt;
+
+&lt;p&gt;A DistributedLog stream is a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;virtual&lt;/code&gt; stream stored as a 
list of log segments.
+Each log segment is stored as a BookKeeper ledger and replicated to multiple 
bookies.
+There will be only one active log segment accepting writes at any time.
+An old log segment is sealed and a new log segment will be opened when a given 
time period elapses,
+an old log segment reaches a configured size threshold (based on configured 
log segment rolling policies),
+or when the owner of the log stream fails.&lt;/p&gt;
+
+&lt;p&gt;The differences in data segmentation and distribution of Kafka 
partition and DistributedLog stream result in differences in data retention 
policies and cluster operations (e.g. cluster expansion).&lt;/p&gt;
+
+&lt;p&gt;Figure 2 shows the differences between DistributedLog and Kafka data 
model.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; 
src=&quot;/images/blog/kafka-distributedlog-segmentation.png&quot; 
alt=&quot;Kafka Partition and DistributedLog Stream&quot; width=&quot;500&quot; 
/&gt;&lt;/p&gt;
+&lt;div class=&quot;text-center&quot;&gt;
+&lt;i&gt;Figure 2 - Kafka Partition and DistributedLog Stream&lt;/i&gt;
+&lt;/div&gt;
+
+&lt;h4 id=&quot;data-retention&quot;&gt;Data Retention&lt;/h4&gt;
+
+&lt;p&gt;All the data of a Kafka partition is stored on one broker (replicated 
to other brokers). Data is expired and deleted after a configured retention 
period. Additionally, a Kafka partition can be configured to do log compaction 
to keep only the latest values for keys.&lt;/p&gt;
+
+&lt;p&gt;Similar to Kafka, DistributedLog also allows configuring retention 
periods for individual streams and expiring / deleting log segments after they 
are expired. Besides that, DistributedLog also provides an explicit-truncation 
mechanism. Application can explicitly truncate a log stream to a given position 
in the stream. This is important for building replicated state machines as the 
replicated state machines require persisting state before deleting log records.
+&lt;a 
href=&quot;https://blog.twitter.com/2016/strong-consistency-in-manhattan&quot;&gt;Manhattan&lt;/a&gt;
 is one example of a system that uses this functionality.&lt;/p&gt;
+
+&lt;h4 id=&quot;operations&quot;&gt;Operations&lt;/h4&gt;
+
+&lt;p&gt;The differences in data segmentation/distribution also result in 
different experiences in operating the cluster, one example is when expanding 
the cluster.&lt;/p&gt;
+
+&lt;p&gt;When expanding a Kafka cluster, the partitions typically need to be 
rebalanced. The rebalancing moves Kafka partitions around to different replicas 
to try to achieve an even distribution. This involves copying the entire stream 
data from one replica to another. It has been mentioned in several talks that 
rebalancing needs to be executed carefully to avoid saturating disk and network 
resources.&lt;/p&gt;
+
+&lt;p&gt;Expanding a DistributedLog cluster works in a very different way. 
DistributedLog consists of two layers: a storage layer - Apache BookKeeper, and 
a serving layer - the write and read proxies. When expanding the storage layer, 
we typically just add more bookies. The new bookies will be discovered by the 
writers of the log streams and will immediately be available for writing of new 
log segments. There is no rebalancing involved when expanding the data storage 
layer. Rebalancing only comes into play when adding capacity to the serving 
layer, however this rebalancing only moves the ownership of the log streams to 
allow network bandwidth to be evenly used across proxies, there is no data 
copying involved, only accounting of ownership. This separation of storage and 
serving layers not only makes it easy to integrate the system with any 
auto-scaling mechanism, but also allows scaling resources 
independently.&lt;/p&gt;
+
+&lt;h3 id=&quot;writer--producer&quot;&gt;Writer &amp;amp; Producer&lt;/h3&gt;
+
+&lt;p&gt;As shown in Figure 1, Kafka producers write batches of records to the 
leader broker of a Kafka partition. The follower brokers in the &lt;a 
href=&quot;https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication&quot;&gt;ISR
 (in-sync-replica) set&lt;/a&gt; will replicate the records from the leader 
broker. A record is considered as committed only when the leader receives 
acknowledgments from all the replicas in the ISR. The producer can be 
configured to wait for the response from leader broker or from all brokers in 
the ISR.&lt;/p&gt;
+
+&lt;p&gt;There are two ways in DistributedLog to write log records to a 
DistributedLog stream, one is using a thin thrift client to write records 
through the write proxies (aka multiple-writer semantic), while the other one 
is using the DistributedLog core library to talk directly to the storage nodes 
(aka single-writer semantics). The first approach is common for building 
messaging systems while the second approach is common for building replicated 
state machines. You can check the &lt;a 
href=&quot;http://distributedlog.incubator.apache.org/docs/latest/user_guide/api/practice&quot;&gt;Best
 Practices&lt;/a&gt; section in DistributedLog documentation for more details 
about what should be used.&lt;/p&gt;
+
+&lt;p&gt;The owner of a log stream writes a batch of records as BookKeeper 
entries in parallel to the bookies and waits for responses from a quorum of 
bookies. The size of a quorum is known as the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;ack_quorum_size&lt;/code&gt; of a 
BookKeeper ledger and it can be configured per DistributedLog stream. It 
provides similar flexibility on durability as Kafka producers. We will discuss 
more about the differences on replication algorithms later in ‘Replication’ 
section.&lt;/p&gt;
+
+&lt;p&gt;Both Kafka and DistributedLog support end-to-end batching and 
compression. One slight difference between Kafka and DistributedLog is all the 
writes in DistributedLog are flushed (via fsync) to disk before acknowledging 
(We are not aware of Kafka providing a similar guarantee).&lt;/p&gt;
+
+&lt;h3 id=&quot;reader--consumer&quot;&gt;Reader &amp;amp; Consumer&lt;/h3&gt;
+
+&lt;p&gt;Kafka consumers read the records from the leader broker. The 
assumption of this design is that the leader has all the latest data in 
filesystem page cache most of the time. This is a good choice to fully utilize 
the filesystem page cache and achieve high performance.&lt;/p&gt;
+
+&lt;p&gt;DistributedLog takes a different approach. As there is no 
distinguished &lt;code 
class=&quot;highlighter-rouge&quot;&gt;leadership&lt;/code&gt; on storage 
nodes, DistributedLog reads the records from any of the storage nodes that 
store the data. In order to achieve predictable low latency, DistributedLog 
deploys a speculative read mechanism to retry reading from different replicas 
after a configured speculative read timeout. This will result in more read 
traffic to storage nodes than Kafka’s approach. However configuring the 
speculative read timeout to be around the 99th percentile of read latency can 
help mask the latency variabilities on the storage node with an additional 1% 
traffic overhead.&lt;/p&gt;
+
+&lt;p&gt;The differences in considerations and mechanisms on reads are mostly 
due to the differences in replication mechanism and the I/O system on storage 
nodes. They will be discussed below.&lt;/p&gt;
+
+&lt;h3 id=&quot;replication&quot;&gt;Replication&lt;/h3&gt;
+
+&lt;p&gt;Kafka uses an ISR replication algorithm - a broker is elected as the 
leader. All the writes are published to the leader broker and all the followers 
in a ISR set will read and replicate data from the leader. The leader maintains 
a high watermark (HW), which is the offset of last committed record for a 
partition. The high watermark is continuously propagated to the followers and 
is checkpointed to disk in each broker periodically for recovery. The HW is 
updated when all replicas in ISR successfully write the records to the 
filesystem (not necessarily to disk) and acknowledge back to the 
leader.&lt;/p&gt;
+
+&lt;p&gt;ISR mechanism allows adding and dropping replicas to achieve tradeoff 
between availability and performance. However the side effect of allowing 
adding and shrinking replica set is increased probability of &lt;a 
href=&quot;https://aphyr.com/posts/293-jepsen-kafka&quot;&gt;data 
loss&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;DistributedLog uses a quorum-vote replication algorithm, which is 
typically seen in consensus algorithms like Zab, Raft and Viewstamped 
Replication. The owner of the log stream writes the records to all the storage 
nodes in parallel and waits until a configured quorum of storage nodes have 
acknowledged before they are considered to be committed. The storage nodes 
acknowledge the write requests only after the data has been persisted to disk 
by explicitly calling flush. The owner of the log stream also maintains the 
offset of last committed record for a log stream, which is known as LAC 
(LastAddConfirmed) in Apache BookKeeper. The LAC is piggybacked into entries 
(to save extra rpc calls) and continuously propagated to the storage nodes. The 
size of replica set in DistributedLog is configured and fixed per log segment 
per stream. The change of replication settings only affect the newly allocated 
log segments but not the old log segments.&lt;/p&gt;
+
+&lt;h3 id=&quot;storage&quot;&gt;Storage&lt;/h3&gt;
+
+&lt;p&gt;Each Kafka partition is stored as a (set of) file(s) on the 
broker’s disks. It achieves good performance by leveraging the filesystem 
page cache and I/O scheduling. It is also the way Kafka leverages Java’s 
sendfile api to get data in and out of brokers efficiently. However, it can 
potentially be exposed to unpredictabilities when page cache evictions become 
frequent due to various reasons (e.g. consumer falls behind, random 
writes/reads and such).&lt;/p&gt;
+
+&lt;p&gt;DistributedLog uses a different I/O model. Figure 3 illustrates the 
I/O system on the bookies (BookKeeper’s storage nodes). The three common I/O 
workloads - writes (blue lines), tailing-reads (red lines) and catchup-reads 
(purple lines) are separated into 3 physically different I/O subsystems. All 
the writes are sequentially appended to journal files on the journal disks and 
group committed to the disk. After the writes are durably persisted on the 
disks, they are put into a memtable and acknowledged back to the clients. The 
data in the memtable is asynchronously flushed into an interleaved indexed data 
structure: the entries are appended into entry log files and the offsets are 
indexed by entry ids in the ledger index files. The latest data is guaranteed 
to be in memtable to serve tailing reads. Catch-up reads read from the entry 
log files. With physical isolation, a bookie node can fully utilize the network 
ingress bandwidth and sequential write bandwidth on the journal
  disk for writes and utilize the network egress bandwidth and IOPS on multiple 
ledger disks for reads, without interfering with each other.&lt;/p&gt;
+
+&lt;p&gt;&lt;img class=&quot;center-block&quot; 
src=&quot;/images/blog/kafka-distributedlog-bookkeeper-io.png&quot; 
alt=&quot;BookKeeper I/O Isolation&quot; width=&quot;500&quot; /&gt;&lt;/p&gt;
+&lt;div class=&quot;text-center&quot;&gt;
+&lt;i&gt;Figure 3 - BookKeeper I/O Isolation&lt;/i&gt;
+&lt;/div&gt;
+
+&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
+
+&lt;p&gt;Both Kafka and DistributedLog are designed for log stream related 
workloads. They share some similarities but different design principles for 
storage and replication lead to different implementations. Hopefully this blog 
post clarifies the differences and address some of the questions. We are also 
working on  followup blog posts to explain the performance characteristics of 
DistributedLog.&lt;/p&gt;
+
+&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
+
+&lt;div class=&quot;footnotes&quot;&gt;
+  &lt;ol&gt;
+    &lt;li id=&quot;fn:corelibrary&quot;&gt;
+      &lt;p&gt;Applications can also use the core library directly to append 
log records. This is very useful for use cases like replicated state machines 
that require ordering and exclusive write semantics. &lt;a 
href=&quot;#fnref:corelibrary&quot; 
class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
+    &lt;/li&gt;
+  &lt;/ol&gt;
+&lt;/div&gt;
+</description>
+        <pubDate>Sat, 19 Sep 2015 18:00:00 +0800</pubDate>
+        
<link>http://distributedlog.incubator.apache.org/technical-review/2015/09/19/kafka-vs-distributedlog.html</link>
+        <guid 
isPermaLink="true">http://distributedlog.incubator.apache.org/technical-review/2015/09/19/kafka-vs-distributedlog.html</guid>
+        
+        
+        <category>technical-review</category>
+        
+      </item>
+    
   </channel>
 </rss>

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/ccacde34/content/images/blog/kafka-distributedlog-bookkeeper-io.png
----------------------------------------------------------------------
diff --git a/content/images/blog/kafka-distributedlog-bookkeeper-io.png 
b/content/images/blog/kafka-distributedlog-bookkeeper-io.png
new file mode 100644
index 0000000..bd2d469
Binary files /dev/null and 
b/content/images/blog/kafka-distributedlog-bookkeeper-io.png differ

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/ccacde34/content/images/blog/kafka-distributedlog-dataflow.png
----------------------------------------------------------------------
diff --git a/content/images/blog/kafka-distributedlog-dataflow.png 
b/content/images/blog/kafka-distributedlog-dataflow.png
new file mode 100644
index 0000000..280f6a1
Binary files /dev/null and 
b/content/images/blog/kafka-distributedlog-dataflow.png differ

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/ccacde34/content/images/blog/kafka-distributedlog-segmentation.png
----------------------------------------------------------------------
diff --git a/content/images/blog/kafka-distributedlog-segmentation.png 
b/content/images/blog/kafka-distributedlog-segmentation.png
new file mode 100644
index 0000000..7a464b1
Binary files /dev/null and 
b/content/images/blog/kafka-distributedlog-segmentation.png differ

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/ccacde34/content/index.html
----------------------------------------------------------------------
diff --git a/content/index.html b/content/index.html
index 75c9d7e..82d7081 100644
--- a/content/index.html
+++ b/content/index.html
@@ -264,6 +264,8 @@ and memory.
     <h3>Blog</h3>
     <div class="list-group">
     
+    <a class="list-group-item" 
href="/technical-review/2015/09/19/kafka-vs-distributedlog.html">Sep 19, 2015 - 
A Technical Review of Kafka and DistributedLog</a>
+    
     </div>
   </div>
   <div class="col-md-6">

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/ccacde34/content/technical-review/2015/09/19/kafka-vs-distributedlog.html
----------------------------------------------------------------------
diff --git a/content/technical-review/2015/09/19/kafka-vs-distributedlog.html 
b/content/technical-review/2015/09/19/kafka-vs-distributedlog.html
new file mode 100644
index 0000000..d449689
--- /dev/null
+++ b/content/technical-review/2015/09/19/kafka-vs-distributedlog.html
@@ -0,0 +1,413 @@
+<!DOCTYPE html>
+<html lang="en">
+
+  <head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+
+  <title>A Technical Review of Kafka and DistributedLog</title>
+  <meta name="description" content="We open sourced DistributedLog in May 
2016.It generated a lot of interest in the community. One frequent question we 
are asked is how does DistributedLogcomp...">
+
+  <link rel="stylesheet" href="/styles/site.css">
+  <link rel="stylesheet" href="/css/theme.css">
+  <script 
src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <link rel="canonical" 
href="http://distributedlog.incubator.apache.org/technical-review/2015/09/19/kafka-vs-distributedlog.html";
 data-proofer-ignore>
+  <link rel="alternate" type="application/rss+xml" title="Apache 
DistributedLog (incubating)" 
href="http://distributedlog.incubator.apache.org/feed.xml";>
+  <script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+  
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
+
+  ga('create', 'UA-83870961-1', 'auto');
+  ga('send', 'pageview');
+
+  </script> 
+  <link rel="shortcut icon" type="image/x-icon" href="/images/favicon.ico">
+</head>
+
+
+  <body role="document">
+
+    <nav class="navbar navbar-default navbar-fixed-top">
+  <div class="container">
+    <div class="navbar-header">
+      <a href="/" class="navbar-brand" >
+        <img alt="Brand" style="height: 28px" 
src="/images/distributedlog_logo_navbar.png">
+      </a>
+      <button type="button" class="navbar-toggle collapsed" 
data-toggle="collapse" data-target="#navbar" aria-expanded="false" 
aria-controls="navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+    </div>
+    <div id="navbar" class="navbar-collapse collapse">
+      <ul class="nav navbar-nav">
+        <!-- Overview -->
+        <li><a href="/docs/latest/basics/introduction">Overview</a></li>
+        <!-- Downloads -->
+        <li><a href="/docs/latest/start/download">Downloads</a></li>
+        <!-- Quick Start -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown" 
role="button" aria-haspopup="true" aria-expanded="false">Quick Start<span 
class="caret"></span></a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/latest/start/quickstart">Setup & Run 
Example</a></li>
+            <li role="separator" class="divider"></li>
+            <li class="dropdown-header">Tutorials</li>
+            <li>
+              <a href="/docs/latest/tutorials/main#id3">
+              <small><span class="glyphicon glyphicon-pencil"></span></small>
+              Basic
+              </a>
+            </li>
+            <li>
+              <a href="/docs/latest/tutorials/main#id4">
+              <small><span class="glyphicon glyphicon-envelope"></span></small>
+              Messaging
+              </a>
+            </li>
+            <li>
+              <a href="/docs/latest/tutorials/main#id6">
+              <small><span class="glyphicon glyphicon-stats"></span></small>
+              Analytics
+              </a>
+            </li>
+          </ul>
+        </li>
+        <!-- Documentation -->
+        <li class="dropdown">
+                     <a href="#" class="dropdown-toggle" 
data-toggle="dropdown" role="button" aria-haspopup="true" 
aria-expanded="false">Documentation<span class="caret"></span></a>
+          <ul class="dropdown-menu">
+            <li class="dropdown-header">Snapshot (Developement)</li>
+            <li><a href="/docs/latest">Latest</a></li>
+            <li role="separator" class="divider"></li>
+            <li>
+              <a 
href="https://cwiki.apache.org/confluence/display/DL/Project+Ideas";>
+                <small><span class="glyphicon 
glyphicon-new-window"></span></small>
+                Project Ideas
+              </a>
+            </li>
+          </ul>
+        </li>
+        <!-- FAQ -->
+        <li><a href="/faq">FAQ</a></li>
+      </ul>
+      <!-- Right Side -->
+      <ul class="nav navbar-nav navbar-right">
+        <!-- Blog -->
+        <li><a href="/blog">Blog</a></li>
+        <!-- Community -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown" 
role="button" aria-haspopup="true" aria-expanded="false">Community<span 
class="caret"></span></a>
+          <ul class="dropdown-menu">
+            <li class="dropdown-header">Community</li>
+            <li><a href="/community/#mailing-lists">Mailing Lists</a></li>
+            <li><a href="/community/#source-code">Source Code</a></li>
+            <li><a href="/community/#issue-tracker">Issue Tracking</a></li>
+            <li><a href="/community/team/">Team</a></li>
+            <li role="separator" class="divider"></li>
+            <li class="dropdown-header">Contribute</li>
+            <li><a 
href="https://cwiki.apache.org/confluence/display/DL/Developer+Setup";>Developer 
Setup</a></li>
+            <li><a 
href="https://cwiki.apache.org/confluence/display/DL/Contributing+to+DistributedLog";>Contributing
 to DistributedLog</a></li>
+            <li><a 
href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65867477";>Coding
 Guide</a></li>
+          </ul>
+        </li>
+        <!-- Project -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown" 
role="button" aria-haspopup="true" aria-expanded="false">Project<span 
class="caret"></span></a>
+          <ul class="dropdown-menu">
+                             <li class="dropdown-header">Project</li>
+            <li><a href="/project/presentations/">Presentations</a></li>
+            <li>
+              <a href="https://twitter.com/distributedlog";>
+                <small><span class="glyphicon 
glyphicon-new-window"></span></small>
+                Twitter
+              </a>
+            </li>
+            <li>
+              <a href="https://github.com/apache/incubator-distributedlog";>
+                <small><span class="glyphicon 
glyphicon-new-window"></span></small>
+                Github
+              </a>
+            </li>
+            <li>
+              <a 
href="https://cwiki.apache.org/confluence/display/DL/Apache+DistributedLog+Home";>
+                <small><span class="glyphicon 
glyphicon-new-window"></span></small>
+                Wiki
+              </a>
+            </li>
+          </ul>
+        </li>
+      </ul>
+    </div><!--/.nav-collapse -->
+  </div>
+</nav>
+
+
+<link rel="stylesheet" href="">
+
+
+    <div class="container" role="main">
+
+      <div class="row">
+        
+
+<article class="post" itemscope itemtype="http://schema.org/BlogPosting";>
+
+  <div class="col-md-8 col-md-offset-2">
+    <header class="post-header">
+      <h1 class="post-title" itemprop="name headline">A Technical Review of 
Kafka and DistributedLog</h1>
+      <p class="post-meta"><time datetime="2015-09-19T18:00:00+08:00" 
itemprop="datePublished">Sep 19, 2015</time> •  Sijie Guo [<a 
href="https://twitter.com/sijieg";>@sijieg</a>]
+</p>
+    </header>
+
+    <div class="post-content" itemprop="articleBody">
+
+      <p>We open sourced <a href="http://DistributedLog.io";>DistributedLog</a> 
in May 2016.
+It generated a lot of interest in the community. One frequent question we are 
asked is how does DistributedLog
+compare to <a href="http://kafka.apache.org/";>Apache Kafka</a> . Technically 
DistributedLog is not a full fledged partitioned
+pub/sub system like Apache Kafka. DistributedLog is a replicated log stream 
store, using <a href="http://bookKeeper.apache.org/";>Apache BookKeeper</a> as 
its log segment store.
+It focuses on offering <em>durability</em>, <em>replication</em> and 
<em>strong consistency</em> as essentials for building reliable
+real-time systems. One can use DistributedLog to build and experiment with 
different messaging models
+(such as Queue, Pub/Sub).</p>
+
+<p>Since both of them share very similar data model around <code 
class="highlighter-rouge">log</code>, this blog post will discuss the difference
+between Apache Kafka and DistributedLog from a technical perspective. We will 
try our best to be objective.
+However, since we are not experts in Apache Kafka, we may have made wrong 
assumptions about Apache Kafka.
+Please do correct us if that is the case.</p>
+
+<p>First, let’s have a quick overview of Kafka and DistributedLog.</p>
+
+<h2 id="what-is-kafka">What is Kafka?</h2>
+
+<p>Kafka is a distributed messaging system originally built at Linkedin and 
now part of <a href="http://apache.org/";>Apache Software Foundation</a>.
+It is a partition based pub/sub system. The key abstraction in Kafka is a 
topic.
+A topic is partitioned into multiple partitions. Each partition is stored and 
replicated in multiple brokers.
+Producers publish their records to partitions of a topic (round-robin or 
partitioned by keys), and consumers
+consume the published records of that topic. Data is published to a leader 
broker for a partition and
+replicated to follower brokers; all read requests are subsequently served by 
the leader broker.
+The follower brokers are used for redundancy and backup in case the leader can 
no longer serve requests.
+The left diagram in Figure 1 shows the data flow in Kafka.</p>
+
+<h2 id="what-is-distributedlog">What is DistributedLog?</h2>
+
+<p>Unlike Kafka, DistributedLog is not a partitioned pub/sub system. It is a 
replicated log stream store.
+The key abstraction in DistributedLog is a continuous replicated log stream. A 
log stream is segmented
+into multiple log segments. Each log segment is stored as a <a 
href="http://BookKeeper.apache.org/docs/r4.1.0/BookKeeperOverview.html";>ledger</a>
 in Apache BookKeeper,
+whose data is replicated and distributed evenly across multiple bookies (a 
bookie is a storage node in Apache BookKeeper).
+All the records of a log stream are sequenced by the owner of the log stream - 
a set of write proxies that
+manage the ownership of log streams <sup id="fnref:corelibrary"><a 
href="#fn:corelibrary" class="footnote">1</a></sup>. Each of the log records 
appended to a log stream will
+be assigned a sequence number. The readers can start reading the log stream 
from any provided sequence number.
+The read requests will be load balanced across the storage replicas of that 
stream.
+The right diagram in Figure 1 shows the data flow in DistributedLog.</p>
+
+<p><img class="center-block" 
src="/images/blog/kafka-distributedlog-dataflow.png" alt="Kafka vs 
DistributedLog" width="500" /></p>
+<div class="text-center">
+<i>Figure 1 - Apache Kafka vs Apache DistributedLog</i>
+</div>
+
+<h2 id="how-do-kafka-and-distributedlog-differ">How do Kafka and 
DistributedLog differ?</h2>
+
+<p>For an apples-to-apples comparison, we will only compare Kafka partitions 
with DistributedLog streams in this blog post. The table below lists the most 
important differences between the two systems.</p>
+
+<table border="1px">
+  <thead>
+    <tr>
+      <th>&nbsp;</th>
+      <th><strong>Kafka</strong></th>
+      <th><strong>DistributedLog</strong></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><strong>Data Segmentation/Distribution</strong></td>
+      <td>The data of a Kafka partition lives in a set of brokers; The data is 
segmented into log segment files locally on each broker.</td>
+      <td>The data of a DL stream is segmented into multiple log segments. The 
log segments are evenly distributed across the storage nodes.</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Data Retention</strong></td>
+      <td>The data is expired after configured retention time or compacted to 
keep the latest values for keys.</td>
+      <td>The data can be either expired after configured retention time or 
explicitly truncated to given positions.</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Write</strong></td>
+      <td>Multiple-Writers via Broker</td>
+      <td>Multiple-Writers via Write Proxy; Single-Writer via Core Library</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Read</strong></td>
+      <td>Read from leader broker</td>
+      <td>Read from any storage node which has the data</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Replication</strong></td>
+      <td>ISR (In-Sync-Replica) Replication - Both leader and followers are 
brokers.</td>
+      <td>Quorum vote replication - Leaders are write proxies; Followers are 
bookies.</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Replication Repair</strong></td>
+      <td>Done by adding new replica to copy data from the leader broker.</td>
+      <td>The replication repair is done by bookie AutoRecovery mechanism to 
ensure replication factor.</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Cluster Expand</strong></td>
+      <td>Need to re-distribute (balance) partition when adding new brokers. 
Carefully track the size of each partition and make sure rebalancing doesn’t 
saturate network and I/O.</td>
+      <td>No data redistribution when either adding write proxies or bookies; 
New log segment will be automatically allocated on newly added bookies.</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Storage</strong></td>
+      <td>File (set of files) per partition</td>
+      <td>Interleaved Storage Format</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>Durability</strong></td>
+      <td>Only write to filesystem page cache. Configurable to either wait for 
acknowledgement from leader or from all replicas in ISR.</td>
+      <td>All writes are persisted to disk via explicit fsync. Wait for 
acknowledgements from a configurable quorum of bookies.</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr>
+      <td><strong>I/O Isolation</strong></td>
+      <td>No physical I/O isolation. Rely on filesystem page cache.</td>
+      <td>Physical I/O isolation. Separate journal disk for write and ledger 
disks for reads.</td>
+    </tr>
+  </tbody>
+</table>
+
+<h3 id="data-model">Data Model</h3>
+
+<p>A Kafka partition is a log stored as a (set of) file(s) in the broker’s 
disks.
+Each record is a key/value pair (key can be omitted for round-robin 
publishes). 
+The key is used for assigning the record to a Kafka partition and also for <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction";>log 
compaction</a>.
+All the data of a partition is stored only on a set of brokers, replicated 
from leader broker to follower brokers.</p>
+
+<p>A DistributedLog stream is a <code class="highlighter-rouge">virtual</code> 
stream stored as a list of log segments.
+Each log segment is stored as a BookKeeper ledger and replicated to multiple 
bookies.
+There will be only one active log segment accepting writes at any time.
+An old log segment is sealed and a new log segment will be opened when a given 
time period elapses,
+an old log segment reaches a configured size threshold (based on configured 
log segment rolling policies),
+or when the owner of the log stream fails.</p>
+
+<p>The differences in data segmentation and distribution of Kafka partition 
and DistributedLog stream result in differences in data retention policies and 
cluster operations (e.g. cluster expansion).</p>
+
+<p>Figure 2 shows the differences between DistributedLog and Kafka data 
model.</p>
+
+<p><img class="center-block" 
src="/images/blog/kafka-distributedlog-segmentation.png" alt="Kafka Partition 
and DistributedLog Stream" width="500" /></p>
+<div class="text-center">
+<i>Figure 2 - Kafka Partition and DistributedLog Stream</i>
+</div>
+
+<h4 id="data-retention">Data Retention</h4>
+
+<p>All the data of a Kafka partition is stored on one broker (replicated to 
other brokers). Data is expired and deleted after a configured retention 
period. Additionally, a Kafka partition can be configured to do log compaction 
to keep only the latest values for keys.</p>
+
+<p>Similar to Kafka, DistributedLog also allows configuring retention periods 
for individual streams and expiring / deleting log segments after they are 
expired. Besides that, DistributedLog also provides an explicit-truncation 
mechanism. Application can explicitly truncate a log stream to a given position 
in the stream. This is important for building replicated state machines as the 
replicated state machines require persisting state before deleting log records.
+<a 
href="https://blog.twitter.com/2016/strong-consistency-in-manhattan";>Manhattan</a>
 is one example of a system that uses this functionality.</p>
+
+<h4 id="operations">Operations</h4>
+
+<p>The differences in data segmentation/distribution also result in different 
experiences in operating the cluster, one example is when expanding the 
cluster.</p>
+
+<p>When expanding a Kafka cluster, the partitions typically need to be 
rebalanced. The rebalancing moves Kafka partitions around to different replicas 
to try to achieve an even distribution. This involves copying the entire stream 
data from one replica to another. It has been mentioned in several talks that 
rebalancing needs to be executed carefully to avoid saturating disk and network 
resources.</p>
+
+<p>Expanding a DistributedLog cluster works in a very different way. 
DistributedLog consists of two layers: a storage layer - Apache BookKeeper, and 
a serving layer - the write and read proxies. When expanding the storage layer, 
we typically just add more bookies. The new bookies will be discovered by the 
writers of the log streams and will immediately be available for writing of new 
log segments. There is no rebalancing involved when expanding the data storage 
layer. Rebalancing only comes into play when adding capacity to the serving 
layer, however this rebalancing only moves the ownership of the log streams to 
allow network bandwidth to be evenly used across proxies, there is no data 
copying involved, only accounting of ownership. This separation of storage and 
serving layers not only makes it easy to integrate the system with any 
auto-scaling mechanism, but also allows scaling resources independently.</p>
+
+<h3 id="writer--producer">Writer &amp; Producer</h3>
+
+<p>As shown in Figure 1, Kafka producers write batches of records to the 
leader broker of a Kafka partition. The follower brokers in the <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication";>ISR 
(in-sync-replica) set</a> will replicate the records from the leader broker. A 
record is considered as committed only when the leader receives acknowledgments 
from all the replicas in the ISR. The producer can be configured to wait for 
the response from leader broker or from all brokers in the ISR.</p>
+
+<p>There are two ways in DistributedLog to write log records to a 
DistributedLog stream, one is using a thin thrift client to write records 
through the write proxies (aka multiple-writer semantic), while the other one 
is using the DistributedLog core library to talk directly to the storage nodes 
(aka single-writer semantics). The first approach is common for building 
messaging systems while the second approach is common for building replicated 
state machines. You can check the <a 
href="http://distributedlog.incubator.apache.org/docs/latest/user_guide/api/practice";>Best
 Practices</a> section in DistributedLog documentation for more details about 
what should be used.</p>
+
+<p>The owner of a log stream writes a batch of records as BookKeeper entries 
in parallel to the bookies and waits for responses from a quorum of bookies. 
The size of a quorum is known as the <code 
class="highlighter-rouge">ack_quorum_size</code> of a BookKeeper ledger and it 
can be configured per DistributedLog stream. It provides similar flexibility on 
durability as Kafka producers. We will discuss more about the differences on 
replication algorithms later in ‘Replication’ section.</p>
+
+<p>Both Kafka and DistributedLog support end-to-end batching and compression. 
One slight difference between Kafka and DistributedLog is all the writes in 
DistributedLog are flushed (via fsync) to disk before acknowledging (We are not 
aware of Kafka providing a similar guarantee).</p>
+
+<h3 id="reader--consumer">Reader &amp; Consumer</h3>
+
+<p>Kafka consumers read the records from the leader broker. The assumption of 
this design is that the leader has all the latest data in filesystem page cache 
most of the time. This is a good choice to fully utilize the filesystem page 
cache and achieve high performance.</p>
+
+<p>DistributedLog takes a different approach. As there is no distinguished 
<code class="highlighter-rouge">leadership</code> on storage nodes, 
DistributedLog reads the records from any of the storage nodes that store the 
data. In order to achieve predictable low latency, DistributedLog deploys a 
speculative read mechanism to retry reading from different replicas after a 
configured speculative read timeout. This will result in more read traffic to 
storage nodes than Kafka’s approach. However configuring the speculative read 
timeout to be around the 99th percentile of read latency can help mask the 
latency variabilities on the storage node with an additional 1% traffic 
overhead.</p>
+
+<p>The differences in considerations and mechanisms on reads are mostly due to 
the differences in replication mechanism and the I/O system on storage nodes. 
They will be discussed below.</p>
+
+<h3 id="replication">Replication</h3>
+
+<p>Kafka uses an ISR replication algorithm - a broker is elected as the 
leader. All the writes are published to the leader broker and all the followers 
in a ISR set will read and replicate data from the leader. The leader maintains 
a high watermark (HW), which is the offset of last committed record for a 
partition. The high watermark is continuously propagated to the followers and 
is checkpointed to disk in each broker periodically for recovery. The HW is 
updated when all replicas in ISR successfully write the records to the 
filesystem (not necessarily to disk) and acknowledge back to the leader.</p>
+
+<p>ISR mechanism allows adding and dropping replicas to achieve tradeoff 
between availability and performance. However the side effect of allowing 
adding and shrinking replica set is increased probability of <a 
href="https://aphyr.com/posts/293-jepsen-kafka";>data loss</a>.</p>
+
+<p>DistributedLog uses a quorum-vote replication algorithm, which is typically 
seen in consensus algorithms like Zab, Raft and Viewstamped Replication. The 
owner of the log stream writes the records to all the storage nodes in parallel 
and waits until a configured quorum of storage nodes have acknowledged before 
they are considered to be committed. The storage nodes acknowledge the write 
requests only after the data has been persisted to disk by explicitly calling 
flush. The owner of the log stream also maintains the offset of last committed 
record for a log stream, which is known as LAC (LastAddConfirmed) in Apache 
BookKeeper. The LAC is piggybacked into entries (to save extra rpc calls) and 
continuously propagated to the storage nodes. The size of replica set in 
DistributedLog is configured and fixed per log segment per stream. The change 
of replication settings only affect the newly allocated log segments but not 
the old log segments.</p>
+
+<h3 id="storage">Storage</h3>
+
+<p>Each Kafka partition is stored as a (set of) file(s) on the broker’s 
disks. It achieves good performance by leveraging the filesystem page cache and 
I/O scheduling. It is also the way Kafka leverages Java’s sendfile api to get 
data in and out of brokers efficiently. However, it can potentially be exposed 
to unpredictabilities when page cache evictions become frequent due to various 
reasons (e.g. consumer falls behind, random writes/reads and such).</p>
+
+<p>DistributedLog uses a different I/O model. Figure 3 illustrates the I/O 
system on the bookies (BookKeeper’s storage nodes). The three common I/O 
workloads - writes (blue lines), tailing-reads (red lines) and catchup-reads 
(purple lines) are separated into 3 physically different I/O subsystems. All 
the writes are sequentially appended to journal files on the journal disks and 
group committed to the disk. After the writes are durably persisted on the 
disks, they are put into a memtable and acknowledged back to the clients. The 
data in the memtable is asynchronously flushed into an interleaved indexed data 
structure: the entries are appended into entry log files and the offsets are 
indexed by entry ids in the ledger index files. The latest data is guaranteed 
to be in memtable to serve tailing reads. Catch-up reads read from the entry 
log files. With physical isolation, a bookie node can fully utilize the network 
ingress bandwidth and sequential write bandwidth on the journal disk 
 for writes and utilize the network egress bandwidth and IOPS on multiple 
ledger disks for reads, without interfering with each other.</p>
+
+<p><img class="center-block" 
src="/images/blog/kafka-distributedlog-bookkeeper-io.png" alt="BookKeeper I/O 
Isolation" width="500" /></p>
+<div class="text-center">
+<i>Figure 3 - BookKeeper I/O Isolation</i>
+</div>
+
+<h2 id="summary">Summary</h2>
+
+<p>Both Kafka and DistributedLog are designed for log stream related 
workloads. They share some similarities but different design principles for 
storage and replication lead to different implementations. Hopefully this blog 
post clarifies the differences and address some of the questions. We are also 
working on  followup blog posts to explain the performance characteristics of 
DistributedLog.</p>
+
+<h2 id="references">References</h2>
+
+<div class="footnotes">
+  <ol>
+    <li id="fn:corelibrary">
+      <p>Applications can also use the core library directly to append log 
records. This is very useful for use cases like replicated state machines that 
require ordering and exclusive write semantics. <a href="#fnref:corelibrary" 
class="reversefootnote">&#8617;</a></p>
+    </li>
+  </ol>
+</div>
+
+
+    </div>
+  </div>
+
+</article>
+
+      </div>
+
+
+    <hr>
+  <div class="row">
+      <div class="col-xs-12">
+          <footer>
+              <p class="text-center">&copy; Copyright 2016
+                  <a href="http://www.apache.org";>The Apache Software 
Foundation.</a> All Rights Reserved.
+              </p>
+              <p class="text-center">
+                  <a href="/feed.xml">RSS Feed</a>
+              </p>
+          </footer>
+      </div>
+  </div>
+  <!-- container div end -->
+</div>
+
+
+  </body>
+
+</html>

Reply via email to