Repository: kylin
Updated Branches:
  refs/heads/document 8989d052c -> 38cd798d0

Add blog for new streaming


Branch: refs/heads/document
Commit: 38cd798d0e656cb202f2d530e6fdd44491f6353a
Parents: 8989d05
Author: shaofengshi <>
Authored: Tue Oct 18 14:59:48 2016 +0800
Committer: shaofengshi <>
Committed: Tue Oct 18 14:59:48 2016 +0800

 .../_posts/blog/ |  56 +++++++++++++++++++
 website/images/blog/new-streaming.png           | Bin 0 -> 152820 bytes
 .../images/blog/offset-as-partition-value.png   | Bin 0 -> 121697 bytes
 website/images/blog/streaming-adapter.png       | Bin 0 -> 117956 bytes
 website/images/blog/streaming-monitor.png       | Bin 0 -> 227754 bytes
 website/images/blog/streaming-twitter.png       | Bin 0 -> 90771 bytes
 6 files changed, 56 insertions(+)
diff --git a/website/_posts/blog/ 
new file mode 100644
index 0000000..e1cb845
--- /dev/null
+++ b/website/_posts/blog/
@@ -0,0 +1,56 @@
+layout: post-blog
+title:  New NRT Streaming in Apache Kylin
+date:   2016-10-18 17:30:00
+author: Shaofeng Shi
+categories: blog
+In 1.5.0 Apache Kylin introduces the Streaming Cubing feature, which can 
consume data from Kafka topic directly. This 
[blog](/blog/2016/02/03/streaming-cubing/) introduces how that be implemented, 
and this [tutorial](/docs15/tutorial/cube_streaming.html) introduces how to use 
+While, that implementation was marked as "experimental" because it has the 
following limitations:
+ * Not scalable: it starts a Java process for a micro-batch cube building, 
instead of leveraging any computing framework; If too many messages arrive at 
one time, the build may fail with OutOfMemory error;
+ * May loss data: it uses a time window to seek the approximate start/end 
offsets on Kafka topic, which means too late/early arrived messages will be 
skipped; Then the query couldn't ensure 100% accuracy.
+ * Difficult to monitor: the streaming cubing is out of the Job engine's 
scope, user can not monitor the jobs with Web GUI or REST API. 
+ * Others: hard to recover from accident, difficult to maintain the code, etc.
+To overcome these limitations, the Apache Kylin team developed the new 
streaming ([KYLIN-1726]( with 
Kafka 0.10 API, it has been tested internally for some time, will release to 
public soon.
+The new design is a perfect implementation under Kylin 1.5's "Plug-in" 
architecture: treat Kafka topic as a "Data Source" like Hive table, using an 
adapter to extract the data to HDFS; the next steps are almost the same as from 
Hive. Figure 1 is a high level architecture of the new design.
+![Kylin New Streaming Framework Architecture](/images/blog/new-streaming.png)
+The adapter to read Kafka messages is modified from 
[kafka-hadoop-loader](, which is 
open sourced under Apache License V2.0; it starts a mapper for each Kafka 
partition, reading and then saving the messages to HDFS; in next steps Kylin 
will be able to leverage existing framework like MR to do the processing, this 
makes the solution scalable and fault-tolerant. 
+To overcome the "data loss" problem, Kylin adds the start/end offset 
information on each Cube segment, and then use the offsets as the partition 
value (no overlap is allowed); this ensures no data be lost and 1 message be 
consumed at most once. To let the late/early message can be queried, Cube 
segments allow overlap for the partition time dimension: Kylin will scan all 
segments which include the queried time. Figure 2 illurates this.
+![Use Offset to Cut Segments](/images/blog/offset-as-partition-value.png)
+Other changes/enhancements are made in the new streaming:
+ * Allow multiple segments being built/merged concurrently
+ * Automatically seek start/end offsets (if user doesn't specify) from 
previous segment or Kafka
+ * Support embeded properties in JSON message
+ * Add REST API to trigger streaming cube's building
+ * Add REST API to check and fill the segment holes
+The integration test result shows big improvements than the previous version:
+ * Scalability: it can easily process up to hundreds of million records in one 
+ * Flexibility: trigger the build at any time with the frequency you want, 
e.g: every 5 minutes in day and every hour in night; Kylin manages the offsets 
so it can resume from the last position;
+ * Stability: pretty stable, no OutOfMemory error;
+ * Management: user can check all jobs' status through Kylin's "Monitor" page 
+ * Build Performance: in a testing cluster (8 AWS instances to consume Twitter 
streams), 10 thousands arrives per second, define a 9-dimension cube with 3 
measures; when build interval is 2 mintues, the job finishes in around 3 
minutes; if change interval to 5 mintues, build finishes in around 4 minutes;
+Here are a couple of screenshots in this test:
+![Streaming Job Monitoring](/images/blog/streaming-monitor.png)
+![Streaming Adapter](/images/blog/streaming-adapter.png)
+![Streaming Twitter Sample](/images/blog/streaming-twitter.png)
diff --git a/website/images/blog/new-streaming.png 
new file mode 100644
index 0000000..2e68639
Binary files /dev/null and b/website/images/blog/new-streaming.png differ
diff --git a/website/images/blog/offset-as-partition-value.png 
new file mode 100644
index 0000000..3691fce
Binary files /dev/null and b/website/images/blog/offset-as-partition-value.png 
diff --git a/website/images/blog/streaming-adapter.png 
new file mode 100644
index 0000000..be227e3
Binary files /dev/null and b/website/images/blog/streaming-adapter.png differ
diff --git a/website/images/blog/streaming-monitor.png 
new file mode 100644
index 0000000..6d4750b
Binary files /dev/null and b/website/images/blog/streaming-monitor.png differ
diff --git a/website/images/blog/streaming-twitter.png 
new file mode 100644
index 0000000..9c346f7
Binary files /dev/null and b/website/images/blog/streaming-twitter.png differ

Reply via email to