kylin git commit: minor update on the blog

shaofengshi Tue, 18 Oct 2016 07:26:52 -0700

Repository: kylin
Updated Branches:
  refs/heads/document 38cd798d0 -> 6e3a4bc00



minor update on the blog


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/6e3a4bc0
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/6e3a4bc0
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/6e3a4bc0

Branch: refs/heads/document
Commit: 6e3a4bc004ca2bbb050d3851cdac4b5a21794dd4
Parents: 38cd798
Author: shaofengshi <shaofeng...@apache.org>
Authored: Tue Oct 18 22:25:37 2016 +0800
Committer: shaofengshi <shaofeng...@apache.org>
Committed: Tue Oct 18 22:25:37 2016 +0800

----------------------------------------------------------------------
 .../_posts/blog/2016-10-18-new-nrt-streaming.md   | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/6e3a4bc0/website/_posts/blog/2016-10-18-new-nrt-streaming.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2016-10-18-new-nrt-streaming.md 
b/website/_posts/blog/2016-10-18-new-nrt-streaming.md
index e1cb845..ad75636 100644
--- a/website/_posts/blog/2016-10-18-new-nrt-streaming.md
+++ b/website/_posts/blog/2016-10-18-new-nrt-streaming.md
@@ -18,16 +18,16 @@ While, that implementation was marked as "experimental" 
because it has the follo
 
  * Others: hard to recover from accident, difficult to maintain the code, etc.
 
-To overcome these limitations, the Apache Kylin team developed the new 
streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726)) with 
Kafka 0.10 API, it has been tested internally for some time, will release to 
public soon.
+To overcome these limitations, the Apache Kylin team developed the new 
streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726)) with 
Kafka 0.10, it has been tested internally for some time, will release to public 
soon.
 
-The new design is a perfect implementation under Kylin 1.5's "Plug-in" 
architecture: treat Kafka topic as a "Data Source" like Hive table, using an 
adapter to extract the data to HDFS; the next steps are almost the same as from 
Hive. Figure 1 is a high level architecture of the new design.
+The new design is a perfect implementation under Kylin 1.5's "plug-in" 
architecture: treat Kafka topic as a "Data Source" like Hive table, using an 
adapter to extract the data to HDFS; the next steps are almost the same as 
other cubes. Figure 1 is a high level architecture of the new design.
 
 
 ![Kylin New Streaming Framework Architecture](/images/blog/new-streaming.png)
 
-The adapter to read Kafka messages is modified from 
[kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader), which is 
open sourced under Apache License V2.0; it starts a mapper for each Kafka 
partition, reading and then saving the messages to HDFS; in next steps Kylin 
will be able to leverage existing framework like MR to do the processing, this 
makes the solution scalable and fault-tolerant. 
+The adapter to read Kafka messages is modified from 
[kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader), the 
author Michal Harish open sourced it under Apache License V2.0; it starts a 
mapper for each Kafka partition, reading and then saving the messages to HDFS; 
so Kylin will be able to leverage existing framework like MR to do the 
processing, this makes the solution scalable and fault-tolerant. 
 
-To overcome the "data loss" problem, Kylin adds the start/end offset 
information on each Cube segment, and then use the offsets as the partition 
value (no overlap is allowed); this ensures no data be lost and 1 message be 
consumed at most once. To let the late/early message can be queried, Cube 
segments allow overlap for the partition time dimension: Kylin will scan all 
segments which include the queried time. Figure 2 illurates this.
+To overcome the "data loss" limitation, Kylin adds the start/end offset 
information on each Cube segment, and then use the offsets as the partition 
value (no overlap allowed); this ensures no data be lost and 1 message be 
consumed at most once. To let the late/early message can be queried, Cube 
segments allow overlap for the partition time dimension: each segment has a 
"min" date/time and a "max" date/time; Kylin will scan all segments which 
matched with the queried time scope. Figure 2 illurates this.
 
 ![Use Offset to Cut Segments](/images/blog/offset-as-partition-value.png)
 
@@ -39,18 +39,20 @@ Other changes/enhancements are made in the new streaming:
  * Add REST API to trigger streaming cube's building
  * Add REST API to check and fill the segment holes
 
-The integration test result shows big improvements than the previous version:
+The integration test result is promising:
 
  * Scalability: it can easily process up to hundreds of million records in one 
build; 
- * Flexibility: trigger the build at any time with the frequency you want, 
e.g: every 5 minutes in day and every hour in night; Kylin manages the offsets 
so it can resume from the last position;
- * Stability: pretty stable, no OutOfMemory error;
+ * Flexibility: you can trigger the build at any time, with the frequency you 
want; for example: every 5 minutes in day time but every hour in night time, 
and even pause when you need do a maintenance; Kylin manages the offsets so it 
can automatically continue from the last position;
+ * Stability: pretty stable, no OutOfMemoryError;
  * Management: user can check all jobs' status through Kylin's "Monitor" page 
or REST API; 
  * Build Performance: in a testing cluster (8 AWS instances to consume Twitter 
streams), 10 thousands arrives per second, define a 9-dimension cube with 3 
measures; when build interval is 2 mintues, the job finishes in around 3 
minutes; if change interval to 5 mintues, build finishes in around 4 minutes;
 
 
-Here are a couple of screenshots in this test:
+Here are a couple of screenshots in this test, we may compose it as a 
step-by-step tutorial in the future:
 ![Streaming Job Monitoring](/images/blog/streaming-monitor.png)
 
 ![Streaming Adapter](/images/blog/streaming-adapter.png)
 
 ![Streaming Twitter Sample](/images/blog/streaming-twitter.png)
+
+In short, this is a more robust Near Real Time Streaming OLAP solution 
(compared with the previous version). Nextly, the Apache Kylin team will move 
toward a Real Time engine.

kylin git commit: minor update on the blog

Reply via email to