Repository: kylin Updated Branches: refs/heads/document 38cd798d0 -> 6e3a4bc00
minor update on the blog Project: http://git-wip-us.apache.org/repos/asf/kylin/repo Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/6e3a4bc0 Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/6e3a4bc0 Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/6e3a4bc0 Branch: refs/heads/document Commit: 6e3a4bc004ca2bbb050d3851cdac4b5a21794dd4 Parents: 38cd798 Author: shaofengshi <shaofeng...@apache.org> Authored: Tue Oct 18 22:25:37 2016 +0800 Committer: shaofengshi <shaofeng...@apache.org> Committed: Tue Oct 18 22:25:37 2016 +0800 ---------------------------------------------------------------------- .../_posts/blog/2016-10-18-new-nrt-streaming.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kylin/blob/6e3a4bc0/website/_posts/blog/2016-10-18-new-nrt-streaming.md ---------------------------------------------------------------------- diff --git a/website/_posts/blog/2016-10-18-new-nrt-streaming.md b/website/_posts/blog/2016-10-18-new-nrt-streaming.md index e1cb845..ad75636 100644 --- a/website/_posts/blog/2016-10-18-new-nrt-streaming.md +++ b/website/_posts/blog/2016-10-18-new-nrt-streaming.md @@ -18,16 +18,16 @@ While, that implementation was marked as "experimental" because it has the follo * Others: hard to recover from accident, difficult to maintain the code, etc. -To overcome these limitations, the Apache Kylin team developed the new streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726)) with Kafka 0.10 API, it has been tested internally for some time, will release to public soon. +To overcome these limitations, the Apache Kylin team developed the new streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726)) with Kafka 0.10, it has been tested internally for some time, will release to public soon. -The new design is a perfect implementation under Kylin 1.5's "Plug-in" architecture: treat Kafka topic as a "Data Source" like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as from Hive. Figure 1 is a high level architecture of the new design. +The new design is a perfect implementation under Kylin 1.5's "plug-in" architecture: treat Kafka topic as a "Data Source" like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as other cubes. Figure 1 is a high level architecture of the new design. ![Kylin New Streaming Framework Architecture](/images/blog/new-streaming.png) -The adapter to read Kafka messages is modified from [kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader), which is open sourced under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; in next steps Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant. +The adapter to read Kafka messages is modified from [kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader), the author Michal Harish open sourced it under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; so Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant. -To overcome the "data loss" problem, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap is allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: Kylin will scan all segments which include the queried time. Figure 2 illurates this. +To overcome the "data loss" limitation, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: each segment has a "min" date/time and a "max" date/time; Kylin will scan all segments which matched with the queried time scope. Figure 2 illurates this. ![Use Offset to Cut Segments](/images/blog/offset-as-partition-value.png) @@ -39,18 +39,20 @@ Other changes/enhancements are made in the new streaming: * Add REST API to trigger streaming cube's building * Add REST API to check and fill the segment holes -The integration test result shows big improvements than the previous version: +The integration test result is promising: * Scalability: it can easily process up to hundreds of million records in one build; - * Flexibility: trigger the build at any time with the frequency you want, e.g: every 5 minutes in day and every hour in night; Kylin manages the offsets so it can resume from the last position; - * Stability: pretty stable, no OutOfMemory error; + * Flexibility: you can trigger the build at any time, with the frequency you want; for example: every 5 minutes in day time but every hour in night time, and even pause when you need do a maintenance; Kylin manages the offsets so it can automatically continue from the last position; + * Stability: pretty stable, no OutOfMemoryError; * Management: user can check all jobs' status through Kylin's "Monitor" page or REST API; * Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams), 10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build finishes in around 4 minutes; -Here are a couple of screenshots in this test: +Here are a couple of screenshots in this test, we may compose it as a step-by-step tutorial in the future: ![Streaming Job Monitoring](/images/blog/streaming-monitor.png) ![Streaming Adapter](/images/blog/streaming-adapter.png) ![Streaming Twitter Sample](/images/blog/streaming-twitter.png) + +In short, this is a more robust Near Real Time Streaming OLAP solution (compared with the previous version). Nextly, the Apache Kylin team will move toward a Real Time engine.