Repository: kylin Updated Branches: refs/heads/document 8989d052c -> 38cd798d0
Add blog for new streaming Project: http://git-wip-us.apache.org/repos/asf/kylin/repo Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/38cd798d Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/38cd798d Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/38cd798d Branch: refs/heads/document Commit: 38cd798d0e656cb202f2d530e6fdd44491f6353a Parents: 8989d05 Author: shaofengshi <shaofeng...@apache.org> Authored: Tue Oct 18 14:59:48 2016 +0800 Committer: shaofengshi <shaofeng...@apache.org> Committed: Tue Oct 18 14:59:48 2016 +0800 ---------------------------------------------------------------------- .../_posts/blog/2016-10-18-new-nrt-streaming.md | 56 +++++++++++++++++++ website/images/blog/new-streaming.png | Bin 0 -> 152820 bytes .../images/blog/offset-as-partition-value.png | Bin 0 -> 121697 bytes website/images/blog/streaming-adapter.png | Bin 0 -> 117956 bytes website/images/blog/streaming-monitor.png | Bin 0 -> 227754 bytes website/images/blog/streaming-twitter.png | Bin 0 -> 90771 bytes 6 files changed, 56 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kylin/blob/38cd798d/website/_posts/blog/2016-10-18-new-nrt-streaming.md ---------------------------------------------------------------------- diff --git a/website/_posts/blog/2016-10-18-new-nrt-streaming.md b/website/_posts/blog/2016-10-18-new-nrt-streaming.md new file mode 100644 index 0000000..e1cb845 --- /dev/null +++ b/website/_posts/blog/2016-10-18-new-nrt-streaming.md @@ -0,0 +1,56 @@ +--- +layout: post-blog +title: New NRT Streaming in Apache Kylin +date: 2016-10-18 17:30:00 +author: Shaofeng Shi +categories: blog +--- + +In 1.5.0 Apache Kylin introduces the Streaming Cubing feature, which can consume data from Kafka topic directly. This [blog](/blog/2016/02/03/streaming-cubing/) introduces how that be implemented, and this [tutorial](/docs15/tutorial/cube_streaming.html) introduces how to use it. + +While, that implementation was marked as "experimental" because it has the following limitations: + + * Not scalable: it starts a Java process for a micro-batch cube building, instead of leveraging any computing framework; If too many messages arrive at one time, the build may fail with OutOfMemory error; + + * May loss data: it uses a time window to seek the approximate start/end offsets on Kafka topic, which means too late/early arrived messages will be skipped; Then the query couldn't ensure 100% accuracy. + + * Difficult to monitor: the streaming cubing is out of the Job engine's scope, user can not monitor the jobs with Web GUI or REST API. + + * Others: hard to recover from accident, difficult to maintain the code, etc. + +To overcome these limitations, the Apache Kylin team developed the new streaming ([KYLIN-1726](https://issues.apache.org/jira/browse/KYLIN-1726)) with Kafka 0.10 API, it has been tested internally for some time, will release to public soon. + +The new design is a perfect implementation under Kylin 1.5's "Plug-in" architecture: treat Kafka topic as a "Data Source" like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as from Hive. Figure 1 is a high level architecture of the new design. + + +![Kylin New Streaming Framework Architecture](/images/blog/new-streaming.png) + +The adapter to read Kafka messages is modified from [kafka-hadoop-loader](https://github.com/amient/kafka-hadoop-loader), which is open sourced under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; in next steps Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant. + +To overcome the "data loss" problem, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap is allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: Kylin will scan all segments which include the queried time. Figure 2 illurates this. + +![Use Offset to Cut Segments](/images/blog/offset-as-partition-value.png) + +Other changes/enhancements are made in the new streaming: + + * Allow multiple segments being built/merged concurrently + * Automatically seek start/end offsets (if user doesn't specify) from previous segment or Kafka + * Support embeded properties in JSON message + * Add REST API to trigger streaming cube's building + * Add REST API to check and fill the segment holes + +The integration test result shows big improvements than the previous version: + + * Scalability: it can easily process up to hundreds of million records in one build; + * Flexibility: trigger the build at any time with the frequency you want, e.g: every 5 minutes in day and every hour in night; Kylin manages the offsets so it can resume from the last position; + * Stability: pretty stable, no OutOfMemory error; + * Management: user can check all jobs' status through Kylin's "Monitor" page or REST API; + * Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams), 10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build finishes in around 4 minutes; + + +Here are a couple of screenshots in this test: +![Streaming Job Monitoring](/images/blog/streaming-monitor.png) + +![Streaming Adapter](/images/blog/streaming-adapter.png) + +![Streaming Twitter Sample](/images/blog/streaming-twitter.png) http://git-wip-us.apache.org/repos/asf/kylin/blob/38cd798d/website/images/blog/new-streaming.png ---------------------------------------------------------------------- diff --git a/website/images/blog/new-streaming.png b/website/images/blog/new-streaming.png new file mode 100644 index 0000000..2e68639 Binary files /dev/null and b/website/images/blog/new-streaming.png differ http://git-wip-us.apache.org/repos/asf/kylin/blob/38cd798d/website/images/blog/offset-as-partition-value.png ---------------------------------------------------------------------- diff --git a/website/images/blog/offset-as-partition-value.png b/website/images/blog/offset-as-partition-value.png new file mode 100644 index 0000000..3691fce Binary files /dev/null and b/website/images/blog/offset-as-partition-value.png differ http://git-wip-us.apache.org/repos/asf/kylin/blob/38cd798d/website/images/blog/streaming-adapter.png ---------------------------------------------------------------------- diff --git a/website/images/blog/streaming-adapter.png b/website/images/blog/streaming-adapter.png new file mode 100644 index 0000000..be227e3 Binary files /dev/null and b/website/images/blog/streaming-adapter.png differ http://git-wip-us.apache.org/repos/asf/kylin/blob/38cd798d/website/images/blog/streaming-monitor.png ---------------------------------------------------------------------- diff --git a/website/images/blog/streaming-monitor.png b/website/images/blog/streaming-monitor.png new file mode 100644 index 0000000..6d4750b Binary files /dev/null and b/website/images/blog/streaming-monitor.png differ http://git-wip-us.apache.org/repos/asf/kylin/blob/38cd798d/website/images/blog/streaming-twitter.png ---------------------------------------------------------------------- diff --git a/website/images/blog/streaming-twitter.png b/website/images/blog/streaming-twitter.png new file mode 100644 index 0000000..9c346f7 Binary files /dev/null and b/website/images/blog/streaming-twitter.png differ