[Proposal] Kylin Streaming Cube Builder

Luke Han Tue, 23 Dec 2014 06:32:07 -0800

Hi all,
    Please refer to new proposal about Kylin Streaming Cube Builder from
Branky Shao:

https://github.com/KylinOLAP/Kylin/wiki/%5BProposal%5D-Kylin-Streaming-Cube-Builder
.

Any suggestion and idea please reply here.

Thanks.

Luke

--------Text copy--------

Kylin Streaming Cube Builder

-- By Brank Shao <https://github.com/branky>, 2014-12-22
<https://github.com/KylinOLAP/Kylin/wiki/%5BProposal%5D-Kylin-Streaming-Cube-Builder#proposal>
Proposal

Although Kylin provides sub-second OLAP analysis latency
<http://en.wikipedia.org/wiki/Real-time_business_intelligence>, data latency
<http://en.wikipedia.org/wiki/Real-time_business_intelligence> is still
very long. Because it uses ETL batch generated Hive tables as the source to
build cubes. The ETL process usually takes hours to finish and the cube
building process itself also needs few hours. User cannot use Kylin to
analyze up-to-a-mintue data.

Currently, Kylin's cube builder uses cube metadata (defined by cube admin
or designer) to build a cube from one fact table and several dimension
tables in Hive. A cube is stored as one or multiple HTables in HBase, each
HTable is called a segment of the cube. The metadata is stored in HBase.
The dimension tables are also imported into HBase, stored as snapshots.
Kylin's query engine will parse SQL query from the client and fetch the
required data from HBase.

To reduce the data latency ultimately, we can build the cube from stream
data instead of static data(generated by ETL batch). The feasibility of
performing OLAP analysis on high volume of stream data has been studied a
decade ago (Online Analytical Processing Stream Data: Is It Feasible?
<http://www.cse.wustl.edu/~ychen/public/J82.pdf>). They proposed a feasible
method called stream_cube, which uses a *tilt time frame*, explores only
cuboids from *minimal interesting layer* and*observation layer*, and adopts
an algorithm called *popular path* to partially materialize the cube. The
study showed the approach is cost-effective and realistic.

We can ingest up-to-date data from realtime messaging system (e.g. Apache
Kafka <http://kafka.apache.org/>) and implement stream_cube as a Topology
of Apache Storm <https://storm.apache.org/> to build Kylin cube segments
continuously. This can be a solution to solve the data latency problem for
Kylin. Below is a high level architecture diagram of this solution.

[Proposal] Kylin Streaming Cube Builder

Reply via email to