Sean-Gu commented on a change in pull request #982: Realtime doc for lambda mode
URL: https://github.com/apache/kylin/pull/982#discussion_r357558963
##########
File path: website/_docs30/tutorial/lambda_mode_and_timezone_realtime_olap.md
##########
@@ -0,0 +1,174 @@
+---
+layout: docs30
+title: Lambda mode and Timezone in Real-time OLAP
+categories: tutorial
+permalink: /docs30/tutorial/lambda_mode_and_timezone_realtime_olap.html
+---
+
+Kylin v3.0.0 will release the real-time OLAP function, by the power of new
added streaming reciever cluster, Kylin can query streaming data with
sub-second latency. You can check [this tech
blog](/blog/2019/04/12/rt-streaming-design/) for the overall design and core
concept.
+
+If you want to find a step by step tutorial, please check this [this tech
blog](/docs30/tutorial/realtime_olap.html).
+In this article, we will introduce how to update segment and set timezone in
for derived time column in realtime OLAP cube.
+
+# Background
+
+Says we have Kafka message which look like this:
+
+{% highlight Groff markup %}
+{
+ "s_nation":"SAUDI ARABIA",
+ "lo_supplycost":74292,
+ "p_category":"MFGR#0910",
+ "local_day_hour_minute":"09_21_44",
+ "event_time":"2019-12-09 08:44:50.000-0500",
+ "local_day_hour":"09_21",
+ "lo_quantity":12,
+ "lo_revenue":1411548,
+ "p_brand":"MFGR#0910051",
+ "s_region":"MIDDLE EAST",
+ "lo_discount":5,
+ "customer_info":{
+ "CITY":"CHINA 057",
+ "REGION":"ASIA",
+ "street":"CHINA 05721",
+ "NATION":"CHINA"
+ },
+ "d_year":1994,
+ "d_weeknuminyear":30,
+ "p_mfgr":"MFGR#09",
+ "v_revenue":7429200,
+ "d_yearmonth":"Jul1994",
+ "s_city":"SAUDI ARA15",
+ "profit_ratio":0.05263157894736842,
+ "d_yearmonthnum":199407,
+ "round":1
+}
+{% endhighlight %}
+
+In this sample, it is come from SSB with some additional field such as
*event_time*. We have the field *event_time* as the timestamp of current event.
+And we assumed that event came from countries of different timezone,
"2019-12-09 08:44:50.000-0500" indicated that this a event which come from
'America/New_York' timezone. You may have some events which come from
'Asia/Shanghai' as well.
+
+*local_day_hour_minute* is a column which value is in local timezone, in this
sample it in "GMT+8".
+
+### Question
+We want to do some realtime OLAP analysis, so you may consider to use Realtime
OLAP. But you may have some concerns which included:
+
+1. In the fact that events are come from different timezone, you may worried
will this cause some trouble or incorrect query result?
+2. In some cases, kafka message contains the value which is not actually what
you want, says some dimension value is misspelled, how could you make
corrections? (Or you want to retrieve some long-late-message which was dropped.)
+3. My query only hit a small range of time range, how should I write filter
condition make sure unused segments purged/skipped from scan?
+
+### Quick Answer
+Firstly, you can always get the correct result in the right timezone of your
place. Just by set *kylin.stream.event.timezone=GMT+N* for all Kylin processes.
By default, UTC is used for *derived time column*.
+
+Secondly, in fact you cannot update a normal streaming cube, but you can
update a streaming cube which in lambda mode, all you need to prepare is
creating a Hive table which mapping to your kafka event.
+
+Thirdly, yes it is, you can achieved this by add *derived time column* like
*MINUTE_START*/*DAY_START* etc in your filter condition.
+
+# How to do
+
+### Configure timezone
+We knew message may come from different timezone, but you want the query
result should stick to some specific timezone.
+If you live in some place in GMT+2, please set
*kylin.stream.event.timezone=GMT+2* for all Kylin process.
+
+
+### Create lambda table
+
+You should create a hive table in *default* namespace, and this table should
contains all column in your cube's dimension and measure, please
+be ware of the *derived time column* like *MINUTE_START*/*DAY_START* which you
are interested in should be also included if they are appeared in your cube's
dimension column.
+
+Depend on in which granularity level you want to update segment, you can
choose *HOUR_START* or *DAY_START* as partition column of this hive table.
+
+{% highlight Groff markup %}
+use default;
+CREATE EXTERNAL TABLE IF NOT EXISTS lambda_flat_table
+(
+-- event timestamp and debug purpose column
+EVENT_TIME timestamp
+,ROUND bigint COMMENT "For debug purpose, in which round did this event sent
by producer"
+,LOCAL_DAY_HOUR string COMMENT "For debug purpose, maybe check timezone etc"
+,LOCAL_MINUTE string COMMENT "For debug purpose, maybe check timezone etc"
+
+-- dimension column on fact table
+,LO_QUANTITY bigint
+,LO_DISCOUNT bigint
+
+-- dimension column on dimension table
+,C_REGION string
+,C_NATION string
+,C_CITY string
+
+,D_YEAR int
+,D_YEARMONTH string
+,D_WEEKNUMINYEAR int
+,D_YEARMONTHNUM int
+
+,S_REGION string
+,S_NATION string
+,S_CITY string
+
+,P_CATEGORY string
+,P_BRAND string
+,P_MFGR string
+
+
+-- measure column on fact table
+,V_REVENUE bigint
+,LO_SUPPLYCOST bigint
+,LO_REVENUE bigint
+,PROFIT_RATIO double
+
+-- for kylin used
+,MINUTE_START timestamp
+,HOUR_START timestamp
+,MONTH_START date
+)
+PARTITIONED BY (DAY_START date)
+STORED AS SEQUENCEFILE
+LOCATION 'hdfs:///LacusDir/lambda_flat_table';
+{% endhighlight %}
+
+
+### Create streaming cube in Kylin
+The first step is add information of broker list and topic name;
+after that, you should paste sample message into left and let Kylin
auto-detect the column name and column type.
+You may find some data type is not corrected, please change to what its real
data type and make sure it is aligned to Hive table.
+
+For example, you should change the data type of event_time from varchar to
timestamp.
+And some column name is not the same as Hive Table, so please correct them
too, such as from `customer_info_REGION` to `C_REGION`.
Review comment:
And some column names are...
so please correct them too, such as `customer_info_REGION` to `C_REGION`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services