Hi Yang, The real-time streaming feature is also under review and testing now. I think when they (new storage and real-time) are ready to be merged, we can propose to jump the version to 3.0.
Best regards, Shaofeng Shi 史少锋 Apache Kylin PMC Work email: shaofeng....@kyligence.io Kyligence Inc: https://kyligence.io/ Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html Join Kylin user mail group: user-subscr...@kylin.apache.org Join Kylin dev mail group: dev-subscr...@kylin.apache.org Li Yang <liy...@apache.org> 于2019年1月1日周二 下午12:40写道: > From the discussion, apparently a new storage will be added sooner or late. > > Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how > about the migration from old storage? I assume old cube data has to be > transformed and loaded into the new storage. > > Yang > > On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <shaofeng...@apache.org> > wrote: > >> Thanks very much for Yiming and Jiatao's comments, they're very valueable. >> There are many improvements can do for this new storage. We welcome all >> kinds of contribution and would like to improve it together with the >> community in the year of 2019! >> >> Best regards, >> >> Shaofeng Shi 史少锋 >> Apache Kylin PMC >> Work email: shaofeng....@kyligence.io >> Kyligence Inc: https://kyligence.io/ >> >> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html >> Join Kylin user mail group: user-subscr...@kylin.apache.org >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org >> >> >> >> >> JiaTao Tao <taojia...@gmail.com> 于2018年12月19日周三 下午8:44写道: >> >> > Hi all, >> > >> > Truly agreed with Yiming, and here I expand a little more about >> > "Distributed computing". >> > >> > As Yiming mentioned, Kylin will parse the query into an execution plan >> > using Calcite(Kylin will change the execution plan cuz the data in >> cubes is >> > already aggregated, we cannot use the origin plan directly). It's a tree >> > structure, a node represents a specific calculation and data goes from >> > bottom to top applying all these calculations. >> > [image: image.png] >> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a >> > really good blog.) >> > >> > At present, Kylin will do almost all these calculations only in its own >> > node, in other words, we cannot fully use the power of the cluster, and >> > it's a SPOF. And here comes a design that we can visit this tree, *and >> > transform each node into operations to Spark's Dataframes(i.e. "DF").* >> > >> > More specifically, we will visit the nodes recursively until we met the >> > "TableScan" node(like a stack pushing operation). e.g. In the above >> > diagram, the first node we met is a "Sort" node, we just visit its >> > child(ren), and we'll not stop visiting each node's child(ren) until we >> met >> > a "TableScan" node. >> > >> > In the "TableScan" node, we will generate the initial DF, then the DF >> will >> > be poped to the "Filter" node, and the "Filter" node will apply its own >> > operation like "df.filter(xxx)". Finally, we will apply each node's >> > operation to this DF, and the final call chain will like: >> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)". >> > >> > After we got the final Dataframe and triggered the calculation, all the >> > rest were handled by Spark. And we can gain tremendous benefits in >> > computation level, more details can be seen in my previous post: >> > >> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html >> > . >> > >> > >> > -- >> > >> > >> > Regards! >> > >> > Aron Tao >> > >> > >> > 许益铭 <x1860...@gmail.com> 于2018年12月19日周三 上午11:40写道: >> > >> >> hi All! >> >> 关于CHAO LONG提到的几个问题,我有以下几个看法: >> >> >> >> >> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin >> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime >> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin >> >> server GC严重 >> >> >> >> >> >> >> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的. >> >> >> >> 3.我们要使用parquet的page >> >> >> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page >> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的. >> >> >> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题: >> >> >> >> 1.分布式计算 >> >> sql通过calcite解析优化之后会生成olap >> >> >> >> >> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin >> >> server端的压力. >> >> >> >> 2.去掉字典 >> >> >> >> >> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可. >> >> >> >> 3.parquet存储使用列的真实类型,而不是使用binary >> >> >> >> >> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算. >> >> >> >> 4.使用spark适配parquet >> >> 当前的spark已经适配了parquet,spark的pushed >> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page >> >> index能力. >> >> >> >> 5.index server >> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file >> >> index的一种,因为我们可以在这里插入一个index server. >> >> >> >> >> >> hi,all! >> >> I have the following views: >> >> 1. At present, our architecture is divided into two layers, one is the >> >> storage layer, and the other is the computing layer. In the storage >> layer, >> >> we have made some optimizations and do pre-aggregation in the storage >> >> layer >> >> to reduce the amount of data returned. However, the aggregation and >> >> connection of the runtime occurs on the kylin server side. >> Serialization >> >> is >> >> inevitable, and this architecture is easy to cause a single point >> >> bottleneck. If the agg or join data of the runtime is relatively large, >> >> the >> >> query performance will drop linearly, and the kylin server GC will be >> >> severe. >> >> >> >> 2. As for the dictionary problem, canceling dictionary encoding is a >> good >> >> choice. The dictionary was originally designed to align rowkey in hbase >> >> and >> >> also to reduce part of the storage. But this also introduces another >> >> problem, it is difficult to handle non-fixed string type dimension If >> you >> >> encounter a UHC dimension, you can only create a large dictionary or >> give >> >> a >> >> larger fix-length, which causes the storage to double, and because the >> >> dictionary is large, the query performance will be greatly affected. We >> >> use >> >> columnar storage, we don't need to consider this problem. >> >> >> >> 3. We need to use the page index of the parquet, we must convert the >> tuple >> >> filter into the filter of the parquet. This workload is not small. And >> our >> >> data is encoded. The page index of the parquet will only be based on >> the >> >> min and max value on the page. Filtering, so for binary data, it is >> >> impossible to do filter. >> >> >> >> I think using spark to do our calculation engine solves all of the >> above >> >> problems: >> >> >> >> Distributed computing >> >> Sql through calcite analysis optimization will generate a tree of OLAP >> >> rel, >> >> and spark's catalyst is also generated by parsing SQL after a tree, >> >> automatically optimized to become a dataframe to calculate, if the >> plan of >> >> calcite can be converted into a spark plan, then we will achieve >> >> distributed computing, calcite is only responsible for parsing SQL and >> >> returning result sets, reducing the pressure on the kylin server side. >> >> >> >> 2. Remove the dictionary >> >> The dictionary has a very good effect to reduce the storage pressure in >> >> the >> >> low and medium base, but there is also a disadvantage that its data >> files >> >> can not be used separately from the dictionary. I suggest that you can >> use >> >> the page level of the dictionary without considering the dictionary >> type >> >> encoding. >> >> >> >> 3.parquet storage uses the true type of the column instead of using >> binary >> >> As above, parquet has a very weak filter capability for binary, and the >> >> basic type can directly use spark's Vectorizedread to speed up data >> >> reading >> >> speed and calculation. >> >> >> >> 4. Use spark to match the parquet >> >> The current spark has been adapted to the parquet. The sparked filter >> of >> >> the spark has been converted into a filter that can be used by the >> >> parquet. >> >> Here, you only need to upgrade the version of the parcel and modify it >> to >> >> provide the page index of the parquet. >> >> >> >> 5.index server >> >> As described by JiaTao Tao, the index server is divided into file index >> >> and >> >> page index. The filtering of the dictionary is nothing but a file >> index, >> >> because we can insert an index server here. >> >> >> >> JiaTao Tao <taojia...@gmail.com> 于2018年12月19日周三 下午4:45写道: >> >> >> >> > Hi Gang >> >> > >> >> > In my opinion, segments/partition pruning is actually in the scope of >> >> > "Index system", we can have an "Index system" in storage level >> including >> >> > File index(for segment/partition pruning), page index(for page >> pruning) >> >> > etc. We can put all these stuff in such a system and make the >> >> separation of >> >> > duties cleaner. >> >> > >> >> > >> >> > Ma Gang <mg4w...@163.com> 于2018年12月19日周三 上午6:31写道: >> >> > >> >> > > Awesome! Looking forward to the improvement. For dictionary, keep >> the >> >> > > dictionary in query engine, most time is not good since it brings >> >> lots of >> >> > > pressure to Kylin server, but sometimes it has benefit, for >> example, >> >> some >> >> > > segments can be pruned very early when filter value is not in the >> >> > > dictionary, and some queries can be answer directly using >> dictionary >> >> as >> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490 >> >> > > >> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <shaofeng...@apache.org> >> >> wrote: >> >> > > >> >> > > The dimension dictionary is a legacy design for HBase storage I >> think; >> >> > > because HBase has no data type, everything is a byte array, this >> makes >> >> > > Kylin has to encode STRING and other types with some encoding >> method >> >> like >> >> > > the dictionary. >> >> > > >> >> > > Now with the storage like Parquet, it would decide how to encode >> the >> >> data >> >> > > at the page or block level. Then we can drop the dictionary after >> the >> >> > cube >> >> > > is built. This will release the memory pressure of Kylin query >> nodes >> >> and >> >> > > also benefit the UHC case. >> >> > > >> >> > > Best regards, >> >> > > >> >> > > Shaofeng Shi 史少锋 >> >> > > Apache Kylin PMC >> >> > > Work email: shaofeng....@kyligence.io >> >> > > Kyligence Inc: https://kyligence.io/ >> >> > > >> >> > > Apache Kylin FAQ: >> >> https://kylin.apache.org/docs/gettingstarted/faq.html >> >> > > Join Kylin user mail group: user-subscr...@kylin.apache.org >> >> > > Join Kylin dev mail group: dev-subscr...@kylin.apache.org >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > Chao Long <wayn...@qq.com> 于2018年12月17日周一 下午1:23写道: >> >> > > >> >> > >> In this PoC, we verified Kylin On Parquet is viable, but the >> query >> >> > >> performance still have room to improve. We can improve it from the >> >> > >> following aspects: >> >> > >> >> >> > >> 1, Minimize result set serialization time >> >> > >> Since Kylin need Object[] data to process, we convert Dataset to >> >> RDD, >> >> > >> and then convert the "Row" type to Object[], so Spark need to >> >> serialize >> >> > >> Object[] before return it to driver. Those time need to be >> avoided. >> >> > >> >> >> > >> 2, Query without dictionary >> >> > >> In this PoC, for less storage use, we keep dict encode value in >> >> Parquet >> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to >> >> > convert >> >> > >> dict value for query. If we keep original value for dict-encode >> >> > dimension, >> >> > >> dictionary is unnecessary. And we don't hava to worry about the >> >> storage >> >> > >> use, because Parquet will encode it. We should remove dictionary >> from >> >> > query. >> >> > >> >> >> > >> 3, Remove query single-point issue >> >> > >> In this PoC, we use Spark to read and process Cube data, which is >> >> > >> distributed, but kylin alse need to process result data the Spark >> >> > returned >> >> > >> in single jvm. We can try to make it distributed too. >> >> > >> >> >> > >> 4, Upgrade Parquet to 1.11 for page index >> >> > >> In this PoC, Parquet don't have page index, we get a poor filter >> >> > >> performance. We need to upgrade Parquet to version 1.11 which has >> >> page >> >> > >> index to improve filter performance. >> >> > >> >> >> > >> ------------------ >> >> > >> Best Regards, >> >> > >> Chao Long >> >> > >> >> >> > >> ------------------ 原始邮件 ------------------ >> >> > >> *发件人:* "ShaoFeng Shi"<shaofeng...@apache.org>; >> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39 >> >> > >> *收件人:* "dev"<dev@kylin.apache.org>;"user"<u...@kylin.apache.org>; >> >> > >> *主题:* Evaluate Kylin on Parquet >> >> > >> >> >> > >> Hello Kylin users, >> >> > >> >> >> > >> The first version of Kylin on Parquet [1] feature has been staged >> in >> >> > >> Kylin code repository for public review and evaluation. You can >> check >> >> > out >> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can >> >> make a >> >> > >> binary build to run an example. When creating a cube, you can >> select >> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both >> >> MapReduce >> >> > and >> >> > >> Spark engines support this new storage. A tech blog is under >> drafting >> >> > for >> >> > >> the design and implementation. >> >> > >> >> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen >> >> Zhou! >> >> > >> >> >> > >> This is not the final version; there is room to improve in many >> >> aspects, >> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment. >> >> Your >> >> > >> comments are welcomed. Let's improve it together. >> >> > >> >> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621 >> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet >> >> > >> >> >> > >> Best regards, >> >> > >> >> >> > >> Shaofeng Shi 史少锋 >> >> > >> Apache Kylin PMC >> >> > >> Work email: shaofeng....@kyligence.io >> >> > >> Kyligence Inc: https://kyligence.io/ >> >> > >> >> >> > >> Apache Kylin FAQ: >> >> https://kylin.apache.org/docs/gettingstarted/faq.html >> >> > >> Join Kylin user mail group: user-subscr...@kylin.apache.org >> >> > >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org >> >> > >> >> >> > >> >> >> > >> >> >> > > >> >> > > >> >> > > >> >> > >> >> > >> >> > -- >> >> > >> >> > >> >> > Regards! >> >> > >> >> > Aron Tao >> >> > >> >> >> > >> > >> >