Re: Re: Evaluate Kylin on Parquet

ShaoFeng Shi Tue, 01 Jan 2019 01:18:41 -0800

Hi Yang,

The real-time streaming feature is also under review and testing now. I
think when they (new storage and real-time) are ready to be merged, we can
propose to jump the version to 3.0.


Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: [email protected]
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Li Yang <[email protected]> 于2019年1月1日周二 下午12:40写道：

> From the discussion, apparently a new storage will be added sooner or late.
>
> Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how
> about the migration from old storage? I assume old cube data has to be
> transformed and loaded into the new storage.
>
> Yang
>
> On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <[email protected]>
> wrote:
>
>> Thanks very much for Yiming and Jiatao's comments, they're very valueable.
>> There are many improvements can do for this new storage. We welcome all
>> kinds of contribution and would like to improve it together with the
>> community in the year of 2019!
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: [email protected]
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: [email protected]
>> Join Kylin dev mail group: [email protected]
>>
>>
>>
>>
>> JiaTao Tao <[email protected]> 于2018年12月19日周三 下午8:44写道：
>>
>> > Hi all,
>> >
>> > Truly agreed with Yiming, and here I expand a little more about
>> > "Distributed computing".
>> >
>> > As Yiming mentioned, Kylin will parse the query into an execution plan
>> > using Calcite(Kylin will change the execution plan cuz the data in
>> cubes is
>> > already aggregated, we cannot use the origin plan directly). It's a tree
>> > structure, a node represents a specific calculation and data goes from
>> > bottom to top applying all these calculations.
>> > [image: image.png]
>> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
>> > really good blog.)
>> >
>> > At present, Kylin will do almost all these calculations only in its own
>> > node, in other words, we cannot fully use the power of the cluster, and
>> > it's a SPOF. And here comes a design that we can visit this tree, *and
>> > transform each node into operations to Spark's Dataframes(i.e. "DF").*
>> >
>> > More specifically, we will visit the nodes recursively until we met the
>> > "TableScan" node(like a stack pushing operation). e.g. In the above
>> > diagram, the first node we met is a "Sort" node, we just visit its
>> > child(ren), and we'll not stop visiting each node's child(ren) until we
>> met
>> > a "TableScan" node.
>> >
>> > In the "TableScan" node, we will generate the initial DF, then the DF
>> will
>> > be poped to the "Filter" node, and the "Filter" node will apply its own
>> > operation like "df.filter(xxx)". Finally, we will apply each node's
>> > operation to this DF, and the final call chain will like:
>> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
>> >
>> > After we got the final Dataframe and triggered the calculation, all the
>> > rest were handled by Spark. And we can gain tremendous benefits in
>> > computation level, more details can be seen in my previous post:
>> >
>> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
>> > .
>> >
>> >
>> > --
>> >
>> >
>> > Regards!
>> >
>> > Aron Tao
>> >
>> >
>> > 许益铭 <[email protected]> 于2018年12月19日周三 上午11:40写道：
>> >
>> >> hi All!
>> >> 关于CHAO LONG提到的几个问题,我有以下几个看法:
>> >>
>> >>
>> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
>> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
>> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
>> >> server GC严重
>> >>
>> >>
>> >>
>> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
>> >>
>> >> 3.我们要使用parquet的page
>> >>
>> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
>> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
>> >>
>> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
>> >>
>> >> 1.分布式计算
>> >> sql通过calcite解析优化之后会生成olap
>> >>
>> >>
>> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
>> >> server端的压力.
>> >>
>> >> 2.去掉字典
>> >>
>> >>
>> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
>> >>
>> >> 3.parquet存储使用列的真实类型,而不是使用binary
>> >>
>> >>
>> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
>> >>
>> >> 4.使用spark适配parquet
>> >> 当前的spark已经适配了parquet,spark的pushed
>> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
>> >> index能力.
>> >>
>> >> 5.index server
>> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
>> >> index的一种,因为我们可以在这里插入一个index server.
>> >>
>> >>
>> >> hi,all!
>> >> I have the following views:
>> >> 1. At present, our architecture is divided into two layers, one is the
>> >> storage layer, and the other is the computing layer. In the storage
>> layer,
>> >> we have made some optimizations and do pre-aggregation in the storage
>> >> layer
>> >> to reduce the amount of data returned. However, the aggregation and
>> >> connection of the runtime occurs on the kylin server side.
>> Serialization
>> >> is
>> >> inevitable, and this architecture is easy to cause a single point
>> >> bottleneck. If the agg or join data of the runtime is relatively large,
>> >> the
>> >> query performance will drop linearly, and the kylin server GC will be
>> >> severe.
>> >>
>> >> 2. As for the dictionary problem, canceling dictionary encoding is a
>> good
>> >> choice. The dictionary was originally designed to align rowkey in hbase
>> >> and
>> >> also to reduce part of the storage. But this also introduces another
>> >> problem, it is difficult to handle non-fixed string type dimension If
>> you
>> >> encounter a UHC dimension, you can only create a large dictionary or
>> give
>> >> a
>> >> larger fix-length, which causes the storage to double, and because the
>> >> dictionary is large, the query performance will be greatly affected. We
>> >> use
>> >> columnar storage, we don't need to consider this problem.
>> >>
>> >> 3. We need to use the page index of the parquet, we must convert the
>> tuple
>> >> filter into the filter of the parquet. This workload is not small. And
>> our
>> >> data is encoded. The page index of the parquet will only be based on
>> the
>> >> min and max value on the page. Filtering, so for binary data, it is
>> >> impossible to do filter.
>> >>
>> >> I think using spark to do our calculation engine solves all of the
>> above
>> >> problems:
>> >>
>> >> Distributed computing
>> >> Sql through calcite analysis optimization will generate a tree of OLAP
>> >> rel,
>> >> and spark's catalyst is also generated by parsing SQL after a tree,
>> >> automatically optimized to become a dataframe to calculate, if the
>> plan of
>> >> calcite can be converted into a spark plan, then we will achieve
>> >> distributed computing, calcite is only responsible for parsing SQL and
>> >> returning result sets, reducing the pressure on the kylin server side.
>> >>
>> >> 2. Remove the dictionary
>> >> The dictionary has a very good effect to reduce the storage pressure in
>> >> the
>> >> low and medium base, but there is also a disadvantage that its data
>> files
>> >> can not be used separately from the dictionary. I suggest that you can
>> use
>> >> the page level of the dictionary without considering the dictionary
>> type
>> >> encoding.
>> >>
>> >> 3.parquet storage uses the true type of the column instead of using
>> binary
>> >> As above, parquet has a very weak filter capability for binary, and the
>> >> basic type can directly use spark's Vectorizedread to speed up data
>> >> reading
>> >> speed and calculation.
>> >>
>> >> 4. Use spark to match the parquet
>> >> The current spark has been adapted to the parquet. The sparked filter
>> of
>> >> the spark has been converted into a filter that can be used by the
>> >> parquet.
>> >> Here, you only need to upgrade the version of the parcel and modify it
>> to
>> >> provide the page index of the parquet.
>> >>
>> >> 5.index server
>> >> As described by JiaTao Tao, the index server is divided into file index
>> >> and
>> >> page index. The filtering of the dictionary is nothing but a file
>> index,
>> >> because we can insert an index server here.
>> >>
>> >> JiaTao Tao <[email protected]> 于2018年12月19日周三 下午4:45写道：
>> >>
>> >> > Hi Gang
>> >> >
>> >> > In my opinion, segments/partition pruning is actually in the scope of
>> >> > "Index system", we can have an "Index system" in storage level
>> including
>> >> > File index(for segment/partition pruning), page index(for page
>> pruning)
>> >> > etc. We can put all these stuff in such a system and make the
>> >> separation of
>> >> > duties cleaner.
>> >> >
>> >> >
>> >> > Ma Gang <[email protected]> 于2018年12月19日周三 上午6:31写道：
>> >> >
>> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
>> the
>> >> > > dictionary in query engine, most time is not good since it brings
>> >> lots of
>> >> > > pressure to Kylin server, but sometimes it has benefit, for
>> example,
>> >> some
>> >> > > segments can be pruned very early when filter value is not in the
>> >> > > dictionary, and some queries can be answer directly using
>> dictionary
>> >> as
>> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
>> >> > >
>> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <[email protected]>
>> >> wrote:
>> >> > >
>> >> > > The dimension dictionary is a legacy design for HBase storage I
>> think;
>> >> > > because HBase has no data type, everything is a byte array, this
>> makes
>> >> > > Kylin has to encode STRING and other types with some encoding
>> method
>> >> like
>> >> > > the dictionary.
>> >> > >
>> >> > > Now with the storage like Parquet, it would decide how to encode
>> the
>> >> data
>> >> > > at the page or block level. Then we can drop the dictionary after
>> the
>> >> > cube
>> >> > > is built. This will release the memory pressure of Kylin query
>> nodes
>> >> and
>> >> > > also benefit the UHC case.
>> >> > >
>> >> > > Best regards,
>> >> > >
>> >> > > Shaofeng Shi 史少锋
>> >> > > Apache Kylin PMC
>> >> > > Work email: [email protected]
>> >> > > Kyligence Inc: https://kyligence.io/
>> >> > >
>> >> > > Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > > Join Kylin user mail group: [email protected]
>> >> > > Join Kylin dev mail group: [email protected]
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > Chao Long <[email protected]> 于2018年12月17日周一 下午1:23写道：
>> >> > >
>> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the
>> query
>> >> > >> performance still have room to improve. We can improve it from the
>> >> > >> following aspects:
>> >> > >>
>> >> > >>  1, Minimize result set serialization time
>> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> >> RDD,
>> >> > >> and then convert the "Row" type to Object[], so Spark need to
>> >> serialize
>> >> > >> Object[] before return it to driver. Those time need to be
>> avoided.
>> >> > >>
>> >> > >>  2, Query without dictionary
>> >> > >>  In this PoC, for less storage use, we keep dict encode value in
>> >> Parquet
>> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> >> > convert
>> >> > >> dict value for query. If we keep original value for dict-encode
>> >> > dimension,
>> >> > >> dictionary is unnecessary. And we don't hava to worry about the
>> >> storage
>> >> > >> use, because Parquet will encode it. We should remove dictionary
>> from
>> >> > query.
>> >> > >>
>> >> > >>  3, Remove query single-point issue
>> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> >> > >> distributed, but kylin alse need to process result data the Spark
>> >> > returned
>> >> > >> in single jvm. We can try to make it distributed too.
>> >> > >>
>> >> > >>  4, Upgrade Parquet to 1.11 for page index
>> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> >> page
>> >> > >> index to improve filter performance.
>> >> > >>
>> >> > >> ------------------
>> >> > >> Best Regards,
>> >> > >> Chao Long
>> >> > >>
>> >> > >> ------------------ 原始邮件 ------------------
>> >> > >> *发件人:* "ShaoFeng Shi"<[email protected]>;
>> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> >> > >> *收件人:* "dev"<[email protected]>;"user"<[email protected]>;
>> >> > >> *主题:* Evaluate Kylin on Parquet
>> >> > >>
>> >> > >> Hello Kylin users,
>> >> > >>
>> >> > >> The first version of Kylin on Parquet [1] feature has been staged
>> in
>> >> > >> Kylin code repository for public review and evaluation. You can
>> check
>> >> > out
>> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> >> make a
>> >> > >> binary build to run an example. When creating a cube, you can
>> select
>> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> >> MapReduce
>> >> > and
>> >> > >> Spark engines support this new storage. A tech blog is under
>> drafting
>> >> > for
>> >> > >> the design and implementation.
>> >> > >>
>> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> >> Zhou!
>> >> > >>
>> >> > >> This is not the final version; there is room to improve in many
>> >> aspects,
>> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> >> Your
>> >> > >> comments are welcomed. Let's improve it together.
>> >> > >>
>> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> >> > >>
>> >> > >> Best regards,
>> >> > >>
>> >> > >> Shaofeng Shi 史少锋
>> >> > >> Apache Kylin PMC
>> >> > >> Work email: [email protected]
>> >> > >> Kyligence Inc: https://kyligence.io/
>> >> > >>
>> >> > >> Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > >> Join Kylin user mail group: [email protected]
>> >> > >> Join Kylin dev mail group: [email protected]
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > Regards!
>> >> >
>> >> > Aron Tao
>> >> >
>> >>
>> >
>> >
>>
>

Re: Re: Evaluate Kylin on Parquet

Reply via email to