Re: Re: Evaluate Kylin on Parquet

Li Yang Mon, 31 Dec 2018 20:41:31 -0800

>From the discussion, apparently a new storage will be added sooner or late.


Will it be a new big version of Kylin? Like Apache Kylin 3.0? Also how
about the migration from old storage? I assume old cube data has to be
transformed and loaded into the new storage.

Yang

On Sat, Dec 29, 2018 at 5:52 PM ShaoFeng Shi <shaofeng...@apache.org> wrote:

> Thanks very much for Yiming and Jiatao's comments, they're very valueable.
> There are many improvements can do for this new storage. We welcome all
> kinds of contribution and would like to improve it together with the
> community in the year of 2019!
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng....@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscr...@kylin.apache.org
> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>
>
>
>
> JiaTao Tao <taojia...@gmail.com> 于2018年12月19日周三 下午8:44写道：
>
> > Hi all,
> >
> > Truly agreed with Yiming, and here I expand a little more about
> > "Distributed computing".
> >
> > As Yiming mentioned, Kylin will parse the query into an execution plan
> > using Calcite(Kylin will change the execution plan cuz the data in cubes
> is
> > already aggregated, we cannot use the origin plan directly). It's a tree
> > structure, a node represents a specific calculation and data goes from
> > bottom to top applying all these calculations.
> > [image: image.png]
> > (Pic from https://blog.csdn.net/yu616568/article/details/50838504, a
> > really good blog.)
> >
> > At present, Kylin will do almost all these calculations only in its own
> > node, in other words, we cannot fully use the power of the cluster, and
> > it's a SPOF. And here comes a design that we can visit this tree, *and
> > transform each node into operations to Spark's Dataframes(i.e. "DF").*
> >
> > More specifically, we will visit the nodes recursively until we met the
> > "TableScan" node(like a stack pushing operation). e.g. In the above
> > diagram, the first node we met is a "Sort" node, we just visit its
> > child(ren), and we'll not stop visiting each node's child(ren) until we
> met
> > a "TableScan" node.
> >
> > In the "TableScan" node, we will generate the initial DF, then the DF
> will
> > be poped to the "Filter" node, and the "Filter" node will apply its own
> > operation like "df.filter(xxx)". Finally, we will apply each node's
> > operation to this DF, and the final call chain will like:
> > "df.filter(xxx).select(xxx).agg(xxx).sort(xxx)".
> >
> > After we got the final Dataframe and triggered the calculation, all the
> > rest were handled by Spark. And we can gain tremendous benefits in
> > computation level, more details can be seen in my previous post:
> >
> http://apache-kylin.74782.x6.nabble.com/Re-DISCUSS-Columnar-storage-engine-for-Apache-Kylin-tc12113.html
> > .
> >
> >
> > --
> >
> >
> > Regards!
> >
> > Aron Tao
> >
> >
> > 许益铭 <x1860...@gmail.com> 于2018年12月19日周三 上午11:40写道：
> >
> >> hi All!
> >> 关于CHAO LONG提到的几个问题,我有以下几个看法:
> >>
> >>
> 1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
> >> server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime
> >> 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
> >> server GC严重
> >>
> >>
> >>
> 2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.
> >>
> >> 3.我们要使用parquet的page
> >>
> index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
> >> index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.
> >>
> >> 我觉得使用spark来做我们的计算引擎能解决上述所有问题:
> >>
> >> 1.分布式计算
> >> sql通过calcite解析优化之后会生成olap
> >>
> >>
> rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
> >> server端的压力.
> >>
> >> 2.去掉字典
> >>
> >>
> 字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.
> >>
> >> 3.parquet存储使用列的真实类型,而不是使用binary
> >>
> >>
> 如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.
> >>
> >> 4.使用spark适配parquet
> >> 当前的spark已经适配了parquet,spark的pushed
> >> filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
> >> index能力.
> >>
> >> 5.index server
> >> 就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
> >> index的一种,因为我们可以在这里插入一个index server.
> >>
> >>
> >> hi,all!
> >> I have the following views:
> >> 1. At present, our architecture is divided into two layers, one is the
> >> storage layer, and the other is the computing layer. In the storage
> layer,
> >> we have made some optimizations and do pre-aggregation in the storage
> >> layer
> >> to reduce the amount of data returned. However, the aggregation and
> >> connection of the runtime occurs on the kylin server side. Serialization
> >> is
> >> inevitable, and this architecture is easy to cause a single point
> >> bottleneck. If the agg or join data of the runtime is relatively large,
> >> the
> >> query performance will drop linearly, and the kylin server GC will be
> >> severe.
> >>
> >> 2. As for the dictionary problem, canceling dictionary encoding is a
> good
> >> choice. The dictionary was originally designed to align rowkey in hbase
> >> and
> >> also to reduce part of the storage. But this also introduces another
> >> problem, it is difficult to handle non-fixed string type dimension If
> you
> >> encounter a UHC dimension, you can only create a large dictionary or
> give
> >> a
> >> larger fix-length, which causes the storage to double, and because the
> >> dictionary is large, the query performance will be greatly affected. We
> >> use
> >> columnar storage, we don't need to consider this problem.
> >>
> >> 3. We need to use the page index of the parquet, we must convert the
> tuple
> >> filter into the filter of the parquet. This workload is not small. And
> our
> >> data is encoded. The page index of the parquet will only be based on the
> >> min and max value on the page. Filtering, so for binary data, it is
> >> impossible to do filter.
> >>
> >> I think using spark to do our calculation engine solves all of the above
> >> problems:
> >>
> >> Distributed computing
> >> Sql through calcite analysis optimization will generate a tree of OLAP
> >> rel,
> >> and spark's catalyst is also generated by parsing SQL after a tree,
> >> automatically optimized to become a dataframe to calculate, if the plan
> of
> >> calcite can be converted into a spark plan, then we will achieve
> >> distributed computing, calcite is only responsible for parsing SQL and
> >> returning result sets, reducing the pressure on the kylin server side.
> >>
> >> 2. Remove the dictionary
> >> The dictionary has a very good effect to reduce the storage pressure in
> >> the
> >> low and medium base, but there is also a disadvantage that its data
> files
> >> can not be used separately from the dictionary. I suggest that you can
> use
> >> the page level of the dictionary without considering the dictionary type
> >> encoding.
> >>
> >> 3.parquet storage uses the true type of the column instead of using
> binary
> >> As above, parquet has a very weak filter capability for binary, and the
> >> basic type can directly use spark's Vectorizedread to speed up data
> >> reading
> >> speed and calculation.
> >>
> >> 4. Use spark to match the parquet
> >> The current spark has been adapted to the parquet. The sparked filter of
> >> the spark has been converted into a filter that can be used by the
> >> parquet.
> >> Here, you only need to upgrade the version of the parcel and modify it
> to
> >> provide the page index of the parquet.
> >>
> >> 5.index server
> >> As described by JiaTao Tao, the index server is divided into file index
> >> and
> >> page index. The filtering of the dictionary is nothing but a file index,
> >> because we can insert an index server here.
> >>
> >> JiaTao Tao <taojia...@gmail.com> 于2018年12月19日周三 下午4:45写道：
> >>
> >> > Hi Gang
> >> >
> >> > In my opinion, segments/partition pruning is actually in the scope of
> >> > "Index system", we can have an "Index system" in storage level
> including
> >> > File index(for segment/partition pruning), page index(for page
> pruning)
> >> > etc. We can put all these stuff in such a system and make the
> >> separation of
> >> > duties cleaner.
> >> >
> >> >
> >> > Ma Gang <mg4w...@163.com> 于2018年12月19日周三 上午6:31写道：
> >> >
> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
> the
> >> > > dictionary in query engine, most time is not good since it brings
> >> lots of
> >> > > pressure to Kylin server, but sometimes it has benefit, for example,
> >> some
> >> > > segments can be pruned very early when filter value is not in the
> >> > > dictionary, and some queries can be answer directly using dictionary
> >> as
> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >> > >
> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <shaofeng...@apache.org>
> >> wrote:
> >> > >
> >> > > The dimension dictionary is a legacy design for HBase storage I
> think;
> >> > > because HBase has no data type, everything is a byte array, this
> makes
> >> > > Kylin has to encode STRING and other types with some encoding method
> >> like
> >> > > the dictionary.
> >> > >
> >> > > Now with the storage like Parquet, it would decide how to encode the
> >> data
> >> > > at the page or block level. Then we can drop the dictionary after
> the
> >> > cube
> >> > > is built. This will release the memory pressure of Kylin query nodes
> >> and
> >> > > also benefit the UHC case.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > Shaofeng Shi 史少锋
> >> > > Apache Kylin PMC
> >> > > Work email: shaofeng....@kyligence.io
> >> > > Kyligence Inc: https://kyligence.io/
> >> > >
> >> > > Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > > Join Kylin user mail group: user-subscr...@kylin.apache.org
> >> > > Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Chao Long <wayn...@qq.com> 于2018年12月17日周一 下午1:23写道：
> >> > >
> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> > >> performance still have room to improve. We can improve it from the
> >> > >> following aspects:
> >> > >>
> >> > >>  1, Minimize result set serialization time
> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
> >> RDD,
> >> > >> and then convert the "Row" type to Object[], so Spark need to
> >> serialize
> >> > >> Object[] before return it to driver. Those time need to be avoided.
> >> > >>
> >> > >>  2, Query without dictionary
> >> > >>  In this PoC, for less storage use, we keep dict encode value in
> >> Parquet
> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
> >> > convert
> >> > >> dict value for query. If we keep original value for dict-encode
> >> > dimension,
> >> > >> dictionary is unnecessary. And we don't hava to worry about the
> >> storage
> >> > >> use, because Parquet will encode it. We should remove dictionary
> from
> >> > query.
> >> > >>
> >> > >>  3, Remove query single-point issue
> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
> >> > >> distributed, but kylin alse need to process result data the Spark
> >> > returned
> >> > >> in single jvm. We can try to make it distributed too.
> >> > >>
> >> > >>  4, Upgrade Parquet to 1.11 for page index
> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
> >> page
> >> > >> index to improve filter performance.
> >> > >>
> >> > >> ------------------
> >> > >> Best Regards,
> >> > >> Chao Long
> >> > >>
> >> > >> ------------------ 原始邮件 ------------------
> >> > >> *发件人:* "ShaoFeng Shi"<shaofeng...@apache.org>;
> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> > >> *收件人:* "dev"<dev@kylin.apache.org>;"user"<u...@kylin.apache.org>;
> >> > >> *主题:* Evaluate Kylin on Parquet
> >> > >>
> >> > >> Hello Kylin users,
> >> > >>
> >> > >> The first version of Kylin on Parquet [1] feature has been staged
> in
> >> > >> Kylin code repository for public review and evaluation. You can
> check
> >> > out
> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
> >> make a
> >> > >> binary build to run an example. When creating a cube, you can
> select
> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
> >> MapReduce
> >> > and
> >> > >> Spark engines support this new storage. A tech blog is under
> drafting
> >> > for
> >> > >> the design and implementation.
> >> > >>
> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
> >> Zhou!
> >> > >>
> >> > >> This is not the final version; there is room to improve in many
> >> aspects,
> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
> >> Your
> >> > >> comments are welcomed. Let's improve it together.
> >> > >>
> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >> > >>
> >> > >> Best regards,
> >> > >>
> >> > >> Shaofeng Shi 史少锋
> >> > >> Apache Kylin PMC
> >> > >> Work email: shaofeng....@kyligence.io
> >> > >> Kyligence Inc: https://kyligence.io/
> >> > >>
> >> > >> Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > >> Join Kylin user mail group: user-subscr...@kylin.apache.org
> >> > >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >> > >>
> >> > >>
> >> > >>
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> >
> >> >
> >> > Regards!
> >> >
> >> > Aron Tao
> >> >
> >>
> >
> >
>

Re: Re: Evaluate Kylin on Parquet

Reply via email to