Re: Re: Evaluate Kylin on Parquet

许益铭 Wed, 19 Dec 2018 03:41:37 -0800

hi All!
关于CHAO LONG提到的几个问题,我有以下几个看法:
1.当前我们的架构是分为两层的,一层是storage层,一层是计算层.在storage层,我们已经做了一些优化,在storage层做了预聚合来减少返回的数据量,但是runtime的聚合和连接发生在kylin
server端,序列化无可避免,且这个架构容易导致单点瓶颈,如果runtime 的agg或join数据量比较大的话,会导致查询性能直线下降,kylin
server GC严重

2.关于字典问题,字典是当初为了在hbase中对齐rowkey,同时也为了减少一部分的存储而引入的设计.但这也引入另外一个问题,hbase很难处理非定长的string类型的dimension,如果遇到高基的非定长dimension,往往只能去建立一个很大的字典或者给一个比较大的fixlength,导致存储翻倍,同时因为字典比较大,查询性能会受到很大影响(gc).如果我们使用列式存储,是可以不需要考虑这个问题的.

3.我们要使用parquet的page
index,必须把tuplefilter转换成parquet的filter,这个工作量不小.而且我们的数据都是被编码过的,parquet的page
index只会根据page上的min max来进行过滤,因此对于binary的数据,是无法做filter的.

我觉得使用spark来做我们的计算引擎能解决上述所有问题:

1.分布式计算
sql通过calcite解析优化之后会生成olap
rel的一颗树,而spark的catalyst也是通过解析sql生成一棵树后,自动优化成为dataframe来计算,如果calcite的plan能够转换成spark的plan,那么我们将实现分布式计算,calcite只负责解析sql和返回结果集,减少kylin
server端的压力.

2.去掉字典
字典有个很好的作用就是在中低基数下减少储存压力,但是也有一个坏处就是其数据文件无法脱离字典单独使用,我建议刚开始可以不考虑字典类型的encoding,让系统尽可能的简单,默认使用parquet的page级别的dictionary即可.

3.parquet存储使用列的真实类型,而不是使用binary
如上,parquet对于binary的filter能力极弱,而使用基本类型能够直接使用spark的Vectorizedread,加速数据读取速度和计算.

4.使用spark适配parquet
当前的spark已经适配了parquet,spark的pushed
filter已经被转换成为了parquet能用的filter,这里只需要升级parquet版本后稍加修改就能提供parquet的page
index能力.

5.index server
就如JiaTao Tao所述,index server分为file index 和 page index ,字典的过滤无非就是file
index的一种,因为我们可以在这里插入一个index server.

hi,all!
I have the following views:
1. At present, our architecture is divided into two layers, one is the
storage layer, and the other is the computing layer. In the storage layer,
we have made some optimizations and do pre-aggregation in the storage layer
to reduce the amount of data returned. However, the aggregation and
connection of the runtime occurs on the kylin server side. Serialization is
inevitable, and this architecture is easy to cause a single point
bottleneck. If the agg or join data of the runtime is relatively large, the
query performance will drop linearly, and the kylin server GC will be
severe.

2. As for the dictionary problem, canceling dictionary encoding is a good
choice. The dictionary was originally designed to align rowkey in hbase and
also to reduce part of the storage. But this also introduces another
problem, it is difficult to handle non-fixed string type dimension If you
encounter a UHC dimension, you can only create a large dictionary or give a
larger fix-length, which causes the storage to double, and because the
dictionary is large, the query performance will be greatly affected. We use
columnar storage, we don't need to consider this problem.

3. We need to use the page index of the parquet, we must convert the tuple
filter into the filter of the parquet. This workload is not small. And our
data is encoded. The page index of the parquet will only be based on the
min and max value on the page. Filtering, so for binary data, it is
impossible to do filter.

I think using spark to do our calculation engine solves all of the above
problems:

Distributed computing
Sql through calcite analysis optimization will generate a tree of OLAP rel,
and spark's catalyst is also generated by parsing SQL after a tree,
automatically optimized to become a dataframe to calculate, if the plan of
calcite can be converted into a spark plan, then we will achieve
distributed computing, calcite is only responsible for parsing SQL and
returning result sets, reducing the pressure on the kylin server side.

2. Remove the dictionary
The dictionary has a very good effect to reduce the storage pressure in the
low and medium base, but there is also a disadvantage that its data files
can not be used separately from the dictionary. I suggest that you can use
the page level of the dictionary without considering the dictionary type
encoding.

3.parquet storage uses the true type of the column instead of using binary
As above, parquet has a very weak filter capability for binary, and the
basic type can directly use spark's Vectorizedread to speed up data reading
speed and calculation.

4. Use spark to match the parquet
The current spark has been adapted to the parquet. The sparked filter of
the spark has been converted into a filter that can be used by the parquet.
Here, you only need to upgrade the version of the parcel and modify it to
provide the page index of the parquet.

5.index server
As described by JiaTao Tao, the index server is divided into file index and
page index. The filtering of the dictionary is nothing but a file index,
because we can insert an index server here.

JiaTao Tao <taojia...@gmail.com> 于2018年12月19日周三 下午4:45写道：

> Hi Gang
>
> In my opinion, segments/partition pruning is actually in the scope of
> "Index system", we can have an "Index system" in storage level including
> File index(for segment/partition pruning), page index(for page pruning)
> etc. We can put all these stuff in such a system and make the separation of
> duties cleaner.
>
>
> Ma Gang <mg4w...@163.com> 于2018年12月19日周三 上午6:31写道：
>
> > Awesome! Looking forward to the improvement. For dictionary, keep the
> > dictionary in query engine, most time is not good since it brings lots of
> > pressure to Kylin server, but sometimes it has benefit, for example, some
> > segments can be pruned very early when filter value is not in the
> > dictionary, and some queries can be answer directly using dictionary as
> > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >
> > At 2018-12-17 15:36:01, "ShaoFeng Shi" <shaofeng...@apache.org> wrote:
> >
> > The dimension dictionary is a legacy design for HBase storage I think;
> > because HBase has no data type, everything is a byte array, this makes
> > Kylin has to encode STRING and other types with some encoding method like
> > the dictionary.
> >
> > Now with the storage like Parquet, it would decide how to encode the data
> > at the page or block level. Then we can drop the dictionary after the
> cube
> > is built. This will release the memory pressure of Kylin query nodes and
> > also benefit the UHC case.
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Work email: shaofeng....@kyligence.io
> > Kyligence Inc: https://kyligence.io/
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: user-subscr...@kylin.apache.org
> > Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >
> >
> >
> >
> > Chao Long <wayn...@qq.com> 于2018年12月17日周一 下午1:23写道：
> >
> >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> performance still have room to improve. We can improve it from the
> >> following aspects:
> >>
> >>  1, Minimize result set serialization time
> >>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
> >> and then convert the "Row" type to Object[], so Spark need to serialize
> >> Object[] before return it to driver. Those time need to be avoided.
> >>
> >>  2, Query without dictionary
> >>  In this PoC, for less storage use, we keep dict encode value in Parquet
> >> file for dict-encode dimensions, so Kylin must load dictionary to
> convert
> >> dict value for query. If we keep original value for dict-encode
> dimension,
> >> dictionary is unnecessary. And we don't hava to worry about the storage
> >> use, because Parquet will encode it. We should remove dictionary from
> query.
> >>
> >>  3, Remove query single-point issue
> >>  In this PoC, we use Spark to read and process Cube data, which is
> >> distributed, but kylin alse need to process result data the Spark
> returned
> >> in single jvm. We can try to make it distributed too.
> >>
> >>  4, Upgrade Parquet to 1.11 for page index
> >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> performance. We need to upgrade Parquet to version 1.11 which has page
> >> index to improve filter performance.
> >>
> >> ------------------
> >> Best Regards,
> >> Chao Long
> >>
> >> ------------------ 原始邮件 ------------------
> >> *发件人:* "ShaoFeng Shi"<shaofeng...@apache.org>;
> >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> *收件人:* "dev"<dev@kylin.apache.org>;"user"<u...@kylin.apache.org>;
> >> *主题:* Evaluate Kylin on Parquet
> >>
> >> Hello Kylin users,
> >>
> >> The first version of Kylin on Parquet [1] feature has been staged in
> >> Kylin code repository for public review and evaluation. You can check
> out
> >> the "kylin-on-parquet" branch [2] to read the code, and also can make a
> >> binary build to run an example. When creating a cube, you can select
> >> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce
> and
> >> Spark engines support this new storage. A tech blog is under drafting
> for
> >> the design and implementation.
> >>
> >> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
> >>
> >> This is not the final version; there is room to improve in many aspects,
> >> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
> >> comments are welcomed. Let's improve it together.
> >>
> >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >>
> >> Best regards,
> >>
> >> Shaofeng Shi 史少锋
> >> Apache Kylin PMC
> >> Work email: shaofeng....@kyligence.io
> >> Kyligence Inc: https://kyligence.io/
> >>
> >> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> >> Join Kylin user mail group: user-subscr...@kylin.apache.org
> >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >>
> >>
> >>
> >
> >
> >
>
>
> --
>
>
> Regards!
>
> Aron Tao
>

Re: Re: Evaluate Kylin on Parquet

Reply via email to