Re: Re: Evaluate Kylin on Parquet

JiaTao Tao Wed, 19 Dec 2018 00:45:53 -0800

Hi Gang

In my opinion, segments/partition pruning is actually in the scope of
"Index system", we can have an "Index system" in storage level including
File index(for segment/partition pruning), page index(for page pruning)
etc. We can put all these stuff in such a system and make the separation of
duties cleaner.



Ma Gang <mg4w...@163.com> 于2018年12月19日周三 上午6:31写道：

> Awesome! Looking forward to the improvement. For dictionary, keep the
> dictionary in query engine, most time is not good since it brings lots of
> pressure to Kylin server, but sometimes it has benefit, for example, some
> segments can be pruned very early when filter value is not in the
> dictionary, and some queries can be answer directly using dictionary as
> described in: https://issues.apache.org/jira/browse/KYLIN-3490
>
> At 2018-12-17 15:36:01, "ShaoFeng Shi" <shaofeng...@apache.org> wrote:
>
> The dimension dictionary is a legacy design for HBase storage I think;
> because HBase has no data type, everything is a byte array, this makes
> Kylin has to encode STRING and other types with some encoding method like
> the dictionary.
>
> Now with the storage like Parquet, it would decide how to encode the data
> at the page or block level. Then we can drop the dictionary after the cube
> is built. This will release the memory pressure of Kylin query nodes and
> also benefit the UHC case.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng....@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscr...@kylin.apache.org
> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>
>
>
>
> Chao Long <wayn...@qq.com> 于2018年12月17日周一 下午1:23写道：
>
>>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> performance still have room to improve. We can improve it from the
>> following aspects:
>>
>>  1, Minimize result set serialization time
>>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
>> and then convert the "Row" type to Object[], so Spark need to serialize
>> Object[] before return it to driver. Those time need to be avoided.
>>
>>  2, Query without dictionary
>>  In this PoC, for less storage use, we keep dict encode value in Parquet
>> file for dict-encode dimensions, so Kylin must load dictionary to convert
>> dict value for query. If we keep original value for dict-encode dimension,
>> dictionary is unnecessary. And we don't hava to worry about the storage
>> use, because Parquet will encode it. We should remove dictionary from query.
>>
>>  3, Remove query single-point issue
>>  In this PoC, we use Spark to read and process Cube data, which is
>> distributed, but kylin alse need to process result data the Spark returned
>> in single jvm. We can try to make it distributed too.
>>
>>  4, Upgrade Parquet to 1.11 for page index
>>  In this PoC, Parquet don't have page index, we get a poor filter
>> performance. We need to upgrade Parquet to version 1.11 which has page
>> index to improve filter performance.
>>
>> ------------------
>> Best Regards,
>> Chao Long
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "ShaoFeng Shi"<shaofeng...@apache.org>;
>> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> *收件人:* "dev"<dev@kylin.apache.org>;"user"<u...@kylin.apache.org>;
>> *主题:* Evaluate Kylin on Parquet
>>
>> Hello Kylin users,
>>
>> The first version of Kylin on Parquet [1] feature has been staged in
>> Kylin code repository for public review and evaluation. You can check out
>> the "kylin-on-parquet" branch [2] to read the code, and also can make a
>> binary build to run an example. When creating a cube, you can select
>> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and
>> Spark engines support this new storage. A tech blog is under drafting for
>> the design and implementation.
>>
>> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
>>
>> This is not the final version; there is room to improve in many aspects,
>> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
>> comments are welcomed. Let's improve it together.
>>
>> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: shaofeng....@kyligence.io
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscr...@kylin.apache.org
>> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>>
>>
>>
>
>
>


-- 


Regards!

Aron Tao

Re: Re: Evaluate Kylin on Parquet

Reply via email to