Hi Gang In my opinion, segments/partition pruning is actually in the scope of "Index system", we can have an "Index system" in storage level including File index(for segment/partition pruning), page index(for page pruning) etc. We can put all these stuff in such a system and make the separation of duties cleaner.
Ma Gang <mg4w...@163.com> 于2018年12月19日周三 上午6:31写道: > Awesome! Looking forward to the improvement. For dictionary, keep the > dictionary in query engine, most time is not good since it brings lots of > pressure to Kylin server, but sometimes it has benefit, for example, some > segments can be pruned very early when filter value is not in the > dictionary, and some queries can be answer directly using dictionary as > described in: https://issues.apache.org/jira/browse/KYLIN-3490 > > At 2018-12-17 15:36:01, "ShaoFeng Shi" <shaofeng...@apache.org> wrote: > > The dimension dictionary is a legacy design for HBase storage I think; > because HBase has no data type, everything is a byte array, this makes > Kylin has to encode STRING and other types with some encoding method like > the dictionary. > > Now with the storage like Parquet, it would decide how to encode the data > at the page or block level. Then we can drop the dictionary after the cube > is built. This will release the memory pressure of Kylin query nodes and > also benefit the UHC case. > > Best regards, > > Shaofeng Shi 史少锋 > Apache Kylin PMC > Work email: shaofeng....@kyligence.io > Kyligence Inc: https://kyligence.io/ > > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html > Join Kylin user mail group: user-subscr...@kylin.apache.org > Join Kylin dev mail group: dev-subscr...@kylin.apache.org > > > > > Chao Long <wayn...@qq.com> 于2018年12月17日周一 下午1:23写道: > >> In this PoC, we verified Kylin On Parquet is viable, but the query >> performance still have room to improve. We can improve it from the >> following aspects: >> >> 1, Minimize result set serialization time >> Since Kylin need Object[] data to process, we convert Dataset to RDD, >> and then convert the "Row" type to Object[], so Spark need to serialize >> Object[] before return it to driver. Those time need to be avoided. >> >> 2, Query without dictionary >> In this PoC, for less storage use, we keep dict encode value in Parquet >> file for dict-encode dimensions, so Kylin must load dictionary to convert >> dict value for query. If we keep original value for dict-encode dimension, >> dictionary is unnecessary. And we don't hava to worry about the storage >> use, because Parquet will encode it. We should remove dictionary from query. >> >> 3, Remove query single-point issue >> In this PoC, we use Spark to read and process Cube data, which is >> distributed, but kylin alse need to process result data the Spark returned >> in single jvm. We can try to make it distributed too. >> >> 4, Upgrade Parquet to 1.11 for page index >> In this PoC, Parquet don't have page index, we get a poor filter >> performance. We need to upgrade Parquet to version 1.11 which has page >> index to improve filter performance. >> >> ------------------ >> Best Regards, >> Chao Long >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "ShaoFeng Shi"<shaofeng...@apache.org>; >> *发送时间:* 2018年12月14日(星期五) 下午4:39 >> *收件人:* "dev"<dev@kylin.apache.org>;"user"<u...@kylin.apache.org>; >> *主题:* Evaluate Kylin on Parquet >> >> Hello Kylin users, >> >> The first version of Kylin on Parquet [1] feature has been staged in >> Kylin code repository for public review and evaluation. You can check out >> the "kylin-on-parquet" branch [2] to read the code, and also can make a >> binary build to run an example. When creating a cube, you can select >> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and >> Spark engines support this new storage. A tech blog is under drafting for >> the design and implementation. >> >> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou! >> >> This is not the final version; there is room to improve in many aspects, >> parquet, spark, and Kylin. It can be used for PoC at this moment. Your >> comments are welcomed. Let's improve it together. >> >> [1] https://issues.apache.org/jira/browse/KYLIN-3621 >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet >> >> Best regards, >> >> Shaofeng Shi 史少锋 >> Apache Kylin PMC >> Work email: shaofeng....@kyligence.io >> Kyligence Inc: https://kyligence.io/ >> >> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html >> Join Kylin user mail group: user-subscr...@kylin.apache.org >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org >> >> >> > > > -- Regards! Aron Tao