Hello, Kylin users,
Here is my proposal of implementing cube planner phase one for Kylin 4, and
this is the
link(https://cwiki.apache.org/confluence/display/KYLIN/KIP-3+Support+Cube+Planner+Phase+One+for+Kylin+4).
If you have any suggestion, please let me know, thank you.
KIP-3 Support Cube Planner Phase One for Kylin 4
Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.
Q2. What problem is this proposal NOT designed to solve?
Q3. How is it done today, and what are the limits of current practice?
Q4. What is new in your approach and why do you think it will be successful?
Q5. Who cares? If you are successful, what difference will it make?
Q6. What are the risks?
Q7. How long will it take?
Q8. How it works?
Reference
Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.
In Apache Kylin 4, Kylin team have implemented/developed new build engine and
new query engine to provide better performance, please refer to KIP-1: Parquet
storage if you are interested. But the current cuboid prune tools(Cube Planner)
is not incompatible with new build engine, so I want to make new build engine
support Cube Planner.
Q2. What problem is this proposal NOT designed to solve?
I am not going to support Cube Planner phase 2 at the moment, because phase 2
depend on some metrics in CubeVisitService.java(aggRowCount & totalRowCount)to
infer row count of unbuilt/new cuboid. HBase storage is removed in Kylin 4, so
we have find a another way to infer row count for unbuilt/new cuboid. Besides,
System Cube(or metrics system) need to be refactored and metrics in
METRICS_QUERY_RPC is deprecated because storage is changed(we don't have
HBase's region server any more).
Q3. How is it done today, and what are the limits of current practice?
It is almost done in my patch, please check or review my patch at
https://github.com/apache/kylin/pull/1485 .
Add a new step to calculate cuboid's HyperHyperLog did degrade build
performance slightly, and it looks acceptable to me.
Q4. What is new in your approach and why do you think it will be successful?
It is not a new way, main logic of new added code looks like the original one
in FactDistinctColumnsMapper.java .
We know that Cube Planner phase 1 depend on row count of each cuboid to
calculate BPUS(benefit per unit space). By introduce a new step which will
calcualte HyperLogLog for each candidate cuboid, we can enable Cube Planner
phase 1 now.
Q5. Who cares? If you are successful, what difference will it make?
After this task is done, Kylin 4 will support Cube Planner phase 1, and make
cuboid prune much easier than current state(didn't support ).
Q6. What are the risks?
So far so good.
Q7. How long will it take?
I have spent about three weeks to read original source code, write my code and
test it. It is almost done.
Q8. How it works?
Use Spark to calculate cuboid's HllCounter for the first segment and persist
into HDFS.
Re-enable Cube planner by default, but not support cube planner phase two.
Not merge cuboid statistics(HLLCounter) when merge segment.
By default, only calculate cuboid statistics for the FIRST segment. (No
necessary becuase phase two is not supported )
Cuboid statistics for HLLCounter use precision 14.
Calculate cuboid statistics use 100% input flat table data. (Maybe use sample
for input RDD in the future.)
Reference
https://github.com/apache/kylin/pull/1485
--
Best wishes to you !
From :Xiaoxiang Yu