Hello, Kylin users, 
    Here is my proposal of implementing cube planner phase one for Kylin 4, and 
this is the 
link(https://cwiki.apache.org/confluence/display/KYLIN/KIP-3+Support+Cube+Planner+Phase+One+for+Kylin+4).
 If you have any suggestion, please let me know, thank you. 




KIP-3 Support Cube Planner Phase One for Kylin 4


Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.
Q2. What problem is this proposal NOT designed to solve?
Q3. How is it done today, and what are the limits of current practice?
Q4. What is new in your approach and why do you think it will be successful?
Q5. Who cares? If you are successful, what difference will it make?
Q6. What are the risks?
Q7. How long will it take?
Q8. How it works?
Reference

Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.

In Apache Kylin 4, Kylin team have implemented/developed new build engine and 
new query engine to provide better performance, please refer to KIP-1: Parquet 
storage if you are interested. But the current cuboid prune tools(Cube Planner) 
is not incompatible with new build engine, so I want to make new build engine 
support Cube Planner.  

Q2. What problem is this proposal NOT designed to solve?

I am not going to support Cube Planner phase 2 at the moment, because phase 2 
depend on some metrics in CubeVisitService.java(aggRowCount & totalRowCount)to 
infer row count of unbuilt/new cuboid. HBase storage is removed in Kylin 4, so 
we have find a another way to infer row count for unbuilt/new cuboid. Besides, 
System Cube(or metrics system) need to be refactored and metrics in  
METRICS_QUERY_RPC is deprecated because storage is changed(we don't have 
HBase's region server any more).

Q3. How is it done today, and what are the limits of current practice?
It is almost done in my patch, please check or review my patch at 
https://github.com/apache/kylin/pull/1485 .
Add a new step to calculate cuboid's HyperHyperLog did degrade build 
performance slightly, and it looks acceptable to me.
Q4. What is new in your approach and why do you think it will be successful?
It is not a new way, main logic of new added code looks like the original one 
in FactDistinctColumnsMapper.java .
We know that Cube Planner phase 1 depend on row count of each cuboid to 
calculate BPUS(benefit per unit space). By introduce a new step which will 
calcualte HyperLogLog for each candidate cuboid, we can enable Cube Planner 
phase 1 now. 
Q5. Who cares? If you are successful, what difference will it make?

After this task is done, Kylin 4 will support Cube Planner phase 1, and make 
cuboid prune much easier than current state(didn't support ).

Q6. What are the risks?

So far so good.

Q7. How long will it take?

I have spent about three weeks to read original source code, write my code and 
test it. It is almost done.

Q8. How it works?
Use Spark to calculate cuboid's HllCounter for the first segment and persist 
into HDFS.
Re-enable Cube planner by default, but not support cube planner phase two.
Not merge cuboid statistics(HLLCounter) when merge segment.
By default, only calculate cuboid statistics for the FIRST segment. (No 
necessary becuase phase two is not supported )
Cuboid statistics for HLLCounter use precision 14.
Calculate cuboid statistics use 100% input flat table data. (Maybe use sample 
for input RDD in the future.)
Reference
https://github.com/apache/kylin/pull/1485

--

Best wishes to you ! 
From :Xiaoxiang Yu

Reply via email to