[ 
https://issues.apache.org/jira/browse/KYLIN-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Calaba updated KYLIN-1836:
----------------------------------
    Description: 
After reading the Tech Blog - 
https://kylin.apache.org/blog/2016/02/18/new-aggregation-group/ from Hongbin Ma 
- I got few ideas mentioned below - to help the Cube designers understand 
impact of their cube design on the Build and Query performance - see below:

BTW: hank you for putting this Blog together !!! and thank you for referencing 
this blog through Kylin UI - link in the Aggregation Groups section !! - it is 
very powerful optimization technique.)

Idea 1
=====

 It would be great if the Advanced Settings section on UI can calculate the 
exact number of Cuboids defined by every Aggregation Group (# of combinations ; 
# of pruned combinations (based on Hier/Joint and Mandatory Dimensions) and 
then also showing the overall total of Cuboids considering ALL the defined 
Aggregation Groups.

Idea 2
=====

As Aggregation Group section is about optimizing # of necessary cuboids 
assuming you know the queries patterns. This is sometimes easy but for more 
complex dashboards where multiple people work on defining the queries this is 
hard to control and guess, thus I would suggest adding a new Tab in the Monitor 
Kylin UI - next to Job and Slow Queries add additional tab "Non-satisfied 
Queries" showing the Queries which were not able to be evaluated by Kylin - 
queries which end with "No Realization" exception. Together with the Query SQL 
(including all the parameters) it would help to show the "missing dimension 
name" used in the query which was the cause for not finding proper Cuboid.


Idea 3
=====
Can anyone also document the section Rowkeys in the same section of UI 
(Advanced Settings) ??? It is not really clear what effect will have if I start 
playing with the Rowkeys section (adding/removing dimension fields; adding 
non-dimension fields, ...). All I understand is that the "Rowkeys" section has 
impact only on HBase storage of calculated cuboids. Thus doesn't have impact on 
Cube Build time that much (except the impact that the Trie for dictionary needs 
to be built for every specified rowkey on this tab). I understand that the 
major impact of Rowkeys section is thus only on HBase size / regions split and 
thus also on the Query execution time. 

What I am confused with is whether I can define high-cardinality dimension in 
Cube and remove it from the Rowkeys section ??? What would happen in HBase 
storage and expected Query time ...would that dimension be still query-enabled 
??

The closest explanation I found is this Reply from - Yu Feng's here 
http://apache-kylin.74782.x6.nabble.com/Relationship-between-rowkey-column-length-and-cube-size-td3174.html
==========================================================
Reply: Cube size determines how to split region for table in hbase after 
generate 
all cuboid files, for example, If all of your cuboid file size is 100GB, 
your  cube size set to "SMALL", and the property for SMALL is 10GB, kylin 
will create hbase table with 10 regions. it will calculate every start 
rowkey and end rowkey of every region before create htable. then create 
table with those split infomations. 

Rowkey column length is another thing, you can choose either use dictionary 
or set rowkey column length for every dimension , If you use dictionary, 
kylin will build dictionary for this column(Trie tree), it means every 
value of the dimension will be encoded as a unique number value, because 
dimension value is a part of hbase rowkey, so it will reduce hbase table 
size with dictionary. However, kylin store the dictionary in memory, if 
dimension cardinality is large, It will become something bad. If you set rowkey 
column length to N for one dimension, kylin will not build dictionary for 
it, and every value will be cutted to a N-length string, so, no dictionary 
in memory, rowkey in hbase table will be longer. 
==========================================================

Additional - verly light explanation on the Rowkeys section is here: 
https://kylin.apache.org/docs15/tutorial/create_cube.html
=====================================================
Rowkeys: the rowkeys are composed by the dimension encoded values. “Dictionary” 
is the default encoding method; If a dimension is not fit with dictionary 
(e.g., cardinality > 10 million), select “false” and then enter the fixed 
length for that dimension, usually that is the max. length of that column; if a 
value is longer than that size it will be truncated. Please note, without 
dictionary encoding, the cube size might be much bigger.

You can drag & drop a dimension column to adjust its position in rowkey; Put 
the mandantory dimension at the begining, then followed the dimensions that 
heavily involved in filters (where condition). Put high cardinality dimensions 
ahead of low cardinality dimensions.

I.e. The "Put high cardinality dimensions ahead of low cardinality dimensions." 
if really important - seems to be really missing on UI !

  was:
After reading the Tech Blog - 
https://kylin.apache.org/blog/2016/02/18/new-aggregation-group/ from Hongbin Ma 
- I got few ideas mentioned below - to help the Cube designers understand 
impact of their cube design on the Build and Query performance - see below:

BTW: hank you for putting this Blog together !!! and thank you for referencing 
this blog through Kylin UI - link in the Aggregation Groups section !! - it is 
very powerful optimization technique.)

Idea 1
=====

 It would be great if the Advanced Settings section on UI can calculate the 
exact number of Cuboids defined by every Aggregation Group (# of combinations ; 
# of pruned combinations (based on Hier/Joint and Mandatory Dimensions) and 
then also showing the overall total of Cuboids considering ALL the defined 
Aggregation Groups.

Idea 2
=====

As Aggregation Group section is about optimizing # of necessary cuboids 
assuming you know the queries patterns. This is sometimes easy but for more 
complex dashboards where multiple people work on defining the queries this is 
hard to control and guess, thus I would suggest adding a new Tab in the Monitor 
Kylin UI - next to Job and Slow Queries add additional tab "Non-satisfied 
Queries" showing the Queries which were not able to be evaluated by Kylin - 
queries which end with "No Realization" exception. Together with the Query SQL 
(including all the parameters) it would help to show the "missing dimension 
name" used in the query which was the cause for not finding proper Cuboid.


Idea 3
=====
Can anyone also document the section Rowkeys in the same section of UI 
(Advanced Settings) ??? It is not really clear what effect will have if I start 
playing with the Rowkeys section (adding/removing dimension fields; adding 
non-dimension fields, ...). All I understand is that the "Rowkeys" section has 
impact only on HBase storage of calculated cuboids. Thus doesn't have impact on 
Cube Build time that much (except the impact that the Trie for dictionary needs 
to be built for every specified rowkey on this tab). I understand that the 
major impact of Rowkeys section is thus only on HBase size / regions split and 
thus also on the Query execution time. 

What I am confused with is whether I can define high-cardinality dimension in 
Cube and remove it from the Rowkeys section ??? What would happen in HBase 
storage and expected Query time ...would that dimension be still query-enabled 
??

The closest explanation I found is this Reply from - Yu Feng's here 
http://apache-kylin.74782.x6.nabble.com/Relationship-between-rowkey-column-length-and-cube-size-td3174.html
==========================================================
Reply: Cube size determines how to split region for table in hbase after 
generate 
all cuboid files, for example, If all of your cuboid file size is 100GB, 
your  cube size set to "SMALL", and the property for SMALL is 10GB, kylin 
will create hbase table with 10 regions. it will calculate every start 
rowkey and end rowkey of every region before create htable. then create 
table with those split infomations. 

Rowkey column length is another thing, you can choose either use dictionary 
or set rowkey column length for every dimension , If you use dictionary, 
kylin will build dictionary for this column(Trie tree), it means every 
value of the dimension will be encoded as a unique number value, because 
dimension value is a part of hbase rowkey, so it will reduce hbase table 
size with dictionary. However, kylin store the dictionary in memory, if 
dimension cardinality is large, It will become something bad. If you set rowkey 
column length to N for one dimension, kylin will not build dictionary for 
it, and every value will be cutted to a N-length string, so, no dictionary 
in memory, rowkey in hbase table will be longer. 
==========================================================

Additional - verly light explanation on the Rowkeys section is here: 
https://kylin.apache.org/docs15/tutorial/create_cube.html
=====================================================
Rowkeys: the rowkeys are composed by the dimension encoded values. “Dictionary” 
is the default encoding method; If a dimension is not fit with dictionary 
(e.g., cardinality > 10 million), select “false” and then enter the fixed 
length for that dimension, usually that is the max. length of that column; if a 
value is longer than that size it will be truncated. Please note, without 
dictionary encoding, the cube size might be much bigger.

You can drag & drop a dimension column to adjust its position in rowkey; Put 
the mandantory dimension at the begining, then followed the dimensions that 
heavily involved in filters (where condition). Put high cardinality dimensions 
ahead of low cardinality dimensions.


> Kylin 1.5+ New Aggregation Group - UI improvements
> --------------------------------------------------
>
>                 Key: KYLIN-1836
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1836
>             Project: Kylin
>          Issue Type: Improvement
>    Affects Versions: v1.5.0, v1.5.1, v1.5.2, v1.5.3, v1.5.2.1
>            Reporter: Richard Calaba
>
> After reading the Tech Blog - 
> https://kylin.apache.org/blog/2016/02/18/new-aggregation-group/ from Hongbin 
> Ma - I got few ideas mentioned below - to help the Cube designers understand 
> impact of their cube design on the Build and Query performance - see below:
> BTW: hank you for putting this Blog together !!! and thank you for 
> referencing this blog through Kylin UI - link in the Aggregation Groups 
> section !! - it is very powerful optimization technique.)
> Idea 1
> =====
>  It would be great if the Advanced Settings section on UI can calculate the 
> exact number of Cuboids defined by every Aggregation Group (# of combinations 
> ; # of pruned combinations (based on Hier/Joint and Mandatory Dimensions) and 
> then also showing the overall total of Cuboids considering ALL the defined 
> Aggregation Groups.
> Idea 2
> =====
> As Aggregation Group section is about optimizing # of necessary cuboids 
> assuming you know the queries patterns. This is sometimes easy but for more 
> complex dashboards where multiple people work on defining the queries this is 
> hard to control and guess, thus I would suggest adding a new Tab in the 
> Monitor Kylin UI - next to Job and Slow Queries add additional tab 
> "Non-satisfied Queries" showing the Queries which were not able to be 
> evaluated by Kylin - queries which end with "No Realization" exception. 
> Together with the Query SQL (including all the parameters) it would help to 
> show the "missing dimension name" used in the query which was the cause for 
> not finding proper Cuboid.
> Idea 3
> =====
> Can anyone also document the section Rowkeys in the same section of UI 
> (Advanced Settings) ??? It is not really clear what effect will have if I 
> start playing with the Rowkeys section (adding/removing dimension fields; 
> adding non-dimension fields, ...). All I understand is that the "Rowkeys" 
> section has impact only on HBase storage of calculated cuboids. Thus doesn't 
> have impact on Cube Build time that much (except the impact that the Trie for 
> dictionary needs to be built for every specified rowkey on this tab). I 
> understand that the major impact of Rowkeys section is thus only on HBase 
> size / regions split and thus also on the Query execution time. 
> What I am confused with is whether I can define high-cardinality dimension in 
> Cube and remove it from the Rowkeys section ??? What would happen in HBase 
> storage and expected Query time ...would that dimension be still 
> query-enabled ??
> The closest explanation I found is this Reply from - Yu Feng's here 
> http://apache-kylin.74782.x6.nabble.com/Relationship-between-rowkey-column-length-and-cube-size-td3174.html
> ==========================================================
> Reply: Cube size determines how to split region for table in hbase after 
> generate 
> all cuboid files, for example, If all of your cuboid file size is 100GB, 
> your  cube size set to "SMALL", and the property for SMALL is 10GB, kylin 
> will create hbase table with 10 regions. it will calculate every start 
> rowkey and end rowkey of every region before create htable. then create 
> table with those split infomations. 
> Rowkey column length is another thing, you can choose either use dictionary 
> or set rowkey column length for every dimension , If you use dictionary, 
> kylin will build dictionary for this column(Trie tree), it means every 
> value of the dimension will be encoded as a unique number value, because 
> dimension value is a part of hbase rowkey, so it will reduce hbase table 
> size with dictionary. However, kylin store the dictionary in memory, if 
> dimension cardinality is large, It will become something bad. If you set 
> rowkey 
> column length to N for one dimension, kylin will not build dictionary for 
> it, and every value will be cutted to a N-length string, so, no dictionary 
> in memory, rowkey in hbase table will be longer. 
> ==========================================================
> Additional - verly light explanation on the Rowkeys section is here: 
> https://kylin.apache.org/docs15/tutorial/create_cube.html
> =====================================================
> Rowkeys: the rowkeys are composed by the dimension encoded values. 
> “Dictionary” is the default encoding method; If a dimension is not fit with 
> dictionary (e.g., cardinality > 10 million), select “false” and then enter 
> the fixed length for that dimension, usually that is the max. length of that 
> column; if a value is longer than that size it will be truncated. Please 
> note, without dictionary encoding, the cube size might be much bigger.
> You can drag & drop a dimension column to adjust its position in rowkey; Put 
> the mandantory dimension at the begining, then followed the dimensions that 
> heavily involved in filters (where condition). Put high cardinality 
> dimensions ahead of low cardinality dimensions.
> I.e. The "Put high cardinality dimensions ahead of low cardinality 
> dimensions." if really important - seems to be really missing on UI !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to