[jira] [Updated] (KYLIN-187) Data Statistics Analyzer

Roger Shi (JIRA) Wed, 08 Mar 2017 00:21:59 -0800

     [ 
https://issues.apache.org/jira/browse/KYLIN-187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roger Shi updated KYLIN-187:
----------------------------
    Description: 
## 1. Overview 
We need the statistics data for the following domains:
* Design cube metadata based on query log
* Design HBase row-key based on data distribution (e.g. histogram and 
cardinality)
* Choose execution plan based on cuboid data

## 2. Data Analyzer 
We need to analyzer the hive data and cube data in 2 phases. Firstly, we will 
analyze the hive to guide the 1st round design of row key. Then we will analyze 
the cube data to refine the design of row key and to estimate the cost of query.

#### 2.1. Analyze Hive Data 
We need to analyze the following statistics data on hive table:
* Cardinality of each dimension
* Cardinality of dimension combination (optional)
* Value distribution of each dimension (optional)
Based on the statistics of hive data, we can design row key group from high 
cardinality dimension to low cardinality dimension. BTW, we should evenly split 
dimension into the row key group that will reduce the number of cuboid.

#### 2.2. Analyze Cube Data 
We need to analyze the following statistics on data cube:
* Count of each cuboid
* Group ratio of each cuboid = current cuboid count / lower group base cuboid 
count 

#### 3. Query Analyzer 
TBD

---------------- Imported from GitHub ----------------
Url: https://github.com/KylinOLAP/Kylin/issues/318
Created by: [lukehan|https://github.com/lukehan]
Labels: newfeature, 
Milestone: Backlog
Created at: Fri Dec 26 15:21:24 CST 2014
State: open


  was:
#Overview 
We need the statistics data for the following domains:
* Design cube metadata based on query log
* Design HBase row-key based on data distribution (e.g. histogram and 
cardinality)
* Choose execution plan based on cuboid data

#Data Analyzer 
We need to analyzer the hive data and cube data in 2 phases. Firstly, we will 
analyze the hive to guide the 1st round design of row key. Then we will analyze 
the cube data to refine the design of row key and to estimate the cost of query.

##Analyze Hive Data 
We need to analyze the following statistics data on hive table:
* Cardinality of each dimension
* Cardinality of dimension combination (optional)
* Value distribution of each dimension (optional)
Based on the statistics of hive data, we can design row key group from high 
cardinality dimension to low cardinality dimension. BTW, we should evenly split 
dimension into the row key group that will reduce the number of cuboid.

##Analyze Cube Data 
We need to analyze the following statistics on data cube:
* Count of each cuboid
* Group ratio of each cuboid = current cuboid count / lower group base cuboid 
count 

# Query Analyzer 
TBD

---------------- Imported from GitHub ----------------
Url: https://github.com/KylinOLAP/Kylin/issues/318
Created by: [lukehan|https://github.com/lukehan]
Labels: newfeature, 
Milestone: Backlog
Created at: Fri Dec 26 15:21:24 CST 2014
State: open



> Data Statistics Analyzer 
> -------------------------
>
>                 Key: KYLIN-187
>                 URL: https://issues.apache.org/jira/browse/KYLIN-187
>             Project: Kylin
>          Issue Type: New Feature
>          Components: Tools, Build and Test
>            Reporter: Luke Han
>              Labels: github-import
>             Fix For: Backlog
>
>
> ## 1. Overview 
> We need the statistics data for the following domains:
> * Design cube metadata based on query log
> * Design HBase row-key based on data distribution (e.g. histogram and 
> cardinality)
> * Choose execution plan based on cuboid data
> ## 2. Data Analyzer 
> We need to analyzer the hive data and cube data in 2 phases. Firstly, we will 
> analyze the hive to guide the 1st round design of row key. Then we will 
> analyze the cube data to refine the design of row key and to estimate the 
> cost of query.
> #### 2.1. Analyze Hive Data 
> We need to analyze the following statistics data on hive table:
> * Cardinality of each dimension
> * Cardinality of dimension combination (optional)
> * Value distribution of each dimension (optional)
> Based on the statistics of hive data, we can design row key group from high 
> cardinality dimension to low cardinality dimension. BTW, we should evenly 
> split dimension into the row key group that will reduce the number of cuboid.
> #### 2.2. Analyze Cube Data 
> We need to analyze the following statistics on data cube:
> * Count of each cuboid
> * Group ratio of each cuboid = current cuboid count / lower group base cuboid 
> count 
> #### 3. Query Analyzer 
> TBD
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/KylinOLAP/Kylin/issues/318
> Created by: [lukehan|https://github.com/lukehan]
> Labels: newfeature, 
> Milestone: Backlog
> Created at: Fri Dec 26 15:21:24 CST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KYLIN-187) Data Statistics Analyzer

Reply via email to