[ 
https://issues.apache.org/jira/browse/KYLIN-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hongbin ma updated KYLIN-1237:
------------------------------
    Description: 
currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase 
encoding and compression into consideration. From our observation in real 
cubes, the estimation can be tens of times bigger than actual:

here's some stats:

Cube1(w/o hll, holistic distinct count)
1051G=>161G  (estimated size=>real size)

cube2(w/o hll)
2118G => 504G

cube3(w/o hll)
3507G=>791G

cube 4(w 2 hll15)
188T => 2T

cube 5(w 2 hll15)
28T => 0.7T

cube 6(w 1 hll16)
172G=>30G

from the stats we can see that for cubes without hll, the estimation can be 4~5 
times bigger, for cubes with hll, the estimation can be more than 50 
times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe 
related to hll precision level, maybe due to data?)

To reduce region counts, we will apply estimation discount as follows:

 if (isMemoryHungry) {
            logger.info("Cube is memory hungry, storage size multiply 0.05");
            ret *= 0.05;
        } else {
            logger.info("Cube is not memory hungry, storage size multiply 
0.25");
            ret *= 0.25;
        }

and let's see how it works

  was:
currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase 
encoding and compression into consideration. From our observation in real 
cubes, the estimation can be tens of times bigger than actual:

here's some stats:

Cube1(w/o hll, holistic distinct count)
1051G=>161G

cube2(w/o hll)
2118G => 504G

cube3(w/o hll)
3507G=>791G

cube 4(w 2 hll15)
188T => 2T

cube 5(w 2 hll15)
28T => 0.7T

cube 6(w 1 hll16)
172G=>30G

from the stats we can see that for cubes without hll, the estimation can be 4~5 
times bigger, for cubes with hll, the estimation can be more than 50 
times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe 
related to hll precision level, maybe due to data?)

To reduce region counts, we will apply estimation discount as follows:

 if (isMemoryHungry) {
            logger.info("Cube is memory hungry, storage size multiply 0.05");
            ret *= 0.05;
        } else {
            logger.info("Cube is not memory hungry, storage size multiply 
0.25");
            ret *= 0.25;
        }

and let's see how it works


> Revisit on cube size estimation
> -------------------------------
>
>                 Key: KYLIN-1237
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1237
>             Project: Kylin
>          Issue Type: Improvement
>    Affects Versions: v2.1, v2.0
>            Reporter: hongbin ma
>            Assignee: hongbin ma
>
> currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase 
> encoding and compression into consideration. From our observation in real 
> cubes, the estimation can be tens of times bigger than actual:
> here's some stats:
> Cube1(w/o hll, holistic distinct count)
> 1051G=>161G  (estimated size=>real size)
> cube2(w/o hll)
> 2118G => 504G
> cube3(w/o hll)
> 3507G=>791G
> cube 4(w 2 hll15)
> 188T => 2T
> cube 5(w 2 hll15)
> 28T => 0.7T
> cube 6(w 1 hll16)
> 172G=>30G
> from the stats we can see that for cubes without hll, the estimation can be 
> 4~5 times bigger, for cubes with hll, the estimation can be more than 50 
> times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe 
> related to hll precision level, maybe due to data?)
> To reduce region counts, we will apply estimation discount as follows:
>  if (isMemoryHungry) {
>             logger.info("Cube is memory hungry, storage size multiply 0.05");
>             ret *= 0.05;
>         } else {
>             logger.info("Cube is not memory hungry, storage size multiply 
> 0.25");
>             ret *= 0.25;
>         }
> and let's see how it works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to