hongbin ma created KYLIN-1237:
---------------------------------

             Summary: Revisit on cube size estimation
                 Key: KYLIN-1237
                 URL: https://issues.apache.org/jira/browse/KYLIN-1237
             Project: Kylin
          Issue Type: Improvement
    Affects Versions: v2.1, v2.0
            Reporter: hongbin ma
            Assignee: hongbin ma


currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase 
encoding and compression into consideration. From our observation in real 
cubes, the estimation can be tens of times bigger than actual:

here's some stats:

Cube1(w/o hll, holistic distinct count)
1051G=>161G

cube2(w/o hll)
2118G => 504G

cube3(w/o hll)
3507G=>791G

cube 4(w 2 hll15)
188T => 2T

cube 5(w 2 hll15)
28T => 0.7T

cube 6(w 1 hll16)
172G=>30G

from the stats we can see that for cubes without hll, the estimation can be 4~5 
times bigger, for cubes with hll, the estimation can be more than 50 
times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe 
related to hll precision level, maybe due to data?)

To reduce region counts, we will apply estimation discount as follows:

 if (isMemoryHungry) {
            logger.info("Cube is memory hungry, storage size multiply 0.05");
            ret *= 0.05;
        } else {
            logger.info("Cube is not memory hungry, storage size multiply 
0.25");
            ret *= 0.25;
        }

and let's see how it works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to