hongbin ma created KYLIN-1237:
---------------------------------
Summary: Revisit on cube size estimation
Key: KYLIN-1237
URL: https://issues.apache.org/jira/browse/KYLIN-1237
Project: Kylin
Issue Type: Improvement
Affects Versions: v2.1, v2.0
Reporter: hongbin ma
Assignee: hongbin ma
currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase
encoding and compression into consideration. From our observation in real
cubes, the estimation can be tens of times bigger than actual:
here's some stats:
Cube1(w/o hll, holistic distinct count)
1051G=>161G
cube2(w/o hll)
2118G => 504G
cube3(w/o hll)
3507G=>791G
cube 4(w 2 hll15)
188T => 2T
cube 5(w 2 hll15)
28T => 0.7T
cube 6(w 1 hll16)
172G=>30G
from the stats we can see that for cubes without hll, the estimation can be 4~5
times bigger, for cubes with hll, the estimation can be more than 50
times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe
related to hll precision level, maybe due to data?)
To reduce region counts, we will apply estimation discount as follows:
if (isMemoryHungry) {
logger.info("Cube is memory hungry, storage size multiply 0.05");
ret *= 0.05;
} else {
logger.info("Cube is not memory hungry, storage size multiply
0.25");
ret *= 0.25;
}
and let's see how it works
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)