[
https://issues.apache.org/jira/browse/KYLIN-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
hongbin ma updated KYLIN-1237:
------------------------------
Description:
currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase
encoding and compression into consideration. From our observation in real
cubes, the estimation can be tens of times bigger than actual:
here's some stats:
Cube1(w/o hll, holistic distinct count)
1051G=>161G (estimated size=>real size)
cube2(w/o hll)
2118G => 504G
cube3(w/o hll)
3507G=>791G
cube 4(w 2 hll15)
188T => 2T
cube 5(w 2 hll15)
28T => 0.7T
cube 6(w 1 hll16)
172G=>30G
from the stats we can see that for cubes without hll, the estimation can be 4~5
times bigger, for cubes with hll, the estimation can be more than 50
times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe
related to hll precision level, maybe due to data?)
To reduce region counts, we will apply estimation discount as follows:
if (isMemoryHungry) {
logger.info("Cube is memory hungry, storage size multiply 0.05");
ret *= 0.05;
} else {
logger.info("Cube is not memory hungry, storage size multiply
0.25");
ret *= 0.25;
}
and let's see how it works
was:
currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase
encoding and compression into consideration. From our observation in real
cubes, the estimation can be tens of times bigger than actual:
here's some stats:
Cube1(w/o hll, holistic distinct count)
1051G=>161G
cube2(w/o hll)
2118G => 504G
cube3(w/o hll)
3507G=>791G
cube 4(w 2 hll15)
188T => 2T
cube 5(w 2 hll15)
28T => 0.7T
cube 6(w 1 hll16)
172G=>30G
from the stats we can see that for cubes without hll, the estimation can be 4~5
times bigger, for cubes with hll, the estimation can be more than 50
times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe
related to hll precision level, maybe due to data?)
To reduce region counts, we will apply estimation discount as follows:
if (isMemoryHungry) {
logger.info("Cube is memory hungry, storage size multiply 0.05");
ret *= 0.05;
} else {
logger.info("Cube is not memory hungry, storage size multiply
0.25");
ret *= 0.25;
}
and let's see how it works
> Revisit on cube size estimation
> -------------------------------
>
> Key: KYLIN-1237
> URL: https://issues.apache.org/jira/browse/KYLIN-1237
> Project: Kylin
> Issue Type: Improvement
> Affects Versions: v2.1, v2.0
> Reporter: hongbin ma
> Assignee: hongbin ma
>
> currently CreateHTableJob.estimateCuboidStorageSize does not consider hbase
> encoding and compression into consideration. From our observation in real
> cubes, the estimation can be tens of times bigger than actual:
> here's some stats:
> Cube1(w/o hll, holistic distinct count)
> 1051G=>161G (estimated size=>real size)
> cube2(w/o hll)
> 2118G => 504G
> cube3(w/o hll)
> 3507G=>791G
> cube 4(w 2 hll15)
> 188T => 2T
> cube 5(w 2 hll15)
> 28T => 0.7T
> cube 6(w 1 hll16)
> 172G=>30G
> from the stats we can see that for cubes without hll, the estimation can be
> 4~5 times bigger, for cubes with hll, the estimation can be more than 50
> times!(It's worth studying why cube6 is estimated only 6 time bigger, maybe
> related to hll precision level, maybe due to data?)
> To reduce region counts, we will apply estimation discount as follows:
> if (isMemoryHungry) {
> logger.info("Cube is memory hungry, storage size multiply 0.05");
> ret *= 0.05;
> } else {
> logger.info("Cube is not memory hungry, storage size multiply
> 0.25");
> ret *= 0.25;
> }
> and let's see how it works
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)