The issue is very likely related with https://issues.apache.org/jira/browse/KYLIN-1624; You can wait for v1.5.2, or pick the commits related with HLL (on master branch) made by Yang yesterday.
2016-04-26 17:49 GMT+08:00 ShaoFeng Shi <[email protected]>: > Hi Dayue, > > could you please open a JIRA for this, and make it configurable? As I know > now Kylin allow cube level's configurations to overwirte kylin.properties, > with this you can customize the magic number at cube level. > > Thanks; > > 2016-04-25 15:01 GMT+08:00 Li Yang <[email protected]>: > >> The magic coefficient is due to hbase compression on keys and values, the >> final cube size is much smaller than the sum of all keys and all values. >> That's why multiplying the coefficient. It's totally by experience at the >> moment. It should vary depends on the key encoding and compression applied >> to HTable. >> >> At the minimal, we should make it configurable I think. >> >> On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao <[email protected]> wrote: >> >> > Hi everyone, >> > >> > >> > I made several cubing tests on 1.5 and found most of the time was spent >> on >> > the "Convert Cuboid Data to HFile" step due to lack of reducer >> parallelism. >> > It seems that the estimated cube size is too small compared to the >> actual >> > size, which leads to small number of regions (hence reducers) to be >> > created. The setup and result of the tests are like: >> > >> > >> > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, >> > region_cut=5GB, #regions=2, actual_size=49GB >> > Cube#2: source_record=123908390, estimated_size=4653MB, >> coefficient=0.05, >> > region_cut=10GB, #regions=2, actual_size=144GB >> > >> > >> > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, >> which >> > looks mysterious to me. Currently the formula for cuboid size >> estimation is >> > >> > >> > size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient >> > where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25 >> > >> > >> > Why do we multiply the coefficient? And why it's five times smaller in >> > memory hungry case? Cloud someone explain the rationale behind it? >> > >> > >> > Thanks, Dayue >> > >> > >> > >> > >> > >> > >> > >> > >> > > > > -- > Best regards, > > Shaofeng Shi > > -- Best regards, Shaofeng Shi
