Hi Dayue, could you please open a JIRA for this, and make it configurable? As I know now Kylin allow cube level's configurations to overwirte kylin.properties, with this you can customize the magic number at cube level.
Thanks; 2016-04-25 15:01 GMT+08:00 Li Yang <[email protected]>: > The magic coefficient is due to hbase compression on keys and values, the > final cube size is much smaller than the sum of all keys and all values. > That's why multiplying the coefficient. It's totally by experience at the > moment. It should vary depends on the key encoding and compression applied > to HTable. > > At the minimal, we should make it configurable I think. > > On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao <[email protected]> wrote: > > > Hi everyone, > > > > > > I made several cubing tests on 1.5 and found most of the time was spent > on > > the "Convert Cuboid Data to HFile" step due to lack of reducer > parallelism. > > It seems that the estimated cube size is too small compared to the actual > > size, which leads to small number of regions (hence reducers) to be > > created. The setup and result of the tests are like: > > > > > > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, > > region_cut=5GB, #regions=2, actual_size=49GB > > Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05, > > region_cut=10GB, #regions=2, actual_size=144GB > > > > > > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, > which > > looks mysterious to me. Currently the formula for cuboid size estimation > is > > > > > > size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient > > where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25 > > > > > > Why do we multiply the coefficient? And why it's five times smaller in > > memory hungry case? Cloud someone explain the rationale behind it? > > > > > > Thanks, Dayue > > > > > > > > > > > > > > > > > -- Best regards, Shaofeng Shi
