Hi everyone,
I made several cubing tests on 1.5 and found most of the time was spent on the "Convert Cuboid Data to HFile" step due to lack of reducer parallelism. It seems that the estimated cube size is too small compared to the actual size, which leads to small number of regions (hence reducers) to be created. The setup and result of the tests are like: Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, region_cut=5GB, #regions=2, actual_size=49GB Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05, region_cut=10GB, #regions=2, actual_size=144GB The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, which looks mysterious to me. Currently the formula for cuboid size estimation is size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25 Why do we multiply the coefficient? And why it's five times smaller in memory hungry case? Cloud someone explain the rationale behind it? Thanks, Dayue
