Hi everyone,

I made several cubing tests on 1.5 and found most of the time was spent on the 
"Convert Cuboid Data to HFile" step due to lack of reducer parallelism. It 
seems that the estimated cube size is too small compared to the actual size, 
which leads to small number of regions (hence reducers) to be created. The 
setup and result of the tests are like:


Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25, 
region_cut=5GB, #regions=2, actual_size=49GB
Cube#2: source_record=123908390, estimated_size=4653MB, coefficient=0.05, 
region_cut=10GB, #regions=2, actual_size=144GB


The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize, which 
looks mysterious to me. Currently the formula for cuboid size estimation is


  size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient
  where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25


Why do we multiply the coefficient? And why it's five times smaller in memory 
hungry case? Cloud someone explain the rationale behind it?


Thanks, Dayue







Reply via email to