Re: Question about cube size estimation in Kylin 1.5

ShaoFeng Shi Tue, 26 Apr 2016 22:47:14 -0700

The issue is very likely related with
https://issues.apache.org/jira/browse/KYLIN-1624; You can wait for v1.5.2,
or pick the commits related with HLL (on master branch) made by Yang
yesterday.



2016-04-26 17:49 GMT+08:00 ShaoFeng Shi <[email protected]>:

> Hi Dayue,
>
> could you please open a JIRA for this, and make it configurable? As I know
> now Kylin allow cube level's configurations to overwirte kylin.properties,
> with this you can customize the magic number at cube level.
>
> Thanks;
>
> 2016-04-25 15:01 GMT+08:00 Li Yang <[email protected]>:
>
>> The magic coefficient is due to hbase compression on keys and values, the
>> final cube size is much smaller than the sum of all keys and all values.
>> That's why multiplying the coefficient. It's totally by experience at the
>> moment. It should vary depends on the key encoding and compression applied
>> to HTable.
>>
>> At the minimal, we should make it configurable I think.
>>
>> On Mon, Apr 18, 2016 at 4:38 PM, Dayue Gao <[email protected]> wrote:
>>
>> > Hi everyone,
>> >
>> >
>> > I made several cubing tests on 1.5 and found most of the time was spent
>> on
>> > the "Convert Cuboid Data to HFile" step due to lack of reducer
>> parallelism.
>> > It seems that the estimated cube size is too small compared to the
>> actual
>> > size, which leads to small number of regions (hence reducers) to be
>> > created. The setup and result of the tests are like:
>> >
>> >
>> > Cube#1: source_record=11998051, estimated_size=8805MB, coefficient=0.25,
>> > region_cut=5GB, #regions=2, actual_size=49GB
>> > Cube#2: source_record=123908390, estimated_size=4653MB,
>> coefficient=0.05,
>> > region_cut=10GB, #regions=2, actual_size=144GB
>> >
>> >
>> > The "coefficient" is from CubeStatsReader#estimateCuboidStorageSize,
>> which
>> > looks mysterious to me. Currently the formula for cuboid size
>> estimation is
>> >
>> >
>> >   size(cuboid) = rows(cuboid) x row_size(cuboid) x coefficient
>> >   where coefficient = has_memory_hungry_measures(cube) ? 0.05 : 0.25
>> >
>> >
>> > Why do we multiply the coefficient? And why it's five times smaller in
>> > memory hungry case? Cloud someone explain the rationale behind it?
>> >
>> >
>> > Thanks, Dayue
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>
>


-- 
Best regards,

Shaofeng Shi

Re: Question about cube size estimation in Kylin 1.5

Reply via email to