assume average size of this column is 32 bytes, 50 millions cardinality
means 1.5GB, in the step of 'Fact Table Distinct Columns.' mapper need read
from intermediate table and remove duplicate values(do it in Combiner),
however, this job will startup more than one mapper and just one reducer,
therefore, input for reducer is more than 1.5GB and in reduce function
kylin will create a new Set to contain all unique values, so , this is a
another 1.5GB.
I have encounter this probelm and I have to change MR config preperty for
every job, I modify those properties :
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6000M</value>
<description>Larger heap-size for child jvms of
reduces.</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8000</value>
<description>Larger resource limit for reduces.</description>
</property>
you can check the value of those properties currently used and increase
them.
At Last, ask yourself Do you really need all detail values of those two
column, if not , you can create create view to change the source data or
just do not use dictionary while creating cube, set the length value for
them in 'Advanced Setting' step..
Hope to be helpful to you.
2016-01-09 6:17 GMT+08:00 zhong zhang <[email protected]>:
> Hi All,
>
> There are two ultra high carnality columns in our cube. Both of them are
> over 50 million cardinality. When building the cube, it keeps giving us the
> error: Error: GC overhead limit exceeded for the reduce jobs at the
> step Extract
> Fact Table Distinct Columns.
>
> We've just updated to version1.2.
>
> Can anyone give some ideas to solve this issue?
>
> Best regards,
> Zhong
>