[
https://issues.apache.org/jira/browse/KYLIN-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
PENG Zhengshuai updated KYLIN-4083:
-----------------------------------
Description:
In the Fact Distinct Column Step, kylin uses MR to de-dup the values of columns.
If the column is UHC (ultra high cardinality) column and the value of the
property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, the
Mapper task will write the output of UHC column values to different reducers by
*FactDistinctColumnPartitioner* according to the reducer id
The reducer id will be calculated by hash, the implementation in
*FactDistinctColumnsReducerMapping#getReducerIdForCol()*, in this method, *the
reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % uhcReducerCount*
When the value.hashCode() is Integer.MIN_VALUE, the Math.abs(value.hashCode())
return also Integer.MIN_VALUE. Thus the reducer id may return a negative value.
This may cause the FactDistinctColumn step failed, or the UHC column value may
be redirected to another reducer which not belongs to UHC column
For example:
If a UHC column value is "539019926", its hashcode is Integer.MIN_VALUE.
"539019926".hashCode() == Integer.MIN_VALUE == -2147483648. The
Math.abs(-2147483648) returns -2147483648.
so the reducerId = beginIndex + (-2147483648) % uhcReducerCount.
If the beginIndex is 8, uhcReducerCount is 35.The method
*FactDistinctColumnsReducerMapping#getReducerIdForCol()* will return -15.
To Fix it: convert hashCode() value to *long* when calculating Math.abs() to
avoid Integer.MIN_VALUE instead of *int*.
Because the hashCode() method return an int value, Math.abs(longValue) will
never meet the Long.MIN_VALUE, so it's safe
was:
In the Fact Distinct Column Step, kylin uses MR to de-dup the values of columns.
If the column is UHC (ultra high cardinality) column and the value of the
property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, the
Mapper task will write the output of UHC column values to different reducers by
*FactDistinctColumnPartitioner* according to the reducer id
The reducer id will be calculated by hash, the implementation in
*FactDistinctColumnsReducerMapping#getReducerIdForCol()*, in this method, *the
reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % uhcReducerCount*
When the value.hashCode() is Integer.MIN_VALUE, the Math.abs(value.hashCode())
return also Integer.MIN_VALUE. Thus the reducer id may return a negative value.
This may cause the FactDistinctColumn step failed, or the UHC column value may
be redirected to another reducer which not belongs to UHC column
> Fact Distinct Column Step maybe failed or value lost when hashcode of the UHC
> column value is Integer.MIN_VALUE
> ---------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-4083
> URL: https://issues.apache.org/jira/browse/KYLIN-4083
> Project: Kylin
> Issue Type: Bug
> Reporter: PENG Zhengshuai
> Assignee: PENG Zhengshuai
> Priority: Major
>
> In the Fact Distinct Column Step, kylin uses MR to de-dup the values of
> columns.
> If the column is UHC (ultra high cardinality) column and the value of the
> property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*,
> the Mapper task will write the output of UHC column values to different
> reducers by *FactDistinctColumnPartitioner* according to the reducer id
> The reducer id will be calculated by hash, the implementation in
> *FactDistinctColumnsReducerMapping#getReducerIdForCol()*, in this method,
> *the reducer id = reducerBeginIndex + Math.abs(value.hashCode()) %
> uhcReducerCount*
> When the value.hashCode() is Integer.MIN_VALUE, the
> Math.abs(value.hashCode()) return also Integer.MIN_VALUE. Thus the reducer id
> may return a negative value. This may cause the FactDistinctColumn step
> failed, or the UHC column value may be redirected to another reducer which
> not belongs to UHC column
> For example:
> If a UHC column value is "539019926", its hashcode is Integer.MIN_VALUE.
> "539019926".hashCode() == Integer.MIN_VALUE == -2147483648. The
> Math.abs(-2147483648) returns -2147483648.
> so the reducerId = beginIndex + (-2147483648) % uhcReducerCount.
> If the beginIndex is 8, uhcReducerCount is 35.The method
> *FactDistinctColumnsReducerMapping#getReducerIdForCol()* will return -15.
> To Fix it: convert hashCode() value to *long* when calculating Math.abs() to
> avoid Integer.MIN_VALUE instead of *int*.
> Because the hashCode() method return an int value, Math.abs(longValue) will
> never meet the Long.MIN_VALUE, so it's safe
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)