[ 
https://issues.apache.org/jira/browse/KYLIN-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894975#comment-16894975
 ] 

ASF GitHub Bot commented on KYLIN-4083:
---------------------------------------

ZhengshuaiPENG commented on pull request #742: #KYLIN-4083, Fact Distinct 
Column Step may be failed or value lost whe…
URL: https://github.com/apache/kylin/pull/742
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Fact Distinct Column Step maybe failed or value lost when hashcode of the UHC 
> column value is Integer.MIN_VALUE
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-4083
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4083
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: PENG Zhengshuai
>            Assignee: PENG Zhengshuai
>            Priority: Major
>
> In the Fact Distinct Column Step, kylin uses MR to de-dup the values of 
> columns.
> If the column is UHC (ultra high cardinality) column and the value of the 
> property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, 
> the Mapper task will write the output of UHC column values to different 
> reducers by *FactDistinctColumnPartitioner* according to the reducer id 
> The reducer id will be calculated by hash, the implementation in 
> *FactDistinctColumnsReducerMapping#getReducerIdForCol()*,  in this method, 
> *the reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % 
> uhcReducerCount*
> When the value.hashCode() is Integer.MIN_VALUE, the 
> Math.abs(value.hashCode()) return also Integer.MIN_VALUE. Thus the reducer id 
> may return a negative value. This may cause the FactDistinctColumn step 
> failed, or the UHC column value may be redirected to another reducer which 
> not belongs to UHC column
> For example:
> If a UHC column value is "539019926", its hashcode is Integer.MIN_VALUE.
> "539019926".hashCode() == Integer.MIN_VALUE == -2147483648.  The 
> Math.abs(-2147483648) returns -2147483648. 
> so the reducerId = beginIndex + (-2147483648) % uhcReducerCount.
> If the beginIndex is 8, uhcReducerCount is 35.The method 
> *FactDistinctColumnsReducerMapping#getReducerIdForCol()* will return -15.
> To Fix it: convert hashCode() value to *long* when calculating Math.abs() to 
> avoid Integer.MIN_VALUE instead of *int*.
> Because the hashCode() method return an int value,  Math.abs(longValue) will 
> never meet the Long.MIN_VALUE, so it's safe
> After fix, *FactDistinctColumnsReducerMapping#getReducerIdForCol()* will 
> return 31



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to