[ 
https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444513#comment-16444513
 ] 

ASF GitHub Bot commented on HELIX-690:
--------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/helix/pull/181


> Batch message should not share same NotificationContext object to update 
> CurrentState
> -------------------------------------------------------------------------------------
>
>                 Key: HELIX-690
>                 URL: https://issues.apache.org/jira/browse/HELIX-690
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Hao Zhang
>            Priority: Major
>
> Currently batch message has bugs:
>  1. Batch message is triggering a lot of duplicated state transition messages 
> sent from controller, result in "state does not match" error on participant 
> side. This will further create a lot of ERROR znodes in ZK, which adds up 
> both read/write workload in participant and controller
> 2. We see a lot of concurrent update exceptions as well
> {noformat}
> 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] 
> [org.apache.helix.messaging.handling.HelixTask:113] - Exception while 
> executing a message. java.util.ConcurrentModificat
> ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: 
> STATE_TRANSITION
> 9909349-java.util.ConcurrentModificationException
> 9909350- at 
> java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
> 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
> 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497)
> 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121)
> 9909354- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182)
> 9909355- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170)
> 9909356- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118)
> 9909357- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203)
> 9909358- at 
> org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
> {noformat}
> The above 2 errors are resulted in the fact that in HelixTaskExecutor, all 
> HelixTask objects from same batch of messages are sharing the same 
> changeContext object. For batch message, HelixTask will create current state 
> update map to record current state updates, and therefore result in a racing 
> condition in current state recording - it is very normal that due to such 
> bug, resource's current state is changed on participant side, current state 
> is not updated in ZK, and after message is removed, controller still think 
> that state transition is not finished, and send duplicated state transition 
> message.
>  
> The error situation will only be triggered when the load is high, so not 
> covered by our unit / e2e tests
> To fix the issue, we should create deep copies of NotificationContext object 
> for each HelixTask in HelixTaskExecutor. I tried this fix using large data 
> sets, and it worked.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to