[ https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444513#comment-16444513 ]
ASF GitHub Bot commented on HELIX-690: -------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/helix/pull/181 > Batch message should not share same NotificationContext object to update > CurrentState > ------------------------------------------------------------------------------------- > > Key: HELIX-690 > URL: https://issues.apache.org/jira/browse/HELIX-690 > Project: Apache Helix > Issue Type: Bug > Reporter: Hao Zhang > Priority: Major > > Currently batch message has bugs: > 1. Batch message is triggering a lot of duplicated state transition messages > sent from controller, result in "state does not match" error on participant > side. This will further create a lot of ERROR znodes in ZK, which adds up > both read/write workload in participant and controller > 2. We see a lot of concurrent update exceptions as well > {noformat} > 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] > [org.apache.helix.messaging.handling.HelixTask:113] - Exception while > executing a message. java.util.ConcurrentModificat > ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: > STATE_TRANSITION > 9909349-java.util.ConcurrentModificationException > 9909350- at > java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) > 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) > 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497) > 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121) > 9909354- at > org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182) > 9909355- at > org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170) > 9909356- at > org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118) > 9909357- at > org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203) > 9909358- at > org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96) > {noformat} > The above 2 errors are resulted in the fact that in HelixTaskExecutor, all > HelixTask objects from same batch of messages are sharing the same > changeContext object. For batch message, HelixTask will create current state > update map to record current state updates, and therefore result in a racing > condition in current state recording - it is very normal that due to such > bug, resource's current state is changed on participant side, current state > is not updated in ZK, and after message is removed, controller still think > that state transition is not finished, and send duplicated state transition > message. > > The error situation will only be triggered when the load is high, so not > covered by our unit / e2e tests > To fix the issue, we should create deep copies of NotificationContext object > for each HelixTask in HelixTaskExecutor. I tried this fix using large data > sets, and it worked. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)