[ 
https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439716#comment-16439716
 ] 

ASF GitHub Bot commented on HELIX-690:
--------------------------------------

GitHub user zhan849 opened a pull request:

    https://github.com/apache/helix/pull/181

    [HELIX-690] batch message execution should not share same context

    In this PR, I added deep copy methods to NotificationContext so when 
processing messages in batch, different thread would not share the same 
notification context.
    
    This solves the problem that when processing BatchMessages, each thread 
will have their own current state delta to work on, so current states won't be 
messed up.
    
    Also modified some logs to make it more useful when debugging

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhan849/helix harry/batch-msg-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/helix/pull/181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #181
    
----
commit bb7751b0f52aadcf04b7813fa3e99c8e266a3d0b
Author: Harry Zhang <zhan849@...>
Date:   2018-04-16T16:55:43Z

    [HELIX-690] batch message execution should not share same context

----


> Batch message should not share same NotificationContext object to update 
> CurrentState
> -------------------------------------------------------------------------------------
>
>                 Key: HELIX-690
>                 URL: https://issues.apache.org/jira/browse/HELIX-690
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Hao Zhang
>            Priority: Major
>
> Currently batch message has bugs:
>  1. Batch message is triggering a lot of duplicated state transition messages 
> sent from controller, result in "state does not match" error on participant 
> side. This will further create a lot of ERROR znodes in ZK, which adds up 
> both read/write workload in participant and controller
> 2. We see a lot of concurrent update exceptions as well
> {noformat}
> 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] 
> [org.apache.helix.messaging.handling.HelixTask:113] - Exception while 
> executing a message. java.util.ConcurrentModificat
> ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: 
> STATE_TRANSITION
> 9909349-java.util.ConcurrentModificationException
> 9909350- at 
> java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
> 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
> 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497)
> 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121)
> 9909354- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182)
> 9909355- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170)
> 9909356- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118)
> 9909357- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203)
> 9909358- at 
> org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
> {noformat}
> The above 2 errors are resulted in the fact that in HelixTaskExecutor, all 
> HelixTask objects from same batch of messages are sharing the same 
> changeContext object. For batch message, HelixTask will create current state 
> update map to record current state updates, and therefore result in a racing 
> condition in current state recording - it is very normal that due to such 
> bug, resource's current state is changed on participant side, current state 
> is not updated in ZK, and after message is removed, controller still think 
> that state transition is not finished, and send duplicated state transition 
> message.
>  
> The error situation will only be triggered when the load is high, so not 
> covered by our unit / e2e tests
> To fix the issue, we should create deep copies of NotificationContext object 
> for each HelixTask in HelixTaskExecutor. I tried this fix using large data 
> sets, and it worked.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to