[GitHub] helix pull request #183: Fix race-condition issue that could block ZkClient ...

2018-04-16 Thread lei-xia
GitHub user lei-xia opened a pull request:

https://github.com/apache/helix/pull/183

Fix race-condition issue that could block ZkClient event thread in 
CallbackHandler.

1) Replace native batch callback handling thread with 
DedupEventCallbackProcessor, so the queue is deduplicated when event gets 
enqueued, instead of at the time of dequeue.

2) Shutdown and clean up the processor properly when CallbackHanlder is 
reset().

3) Add a flag to indicated whether the CallbackHandler is reset. If reset, 
do not enqueue further event into the batch callback queue.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lei-xia/helix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #183


commit a25305eb83c3226d028e39c27a70293c2576756e
Author: Lei Xia 
Date:   2018-04-16T18:38:26Z

Fix race-condition issue that could block ZkClient event thread in 
CallbackHandler.




---


[GitHub] helix pull request #178: A few minor fixes.

2018-04-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/helix/pull/178


---


[GitHub] helix issue #179: Unique thread id for the threads that execute Tasks

2018-04-16 Thread lei-xia
Github user lei-xia commented on the issue:

https://github.com/apache/helix/pull/179
  
Could you please rebase to the current HEAD of master branch?  Thanks


---


[jira] [Commented] (HELIX-695) Add Helix Manager listener for new connection notification

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439730#comment-16439730
 ] 

ASF GitHub Bot commented on HELIX-695:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/182

[HELIX-695] add helix manager listener for new connection notification

In this PR I added invocation and related tests of 
`stateListener.onConnected()` method in ZkHelixManager when it is connected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-manager-onconnected

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/182.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #182


commit 65e84713503437c542e545abd521c2ba6d26
Author: Harry Zhang 
Date:   2018-04-16T17:05:30Z

[HELIX-695] add helix manager listener for new connection notification




> Add Helix Manager listener for new connection notification
> --
>
> Key: HELIX-695
> URL: https://issues.apache.org/jira/browse/HELIX-695
> Project: Apache Helix
>  Issue Type: Task
>Reporter: Hao Zhang
>Priority: Major
>
> Currently HelixManager is not notifying state listener about connection 
> establishment. Adding this notification is useful since HelixManager supports 
> get ZkClient method and when connection is re-established, ZkClient is newly 
> created and users who used get method to extract client should be notified 
> and refresh their client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] helix pull request #182: [HELIX-695] add helix manager listener for new conn...

2018-04-16 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/182

[HELIX-695] add helix manager listener for new connection notification

In this PR I added invocation and related tests of 
`stateListener.onConnected()` method in ZkHelixManager when it is connected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-manager-onconnected

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/182.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #182


commit 65e84713503437c542e545abd521c2ba6d26
Author: Harry Zhang 
Date:   2018-04-16T17:05:30Z

[HELIX-695] add helix manager listener for new connection notification




---


[jira] [Created] (HELIX-695) Add Helix Manager listener for new connection notification

2018-04-16 Thread Hao Zhang (JIRA)
Hao Zhang created HELIX-695:
---

 Summary: Add Helix Manager listener for new connection notification
 Key: HELIX-695
 URL: https://issues.apache.org/jira/browse/HELIX-695
 Project: Apache Helix
  Issue Type: Task
Reporter: Hao Zhang


Currently HelixManager is not notifying state listener about connection 
establishment. Adding this notification is useful since HelixManager supports 
get ZkClient method and when connection is re-established, ZkClient is newly 
created and users who used get method to extract client should be notified and 
refresh their client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HELIX-690) Batch message should not share same NotificationContext object to update CurrentState

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439716#comment-16439716
 ] 

ASF GitHub Bot commented on HELIX-690:
--

GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/181

[HELIX-690] batch message execution should not share same context

In this PR, I added deep copy methods to NotificationContext so when 
processing messages in batch, different thread would not share the same 
notification context.

This solves the problem that when processing BatchMessages, each thread 
will have their own current state delta to work on, so current states won't be 
messed up.

Also modified some logs to make it more useful when debugging

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/batch-msg-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #181


commit bb7751b0f52aadcf04b7813fa3e99c8e266a3d0b
Author: Harry Zhang 
Date:   2018-04-16T16:55:43Z

[HELIX-690] batch message execution should not share same context




> Batch message should not share same NotificationContext object to update 
> CurrentState
> -
>
> Key: HELIX-690
> URL: https://issues.apache.org/jira/browse/HELIX-690
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Hao Zhang
>Priority: Major
>
> Currently batch message has bugs:
>  1. Batch message is triggering a lot of duplicated state transition messages 
> sent from controller, result in "state does not match" error on participant 
> side. This will further create a lot of ERROR znodes in ZK, which adds up 
> both read/write workload in participant and controller
> 2. We see a lot of concurrent update exceptions as well
> {noformat}
> 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917] 
> [org.apache.helix.messaging.handling.HelixTask:113] - Exception while 
> executing a message. java.util.ConcurrentModificat
> ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type: 
> STATE_TRANSITION
> 9909349-java.util.ConcurrentModificationException
> 9909350- at 
> java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
> 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
> 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497)
> 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121)
> 9909354- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182)
> 9909355- at 
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170)
> 9909356- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118)
> 9909357- at 
> org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203)
> 9909358- at 
> org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
> {noformat}
> The above 2 errors are resulted in the fact that in HelixTaskExecutor, all 
> HelixTask objects from same batch of messages are sharing the same 
> changeContext object. For batch message, HelixTask will create current state 
> update map to record current state updates, and therefore result in a racing 
> condition in current state recording - it is very normal that due to such 
> bug, resource's current state is changed on participant side, current state 
> is not updated in ZK, and after message is removed, controller still think 
> that state transition is not finished, and send duplicated state transition 
> message.
>  
> The error situation will only be triggered when the load is high, so not 
> covered by our unit / e2e tests
> To fix the issue, we should create deep copies of NotificationContext object 
> for each HelixTask in HelixTaskExecutor. I tried this fix using large data 
> sets, and it worked.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)