Hi Jiajun, Kishore and others,

Thanks for looking into this. We used 0.8.1 in that setup and now we
upgraded it to 0.8.2. However this was a frequently occurring issue causing
controller to manually kill and restart and we expect it to happen again if
it isn't fixed in 0.8.2. Main reason for that is we are relying on a single
Zookeper node across a fairly unreliable network for our staging
environment. We will let you know if we see that issue again. In the
meantime, I will try to reproduce it in 0.8.2 in my local helix deployment.

Thanks
Dimuthu

On Tue, Apr 2, 2019 at 1:58 AM ericwang1985 <ericwang1...@gmail.com> wrote:

> Could you please confirm the Helix version that is used, Dimuthu?
> The thing is that we have fixed several potential ZkHelixManager
> concurrency issues in 0.8.2. Basically, that was a race condition in which
> the disconnect method could get a disconnected non-null zkclient. In this
> case, reset handler will never finish.
>
> Please let us know if you are already using 0.8.2 or a later version. That
> probably means we have a new bug to fix.
>
> Cheers,
> -Jiajun
>
> On Apr 1, 2019, at 13:15, kishore g <g.kish...@gmail.com> wrote:
>
> This is a good catch. @Wang Jiajun <ericwang1...@gmail.com> the stack
> trace is good enough to fix this right. We just have to look at all the
> paths we can get into this method and make sure resetHandler is thread safe
> and validates the state of the zkConnection and handlers.
>
> On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <ericwang1...@gmail.com>
> wrote:
>
>> Hi Dimuthu,
>>
>> Did you stop the controller when the connection is flapping or when it is
>> normal?
>> Could you please list all the steps that you have done in order?
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <
>> dimuthu.upeks...@gmail.com>
>> wrote:
>>
>> > Hi Folks,
>> >
>> > In helix controller, we have seen below log line and by looking at the
>> > code, I understood that it is due to ZkHelixManager is failing to
>> connect
>> > to zookeeper for 5 times. So I tried to stop the controller and in the
>> stop
>> > logic, we have a call to ZkHelixManager.disconnect() method and it
>> hangs. I
>> > got a thread dump and you can see where it is waiting. Can you please
>> > advice as better approach to solve this?
>> >
>> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
>> > detected. Is calling disconnect() twice the reason for that?
>> >
>> > 2019-03-29 15:19:56,832 [
>> > ZkClient-EventThread-14-api.staging.scigap.org:2181
>> <http://zkclient-eventthread-14-api.staging.scigap.org:2181/>]
>> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
>> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
>> > 300000ms.
>> >
>> > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
>> > nativeId (decimal):25585 - state:BLOCKED
>> > stackTrace:
>> > java.lang.Thread.State: BLOCKED (on object monitor)
>> > at
>> >
>> >
>> org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
>> > - waiting to lock <0x00000006c7e08110> (a
>> > org.apache.helix.manager.zk.ZKHelixManager)
>> > at
>> >
>> >
>> org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
>> > at
>> >
>> >
>> org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
>> > at
>> >
>> >
>> org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
>> > Source)
>> > at java.lang.Thread.run(Thread.java:748)
>> > Locked ownable synchronizers:
>> > - None
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
>> > Thanks
>> > Dimuthu
>> >
>>
>
>

Reply via email to