[ 
https://issues.apache.org/jira/browse/IGNITE-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534641#comment-16534641
 ] 

Sergey Chugunov commented on IGNITE-8131:
-----------------------------------------

[~garus.d.g],

I reviewed the change and it looks somewhat reasonable for me, tests look fine 
as well. But I still have a feeling that we don't fix the root cause of the 
problem but mask it (most likely it is some kind of race as introducing a delay 
helps to fix it).

What makes me think like this is that (again from analysis of attached logs) is 
that in failure example I don't see even report about disconnected event: like 
client was never able to detect that it has disconnected from topology.
And your analysis doesn't explain lack of disconnected event but talks only 
about reconnect process.

Could you please explain from your understanding the sequence of events as 
detailed as possible? Maybe even with references into the code.

Because I see in logs that in successful scenario client detects connection 
loss almost immediately and switches its state to Disconnected:
{noformat}
[2018-06-09 20:12:35,312][INFO 
][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient] 
ZooKeeper client state changed [prevState=Connected, newState=Disconnected]
{noformat}
And in failure scenario client does something different at probably similar 
moment in time:
{noformat}
[2018-06-09 20:12:45,591][WARN 
][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient] Failed 
to execute ZooKeeper operation 
[err=org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052,
 state=Connected]
[2018-06-09 20:12:45,591][WARN 
][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient] 
ZooKeeper operation failed, will retry 
[err=org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052,
 retryTimeout=2000, connLossTimeout=2000, 
path=/apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052,
 remainingWaitTime=2000]
{noformat}
It seems to me that in failure scenario client receives ConnectionLoss when 
executing the code that is not ready for this exception and handles it wrongly.

Another idea here maybe that on connection loss client cannot do necessary 
cleanup in ZooKeeper and when it establishes new connection to ZK it cannot 
figure out that it has to generate disconnected event and make a reconnect 
attempt.

Thanks.

> ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC
> ----------------------------------------------------------------------------
>
>                 Key: IGNITE-8131
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8131
>             Project: Ignite
>          Issue Type: Bug
>          Components: zookeeper
>            Reporter: Sergey Chugunov
>            Assignee: Denis Garus
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.7
>
>         Attachments: ZK_client_reconnect_failure.log, 
> ZK_client_reconnect_success.log
>
>
> Two tests always fail on TC with the assertion
> {noformat}
> junit.framework.AssertionFailedError: Failed to wait for disconnect/reconnect 
> event.
>     at 
> org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.waitReconnectEvent(ZookeeperDiscoverySpiTest.java:4221)
>     at 
> org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.reconnectClientNodes(ZookeeperDiscoverySpiTest.java:4183)
>     at 
> org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.clientReconnectSessionExpire(ZookeeperDiscoverySpiTest.java:2231)
>     at 
> org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.testClientReconnectSessionExpire1_1(ZookeeperDiscoverySpiTest.java:2206)
> {noformat}
> from client disconnect/reconnect events check. Obviously client doesn't 
> generate these events as it supposed to do.
> (TC runs can be found 
> [here|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_IgniteZooKeeperDiscovery&branch_IgniteTests24Java8=pull%2F3730%2Fhead&tab=buildTypeStatusDiv]).
> It is possible to reproduce test failure locally as well, but with low 
> probability: one failure for 50 or even 300 successful executions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to