[
https://issues.apache.org/jira/browse/IGNITE-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790530#comment-16790530
]
Andrew Mashenkov commented on IGNITE-11460:
-------------------------------------------
[~NSAmelchev], got it.
Seems, the root of issue is that we process NODE_FAILED events after
CLIENT_DISCONNECT happens.
To resolve this, we should ignore all topology change events between
onDisconected() and next onLocalJoin(), that is what your fix do.
I've found kernalContext.clientDisconnected flag is set to 'true' in
onDisconnected() and is set to 'false' in onLocalJoin() methods.
I'd think we can use this flag and skip all topology change events in
onDicovery() method via simple check "if (ctx.clientDisconnected) return".
If any reordering between all those events are possible (e.g. due to event
processing from different threads) than it look like bug in discovery.
> MVCC: Possible race on coordinator changing on client reconnection.
> -------------------------------------------------------------------
>
> Key: IGNITE-11460
> URL: https://issues.apache.org/jira/browse/IGNITE-11460
> Project: Ignite
> Issue Type: Bug
> Reporter: Amelchev Nikita
> Assignee: Amelchev Nikita
> Priority: Major
> Labels: MakeTeamcityGreenAgain
> Fix For: 2.8
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I found that the wrong coordinator can be set in case of client reconnect:
> {noformat}
> assert newCrd.topologyVersion().compareTo(curCrd.topologyVersion()) > 0;
> java.lang.AssertionError
> at
> org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl.onCoordinatorChanged(MvccProcessorImpl.java:541)
> at
> org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl.onLocalJoin(MvccProcessorImpl.java:416)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:851)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.lambda$onDiscovery$0(GridDiscoveryManager.java:601)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body0(GridDiscoveryManager.java:2681)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body(GridDiscoveryManager.java:2719)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I have attached reproducer in PR.
> The main reason is that coordinator can be changed from discovery event
> thread when the client already disconnect (disconnection processed in
> notifier thread and change coordinator on onDisconnected method).
> Coordinator can be changed in cases:
> 1. notifier disco thread: onDisconnected method
> 2. event disco thread: onDiscovery listener.
> and events can be processed with some delay and override coordinator that set
> in notifier thread.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)