[
https://issues.apache.org/jira/browse/IGNITE-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578496#comment-17578496
]
Ignite TC Bot commented on IGNITE-17507:
----------------------------------------
{panel:title=Branch: [pull/10187/head] Base: [master] : No blockers
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/10187/head] Base: [master] : New Tests
(1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}
{color:#00008b}Cache 5{color} [[tests
1|https://ci.ignite.apache.org/viewLog.html?buildId=6723321]]
* {color:#013220}IgniteCacheTestSuite5:
CacheLateAffinityAssignmentTest.testDelayAssignmentAffinityChangedUnexpectedPME
- PASSED{color}
{panel}
[TeamCity *--> Run :: All*
Results|https://ci.ignite.apache.org/viewLog.html?buildId=6723417&buildTypeId=IgniteTests24Java8_RunAll]
> Failed to wait for partition map exchange on some clients
> ---------------------------------------------------------
>
> Key: IGNITE-17507
> URL: https://issues.apache.org/jira/browse/IGNITE-17507
> Project: Ignite
> Issue Type: Bug
> Reporter: Vyacheslav Koptilin
> Assignee: Vyacheslav Koptilin
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We have scenario with several client and server nodes, which can stuck on PME
> after start:
> * Start some server nodes
> * Trigger rebalance
> * Start some client and server nodes
> * Some of the client nodes stuck with _Failed to wait for partition map
> exchange [topVer=AffinityTopologyVersion…_
> Deep investigation of the logs showed, that the root cause of the stuck PME
> on client is the race between joining new client node and receiving stale
> _CacheAffinityChangeMessage_ on a client, which causes PME, but when other
> old nodes receive this _CacheAffinityChangeMessage_, they skip it because of
> some optimization.
> Optimization can be found in the method
> _CacheAffinitySharedManager#onDiscoveryEvent_, we save _lastAffVer = topVer_
> for old nodes, but because of some race _lastAffVer_ for the problem client
> node is null when we reach _CacheAffinitySharedManager#onCustomEvent_ and we
> schedule invalid PME in _msg.exchangeNeeded(exchangeNeeded)_, but other
> nodes skip this PME
> The possible fix is that we can try to make the _CacheAffinityChangeMessage_
> mutable (mutable discovery custom message). It allows to modify the message
> before sending it across the ring. This approach does not require to make a
> decision to apply or skip the message on client nodes, the required flag will
> be transferred from a server node. In case of using Zookeeper Discovery,
> there is no ability to mutate discovery messages. However is is possible to
> mutate the message on the coordinator node (this requires adding
> _stopProcess_ flag in _DiscoveryCustomMessage_ which was removed by
> IGNITE-12400). This is quite enough for our case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)