[jira] [Commented] (IGNITE-17507) Failed to wait for partition map exchange on some clients

Ignite TC Bot (Jira) Thu, 11 Aug 2022 07:12:07 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578496#comment-17578496
 ]


Ignite TC Bot commented on IGNITE-17507:
----------------------------------------

{panel:title=Branch: [pull/10187/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/10187/head] Base: [master] : New Tests 
(1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}
{color:#00008b}Cache 5{color} [[tests 
1|https://ci.ignite.apache.org/viewLog.html?buildId=6723321]]
* {color:#013220}IgniteCacheTestSuite5: 
CacheLateAffinityAssignmentTest.testDelayAssignmentAffinityChangedUnexpectedPME 
- PASSED{color}

{panel}
[TeamCity *--&gt; Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=6723417&amp;buildTypeId=IgniteTests24Java8_RunAll]

> Failed to wait for partition map exchange on some clients
> ---------------------------------------------------------
>
>                 Key: IGNITE-17507
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17507
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vyacheslav Koptilin
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have scenario with several client and server nodes, which can stuck on PME 
> after start:
> * Start some server nodes
> * Trigger rebalance
> * Start some client and server nodes
> * Some of the client nodes stuck with _Failed to wait for partition map 
> exchange [topVer=AffinityTopologyVersion…_
> Deep investigation of the logs showed, that the root cause of the stuck PME 
> on client is the race between joining new client node and receiving stale 
> _CacheAffinityChangeMessage_ on a client, which causes PME, but when other 
> old nodes receive this _CacheAffinityChangeMessage_, they skip it because of 
> some optimization. 
> Optimization can be found in the method 
> _CacheAffinitySharedManager#onDiscoveryEvent_, we save _lastAffVer = topVer_ 
> for old nodes, but because of some race _lastAffVer_ for the problem client 
> node is null when we reach _CacheAffinitySharedManager#onCustomEvent_ and we 
> schedule invalid PME in  _msg.exchangeNeeded(exchangeNeeded)_, but other 
> nodes skip this PME
> The possible fix is that we can try to make the _CacheAffinityChangeMessage_ 
> mutable (mutable discovery custom message). It allows to modify the message 
> before sending it across the ring. This approach does not require to make a 
> decision to apply or skip the message on client nodes, the required flag will 
> be transferred from a server node. In case of using Zookeeper Discovery, 
> there is no ability to mutate discovery messages. However is is possible to 
> mutate the message on the coordinator node (this requires adding 
> _stopProcess_ flag in _DiscoveryCustomMessage_ which was removed by 
> IGNITE-12400). This is quite enough for our case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17507) Failed to wait for partition map exchange on some clients

Reply via email to