[ 
https://issues.apache.org/jira/browse/IGNITE-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Ozerov updated IGNITE-5056:
------------------------------------
    Fix Version/s:     (was: 2.1)
                   2.2

> Implement communication backpressure control
> --------------------------------------------
>
>                 Key: IGNITE-5056
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5056
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Yakov Zhdanov
>            Assignee: Yakov Zhdanov
>            Priority: Critical
>             Fix For: 2.2
>
>
> Problem
> Currently backpressure control relies on semaphore on sending side that 
> ensures that sending queue cannot be  overflown and a special counter on 
> receiving side that stops reading from the socket when unprocessed message 
> count outgrows limit config parameter.
> In some scenarios it may lead to a distributed deadlock. E.g. we send many 
> async jobs to remote nodes which in turn do sync cache operations. If task 
> master node is a backup or primary for some cache updates and has already 
> scheduled too many job requests for send it will not be able to respond to 
> cache requests thus remote jobs would never complete.
> Solution
> Reading from socket should never stop
> Design notes
> * add IgniteConfiguration.maxAsyncRequests and propagate it via node 
> attributes to all nodes of the cluster. All nodes may have different value 
> (however this is unlikely).
> * add a flag to GridIoMessage.async. If flag is false then sender node 
> assumed to synchronously wait for response and does not wait otherwise.
> * all sent async messages should be tracked on sender node on per-receiver 
> basis.
> * all received async messages should be tracked on receiver nodes
> * nodes should add flag to communication acks on whether they can more async 
> messages or not 
> * sender should never exceed IgniteConfiguration.maxAsyncRequests async 
> requests per node
> * if IgniteConfiguration.maxAsyncRequests is exceeded or node sets flag in 
> communication ack then all async messages become sync
> * above means:
> ** next compute job from the task is sent to the node only after response for 
> some previous comes
> ** next dht update request (for primary sync or full async) is sent, but node 
> doesn't send response to near node unless it has not received response for 
> former operation from remote backup or for this operation
> ** next cache operation becomes sync - we force user code to wait on 
> operation future.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to