[ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:
--------------------------------
    Attachment: image-2021-11-22-11-52-23-586.png

> Request-Response Messaging Should Time Out
> ------------------------------------------
>
>                 Key: GEODE-9764
>                 URL: https://issues.apache.org/jira/browse/GEODE-9764
>             Project: Geode
>          Issue Type: Improvement
>          Components: messaging
>            Reporter: Bill Burcham
>            Assignee: Bill Burcham
>            Priority: Major
>         Attachments: image-2021-11-22-11-52-23-586.png
>
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that 
> it never gives up on a request (in a request-response scenario). As a result 
> a bug (software fault) anywhere from the point where the requesting thread 
> hands off the {{DistributionMessage}} e.g. to 
> {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
> point where that request is ultimately fulfilled on a (one) receiver, can 
> result in a hang (of some task on the send side, which is waiting for a 
> response).
> Well it's a little worse than that because any code in the return (response) 
> path can also cause disruption of the (response) flow, thereby leaving the 
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in 
> the response path (P2P messaging and TBD higher-level code) were perfect this 
> might not be a problem. But there is a fair amount of code there and we have 
> some evidence that it is currently not perfect, nor do we expect it to become 
> perfect and stay that way. That being the case it seems prudent to institute 
> response timeouts so that bugs of this sort (which disrupt request-response 
> message flow) don't result in hangs.
> It's TBD if we want to go a step further and institute retries. The latter 
> would entail introducing duplicate-suppression (conflation) in P2P messaging. 
> We might also add exponential backoff (open-loop) or back-pressure 
> (closed-loop) to prevent a flood of retries when the system is at or near the 
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a 
> first step. This would entail:
>  * adding a configuration parameter to specify the timeout value
>  * changing ReplyProcessor21 and others TBD to "give up" after the timeout 
> has elapsed
>  * changing higher-level code dependent on request-reply messaging so it 
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.
> h2. Counterpoint
> Not everbody thinks timeouts are a good idea. Here are some alternative ideas:
>  
> Make request-response primitive better.  make it so only bugs in our core 
> messaging framework could cause a lack of response - rather than our current 
> approach where a bug in a class like “RemotePutMessage” could cause a lack of 
> a response.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to