> On 21 Dec 2016, at 16:52 , Sven Van Caekenberghe <[email protected]> wrote:
> 
>> 
>> On 21 Dec 2016, at 16:04, Henrik Johansen <[email protected]> 
>> wrote:
>> 
>>> 
>>> On 21 Dec 2016, at 14:57 , Sven Van Caekenberghe <[email protected]> wrote:
>>> 
>>> Hendrik,
>>> 
>>> Thank you for this detailed feedback.
>>> 
>>>> On 21 Dec 2016, at 12:44, Henrik Johansen <[email protected]> 
>>>> wrote:
>>>> 
>>>> Hi Sven!
>>>> One thing I noticed when testing the RabbitMQ client with keepalive > 0, 
>>>> was connection being closed and all subscriptions lost when receiving 
>>>> large payloads, due to timeout before the keepalive packet could be sent 
>>>> before waiting to receive next object.
>>> 
>>> Indeed, I used my Stamp (STOMP) Rabbit MQ client as a model for the MQTT 
>>> one - there is a lot of similarity, especially in concept.
>>> 
>>> Keep alive processing is not that easy. I tried to do it by using read 
>>> timeouts as a source of regular opportunities to check for the need to 
>>> process keep alive logic. But of course, if you have no outstanding read 
>>> (in a loop), that won't work.
>>> 
>>> The fact that receiving a large payload would trigger an actual keep alive 
>>> time out is not something that I have seen myself. It seems weird that the 
>>> reading/transferring of incoming data would not count as activity against 
>>> keep alive, no ?
>> 
>> I never had a chance to investigate fully, but I distinctly remember having 
>> the same reaction!
>> It's quite awhile ago now, so my memory might be hazy, take the following 
>> with an appropriate amount of grains of salt.
>> 
>> The first times I encountered it, it seemed quite random, occuring after 
>> extended periods of client inactivity after receiving only small payloads...
>> Setting a much shorter keepalive timeout than the default was/is very useful 
>> in reproducing/verifying if it is an issue.
>> The timeouts then occurred relatively shortly after I'd received a single 
>> payload with no other activity, and disappeared once I removed the resetting 
>> of lastActivity timestamp on reads, indicating that for at least rabbitmq 
>> (3.5 was the version at the time, I believe), receiving data was *not* being 
>> counted as keep-alive activity.
>> 
>> The issue of receiving large payloads blocking writing in time was still 
>> unresolved, consistently cut off in the middle of (its own!) multi-MB 
>> payload transfers due to keepalive packet not being sent.
>> I couldn't see a solution other than abandoning the elegant single-threaded 
>> approach and do keepalive as a separate high-priority process, but the 
>> architectural choice for the app  changed at this point to not include a MQ 
>> in first delivery, so the more involved rewrite for doing so before 
>> deploying in production, kinda got stranded :/
> 
> Instinctively it feels like messaging and multi-MB payloads would not be a 
> good fit, at least I would suspect that they are an edge case. But maybe I am 
> wrong.
> 
> I think that the code in StampMedium>>#readBodyBytes: and 
> StampMedium>>#readBodyString: could be refactored to use a loop with chunk 
> buffers and at the same time check the elapsed time to fire back keep alive 
> pings if necessary. A ping too much would not hurt, better safe than sorry. 
> But is is hard to catch any/all network slowdowns, any IO operation could 
> hang/timeout.
> 
> But all this would assume that the problematic situation can be recreated.

Absolutely, far from certain the MQTT server implementations work the same.

This is the test code I used for rabbit: 

consumer1 := StampClient new
timeout:1;
heartbeat: 5000;
login: 'guest';
        passcode: 'guest';
        open;
subscribeTo: '/exchange/testall';
yourself.
[[ consumer1 runWith:[:message | Transcript crShow: '1 read: ', message body]] 
ensure: [consumer1 close]] forkAt: Processor userInterruptPriority .

producer := StampClient new
login: 'guest';
 passcode: 'guest';
 open; yourself.

[1 to: 10000 do: [:i | producer sendText: i asString to: '/exchange/testall']] 
forkAt: Processor systemBackgroundPriority.

Sometimes, but not always, this would fail, either during processing, or some 
time after.

A screenshot of how it looks when a disconnect happens during processing due to 
heartbeat failing can be found at:
https://tresor.it/s#HN4RG9PEX1SYyKy4NcM31w

Cheers,
Henry


Reply via email to