Python client enters busy loop on heartbeat timeout
---------------------------------------------------

                 Key: QPID-3860
                 URL: https://issues.apache.org/jira/browse/QPID-3860
             Project: Qpid
          Issue Type: Bug
          Components: Python Client
    Affects Versions: 0.14
         Environment: Client: Linux radudan.activesoft.ro 3.2.3-2.fc16.x86_64 
#1 SMP Fri Feb 3 20:08:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Server: Linux ophelia.activesoft.ro 2.6.32-220.2.1.el6.x86_64 #1 SMP Fri Dec 23 
02:21:33 CST 2011 x86_64 x86_64 x86_64 GNU/Linux

            Reporter: Radu Dan
            Priority: Critical


Let's asume the following code, that's linked against and connected to qpid 
0.14:

#!/bin/env python
import sys, time, logging
from qpid.messaging import *
logging.basicConfig(level=logging.INFO)

connection = Connection(
        host="ophelia.activesoft.ro",
        port=5672,
        heartbeat=1,
        reconnect=True,
        reconnect_interval_min=1,
        reconnect_interval_max=1,
        reconnect_limit=5,
        reconnect_timeout=5
)

connection.open()
session = connection.session()
sender = session.sender("test; {create: always}")

id = 0
while True:
        try:
                sender.send({
                        "hello": "world"
                })
                logging.info("Sent %d" % id)
                id += 1
        except KeyboardInterrupt:
                break
        except:
                logging.info("Error")
        finally:
                time.sleep(1)

The script will try to publish a message on the "test" queue to a broker, using 
a heartbeat value of 1s, and will reconnect at most 5 times with a timeout of 
1s before giving up and reporting the message as failed. This works as expected 
when gracefully shutting the broker down (service qpidd stop):

<<daemon is running>>
INFO:root:Sent 0
INFO:root:Sent 1
INFO:root:Sent 2
INFO:root:Sent 3
INFO:root:Sent 4
<<daemon has been stopped>>
WARNING:qpid.messaging:recoverable error[attempt 0]: connection aborted
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
<<daemon is still stopped, yet a reconnect apparently succeeds>>
WARNING:qpid.messaging:reconnect succeeded: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 0]: [Errno 104] Connection 
reset by peer
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 1]: [Errno 111] Connection 
refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 2]: [Errno 111] Connection 
refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 3]: [Errno 111] Connection 
refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 4]: [Errno 111] Connection 
refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 5]: [Errno 111] Connection 
refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
<<gives up here after 5 attempts>>
INFO:root:Error
INFO:root:Error
INFO:root:Error
<<daemon has been started>>
INFO:root:Error
INFO:root:Error

I don't know how exactly it gets a reconnect succeeded right after the daemon 
is killed, but it seems to reconnect 5 times, afterwards it will fail with some 
exception. However, it makes no more attempts to establish a connection when 
this state is reached. Now I'm not entirely sure about the philosophy behind 
the API, but wouldn't it be more useful to try and reconnect if down whenever a 
new message is sent? Right now the default behaviour is to block until the 
connection is back up; Tweaking the reconnect parameters will yield a call that 
throws an exception after a while, but that invalidates the entire 
connection>session>sender chain

This behaviour can be duplicated by kill -9, because the kernel will shutdown 
and close the TCP sockets associated with any killed processes. However, an 
even greater problem arises when disconnecting the interface (I just pulled the 
ether cable out of my test box, but service network stop should probably do the 
same thing). This way, the socket never really closes because it's waiting for 
the other end to shutdown(), which may never come because the other end might 
have already given up on the connection (via a hard reset for example). Since 
most linux TCP stacks are EXTREMELY patient with regards to socket timeouts, 
it's up to the heartbeat mechanism to detect a connection failure and switch to 
the same state as a shutdown/close would.

However, what happens is something like this:

INFO:root:Sent 0
INFO:root:Sent 1
INFO:root:Sent 2
INFO:root:Sent 3
INFO:root:Sent 4
WARNING:qpid.messaging:recoverable error[attempt 0]: heartbeat timeout

At this point, the python process enters a busy loop of some sort from which it 
will never recover (plugging the ether cable back in doesn't seem to do 
anything). Switching logging to debug doesn't output anything past the 
heartbeat timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Reply via email to