On 11/07/2012 07:24 AM, Gordon Sim wrote: > On 11/07/2012 11:52 AM, Gary Kotton wrote: >> On 11/07/2012 01:49 PM, Gordon Sim wrote: >>> On 11/07/2012 11:36 AM, Gary Kotton wrote: >>>> On 11/07/2012 01:12 PM, Gordon Sim wrote: >>>>> On 11/07/2012 08:08 AM, Gary Kotton wrote: >>>>>> Hi, >>>>>> I'm Gary and working on the OpenStack project. We have run into a >>>>>> problem with eventlet monkey patching QPID. We are using python. >>>>>> If the >>>>>> service that I am running starts before qpidd then it hangs and >>>>>> does not >>>>>> recover. Even when the qpidd daemon starts it does not recover. >>>>> >>>>> If qpidd is running before you start the patched client, but is then >>>>> stopped and restarted, does the client manage to reconnect? >>>> >>>> sadly not. >>> >>> Does setting the reconnect option to false have any effect? Does the >>> connection then fail as expected, or does it still hang? >> >> The connect fails and does not hang. > > Ok, so in essence reconnect doesn't work when the socket library is > patched. I'm not exactly sure how eventlet works, but perhaps if I > describe roughly what the python client is doing that may shed some > light... > > When reconnect is on and a connection can't be established, the > application thread in the open() call will be waiting on a condition > variable (waiting essentially for the connection object to be marked open). > > That condition is usually met by the driver thread periodically > rechecking the connection, and when it can be established it sets the > condition. > > Could the patching of the socket be causing the driver thread to exit? > Perhaps some slightly different error condition is raised? Are you using > SSL?
I don't think I'm at the root of the problem, but I've spent some time trying to understand it better. First, eventlet is an odd beast. It patches a bunch of the Python standard library so that it can do its own thread scheduling at various convenient places. select() is one method that eventlet replaces. As you described, the thread doing the connection.open() is using a Condition to wait for the connection to be opened. The Condition is using a pipe internally. It does a select() on the read end of the pipe with a timeout. If something happens that means the condition should be re-checked, the write-end of the pipe is written to so that it will unblock from select() and go back to check the condition it's waiting for. The lock-up occurs because select() returns that the pipe is ready to be read from before anything has been written to the pipe. So, the code proceeds to do a read() and of course blocks. Based on how eventlet's greenthreads are scheduled, hitting this blocking read() results in no other threads ever running again. I added a second select() in the code as a sanity check to basically ask eventlet ... "are you *really* sure this fd was ready for reading?!". That "fixes" the issue. The second select() responds as it should. Sigh. I've tried writing some simple test code using pipes and eventlet's select, but can't reproduce broken behavior, so I can't say I understand the root cause here. Disabling the automatic reconnect code is something we can certainly do short term. We actually have our own reconnect code written into the module, anyway. It at least avoids the immediate issue by making much less use of pipes to signal between the thread internal to python-qpid and our code. It's avoiding the root cause, though. The only other idea I have right now is to patch the patched select() (ugh) so that after eventlet's version returns a result, make a non-blocking call to the native system's version of select() with the results from eventlet's select() to make sure they agree, and only return what the system's select() said was ready for reading/writing. -- Russell Bryant --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
