On 11/07/2012 07:24 AM, Gordon Sim wrote:
> On 11/07/2012 11:52 AM, Gary Kotton wrote:
>> On 11/07/2012 01:49 PM, Gordon Sim wrote:
>>> On 11/07/2012 11:36 AM, Gary Kotton wrote:
>>>> On 11/07/2012 01:12 PM, Gordon Sim wrote:
>>>>> On 11/07/2012 08:08 AM, Gary Kotton wrote:
>>>>>> Hi,
>>>>>> I'm Gary and working on the OpenStack project. We have run into a
>>>>>> problem with eventlet monkey patching  QPID. We are using python.
>>>>>> If the
>>>>>> service that I am running starts before qpidd then it hangs and
>>>>>> does not
>>>>>> recover. Even when the qpidd daemon starts it does not recover.
>>>>>
>>>>> If qpidd is running before you start the patched client, but is then
>>>>> stopped and restarted, does the client manage to reconnect?
>>>>
>>>> sadly not.
>>>
>>> Does setting the reconnect option to false have any effect? Does the
>>> connection then fail as expected, or does it still hang?
>>
>> The connect fails and does not hang.
> 
> Ok, so in essence reconnect doesn't work when the socket library is
> patched. I'm not exactly sure how eventlet works, but perhaps if I
> describe roughly what the python client is doing that may shed some
> light...
> 
> When reconnect is on and a connection can't be established, the
> application thread in the open() call will be waiting on a condition
> variable (waiting essentially for the connection object to be marked open).
> 
> That condition is usually met by the driver thread periodically
> rechecking the connection, and when it can be established it sets the
> condition.
> 
> Could the patching of the socket be causing the driver thread to exit?
> Perhaps some slightly different error condition is raised? Are you using
> SSL?

I don't think I'm at the root of the problem, but I've spent some time
trying to understand it better.

First, eventlet is an odd beast.  It patches a bunch of the Python
standard library so that it can do its own thread scheduling at various
convenient places.  select() is one method that eventlet replaces.

As you described, the thread doing the connection.open() is using a
Condition to wait for the connection to be opened.  The Condition is
using a pipe internally.  It does a select() on the read end of the pipe
with a timeout.  If something happens that means the condition should be
re-checked, the write-end of the pipe is written to so that it will
unblock from select() and go back to check the condition it's waiting for.

The lock-up occurs because select() returns that the pipe is ready to be
read from before anything has been written to the pipe.  So, the code
proceeds to do a read() and of course blocks.  Based on how eventlet's
greenthreads are scheduled, hitting this blocking read() results in no
other threads ever running again.

I added a second select() in the code as a sanity check to basically ask
eventlet ... "are you *really* sure this fd was ready for reading?!".
That "fixes" the issue.  The second select() responds as it should.  Sigh.

I've tried writing some simple test code using pipes and eventlet's
select, but can't reproduce broken behavior, so I can't say I understand
the root cause here.

Disabling the automatic reconnect code is something we can certainly do
short term.  We actually have our own reconnect code written into the
module, anyway.  It at least avoids the immediate issue by making much
less use of pipes to signal between the thread internal to python-qpid
and our code.  It's avoiding the root cause, though.

The only other idea I have right now is to patch the patched select()
(ugh) so that after eventlet's version returns a result, make a
non-blocking call to the native system's version of select() with the
results from eventlet's select() to make sure they agree, and only
return what the system's select() said was ready for reading/writing.

-- 
Russell Bryant

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to