I'm second guessing the webrev that I just sent out.

This is the stack that caused the problem:

              libc.so.1`__lwp_park+0xb
              libc.so.1`sema_wait+0x10
              libc.so.1`sem_wait+0x35
              libpython2.6.so.1.0`PyThread_acquire_lock+0x4b
              libpython2.6.so.1.0`lock_PyThread_acquire_lock+0x54
              libpython2.6.so.1.0`PyCFunction_Call+0x19e
              libpython2.6.so.1.0`call_function+0x41c
              libpython2.6.so.1.0`PyEval_EvalFrameEx+0x3029
                
[/usr/lib/python2.6/vendor-packages/pkg/client/transport/fileobj.py:62 (close) ]
              libpython2.6.so.1.0`fast_function+0x108
              libpython2.6.so.1.0`call_function+0xee
              libpython2.6.so.1.0`PyEval_EvalFrameEx+0x3029
                
[/usr/lib/python2.6/vendor-packages/pkg/client/transport/fileobj.py:51 
(__del__) ]
              libpython2.6.so.1.0`PyEval_EvalCodeEx+0x91c
              libpython2.6.so.1.0`PyEval_EvalCodeEx+0x91c
              libpython2.6.so.1.0`function_call+0x191
              libpython2.6.so.1.0`PyObject_Call+0x64
              libpython2.6.so.1.0`instancemethod_call+0x160
              libpython2.6.so.1.0`PyObject_Call+0x64
              libpython2.6.so.1.0`PyEval_CallObjectWithKeywords+0xb7
              libpython2.6.so.1.0`slot_tp_del+0x5f
              libpython2.6.so.1.0`subtype_dealloc+0x146
              libpython2.6.so.1.0`instancemethod_dealloc+0x89
              pycurl.so`util_curl_xdecref+0x140
              pycurl.so`do_curl_reset+0x37
              libpython2.6.so.1.0`call_function+0x33c
              libpython2.6.so.1.0`PyEval_EvalFrameEx+0x3029
                
[/usr/lib/python2.6/vendor-packages/pkg/client/transport/engine.py:737 
(__teardown_handle) ]
              libpython2.6.so.1.0`fast_function+0x108
              libpython2.6.so.1.0`call_function+0xee
              libpython2.6.so.1.0`PyEval_EvalFrameEx+0x3029
                
[/usr/lib/python2.6/vendor-packages/pkg/client/transport/engine.py:300 
(__cleanup_requests) ]
              libpython2.6.so.1.0`fast_function+0x108
              libpython2.6.so.1.0`call_function+0xee
              libpython2.6.so.1.0`PyEval_EvalFrameEx+0x3029
                
[/usr/lib/python2.6/vendor-packages/pkg/client/transport/engine.py:478 (run) ]
              libpython2.6.so.1.0`fast_function+0x108
              libpython2.6.so.1.0`call_function+0xee
              libpython2.6.so.1.0`PyEval_EvalFrameEx+0x3029
                
[/usr/lib/python2.6/vendor-packages/pkg/client/transport/fileobj.^A ]

The deadlock in the proposed fix was caused by the caller grabbing the
transport lock before calling run(), and then grabbing the lock again in
close().  If you look carefully at this code, it's calling
__cleanup_requests(), which gathers status about transport requests that
have finished.  The __teardown_handle method resets the saved state
about the transport handle.  It calls pycurl for part of this.  It looks
like this call into pycurl generates a garbage collection for an
orphaned StreamingFileObject.  I'm assuming that the transport handle's
reference to the fileobj's callback functions is why this happens now.

This stack is different from the one that generated the bug to begin
with:

Exception pycurl.error: error('curl object not on this multi-stack',) in <bound
method StreamingFileObj.__del__ of
<pkg.client.transport.fileobj.StreamingFileObj object at 0xdd6fa2c>> ignored

The exception above occurs when two threads try to remove the same
easy-handle from the multi-handle at once.  This allows us to infer that
the initial exception was generated by a race between the following
functions: remove_handle(), reset(), and __cleanup_requests().

The __del__ method will call __teardown_handle() if remove_requests()
finds an active transport handle for the fileobject's URL.  This means
that it's possible that a __del__ could modify the state of a running
transport operation.  This is confirmed by the exception that caused
this bug to be filed.  The second deadlock occurred because I failed to
account for the case where garbage collection occurs in the same thread
as a running transport operation.

The current fix covers both the synchronous and asynchronous cases.  In
the synchronous case, __del__ doesn't acquire the lock because the
thread already has it.  In the asyncrhonous case, the __del__ must wait
for the lock to be released if another thread is using the transport.

That accounts for most of our state problems, but what about the side
effects of a synchronous __del__?  In the stack above, the caller is
cleaning up requests. Could __del__ modify loop state needed for correct
operation when the call returns to the original frame?  If we assume that
locking rules out a race, it looks like it's possible for reset() to
trigger a synchronous __del__ in much the same way that
__cleanup_requests() does above.  Both of these two functions could be
calling teardown_handle() on a handle that they have yet to place in the
freelist.  The __del__ calls remove_handle() which also checks the
freelist, it's possible that this path could also trigger the
multi-stack removal exception, since reset or cleanup would have already
removed the handle, but not yet placed it on the freelist.

I believe that locking still needs to stay in the close() path, but
Danek proposed a solution where orphaned requests are placed in a list
or dictionary and cleaned up later.  It seems like that may be the only
sane way to solve this problem for the __del__ case.  I think that this
list/dictionary should still be manipulated under a lock, so that
asynchronous __del__'s don't race with a caller removing processed URLs
from the object.  

Does this sound like a sane solution?

I'm going to re-post parts of this analysis to the bug report.

-j

_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to