Re: [ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Hefty, Sean Thu, 24 Jun 2010 15:19:14 -0700

> I'm performing experiments to find sailent points of interest, not looking
> for a solution by moving code....
> Is it GetThreadId() inside of the lock?
> Is it the thread_id = 0 ?
> What's magic?


thread_id is being used in an attempt to detect if a call into cl_timer_start 
and/or cl_timer_stop are being made from the timer callback.  It does this to 
know if it can block waiting for the timer callback thread to complete.  If 
we're calling from the timer callback thread, we can't block.

Currently the cl_timer callback routine can be invoked multiple times at once.  
If thread_id is set outside of a lock then:

thread 1 - sets thread_id = 1
thread 2 - overwrites thread_id = 2
thread 1 grabs the lock and invokes the user's callback
thread 1 calls cl_timer_stop, which compares GetThreadId() with thread_id
           the two do not match, so the code falsely assumes that we are not
           in a callback

To prevent this, thread_id must be set under the same lock used to serialize 
the callback.

The second issue is whether thread_id needs to be set to 0.  If thread_id is 
NOT reset to 0:

thread 1 - sets thread_id = 1 for timer X
thread 1 invokes the user's callback and returns

thread 1 is part of a system thread pool.  If that thread is used for some 
other purpose, for example, to process some other timer Y:

thread 1 is handling timer Y
thread 1 calls cl_timer_stop for timer X
thread 1 is NOT in the callback for timer X, but cl_timer_stop assumes that it 
is

To prevent this, thread_id must be reset to 0, and it needs to be under the 
same lock as above to prevent overwriting some other thread setting it.

I'm working on a patch to make the cl_timer code simpler based on my previous 
email.
_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Re: [ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Reply via email to