Re: [ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Hefty, Sean Tue, 22 Jun 2010 12:26:47 -0700

> The idea here is that after start, no one can call any other field until the 
> call back is finished so
> I fealt that there is no need for protection.


Here's the call trace that I'm thinking of:

app calls cl_timer_start or cl_timer_trim
timer_callback #1 is invoked
app calls cl_timer_start again (possible from callback #1)
timer_callback #2 is invoked - this changes the thread_id

at this point, the thread_id could refer to either timer_callback #1 or 
timer_callback #2.  We have no way of knowing. 

I think we need a larger fix to the timer code to ensure that all accesses to 
the cl_timer_t structure is fully synchronized.  There's no way that the user 
of the timer can provide this synchronization.  Secondly, the timer_callback 
should be fully serialized.  This is easily done by holding a mutex around the 
callback itself, with the mutex only used for this purpose.

Tracking the thread_id only seems useful if the timer_callbacks are serialized.

 
> Although there can be 2 callbacks at the same time, the only way that this 
> could happen is this:

In theory, I don't see that there's any limit to the number of callbacks that 
can be running at once.

> This is the model that opensm is using the timer.

Can you explain why opensm is using the timer?  Is it simply to kick off sweeps?

I'm more concerned about the mad code, which makes heavy use of timers to retry 
and time out mads, and we want fixes to both the user space and kernel code.
_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Re: [ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Reply via email to