Stan, can you repost just the user space changes to cl_timer? > > If thread_id is set to zero under the cb_serialize lock, immediately > > after the return from the timer callback pfn_callback(), when compute > > nodes reboot they do not get IPv4 addresses by DHCP? > > Remove just the p_timer->thread_id = 0 after the callback and DHCP > > starts assigning addresses upon reboot? Who would have thought? > > This is only the user-mode changes? If you back those changes out on only > the node running OpenSM, do things work? That is, is it a problem with > OpenSM not bringing things up properly?
I gave this more thought, and I'm convinced that we need thread_id set as this: lock cb_serialize thread_id = GetID() callback() thread_id = 0 unlock cb_serialize if we want cl_timer_stop to block. Without this, the checks in cl_timer_stop/start won't work correctly. Personally, I don't see any reason to keep the check in cl_timer_stop to see if we're calling from the callback. Nothing that I see actually does this anyway, and it's just adding more complexity to the code. > > Also, DAPL tests hang when cl_timer callbacks are serialized. I suspect > > there has been a long standing bug that was never noticed.....sigh. > > Nothing wrong with uncovering bugs. I believe that's called progress. :) > > > Perhaps opensm should have it's own user-mode implementation of cl_timer > ? I thought about this, and it may be necessary if we need to serialize all timers. But I still believe that the other cl_timer implementation needs to be fixed. > I think we all agree that the multiple callback model for timers is broken > - each timer should have at most one callback executing at a time. So we > need to change cl_timer both in user and kernel to at least fix that. If > other code ends up broken, that's a side effect. The calling code should definitely not rely on a single timer calling them back multiple times. > One thing to keep in mind is that with the lock to serialize the callback > threads, you may block threadpool threads that need to run something else, > so the deadlock might be a side effect of serializing the callbacks with a > lock. What thread pool are you referring to? I believe all users of cl_timer acquire a lock to serialize access to their own lists. In theory, all we've done is move the locking down into the cl_timer abstraction. - Sean _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
