Hefty, Sean wrote: > Stan, can you repost just the user space changes to cl_timer? > >>> If thread_id is set to zero under the cb_serialize lock, immediately >>> after the return from the timer callback pfn_callback(), when >>> compute nodes reboot they do not get IPv4 addresses by DHCP? >>> Remove just the p_timer->thread_id = 0 after the callback and DHCP >>> starts assigning addresses upon reboot? Who would have thought? >> >> This is only the user-mode changes? If you back those changes out >> on only the node running OpenSM, do things work? That is, is it a >> problem with OpenSM not bringing things up properly? > > I gave this more thought, and I'm convinced that we need thread_id > set as this: > > lock cb_serialize > thread_id = GetID() > callback() > thread_id = 0 > unlock cb_serialize
The above code sequence in user mode cl_timer.c fails DHCP address assignment upon compute node reboot. HCA ports are 'ACTIVE' but no DHCP assignment. Kernel cl_timer V2 patches installed. If you go back to Tzachi's patch, then you get DHCP address assignment correctly . thread_id = GetThreadId(); lock cb_serialize callback() unlock cb_serialize Currently building/testing without the lock/unlock cb_serialize. Will also test with thread_id = GetThreadId(); lock cb_serialize callback() thread_id = 0 unlock cb_serialize stay tuned. > > if we want cl_timer_stop to block. Without this, the checks in > cl_timer_stop/start won't work correctly. Personally, I don't see > any reason to keep the check in cl_timer_stop to see if we're calling > from the callback. Nothing that I see actually does this anyway, and > it's just adding more complexity to the code. > >>> Also, DAPL tests hang when cl_timer callbacks are serialized. I >>> suspect there has been a long standing bug that was never >>> noticed.....sigh. >> >> Nothing wrong with uncovering bugs. I believe that's called >> progress. :) Turns out the DAPL test hangs are node count dependent. 4 nodes OK, 11 nodes OK, 21 nodes fail. After the 21 node failure, 11 & 4 nodes now fail....sigh. >> >>> Perhaps opensm should have it's own user-mode implementation of >>> cl_timer ? > > I thought about this, and it may be necessary if we need to serialize > all timers. But I still believe that the other cl_timer > implementation needs to be fixed. > >> I think we all agree that the multiple callback model for timers is >> broken - each timer should have at most one callback executing at a >> time. So we need to change cl_timer both in user and kernel to at >> least fix that. If other code ends up broken, that's a side effect. > > The calling code should definitely not rely on a single timer calling > them back multiple times. agreed. > >> One thing to keep in mind is that with the lock to serialize the >> callback threads, you may block threadpool threads that need to run >> something else, so the deadlock might be a side effect of >> serializing the callbacks with a lock. > > What thread pool are you referring to? > > I believe all users of cl_timer acquire a lock to serialize access to > their own lists. In theory, all we've done is move the locking down > into the cl_timer abstraction. > > - Sean _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
