Hefty, Sean wrote: >> The above code sequence in user mode cl_timer.c fails DHCP address >> assignment upon compute node reboot. >> HCA ports are 'ACTIVE' but no DHCP assignment. Kernel cl_timer V2 >> patches installed. >> >> If you go back to Tzachi's patch, then you get DHCP address >> assignment correctly . thread_id = GetThreadId(); >> lock cb_serialize >> callback() >> unlock cb_serialize >> >> Currently building/testing without the lock/unlock cb_serialize. >> >> Will also test with >> >> thread_id = GetThreadId(); >> lock cb_serialize >> callback() >> thread_id = 0 >> unlock cb_serialize >> >> stay tuned. > > We are likely hitting another issue here. If thread_id is not reset > to 0 and not set under the cb_serialize lock, then the check in > cl_timer_stop will not work reliably. Moving code around until some > test case passes isn't the approach we should be using. Both code > segments above are racy. We're dealing with some race conditions > that aren't going to be easy to reproduce.
I'm performing experiments to find sailent points of interest, not looking for a solution by moving code.... Is it GetThreadId() inside of the lock? Is it the thread_id = 0 ? What's magic? > > Tzachi successfully identified races in cl_timer. We need to fix > those, and if the fallout is that other bugs are more easily exposed, > with consistent failures, then that's a good thing. > > - Sean _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
