Hefty, Sean wrote:
>> The above code sequence in user mode cl_timer.c fails DHCP address
>> assignment upon compute node reboot.
>> HCA ports are 'ACTIVE' but no DHCP assignment. Kernel cl_timer V2
>> patches installed.
>>
>> If you go back to Tzachi's patch, then you get DHCP address
>> assignment correctly . thread_id = GetThreadId();
>> lock cb_serialize
>> callback()
>> unlock cb_serialize
>>
>> Currently building/testing without the lock/unlock cb_serialize.
>>
>> Will also test with
>>
>> thread_id = GetThreadId();
>> lock cb_serialize
>> callback()
>> thread_id = 0
>> unlock cb_serialize
>>
>> stay tuned.
>
> We are likely hitting another issue here.  If thread_id is not reset
> to 0 and not set under the cb_serialize lock, then the check in
> cl_timer_stop will not work reliably.  Moving code around until some
> test case passes isn't the approach we should be using.  Both code
> segments above are racy.  We're dealing with some race conditions
> that aren't going to be easy to reproduce.

I'm performing experiments to find sailent points of interest, not looking for 
a solution by moving code....
Is it GetThreadId() inside of the lock?
Is it the thread_id = 0 ?
What's magic?

>
> Tzachi successfully identified races in cl_timer.  We need to fix
> those, and if the fallout is that other bugs are more easily exposed,
> with consistent failures, then that's a good thing.
>
> - Sean

_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to