Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux

Vlad Mon, 23 Apr 2012 10:13:14 -0400

This one seems fine, too.

Note that it should always be possible to read at least the current thread's 
/proc data. In my workaround, should I run out of retries I default to 
hwloc_get_last_cpu_location(... HWLOC_CPUBIND_THREAD) -- since presumably that 
can't fail and the result is technically valid given 
hwloc_get_last_cpu_location() semantics (it reads state that's inherently 
transient).


On Apr 23, 2012, at 7:53 AM, Brice Goglin wrote:

> On 21/04/2012 23:36, Vlad wrote:
>> 
>> 
>> 
>> On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
>> 
>>> On 21/04/2012 23:08, Vlad wrote:
>>>> 
>>>> Greetings,
>>>> 
>>>>  I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible 
>>>> concurrency issue not covered by the "Thread Safety" guidelines:
>>>> 
>>>> - I start a small number (4) of threads,  each of which does some work and 
>>>> periodically executes hwloc_get_last_cpu_location() with 
>>>> HWLOC_CPUBIND_PROCESS
>>>> - occasionally, one or two of those threads will see the call fail with 
>>>> ENOSYS (even though the same call has already executed successfully a 
>>>> number of times)
>>>> 
>>>> These errors are transient and seem to occur only when some of the threads 
>>>> in the group are terminating. I've skimmed through the implementation in 
>>>> topology-linux.c and it seems plausible to me that the errors could be 
>>>> caused by failure to read /proc state "atomically" in the presence of 
>>>> concurrent thread starts/exits.
>>>> 
>>>> Of course, the latter is hard (impossible ?) to do because the state 
>>>> always changes and a snapshot can only be obtained with a single read() 
>>>> (which in turn would require knowing how many thread entries to expect in 
>>>> advance). However, returning ENOSYS in such cases does not seems intended 
>>>> but rather a flaw in retry logic. Similar issues may be present with other 
>>>> API methods that rely on hwloc_linux_foreach_proc_tid() or 
>>>> hwloc_linux_get_proc_tids().
>>> 
>>> Can you try the attached patch? It doesn't abort the loop immediately on 
>>> per-tid errors anymore. This may work better when threads disappear. I 
>>> don't remember if the retry logic was written while thinking about adding 
>>> threads only or about adding and removing threads.
>>> 
>>> If the patch doesn't help, can you send your code to help debug things?
>> 
>> Will try this within a day or two. At the moment I am simply using a retry 
>> loop on ENOSYS and usually no more than one retry is needed.
>> 
> 
> Here's a possibly better patch. It lets the retry logic happen before 
> checking whether we should return ENOSYS and friends.
> 
> Brice
> 
> <fix_tids.patch>_______________________________________________
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux

Reply via email to