This one seems fine, too. Note that it should always be possible to read at least the current thread's /proc data. In my workaround, should I run out of retries I default to hwloc_get_last_cpu_location(... HWLOC_CPUBIND_THREAD) -- since presumably that can't fail and the result is technically valid given hwloc_get_last_cpu_location() semantics (it reads state that's inherently transient).
On Apr 23, 2012, at 7:53 AM, Brice Goglin wrote: > On 21/04/2012 23:36, Vlad wrote: >> >> >> >> On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote: >> >>> On 21/04/2012 23:08, Vlad wrote: >>>> >>>> Greetings, >>>> >>>> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible >>>> concurrency issue not covered by the "Thread Safety" guidelines: >>>> >>>> - I start a small number (4) of threads, each of which does some work and >>>> periodically executes hwloc_get_last_cpu_location() with >>>> HWLOC_CPUBIND_PROCESS >>>> - occasionally, one or two of those threads will see the call fail with >>>> ENOSYS (even though the same call has already executed successfully a >>>> number of times) >>>> >>>> These errors are transient and seem to occur only when some of the threads >>>> in the group are terminating. I've skimmed through the implementation in >>>> topology-linux.c and it seems plausible to me that the errors could be >>>> caused by failure to read /proc state "atomically" in the presence of >>>> concurrent thread starts/exits. >>>> >>>> Of course, the latter is hard (impossible ?) to do because the state >>>> always changes and a snapshot can only be obtained with a single read() >>>> (which in turn would require knowing how many thread entries to expect in >>>> advance). However, returning ENOSYS in such cases does not seems intended >>>> but rather a flaw in retry logic. Similar issues may be present with other >>>> API methods that rely on hwloc_linux_foreach_proc_tid() or >>>> hwloc_linux_get_proc_tids(). >>> >>> Can you try the attached patch? It doesn't abort the loop immediately on >>> per-tid errors anymore. This may work better when threads disappear. I >>> don't remember if the retry logic was written while thinking about adding >>> threads only or about adding and removing threads. >>> >>> If the patch doesn't help, can you send your code to help debug things? >> >> Will try this within a day or two. At the moment I am simply using a retry >> loop on ENOSYS and usually no more than one retry is needed. >> > > Here's a possibly better patch. It lets the retry logic happen before > checking whether we should return ENOSYS and friends. > > Brice > > <fix_tids.patch>_______________________________________________ > hwloc-users mailing list > hwloc-us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users