Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux

Brice Goglin Mon, 23 Apr 2012 10:23:36 -0400

On 23/04/2012 16:13, Vlad wrote:

This one seems fine, too.
Note that it should always be possible to read at least the currentthread's /proc data.

This code also works when the task reading the cpubinding/location isnot part of the process it looks at.


Brice

In my workaround, should I run out of retries I default tohwloc_get_last_cpu_location(... HWLOC_CPUBIND_THREAD) -- sincepresumably that can't fail and the result is technically valid givenhwloc_get_last_cpu_location() semantics (it reads state that'sinherently transient).
On Apr 23, 2012, at 7:53 AM, Brice Goglin wrote:
On 21/04/2012 23:36, Vlad wrote:
On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
On 21/04/2012 23:08, Vlad wrote:
Greetings,
I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possibleconcurrency issue not covered by the "Thread Safety" guidelines:
- I start a small number (4) of threads, each of which does somework and periodically executes hwloc_get_last_cpu_location() withHWLOC_CPUBIND_PROCESS- occasionally, one or two of those threads will see the call failwith ENOSYS (even though the same call has already executedsuccessfully a number of times)
These errors are transient and seem to occur only when some of thethreads in the group are terminating. I've skimmed through theimplementation in topology-linux.c and it seems plausible to methat the errors could be caused by failure to read /proc state"atomically" in the presence of concurrent thread starts/exits.
Of course, the latter is hard (impossible ?) to do because thestate always changes and a snapshot can only be obtained with asingle read() (which in turn would require knowing how many threadentries to expect in advance). However, returning ENOSYS in suchcases does not seems intended but rather a flaw in retry logic.Similar issues may be present with other API methods that rely onhwloc_linux_foreach_proc_tid() orhwloc_linux_get_proc_tids().
Can you try the attached patch? It doesn't abort the loopimmediately on per-tid errors anymore. This may work better whenthreads disappear. I don't remember if the retry logic was writtenwhile thinking about adding threads only or about adding andremoving threads.
If the patch doesn't help, can you send your code to help debug things?
Will try this within a day or two. At the moment I am simply using aretry loop on ENOSYS and usually no more than one retry is needed.
Here's a possibly better patch. It lets the retry logic happen beforechecking whether we should return ENOSYS and friends.
Brice

<fix_tids.patch>_______________________________________________
hwloc-users mailing list
hwloc-us...@open-mpi.org <mailto:hwloc-us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
_______________________________________________
hwloc-users mailing list
hwloc-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux

Reply via email to