On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
> On 21/04/2012 23:08, Vlad wrote: >> >> Greetings, >> >> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible concurrency >> issue not covered by the "Thread Safety" guidelines: >> >> - I start a small number (4) of threads, each of which does some work and >> periodically executes hwloc_get_last_cpu_location() with >> HWLOC_CPUBIND_PROCESS >> - occasionally, one or two of those threads will see the call fail with >> ENOSYS (even though the same call has already executed successfully a number >> of times) >> >> These errors are transient and seem to occur only when some of the threads >> in the group are terminating. I've skimmed through the implementation in >> topology-linux.c and it seems plausible to me that the errors could be >> caused by failure to read /proc state "atomically" in the presence of >> concurrent thread starts/exits. >> >> Of course, the latter is hard (impossible ?) to do because the state always >> changes and a snapshot can only be obtained with a single read() (which in >> turn would require knowing how many thread entries to expect in advance). >> However, returning ENOSYS in such cases does not seems intended but rather a >> flaw in retry logic. Similar issues may be present with other API methods >> that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids(). > > Can you try the attached patch? It doesn't abort the loop immediately on > per-tid errors anymore. This may work better when threads disappear. I don't > remember if the retry logic was written while thinking about adding threads > only or about adding and removing threads. > > If the patch doesn't help, can you send your code to help debug things? Will try this within a day or two. At the moment I am simply using a retry loop on ENOSYS and usually no more than one retry is needed. > >> An alternative explanation could be that the retry logic is correct but the >> implementation relies on readdir(), which is documented to not be >> thread-safe: >> http://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html >> > > I don't this can happen. Your threads should not be accessing the same DIR > stream here. You are probably correct. I was thinking of this code from https://svn.open-mpi.org/trac/hwloc/browser/trunk/src/topology-linux.c: 425 hwloc_linux_get_proc_tids(DIR *taskdir, unsigned *nr_tidsp, pid_t ** tidsp) 426 { 427 struct dirent *dirent; 428 unsigned nr_tids = 0; 429 unsigned max_tids = 32; 430 pid_t *tids; 431 struct stat sb; 432 433 /* take the number of links as a good estimate for the number of tids */ 434 if (fstat(dirfd(taskdir), &sb) == 0) 435 max_tids = sb.st_nlink; 436 437 tids = malloc(max_tids*sizeof(pid_t)); 438 if (!tids) { 439 errno = ENOMEM; 440 return -1; 441 } 442 443 rewinddir(taskdir); 444 445 while ((dirent = readdir(taskdir)) != NULL) { "taskdir" here is /proc/<pid>/task, correct? In which case the threads will be doing readdir() on the same DIR stream... > > Thanks > Brice > > <fix_tids.patch>_______________________________________________ > hwloc-users mailing list > hwloc-us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users