On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:

> On 21/04/2012 23:08, Vlad wrote:
>> 
>> Greetings,
>> 
>>  I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible concurrency 
>> issue not covered by the "Thread Safety" guidelines:
>> 
>> - I start a small number (4) of threads,  each of which does some work and 
>> periodically executes hwloc_get_last_cpu_location() with 
>> HWLOC_CPUBIND_PROCESS
>> - occasionally, one or two of those threads will see the call fail with 
>> ENOSYS (even though the same call has already executed successfully a number 
>> of times)
>> 
>> These errors are transient and seem to occur only when some of the threads 
>> in the group are terminating. I've skimmed through the implementation in 
>> topology-linux.c and it seems plausible to me that the errors could be 
>> caused by failure to read /proc state "atomically" in the presence of 
>> concurrent thread starts/exits.
>> 
>> Of course, the latter is hard (impossible ?) to do because the state always 
>> changes and a snapshot can only be obtained with a single read() (which in 
>> turn would require knowing how many thread entries to expect in advance). 
>> However, returning ENOSYS in such cases does not seems intended but rather a 
>> flaw in retry logic. Similar issues may be present with other API methods 
>> that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids().
> 
> Can you try the attached patch? It doesn't abort the loop immediately on 
> per-tid errors anymore. This may work better when threads disappear. I don't 
> remember if the retry logic was written while thinking about adding threads 
> only or about adding and removing threads.
> 
> If the patch doesn't help, can you send your code to help debug things?

Will try this within a day or two. At the moment I am simply using a retry loop 
on ENOSYS and usually no more than one retry is needed.

> 
>> An alternative explanation could be that the retry logic is correct but the 
>> implementation relies on readdir(), which is documented to not be 
>> thread-safe: 
>> http://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
>> 
> 
> I don't this can happen. Your threads should not be accessing the same DIR 
> stream here.

You are probably correct. I was thinking of this code from 
https://svn.open-mpi.org/trac/hwloc/browser/trunk/src/topology-linux.c:

425     hwloc_linux_get_proc_tids(DIR *taskdir, unsigned *nr_tidsp, pid_t ** 
tidsp)
426     {
427       struct dirent *dirent;
428       unsigned nr_tids = 0;
429       unsigned max_tids = 32;
430       pid_t *tids;
431       struct stat sb;
432     
433       /* take the number of links as a good estimate for the number of tids 
*/
434       if (fstat(dirfd(taskdir), &sb) == 0)
435         max_tids = sb.st_nlink;
436     
437       tids = malloc(max_tids*sizeof(pid_t));
438       if (!tids) {
439         errno = ENOMEM;
440         return -1;
441       }
442     
443       rewinddir(taskdir);
444     
445       while ((dirent = readdir(taskdir)) != NULL) {

"taskdir" here is /proc/<pid>/task, correct? In which case the threads will be 
doing readdir() on the same DIR stream...




> 
> Thanks
> Brice
> 
> <fix_tids.patch>_______________________________________________
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Reply via email to