Sure, I'll put it in hwloc v1.10 after a bit more testing. Brice
Le 30/03/2014 18:00, Tim Creech a écrit : > Thanks! This is very helpful. With the patch in place I see very > reasonable output. > > Might this patch (eventually) make it into a hwloc release? > > -Tim > > On Sun, Mar 30, 2014 at 05:32:38PM +0200, Brice Goglin wrote: >> Don't worry, binding multithreaded processes is not a corner case. I was >> rather talking about the general "distributing less processes than there >> are object and returning cpusets as large as possible". >> >> The attached patch should help. Please let me know. >> >> Brice >> >> >> Le 30/03/2014 17:08, Tim Creech a écrit : >>> Hi Brice, >>> First, my apologies if this email starts a new thread. For some reason I >>> never received your response through Mailman and can only see it through the >>> web archive interface. I'm constructing this reponse without things like >>> "In-Reply-To". >>> >>> Thank you for your very helpful response. I'll use your explanation of the >>> algorithm and try to understand the implementation. I was indeed expecting >>> expecting hwloc-distrib to help me to bind multithreaded processes, >>> although I >>> certainly can understand that this is considered a corner case. Could you >>> please consider fixing this? >>> >>> Thanks, >>> Tim >>> >>> Brice Goglin wrote: >>>> Hello, >>>> >>>> This is the main corner case of hwloc-distrib. It can return objects >>>> only, not groups of objects. The distrib algorithms is: >>>> 1) start at the root, where there are M children, and you have to >>>> distribute N processes >>>> 2) if there are no children, or if N is 1, return the entire object >>>> 3) split N into Ni (N = sum of Ni) into M pieces based on each children >>>> weight (the number of PUs under each) >>>> If N>=M, all Ni can be > 0, all children will get some process >>>> if N<M, you can't split N into M integer pieces, some Ni will be 0, >>>> these objects won't get any process >>>> 4) go back to (2) recurse in each children object with Ni instead of N >>>> >>>> Your case is step 3 with N=2 and M=4. It basically means that we >>>> distribute across cores without "assembling group of cores if needed". >>>> >>>> In your case, when you bind to 2 cores of 4 PUs each, your task only >>>> uses one PU in the end, 1 core and 3 PU are ignored as well. They *may* >>>> be used, but the operating system scheduler is free to ignore them. So >>>> binding to 2 cores or binding to 1 core or binding to 1 PU is almost >>>> equivalent. At least the latter is included in the former. And most >>>> people pass --single to get a single PU anyway. >>>> >>>> The case where it's not equivalent is when you bind multithreaded >>>> processes. If you have 8 threads, it's better to use 2 cores than 1 >>>> single one. If this case matters to you, I will look into fixing this >>>> corner case. >>>> >>>> Brice >>>> >>>> Le 30/03/2014 07:56, Tim Creech a écrit : >>>>> Hello, >>>>> I would like to use hwloc_distrib for a project, but I'm having some >>>>> trouble understanding how it distributes. Specifically, it seems to >>>>> avoid distributing multiple processes across cores, and I'm not sure >>>>> why. >>>>> >>>>> As an example, consider the actual output of: >>>>> >>>>> $ hwloc-distrib -i "4 4" 2 >>>>> 0x0000000f >>>>> 0x000000f0 >>>>> >>>>> I'm expecting hwloc-distrib to tell me how to distribute 2 processes >>>>> across the 16 PUs (4 cores by 4 PUs), but the answer only involves 8 >>>>> PUs, leaving the other 8 unused. If there were more cores on the >>>>> machine, then potentially the vast majority of them would be unused. >>>>> >>>>> In other words, I might expect the output to use all of the PUs across >>>>> cores, for example: >>>>> >>>>> $ hwloc-distrib -i "4 4" 2 >>>>> 0x000000ff >>>>> 0x0000ff00 >>>>> >>>>> Why does hwloc-distrib leave PUs unused? I'm using hwloc-1.9. Any help >>>>> in understanding where I'm going wrong is greatly appreciated! >>>>> >>>>> Thanks, >>>>> Tim >>>>> >>>>> _______________________________________________ >>>>> hwloc-users mailing list >>>>> hwloc-users_at_[hidden] >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >>> _______________________________________________ >>> hwloc-users mailing list >>> hwloc-us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >> diff --git a/include/hwloc/helper.h b/include/hwloc/helper.h >> index 750f404..62fbba4 100644 >> --- a/include/hwloc/helper.h >> +++ b/include/hwloc/helper.h >> @@ -685,6 +685,7 @@ hwloc_distrib(hwloc_topology_t topology, >> { >> unsigned i; >> unsigned tot_weight; >> + unsigned given, givenweight; >> hwloc_cpuset_t *cpusetp = set; >> >> if (flags & ~HWLOC_DISTRIB_FLAG_REVERSE) { >> @@ -697,23 +698,40 @@ hwloc_distrib(hwloc_topology_t topology, >> if (roots[i]->cpuset) >> tot_weight += hwloc_bitmap_weight(roots[i]->cpuset); >> >> - for (i = 0; i < n_roots && tot_weight; i++) { >> - /* Give to roots[] a portion proportional to its weight */ >> + for (i = 0, given = 0, givenweight = 0; i < n_roots; i++) { >> + unsigned chunk, weight; >> hwloc_obj_t root = roots[flags & HWLOC_DISTRIB_FLAG_REVERSE ? >> n_roots-1-i : i]; >> - unsigned weight = root->cpuset ? hwloc_bitmap_weight(root->cpuset) : 0; >> - unsigned chunk = (n * weight + tot_weight-1) / tot_weight; >> - if (!root->arity || chunk == 1 || root->depth >= until) { >> + hwloc_cpuset_t cpuset = root->cpuset; >> + if (!cpuset) >> + continue; >> + weight = hwloc_bitmap_weight(cpuset); >> + if (!weight) >> + continue; >> + /* Give to roots[] a chunk proportional to its weight. >> + * If previous chunks got rounded-up, we'll get a bit less. */ >> + chunk = (( (givenweight+weight) * n + tot_weight-1) / tot_weight) >> + - (( givenweight * n + tot_weight-1) / tot_weight); >> + if (!root->arity || chunk <= 1 || root->depth >= until) { >> /* Got to the bottom, we can't split any more, put everything there. >> */ >> - unsigned j; >> - for (j=0; j<n; j++) >> - cpusetp[j] = hwloc_bitmap_dup(root->cpuset); >> + if (chunk) { >> + /* Fill cpusets with ours */ >> + unsigned j; >> + for (j=0; j < chunk; j++) >> + cpusetp[j] = hwloc_bitmap_dup(cpuset); >> + } else { >> + /* We got no chunk, just add our cpuset to a previous one >> + * so that we don't get ignored. >> + * (the first chunk cannot be empty). */ >> + assert(given); >> + hwloc_bitmap_or(cpusetp[-1], cpusetp[-1], cpuset); >> + } >> } else { >> /* Still more to distribute, recurse into children */ >> hwloc_distrib(topology, root->children, root->arity, cpusetp, chunk, >> until, flags); >> } >> cpusetp += chunk; >> - tot_weight -= weight; >> - n -= chunk; >> + given += chunk; >> + givenweight += weight; >> } >> >> return 0;