Re: [hwloc-users] distributing across cores with hwloc-distrib

Brice Goglin Sun, 30 Mar 2014 12:02:15 -0400 (EDT)

Sure, I'll put it in hwloc v1.10 after a bit more testing.

Brice




Le 30/03/2014 18:00, Tim Creech a écrit :
> Thanks! This is very helpful. With the patch in place I see very
> reasonable output.
>
> Might this patch (eventually) make it into a hwloc release?
>
> -Tim
>
> On Sun, Mar 30, 2014 at 05:32:38PM +0200, Brice Goglin wrote:
>> Don't worry, binding multithreaded processes is not a corner case. I was
>> rather talking about the general "distributing less processes than there
>> are object and returning cpusets as large as possible".
>>
>> The attached patch should help. Please let me know.
>>
>> Brice
>>
>>
>> Le 30/03/2014 17:08, Tim Creech a écrit :
>>> Hi Brice,
>>>   First, my apologies if this email starts a new thread. For some reason I
>>> never received your response through Mailman and can only see it through the
>>> web archive interface. I'm constructing this reponse without things like
>>> "In-Reply-To".
>>>
>>> Thank you for your very helpful response. I'll use your explanation of the
>>> algorithm and try to understand the implementation. I was indeed expecting
>>> expecting hwloc-distrib to help me to bind multithreaded processes, 
>>> although I
>>> certainly can understand that this is considered a corner case. Could you
>>> please consider fixing this?
>>>
>>> Thanks,
>>> Tim
>>>
>>> Brice Goglin wrote:
>>>>  Hello,
>>>>
>>>> This is the main corner case of hwloc-distrib. It can return objects
>>>> only, not groups of objects. The distrib algorithms is:
>>>> 1) start at the root, where there are M children, and you have to
>>>> distribute N processes
>>>> 2) if there are no children, or if N is 1, return the entire object
>>>> 3) split N into Ni (N = sum of Ni) into M pieces based on each children
>>>> weight (the number of PUs under each)
>>>>    If N>=M, all Ni can be > 0, all children will get some process
>>>>    if N<M, you can't split N into M integer pieces, some Ni will be 0,
>>>> these objects won't get any process
>>>> 4) go back to (2) recurse in each children object with Ni instead of N
>>>>
>>>> Your case is step 3 with N=2 and M=4. It basically means that we
>>>> distribute across cores without "assembling group of cores if needed".
>>>>
>>>> In your case, when you bind to 2 cores of 4 PUs each, your task only
>>>> uses one PU in the end, 1 core and 3 PU are ignored as well. They *may*
>>>> be used, but the operating system scheduler is free to ignore them. So
>>>> binding to 2 cores or binding to 1 core or binding to 1 PU is almost
>>>> equivalent. At least the latter is included in the former. And most
>>>> people pass --single to get a single PU anyway.
>>>>
>>>> The case where it's not equivalent is when you bind multithreaded
>>>> processes. If you have 8 threads, it's better to use 2 cores than 1
>>>> single one. If this case matters to you, I will look into fixing this
>>>> corner case.
>>>>
>>>> Brice
>>>>
>>>> Le 30/03/2014 07:56, Tim Creech a écrit :
>>>>> Hello,
>>>>>   I would like to use hwloc_distrib for a project, but I'm having some
>>>>> trouble understanding how it distributes. Specifically, it seems to
>>>>> avoid distributing multiple processes across cores, and I'm not sure
>>>>> why.
>>>>>
>>>>> As an example, consider the actual output of:
>>>>>
>>>>> $ hwloc-distrib -i "4 4" 2
>>>>> 0x0000000f
>>>>> 0x000000f0
>>>>>
>>>>> I'm expecting hwloc-distrib to tell me how to distribute 2 processes
>>>>> across the 16 PUs (4 cores by 4 PUs), but the answer only involves 8
>>>>> PUs, leaving the other 8 unused. If there were more cores on the
>>>>> machine, then potentially the vast majority of them would be unused.
>>>>>
>>>>> In other words, I might expect the output to use all of the PUs across
>>>>> cores, for example:
>>>>>
>>>>> $ hwloc-distrib -i "4 4" 2
>>>>> 0x000000ff
>>>>> 0x0000ff00
>>>>>
>>>>> Why does hwloc-distrib leave PUs unused? I'm using hwloc-1.9. Any help
>>>>> in understanding where I'm going wrong is greatly appreciated!
>>>>>
>>>>> Thanks,
>>>>> Tim
>>>>>
>>>>> _______________________________________________
>>>>> hwloc-users mailing list
>>>>> hwloc-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users 
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> diff --git a/include/hwloc/helper.h b/include/hwloc/helper.h
>> index 750f404..62fbba4 100644
>> --- a/include/hwloc/helper.h
>> +++ b/include/hwloc/helper.h
>> @@ -685,6 +685,7 @@ hwloc_distrib(hwloc_topology_t topology,
>>  {
>>    unsigned i;
>>    unsigned tot_weight;
>> +  unsigned given, givenweight;
>>    hwloc_cpuset_t *cpusetp = set;
>>  
>>    if (flags & ~HWLOC_DISTRIB_FLAG_REVERSE) {
>> @@ -697,23 +698,40 @@ hwloc_distrib(hwloc_topology_t topology,
>>      if (roots[i]->cpuset)
>>        tot_weight += hwloc_bitmap_weight(roots[i]->cpuset);
>>  
>> -  for (i = 0; i < n_roots && tot_weight; i++) {
>> -    /* Give to roots[] a portion proportional to its weight */
>> +  for (i = 0, given = 0, givenweight = 0; i < n_roots; i++) {
>> +    unsigned chunk, weight;
>>      hwloc_obj_t root = roots[flags & HWLOC_DISTRIB_FLAG_REVERSE ? 
>> n_roots-1-i : i];
>> -    unsigned weight = root->cpuset ? hwloc_bitmap_weight(root->cpuset) : 0;
>> -    unsigned chunk = (n * weight + tot_weight-1) / tot_weight;
>> -    if (!root->arity || chunk == 1 || root->depth >= until) {
>> +    hwloc_cpuset_t cpuset = root->cpuset;
>> +    if (!cpuset)
>> +      continue;
>> +    weight = hwloc_bitmap_weight(cpuset);
>> +    if (!weight)
>> +      continue;
>> +    /* Give to roots[] a chunk proportional to its weight.
>> +     * If previous chunks got rounded-up, we'll get a bit less. */
>> +    chunk = (( (givenweight+weight) * n  + tot_weight-1) / tot_weight)
>> +          - ((  givenweight         * n  + tot_weight-1) / tot_weight);
>> +    if (!root->arity || chunk <= 1 || root->depth >= until) {
>>        /* Got to the bottom, we can't split any more, put everything there.  
>> */
>> -      unsigned j;
>> -      for (j=0; j<n; j++)
>> -    cpusetp[j] = hwloc_bitmap_dup(root->cpuset);
>> +      if (chunk) {
>> +    /* Fill cpusets with ours */
>> +    unsigned j;
>> +    for (j=0; j < chunk; j++)
>> +      cpusetp[j] = hwloc_bitmap_dup(cpuset);
>> +      } else {
>> +    /* We got no chunk, just add our cpuset to a previous one
>> +     * so that we don't get ignored.
>> +     * (the first chunk cannot be empty). */
>> +    assert(given);
>> +    hwloc_bitmap_or(cpusetp[-1], cpusetp[-1], cpuset);
>> +      }
>>      } else {
>>        /* Still more to distribute, recurse into children */
>>        hwloc_distrib(topology, root->children, root->arity, cpusetp, chunk, 
>> until, flags);
>>      }
>>      cpusetp += chunk;
>> -    tot_weight -= weight;
>> -    n -= chunk;
>> +    given += chunk;
>> +    givenweight += weight;
>>    }
>>  
>>    return 0;

Re: [hwloc-users] distributing across cores with hwloc-distrib

Reply via email to