Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Mike
>
> Ok then your mask 0x,0x,,,0x,0x
> corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be
> displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".
>
> This would be OK if you're only spawning threads to the first socket. Do
> you see the same mask for threads on the other socket?
>
Yes, I do.

Mike

Am Mi., 2. März 2022 um 09:53 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

> Le 02/03/2022 à 09:39, Mike a écrit :
>
> Hello,
>
> Please run "lstopo -.synthetic" to compress the output a lot. I will be
>> able to reuse it from here and understand your binding mask.
>>
> Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432)
> L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1
> PU:2(indexes=2*128:1*2)
>
>
> Ok then your mask 0x,0x,,,0x,0x
> corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be
> displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".
>
> This would be OK if you're only spawning threads to the first socket. Do
> you see the same mask for threads on the other socket?
>
> Brice
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Brice Goglin

Le 02/03/2022 à 09:39, Mike a écrit :

Hello,

Please run "lstopo -.synthetic" to compress the output a lot. I
will be able to reuse it from here and understand your binding mask.

Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432) 
L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) 
Core:1 PU:2(indexes=2*128:1*2)




Ok then your mask 0x,0x,,,0x,0x 
corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be 
displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0".


This would be OK if you're only spawning threads to the first socket. Do 
you see the same mask for threads on the other socket?


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-02 Thread Mike
Hello,

Please run "lstopo -.synthetic" to compress the output a lot. I will be
> able to reuse it from here and understand your binding mask.
>
Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432)
L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1
PU:2(indexes=2*128:1*2)

Mike


Am Di., 1. März 2022 um 19:05 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 01/03/2022 à 17:34, Mike a écrit :
>
> Hello,
>
> Usually you would rather allocate and bind at the same time so that the
>> memory doesn't need to be migrated when bound. However, if you do not touch
>> the memory after allocation, pages are not actually physically allocated,
>> hence there's no to migrate. Might work but keep this in mind.
>>
>
> I need all the data in one allocation, so that is why I opted to allocate
> and then bind via the area function. The way I understand it is that by
> using the memory binding policy HWLOC_MEMBIND_BIND with
> hwloc_set_area_membind() the pages will actually get allocated on the
> specified cores. If that is not the case I suppose the best solution would
> be to just touch the allocated data with my threads.
>
>
> set_area_membind() doesn't allocate pages, but it tells the operating
> system "whenever you allocate them, do it on that NUMA node". Anyway, what
> you're doing makes sense.
>
>
>
> Can you print memory binding like below instead of printing only the first
>> PU in the set returned by get_area_membind?
>>
> char *s;
> hwloc_bitmap_asprintf(, set);
> /* s is now a C string of the bitmap, use it in your std::cout line */
>
> I tried that and now get_area_membind returns that all memory is bound to
> 0x,0x,,,0x,0x
>
>
> Please run "lstopo -.synthetic" to compress the output a lot. I will be
> able to reuse it from here and understand your binding mask.
> Brice
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Brice Goglin


Le 01/03/2022 à 17:34, Mike a écrit :

Hello,

Usually you would rather allocate and bind at the same time so
that the memory doesn't need to be migrated when bound. However,
if you do not touch the memory after allocation, pages are not
actually physically allocated, hence there's no to migrate. Might
work but keep this in mind.


I need all the data in one allocation, so that is why I opted to 
allocate and then bind via the area function. The way I understand it 
is that by using the memory binding policy HWLOC_MEMBIND_BIND with 
hwloc_set_area_membind() the pages will actually get allocated on the 
specified cores. If that is not the case I suppose the best solution 
would be to just touch the allocated data with my threads.



set_area_membind() doesn't allocate pages, but it tells the operating 
system "whenever you allocate them, do it on that NUMA node". Anyway, 
what you're doing makes sense.





Can you print memory binding like below instead of printing only
the first PU in the set returned by get_area_membind?

    char *s;
    hwloc_bitmap_asprintf(, set);
    /* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound 
to 0x,0x,,,0x,0x




Please run "lstopo -.synthetic" to compress the output a lot. I will be 
able to reuse it from here and understand your binding mask.


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Mike
Hello,

Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
>

I need all the data in one allocation, so that is why I opted to allocate
and then bind via the area function. The way I understand it is that by
using the memory binding policy HWLOC_MEMBIND_BIND with
hwloc_set_area_membind() the pages will actually get allocated on the
specified cores. If that is not the case I suppose the best solution would
be to just touch the allocated data with my threads.

Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
>
char *s;
hwloc_bitmap_asprintf(, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

I tried that and now get_area_membind returns that all memory is bound to
0x,0x,,,0x,0x

>
> People often do the contrary. They bind threads, and then they have
> threads allocate/touch memory so that buffers are physically allocated near
> the related threads (automatic by default). It works well when the number
> of threads is known in advance. You place one thread per core, they never
> move. As long as memory is big enough to store the data nearby, everybody's
> happy. If the number of threads varies at runtime, and/or if they need to
> move, things become more difficult.
>
> Your approach is also correct. In the end, it's rather a question of
> whether you're code is data-centric or compute-centric, and whether
> imbalances may require to move things during the execution. Moving threads
> is usually cheaper. But oversubscribing cores with multiple threads is
> usually a bad idea, that's likely why people place one thread per core
> first.
>
My code is rather data-bound and my main motivation for binding the threads
is because I did not want hyperthreading on cores and because I want to
keep all threads that operate on the same data in one L3 Cache.

And send the output of lstopo on your machine so that I can understand it.
>
The machine has two sockets and on each socket are 64 cores. Cores 0-7
share one L3 cache, so do cores 8-15 and so on.
The output of lstopo is quite large, but if my description does not suffice
I can send it.



Thanks for your time

Mike

Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin <
brice.gog...@inria.fr>:

>
> Le 01/03/2022 à 15:17, Mike a écrit :
>
> Dear list,
>
> I have a program that utilizes Openmpi + multithreading and I want the
> freedom to decide on which hardware cores my threads should run. By using
> hwloc_set_cpubind() that already works, so now I also want to bind memory
> to the hardware cores. But I just can't get it to work.
>
> Basically, I wrote the memory binding into my allocator, so the memory
> will be allocated and then bound.
>
>
> Hello
>
> Usually you would rather allocate and bind at the same time so that the
> memory doesn't need to be migrated when bound. However, if you do not touch
> the memory after allocation, pages are not actually physically allocated,
> hence there's no to migrate. Might work but keep this in mind.
>
>
> I use hwloc 2.4.1, run the code on a Linux system and I did check with
> “hwloc-info --support” if hwloc_set_area_membind() and
> hwloc_get_area_membind() are supported and they are.
>
> Here is a snippet of my code, which runs through without any error. But
> the hwloc_get_area_membind() always returns that all memory is bound to PU
> 0, when I think it should be bound to different PUs. Am I missing something?
>
>
> Can you print memory binding like below instead of printing only the first
> PU in the set returned by get_area_membind?
>
> char *s;
> hwloc_bitmap_asprintf(, set);
> /* s is now a C string of the bitmap, use it in your std::cout line */
>
> And send the output of lstopo on your machine so that I can understand it.
>
> Or you could print the smallest object that contains the binding by
> calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object
> whose type may be printed as a C-string with
> hwloc_obj_type_string(obj->type).
>
> You may also do the same before set_area_membind() if you want to verify
> that you're bindin where you really want.
>
>
>
> T* allocate(size_t n, hwloc_topology_t topology, int rank)
> {
>   // allocate memory
>   T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
>   // elements perthread
>   size_t ept = 1024;
>   hwloc_bitmap_t set;
>   size_t offset = 0;
>   size_t threadcount= 4;
>
>   set = hwloc_bitmap_alloc();
>   if(!set) {
> fprintf(stderr, "failed to allocate a bitmap\n");
>   }
>   // bind memory to every thread
>   for(size_t i = 0;i < threadcount; i++)
>   {
> // logical indexof where to bind the memory
> auto logid = (i +rank * threadcount) * 2;
> auto logobj = 

Re: [hwloc-users] Problems with binding memory

2022-03-01 Thread Brice Goglin


Le 01/03/2022 à 15:17, Mike a écrit :


Dear list,

I have a program that utilizes Openmpi + multithreading and I want the 
freedom to decide on which hardware cores my threads should run. By 
using hwloc_set_cpubind() that already works, so now I also want to 
bind memory to the hardware cores. But I just can't get it to work.


Basically, I wrote the memory binding into my allocator, so the memory 
will be allocated and then bound.




Hello

Usually you would rather allocate and bind at the same time so that the 
memory doesn't need to be migrated when bound. However, if you do not 
touch the memory after allocation, pages are not actually physically 
allocated, hence there's no to migrate. Might work but keep this in mind.



I use hwloc 2.4.1, run the code on a Linux system and I did check with 
“hwloc-info --support” if hwloc_set_area_membind() and 
hwloc_get_area_membind() are supported and they are.


Here is a snippet of my code, which runs through without any error. 
But the hwloc_get_area_membind() always returns that all memory is 
bound to PU 0, when I think it should be bound to different PUs. Am I 
missing something?




Can you print memory binding like below instead of printing only the 
first PU in the set returned by get_area_membind?


char *s;
hwloc_bitmap_asprintf(, set);
/* s is now a C string of the bitmap, use it in your std::cout line */

And send the output of lstopo on your machine so that I can understand it.

Or you could print the smallest object that contains the binding by 
calling hwloc_get_obj_covering_cpuset(topology, set). It returns an 
object whose type may be printed as a C-string with 
hwloc_obj_type_string(obj->type).


You may also do the same before set_area_membind() if you want to verify 
that you're bindin where you really want.





T* allocate(size_t n, hwloc_topology_t topology, int rank)
{
  // allocate memory
  T* t = (T*)hwloc_alloc(topology, sizeof(T) * n);
  // elements perthread
  size_t ept = 1024;
  hwloc_bitmap_t set;
  size_t offset = 0;
  size_t threadcount= 4;

  set = hwloc_bitmap_alloc();
  if(!set) {
    fprintf(stderr, "failed to allocate a bitmap\n");
  }
  // bind memory to every thread
  for(size_t i = 0;i < threadcount; i++)
  {
    // logical indexof where to bind the memory
    auto logid = (i +rank * threadcount) * 2;
    auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid);
    hwloc_bitmap_only(set, logobj->os_index);
    //set the memory binding
    // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch 
the memory first to allocate it
    auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) 
*ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | 
HWLOC_MEMBIND_THREAD);

    if(err < 0)
      std::cout << "Error: memory binding failed" <    std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical 
index="