Re: [hwloc-users] Problems with binding memory
> > Ok then your mask 0x,0x,,,0x,0x > corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be > displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0". > > This would be OK if you're only spawning threads to the first socket. Do > you see the same mask for threads on the other socket? > Yes, I do. Mike Am Mi., 2. März 2022 um 09:53 Uhr schrieb Brice Goglin < brice.gog...@inria.fr>: > Le 02/03/2022 à 09:39, Mike a écrit : > > Hello, > > Please run "lstopo -.synthetic" to compress the output a lot. I will be >> able to reuse it from here and understand your binding mask. >> > Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432) > L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1 > PU:2(indexes=2*128:1*2) > > > Ok then your mask 0x,0x,,,0x,0x > corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be > displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0". > > This would be OK if you're only spawning threads to the first socket. Do > you see the same mask for threads on the other socket? > > Brice > > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Problems with binding memory
Le 02/03/2022 à 09:39, Mike a écrit : Hello, Please run "lstopo -.synthetic" to compress the output a lot. I will be able to reuse it from here and understand your binding mask. Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432) L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1 PU:2(indexes=2*128:1*2) Ok then your mask 0x,0x,,,0x,0x corresponds exactly to NUMA node 0 (socket 0). Object cpusets can be displayed on the command-line with "lstopo --cpuset" or "hwloc-calc numa:0". This would be OK if you're only spawning threads to the first socket. Do you see the same mask for threads on the other socket? Brice OpenPGP_signature Description: OpenPGP digital signature ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Problems with binding memory
Hello, Please run "lstopo -.synthetic" to compress the output a lot. I will be > able to reuse it from here and understand your binding mask. > Package:2 [NUMANode(memory=270369247232)] L3Cache:8(size=33554432) L2Cache:8(size=524288) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1 PU:2(indexes=2*128:1*2) Mike Am Di., 1. März 2022 um 19:05 Uhr schrieb Brice Goglin < brice.gog...@inria.fr>: > > Le 01/03/2022 à 17:34, Mike a écrit : > > Hello, > > Usually you would rather allocate and bind at the same time so that the >> memory doesn't need to be migrated when bound. However, if you do not touch >> the memory after allocation, pages are not actually physically allocated, >> hence there's no to migrate. Might work but keep this in mind. >> > > I need all the data in one allocation, so that is why I opted to allocate > and then bind via the area function. The way I understand it is that by > using the memory binding policy HWLOC_MEMBIND_BIND with > hwloc_set_area_membind() the pages will actually get allocated on the > specified cores. If that is not the case I suppose the best solution would > be to just touch the allocated data with my threads. > > > set_area_membind() doesn't allocate pages, but it tells the operating > system "whenever you allocate them, do it on that NUMA node". Anyway, what > you're doing makes sense. > > > > Can you print memory binding like below instead of printing only the first >> PU in the set returned by get_area_membind? >> > char *s; > hwloc_bitmap_asprintf(, set); > /* s is now a C string of the bitmap, use it in your std::cout line */ > > I tried that and now get_area_membind returns that all memory is bound to > 0x,0x,,,0x,0x > > > Please run "lstopo -.synthetic" to compress the output a lot. I will be > able to reuse it from here and understand your binding mask. > Brice > > > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Problems with binding memory
Le 01/03/2022 à 17:34, Mike a écrit : Hello, Usually you would rather allocate and bind at the same time so that the memory doesn't need to be migrated when bound. However, if you do not touch the memory after allocation, pages are not actually physically allocated, hence there's no to migrate. Might work but keep this in mind. I need all the data in one allocation, so that is why I opted to allocate and then bind via the area function. The way I understand it is that by using the memory binding policy HWLOC_MEMBIND_BIND with hwloc_set_area_membind() the pages will actually get allocated on the specified cores. If that is not the case I suppose the best solution would be to just touch the allocated data with my threads. set_area_membind() doesn't allocate pages, but it tells the operating system "whenever you allocate them, do it on that NUMA node". Anyway, what you're doing makes sense. Can you print memory binding like below instead of printing only the first PU in the set returned by get_area_membind? char *s; hwloc_bitmap_asprintf(, set); /* s is now a C string of the bitmap, use it in your std::cout line */ I tried that and now get_area_membind returns that all memory is bound to 0x,0x,,,0x,0x Please run "lstopo -.synthetic" to compress the output a lot. I will be able to reuse it from here and understand your binding mask. Brice OpenPGP_signature Description: OpenPGP digital signature ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Problems with binding memory
Hello, Usually you would rather allocate and bind at the same time so that the > memory doesn't need to be migrated when bound. However, if you do not touch > the memory after allocation, pages are not actually physically allocated, > hence there's no to migrate. Might work but keep this in mind. > I need all the data in one allocation, so that is why I opted to allocate and then bind via the area function. The way I understand it is that by using the memory binding policy HWLOC_MEMBIND_BIND with hwloc_set_area_membind() the pages will actually get allocated on the specified cores. If that is not the case I suppose the best solution would be to just touch the allocated data with my threads. Can you print memory binding like below instead of printing only the first > PU in the set returned by get_area_membind? > char *s; hwloc_bitmap_asprintf(, set); /* s is now a C string of the bitmap, use it in your std::cout line */ I tried that and now get_area_membind returns that all memory is bound to 0x,0x,,,0x,0x > > People often do the contrary. They bind threads, and then they have > threads allocate/touch memory so that buffers are physically allocated near > the related threads (automatic by default). It works well when the number > of threads is known in advance. You place one thread per core, they never > move. As long as memory is big enough to store the data nearby, everybody's > happy. If the number of threads varies at runtime, and/or if they need to > move, things become more difficult. > > Your approach is also correct. In the end, it's rather a question of > whether you're code is data-centric or compute-centric, and whether > imbalances may require to move things during the execution. Moving threads > is usually cheaper. But oversubscribing cores with multiple threads is > usually a bad idea, that's likely why people place one thread per core > first. > My code is rather data-bound and my main motivation for binding the threads is because I did not want hyperthreading on cores and because I want to keep all threads that operate on the same data in one L3 Cache. And send the output of lstopo on your machine so that I can understand it. > The machine has two sockets and on each socket are 64 cores. Cores 0-7 share one L3 cache, so do cores 8-15 and so on. The output of lstopo is quite large, but if my description does not suffice I can send it. Thanks for your time Mike Am Di., 1. März 2022 um 15:42 Uhr schrieb Brice Goglin < brice.gog...@inria.fr>: > > Le 01/03/2022 à 15:17, Mike a écrit : > > Dear list, > > I have a program that utilizes Openmpi + multithreading and I want the > freedom to decide on which hardware cores my threads should run. By using > hwloc_set_cpubind() that already works, so now I also want to bind memory > to the hardware cores. But I just can't get it to work. > > Basically, I wrote the memory binding into my allocator, so the memory > will be allocated and then bound. > > > Hello > > Usually you would rather allocate and bind at the same time so that the > memory doesn't need to be migrated when bound. However, if you do not touch > the memory after allocation, pages are not actually physically allocated, > hence there's no to migrate. Might work but keep this in mind. > > > I use hwloc 2.4.1, run the code on a Linux system and I did check with > “hwloc-info --support” if hwloc_set_area_membind() and > hwloc_get_area_membind() are supported and they are. > > Here is a snippet of my code, which runs through without any error. But > the hwloc_get_area_membind() always returns that all memory is bound to PU > 0, when I think it should be bound to different PUs. Am I missing something? > > > Can you print memory binding like below instead of printing only the first > PU in the set returned by get_area_membind? > > char *s; > hwloc_bitmap_asprintf(, set); > /* s is now a C string of the bitmap, use it in your std::cout line */ > > And send the output of lstopo on your machine so that I can understand it. > > Or you could print the smallest object that contains the binding by > calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object > whose type may be printed as a C-string with > hwloc_obj_type_string(obj->type). > > You may also do the same before set_area_membind() if you want to verify > that you're bindin where you really want. > > > > T* allocate(size_t n, hwloc_topology_t topology, int rank) > { > // allocate memory > T* t = (T*)hwloc_alloc(topology, sizeof(T) * n); > // elements perthread > size_t ept = 1024; > hwloc_bitmap_t set; > size_t offset = 0; > size_t threadcount= 4; > > set = hwloc_bitmap_alloc(); > if(!set) { > fprintf(stderr, "failed to allocate a bitmap\n"); > } > // bind memory to every thread > for(size_t i = 0;i < threadcount; i++) > { > // logical indexof where to bind the memory > auto logid = (i +rank * threadcount) * 2; > auto logobj =
Re: [hwloc-users] Problems with binding memory
Le 01/03/2022 à 15:17, Mike a écrit : Dear list, I have a program that utilizes Openmpi + multithreading and I want the freedom to decide on which hardware cores my threads should run. By using hwloc_set_cpubind() that already works, so now I also want to bind memory to the hardware cores. But I just can't get it to work. Basically, I wrote the memory binding into my allocator, so the memory will be allocated and then bound. Hello Usually you would rather allocate and bind at the same time so that the memory doesn't need to be migrated when bound. However, if you do not touch the memory after allocation, pages are not actually physically allocated, hence there's no to migrate. Might work but keep this in mind. I use hwloc 2.4.1, run the code on a Linux system and I did check with “hwloc-info --support” if hwloc_set_area_membind() and hwloc_get_area_membind() are supported and they are. Here is a snippet of my code, which runs through without any error. But the hwloc_get_area_membind() always returns that all memory is bound to PU 0, when I think it should be bound to different PUs. Am I missing something? Can you print memory binding like below instead of printing only the first PU in the set returned by get_area_membind? char *s; hwloc_bitmap_asprintf(, set); /* s is now a C string of the bitmap, use it in your std::cout line */ And send the output of lstopo on your machine so that I can understand it. Or you could print the smallest object that contains the binding by calling hwloc_get_obj_covering_cpuset(topology, set). It returns an object whose type may be printed as a C-string with hwloc_obj_type_string(obj->type). You may also do the same before set_area_membind() if you want to verify that you're bindin where you really want. T* allocate(size_t n, hwloc_topology_t topology, int rank) { // allocate memory T* t = (T*)hwloc_alloc(topology, sizeof(T) * n); // elements perthread size_t ept = 1024; hwloc_bitmap_t set; size_t offset = 0; size_t threadcount= 4; set = hwloc_bitmap_alloc(); if(!set) { fprintf(stderr, "failed to allocate a bitmap\n"); } // bind memory to every thread for(size_t i = 0;i < threadcount; i++) { // logical indexof where to bind the memory auto logid = (i +rank * threadcount) * 2; auto logobj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, logid); hwloc_bitmap_only(set, logobj->os_index); //set the memory binding // I use HWLOC_MEMBIND_BIND as policy so I do not have to touch the memory first to allocate it auto err = hwloc_set_area_membind(topology, t + offset, sizeof(T) *ept, set, HWLOC_MEMBIND_BIND, HWLOC_MEMBIND_STRICT | HWLOC_MEMBIND_THREAD); if(err < 0) std::cout << "Error: memory binding failed" < std::cout << "Rank=" << rank << " Tid=" << i << " on PU logical index="