The answer is "no", I don't have root access, but I suspect that that would be the right fix if it is currently set to [always] and either madvise or never would be good options. If it is of interest, I'll ask someone to try it and report back on what happens.
-----Original Message----- From: Brice Goglin <brice.gog...@inria.fr> Sent: 29 January 2019 15:39 To: Biddiscombe, John A. <biddi...@cscs.ch>; Hardware locality user list <hwloc-users@lists.open-mpi.org> Subject: Re: [hwloc-users] unusual memory binding results Only the one in brackets is set, others are unset alternatives. If you write "madvise" in that file, it'll become "always [madvise] never". Brice Le 29/01/2019 à 15:36, Biddiscombe, John A. a écrit : > On the 8 numa node machine > > $cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > > is set already, so I'm not really sure what should go in there to disable it. > > JB > > -----Original Message----- > From: Brice Goglin <brice.gog...@inria.fr> > Sent: 29 January 2019 15:29 > To: Biddiscombe, John A. <biddi...@cscs.ch>; Hardware locality user > list <hwloc-users@lists.open-mpi.org> > Subject: Re: [hwloc-users] unusual memory binding results > > Oh, that's very good to know. I guess lots of people using first touch will > be affected by this issue. We may want to add a hwloc memory flag doing > something similar. > > Do you have root access to verify that writing "never" or "madvise" in > /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? > > Brice > > > > Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : >> Brice >> >> madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) >> >> seems to make things behave much more sensibly. I had no idea it was a >> thing, but one of my colleagues pointed me to it. >> >> Problem seems to be solved for now. Thank you very much for your insights >> and suggestions/help. >> >> JB >> >> -----Original Message----- >> From: Brice Goglin <brice.gog...@inria.fr> >> Sent: 29 January 2019 10:35 >> To: Biddiscombe, John A. <biddi...@cscs.ch>; Hardware locality user >> list <hwloc-users@lists.open-mpi.org> >> Subject: Re: [hwloc-users] unusual memory binding results >> >> Crazy idea: 512 pages could be replaced with a single 2MB huge page. >> You're not requesting huge pages in your allocation but some systems >> have transparent huge pages enabled by default (e.g. RHEL >> https://access.redhat.com/solutions/46111) >> >> This could explain why 512 pages get allocated on the same node, but it >> wouldn't explain crazy patterns you've seen in the past. >> >> Brice >> >> >> >> >> Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >>> I simplified things and instead of writing to a 2D array, I allocate a 1D >>> array of bytes and touch pages in a linear fashion. >>> Then I call syscall(NR)move_pages, ....) and retrieve a status array for >>> each page in the data. >>> >>> When I allocate 511 pages and touch alternate pages on alternate >>> numa nodes >>> >>> Numa page binding 511 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> >>> but as soon as I increase to 512 pages, it breaks. >>> >>> Numa page binding 512 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> >>> On the 8 numa node machine it sometimes gives the right answer even with >>> 512 pages. >>> >>> Still baffled >>> >>> JB >>> >>> -----Original Message----- >>> From: hwloc-users <hwloc-users-boun...@lists.open-mpi.org> On Behalf Of >>> Biddiscombe, John A. >>> Sent: 28 January 2019 16:14 >>> To: Brice Goglin <brice.gog...@inria.fr> >>> Cc: Hardware locality user list <hwloc-users@lists.open-mpi.org> >>> Subject: Re: [hwloc-users] unusual memory binding results >>> >>> Brice >>> >>>> Can you print the pattern before and after thread 1 touched its pages, or >>>> even in the middle ? >>>> It looks like somebody is touching too many pages here. >>> Experimenting with different threads touching one or more pages, I >>> get unpredicatable results >>> >>> here on the 8 numa node device, the result is perfect. I am only >>> allowing thread 3 and 7 to write a single memory location >>> >>> get_numa_domain() 8 Domain Numa pattern >>> -------- >>> -------- >>> -------- >>> 3------- >>> -------- >>> -------- >>> -------- >>> 7------- >>> ============================ >>> >>> ============================ >>> Contents of memory locations >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 26 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 63 0 0 0 0 0 0 0 >>> ============================ >>> >>> you can see that core 26 (numa domain 3) wrote to memory, and so did >>> core 63 (domain 8) >>> >>> Now I run it a second time and look, its rubbish >>> >>> get_numa_domain() 8 Domain Numa pattern >>> 3------- >>> 3------- >>> 3------- >>> 3------- >>> 3------- >>> 3------- >>> 3------- >>> 3------- >>> ============================ >>> >>> ============================ >>> Contents of memory locations >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 26 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 >>> 63 0 0 0 0 0 0 0 >>> ============================ >>> >>> after allowing the data to be read by a random thread >>> >>> 37777777 >>> 37777777 >>> 37777777 >>> 37777777 >>> 37777777 >>> 37777777 >>> 37777777 >>> 37777777 >>> >>> I'm baffled. >>> >>> JB >>> >>> _______________________________________________ >>> hwloc-users mailing list >>> hwloc-users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/hwloc-users _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users