Re: [hwloc-users] unusual memory binding results
The answer is "no", I don't have root access, but I suspect that that would be the right fix if it is currently set to [always] and either madvise or never would be good options. If it is of interest, I'll ask someone to try it and report back on what happens. -Original Message- From: Brice Goglin Sent: 29 January 2019 15:39 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Only the one in brackets is set, others are unset alternatives. If you write "madvise" in that file, it'll become "always [madvise] never". Brice Le 29/01/2019 à 15:36, Biddiscombe, John A. a écrit : > On the 8 numa node machine > > $cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > > is set already, so I'm not really sure what should go in there to disable it. > > JB > > -Original Message- > From: Brice Goglin > Sent: 29 January 2019 15:29 > To: Biddiscombe, John A. ; Hardware locality user > list > Subject: Re: [hwloc-users] unusual memory binding results > > Oh, that's very good to know. I guess lots of people using first touch will > be affected by this issue. We may want to add a hwloc memory flag doing > something similar. > > Do you have root access to verify that writing "never" or "madvise" in > /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? > > Brice > > > > Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : >> Brice >> >> madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) >> >> seems to make things behave much more sensibly. I had no idea it was a >> thing, but one of my colleagues pointed me to it. >> >> Problem seems to be solved for now. Thank you very much for your insights >> and suggestions/help. >> >> JB >> >> -Original Message- >> From: Brice Goglin >> Sent: 29 January 2019 10:35 >> To: Biddiscombe, John A. ; Hardware locality user >> list >> Subject: Re: [hwloc-users] unusual memory binding results >> >> Crazy idea: 512 pages could be replaced with a single 2MB huge page. >> You're not requesting huge pages in your allocation but some systems >> have transparent huge pages enabled by default (e.g. RHEL >> https://access.redhat.com/solutions/46111) >> >> This could explain why 512 pages get allocated on the same node, but it >> wouldn't explain crazy patterns you've seen in the past. >> >> Brice >> >> >> >> >> Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >>> I simplified things and instead of writing to a 2D array, I allocate a 1D >>> array of bytes and touch pages in a linear fashion. >>> Then I call syscall(NR)move_pages, ) and retrieve a status array for >>> each page in the data. >>> >>> When I allocate 511 pages and touch alternate pages on alternate >>> numa nodes >>> >>> Numa page binding 511 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> >>> but as soon as I increase to 512 pages, it breaks. >>> >>> Numa page binding 512 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 >
Re: [hwloc-users] unusual memory binding results
Only the one in brackets is set, others are unset alternatives. If you write "madvise" in that file, it'll become "always [madvise] never". Brice Le 29/01/2019 à 15:36, Biddiscombe, John A. a écrit : > On the 8 numa node machine > > $cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > > is set already, so I'm not really sure what should go in there to disable it. > > JB > > -Original Message- > From: Brice Goglin > Sent: 29 January 2019 15:29 > To: Biddiscombe, John A. ; Hardware locality user list > > Subject: Re: [hwloc-users] unusual memory binding results > > Oh, that's very good to know. I guess lots of people using first touch will > be affected by this issue. We may want to add a hwloc memory flag doing > something similar. > > Do you have root access to verify that writing "never" or "madvise" in > /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? > > Brice > > > > Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : >> Brice >> >> madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) >> >> seems to make things behave much more sensibly. I had no idea it was a >> thing, but one of my colleagues pointed me to it. >> >> Problem seems to be solved for now. Thank you very much for your insights >> and suggestions/help. >> >> JB >> >> -Original Message- >> From: Brice Goglin >> Sent: 29 January 2019 10:35 >> To: Biddiscombe, John A. ; Hardware locality user >> list >> Subject: Re: [hwloc-users] unusual memory binding results >> >> Crazy idea: 512 pages could be replaced with a single 2MB huge page. >> You're not requesting huge pages in your allocation but some systems >> have transparent huge pages enabled by default (e.g. RHEL >> https://access.redhat.com/solutions/46111) >> >> This could explain why 512 pages get allocated on the same node, but it >> wouldn't explain crazy patterns you've seen in the past. >> >> Brice >> >> >> >> >> Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >>> I simplified things and instead of writing to a 2D array, I allocate a 1D >>> array of bytes and touch pages in a linear fashion. >>> Then I call syscall(NR)move_pages, ) and retrieve a status array for >>> each page in the data. >>> >>> When I allocate 511 pages and touch alternate pages on alternate numa >>> nodes >>> >>> Numa page binding 511 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >>> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >>> >>> but as soon as I increase to 512 pages, it breaks. >>> >>> Numa page binding 512 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Re: [hwloc-users] unusual memory binding results
On the 8 numa node machine $cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never is set already, so I'm not really sure what should go in there to disable it. JB -Original Message- From: Brice Goglin Sent: 29 January 2019 15:29 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Oh, that's very good to know. I guess lots of people using first touch will be affected by this issue. We may want to add a hwloc memory flag doing something similar. Do you have root access to verify that writing "never" or "madvise" in /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? Brice Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : > Brice > > madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) > > seems to make things behave much more sensibly. I had no idea it was a thing, > but one of my colleagues pointed me to it. > > Problem seems to be solved for now. Thank you very much for your insights and > suggestions/help. > > JB > > -Original Message- > From: Brice Goglin > Sent: 29 January 2019 10:35 > To: Biddiscombe, John A. ; Hardware locality user > list > Subject: Re: [hwloc-users] unusual memory binding results > > Crazy idea: 512 pages could be replaced with a single 2MB huge page. > You're not requesting huge pages in your allocation but some systems > have transparent huge pages enabled by default (e.g. RHEL > https://access.redhat.com/solutions/46111) > > This could explain why 512 pages get allocated on the same node, but it > wouldn't explain crazy patterns you've seen in the past. > > Brice > > > > > Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >> I simplified things and instead of writing to a 2D array, I allocate a 1D >> array of bytes and touch pages in a linear fashion. >> Then I call syscall(NR)move_pages, ) and retrieve a status array for >> each page in the data. >> >> When I allocate 511 pages and touch alternate pages on alternate numa >> nodes >> >> Numa page binding 511 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> >> but as soon as I increase to 512 pages, it breaks. >> >> Numa page binding 512 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> On the 8 numa node machine it sometimes gives the right answer even with 512 >> pages. >> >> Still baffled >> >> JB >> >> -Original Message- >> From: hwloc-users On Behalf Of >> Biddiscombe, John A. >> Sent: 28 January 2019
Re: [hwloc-users] unusual memory binding results
Oh, that's very good to know. I guess lots of people using first touch will be affected by this issue. We may want to add a hwloc memory flag doing something similar. Do you have root access to verify that writing "never" or "madvise" in /sys/kernel/mm/transparent_hugepage/enabled fixes the issue too? Brice Le 29/01/2019 à 14:02, Biddiscombe, John A. a écrit : > Brice > > madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) > > seems to make things behave much more sensibly. I had no idea it was a thing, > but one of my colleagues pointed me to it. > > Problem seems to be solved for now. Thank you very much for your insights and > suggestions/help. > > JB > > -Original Message- > From: Brice Goglin > Sent: 29 January 2019 10:35 > To: Biddiscombe, John A. ; Hardware locality user list > > Subject: Re: [hwloc-users] unusual memory binding results > > Crazy idea: 512 pages could be replaced with a single 2MB huge page. > You're not requesting huge pages in your allocation but some systems have > transparent huge pages enabled by default (e.g. RHEL > https://access.redhat.com/solutions/46111) > > This could explain why 512 pages get allocated on the same node, but it > wouldn't explain crazy patterns you've seen in the past. > > Brice > > > > > Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : >> I simplified things and instead of writing to a 2D array, I allocate a 1D >> array of bytes and touch pages in a linear fashion. >> Then I call syscall(NR)move_pages, ) and retrieve a status array for >> each page in the data. >> >> When I allocate 511 pages and touch alternate pages on alternate numa >> nodes >> >> Numa page binding 511 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 >> 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 >> >> but as soon as I increase to 512 pages, it breaks. >> >> Numa page binding 512 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> On the 8 numa node machine it sometimes gives the right answer even with 512 >> pages. >> >> Still baffled >> >> JB >> >> -Original Message- >> From: hwloc-users On Behalf Of >> Biddiscombe, John A. >> Sent: 28 January 2019 16:14 >> To: Brice Goglin >> Cc: Hardware locality user list >> Subject: Re: [hwloc-users] unusual memory binding results >> >> Brice >> >>> Can you print the pattern before and after thread 1 touched its pages, or >>> even in the middle ? >>> It looks like somebody is touching too many pages
Re: [hwloc-users] unusual memory binding results
Brice madvise(addr, n * sizeof(T), MADV_NOHUGEPAGE) seems to make things behave much more sensibly. I had no idea it was a thing, but one of my colleagues pointed me to it. Problem seems to be solved for now. Thank you very much for your insights and suggestions/help. JB -Original Message- From: Brice Goglin Sent: 29 January 2019 10:35 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Crazy idea: 512 pages could be replaced with a single 2MB huge page. You're not requesting huge pages in your allocation but some systems have transparent huge pages enabled by default (e.g. RHEL https://access.redhat.com/solutions/46111) This could explain why 512 pages get allocated on the same node, but it wouldn't explain crazy patterns you've seen in the past. Brice Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : > I simplified things and instead of writing to a 2D array, I allocate a 1D > array of bytes and touch pages in a linear fashion. > Then I call syscall(NR)move_pages, ) and retrieve a status array for each > page in the data. > > When I allocate 511 pages and touch alternate pages on alternate numa > nodes > > Numa page binding 511 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > > but as soon as I increase to 512 pages, it breaks. > > Numa page binding 512 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > On the 8 numa node machine it sometimes gives the right answer even with 512 > pages. > > Still baffled > > JB > > -Original Message- > From: hwloc-users On Behalf Of > Biddiscombe, John A. > Sent: 28 January 2019 16:14 > To: Brice Goglin > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Brice > >> Can you print the pattern before and after thread 1 touched its pages, or >> even in the middle ? >> It looks like somebody is touching too many pages here. > Experimenting with different threads touching one or more pages, I get > unpredicatable results > > here on the 8 numa node device, the result is perfect. I am only > allowing thread 3 and 7 to write a single memory location > > get_numa_domain() 8 Domain Numa pattern > > > > 3--- > > > > 7--- > > > > Contents of memory locations > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 26 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 63 0 0 0 0 0 0 0 > > > you can see that core 26 (numa domain 3) wrote to memory, and so did
Re: [hwloc-users] unusual memory binding results
I wondered something similar. The crazy patterns usually happen on columns of the 2D matrix and as it is column major, it does loosely fit the idea (most of the time). I will play some more (though I'm fed up with it now). JB -Original Message- From: Brice Goglin Sent: 29 January 2019 10:35 To: Biddiscombe, John A. ; Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Crazy idea: 512 pages could be replaced with a single 2MB huge page. You're not requesting huge pages in your allocation but some systems have transparent huge pages enabled by default (e.g. RHEL https://access.redhat.com/solutions/46111) This could explain why 512 pages get allocated on the same node, but it wouldn't explain crazy patterns you've seen in the past. Brice Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : > I simplified things and instead of writing to a 2D array, I allocate a 1D > array of bytes and touch pages in a linear fashion. > Then I call syscall(NR)move_pages, ) and retrieve a status array for each > page in the data. > > When I allocate 511 pages and touch alternate pages on alternate numa > nodes > > Numa page binding 511 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > > but as soon as I increase to 512 pages, it breaks. > > Numa page binding 512 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > On the 8 numa node machine it sometimes gives the right answer even with 512 > pages. > > Still baffled > > JB > > -Original Message----- > From: hwloc-users On Behalf Of > Biddiscombe, John A. > Sent: 28 January 2019 16:14 > To: Brice Goglin > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Brice > >> Can you print the pattern before and after thread 1 touched its pages, or >> even in the middle ? >> It looks like somebody is touching too many pages here. > Experimenting with different threads touching one or more pages, I get > unpredicatable results > > here on the 8 numa node device, the result is perfect. I am only > allowing thread 3 and 7 to write a single memory location > > get_numa_domain() 8 Domain Numa pattern > > > > 3--- > > > > 7--- > > > > Contents of memory locations > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 26 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 63 0 0 0 0 0 0 0 > > > you can see that core 26 (numa domain 3) wrote to memory, and so did > core 63 (domain 8) > > Now I r
Re: [hwloc-users] unusual memory binding results
Crazy idea: 512 pages could be replaced with a single 2MB huge page. You're not requesting huge pages in your allocation but some systems have transparent huge pages enabled by default (e.g. RHEL https://access.redhat.com/solutions/46111) This could explain why 512 pages get allocated on the same node, but it wouldn't explain crazy patterns you've seen in the past. Brice Le 29/01/2019 à 10:23, Biddiscombe, John A. a écrit : > I simplified things and instead of writing to a 2D array, I allocate a 1D > array of bytes and touch pages in a linear fashion. > Then I call syscall(NR)move_pages, ) and retrieve a status array for each > page in the data. > > When I allocate 511 pages and touch alternate pages on alternate numa nodes > > Numa page binding 511 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 > 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 > 1 0 1 0 > > but as soon as I increase to 512 pages, it breaks. > > Numa page binding 512 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 > > On the 8 numa node machine it sometimes gives the right answer even with 512 > pages. > > Still baffled > > JB > > -Original Message----- > From: hwloc-users On Behalf Of > Biddiscombe, John A. > Sent: 28 January 2019 16:14 > To: Brice Goglin > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Brice > >> Can you print the pattern before and after thread 1 touched its pages, or >> even in the middle ? >> It looks like somebody is touching too many pages here. > Experimenting with different threads touching one or more pages, I get > unpredicatable results > > here on the 8 numa node device, the result is perfect. I am only allowing > thread 3 and 7 to write a single memory location > > get_numa_domain() 8 Domain Numa pattern > > > > 3--- > > > > 7--- > > > > Contents of memory locations > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 26 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 63 0 0 0 0 0 0 0 > > > you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 > (domain 8) > > Now I run it a second time and look, its rubbish > > get_numa_domain() 8 Domain Numa pattern > 3--- > 3--- > 3--- > 3--- > 3--- > 3--- > 3--- > 3--- > > > > Contents of memory locations > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 26 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0
Re: [hwloc-users] unusual memory binding results
I simplified things and instead of writing to a 2D array, I allocate a 1D array of bytes and touch pages in a linear fashion. Then I call syscall(NR)move_pages, ) and retrieve a status array for each page in the data. When I allocate 511 pages and touch alternate pages on alternate numa nodes Numa page binding 511 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 but as soon as I increase to 512 pages, it breaks. Numa page binding 512 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 On the 8 numa node machine it sometimes gives the right answer even with 512 pages. Still baffled JB -Original Message- From: hwloc-users On Behalf Of Biddiscombe, John A. Sent: 28 January 2019 16:14 To: Brice Goglin Cc: Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Brice >Can you print the pattern before and after thread 1 touched its pages, or even >in the middle ? >It looks like somebody is touching too many pages here. Experimenting with different threads touching one or more pages, I get unpredicatable results here on the 8 numa node device, the result is perfect. I am only allowing thread 3 and 7 to write a single memory location get_numa_domain() 8 Domain Numa pattern 3--- 7--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 (domain 8) Now I run it a second time and look, its rubbish get_numa_domain() 8 Domain Numa pattern 3--- 3--- 3--- 3--- 3--- 3--- 3--- 3--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 after allowing the data to be read by a random thread 3777 3777 3777 3777 3777 3777 3777 3777 I'm baffled. JB ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Brice >Can you print the pattern before and after thread 1 touched its pages, or even >in the middle ? >It looks like somebody is touching too many pages here. Experimenting with different threads touching one or more pages, I get unpredicatable results here on the 8 numa node device, the result is perfect. I am only allowing thread 3 and 7 to write a single memory location get_numa_domain() 8 Domain Numa pattern 3--- 7--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 you can see that core 26 (numa domain 3) wrote to memory, and so did core 63 (domain 8) Now I run it a second time and look, its rubbish get_numa_domain() 8 Domain Numa pattern 3--- 3--- 3--- 3--- 3--- 3--- 3--- 3--- Contents of memory locations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 after allowing the data to be read by a random thread 3777 3777 3777 3777 3777 3777 3777 3777 I'm baffled. JB ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Le 28/01/2019 à 11:28, Biddiscombe, John A. a écrit : > If I disable thread 0 and allow thread 1 then I get this pattern on 1 machine > (clearly wrong) > > > > > Can you print the pattern before and after thread 1 touched its pages, or even in the middle ? It looks like somebody is touching too many pages here. Brice > and on another I get > -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 > 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- > -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 > 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- > -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 > 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- > which is correct because the '-' is a negative status. I will run again and > see if it's -14 or -2 > > JB > > > -Original Message- > From: Brice Goglin > Sent: 28 January 2019 10:56 > To: Biddiscombe, John A. > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Can you try again disabling the touching in one thread to check whether the > other thread only touched its own pages? (others' status should be > -2 (ENOENT)) > > Recent kernels have ways to migrate memory at runtime > (CONFIG_NUMA_BALANCING) but this should only occur when it detects that some > thread does a lot of remote access, which shouldn't be the case here, at > least at the beginning of the program. > > Brice > > > > Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit : >> Brice >> >> I might have been using the wrong params to hwloc_get_area_memlocation >> in my original version, but I bypassed it and have been calling >> >> int get_numa_domain(void *page) >> { >> HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); >> >> void *pages[1] = { page }; >> int status[1] = { -1 }; >> if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == >> 0) { >> if (status[0]>=0 && >> status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { >> return status[0]; >> } >> return -1; >> } >> throw std::runtime_error("Failed to get numa node for page"); >> } >> >> this function instead. Just testing one page address at a time. I >> still see this kind of pattern >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> 00101101010010101001010101011010011011010101110101110111010101 >> 010101 >> when I should see >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 0101010101010101010101010101010101010101010101010101010101010101010101 >> 0101010101 >> 1010101010101010101010101010101010101010101010101010101010101010101010 >> 1010101010 >> 010101010101010101010101010101010101
Re: [hwloc-users] unusual memory binding results
If I disable thread 0 and allow thread 1 then I get this pattern on 1 machine (clearly wrong) and on another I get -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- which is correct because the '-' is a negative status. I will run again and see if it's -14 or -2 JB -Original Message- From: Brice Goglin Sent: 28 January 2019 10:56 To: Biddiscombe, John A. Cc: Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Can you try again disabling the touching in one thread to check whether the other thread only touched its own pages? (others' status should be -2 (ENOENT)) Recent kernels have ways to migrate memory at runtime (CONFIG_NUMA_BALANCING) but this should only occur when it detects that some thread does a lot of remote access, which shouldn't be the case here, at least at the beginning of the program. Brice Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit : > Brice > > I might have been using the wrong params to hwloc_get_area_memlocation > in my original version, but I bypassed it and have been calling > > int get_numa_domain(void *page) > { > HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); > > void *pages[1] = { page }; > int status[1] = { -1 }; > if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == > 0) { > if (status[0]>=0 && > status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { > return status[0]; > } > return -1; > } > throw std::runtime_error("Failed to get numa node for page"); > } > > this function instead. Just testing one page address at a time. I > still see this kind of pattern > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > 00101101010010101001010101011010011011010101110101110111010101 > 010101 > when I should see > 0101010101010101010101010101010101010101010101010101010101010101010101 > 0101010101 > 1010101010101010101010101010101010101010101010101010101010101010101010 > 1010101010 > 0101010101010101010101010101010101010101010101010101010101010101010101 > 0101010101 > 1010101010101010101010101010101010101010101010101010101010101010101010 > 1010101010 > 0101010101010101010101010101010101010101010101010101010101010101010101 > 0101010101 > 1010101010101010101010101010101010101010101010101010101010101010101010 > 1010101010 > 0101010101010101010101010101010101010101010101010101010101010101010101 > 0101010101 > 1010101010101010101010101010101010101010101010101010101010101010101010 > 1010101010 > 0101010101010101010101010101010101010101010101010101010101010101010101 > 0101010101 > 1010101010101010101010101010101010101010101010101010101010101010101010 > 1010101010 > > I am deeply troubled by this and can't think of what to try next since I can > see the memory contents hold the correct CPU ID of the thread that touched > the memory, so either the syscall is wrong, or the kernel is doing something > else. I welcome any suggestions on what might be wrong. > >
Re: [hwloc-users] unusual memory binding results
Can you try again disabling the touching in one thread to check whether the other thread only touched its own pages? (others' status should be -2 (ENOENT)) Recent kernels have ways to migrate memory at runtime (CONFIG_NUMA_BALANCING) but this should only occur when it detects that some thread does a lot of remote access, which shouldn't be the case here, at least at the beginning of the program. Brice Le 28/01/2019 à 10:35, Biddiscombe, John A. a écrit : > Brice > > I might have been using the wrong params to hwloc_get_area_memlocation in my > original version, but I bypassed it and have been calling > > int get_numa_domain(void *page) > { > HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); > > void *pages[1] = { page }; > int status[1] = { -1 }; > if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == > 0) { > if (status[0]>=0 && > status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { > return status[0]; > } > return -1; > } > throw std::runtime_error("Failed to get numa node for page"); > } > > this function instead. Just testing one page address at a time. I still see > this kind of pattern > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > 00101101010010101001010101011010011011010101110101110111010101010101 > when I should see > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > 01010101010101010101010101010101010101010101010101010101010101010101010101010101 > 10101010101010101010101010101010101010101010101010101010101010101010101010101010 > > I am deeply troubled by this and can't think of what to try next since I can > see the memory contents hold the correct CPU ID of the thread that touched > the memory, so either the syscall is wrong, or the kernel is doing something > else. I welcome any suggestions on what might be wrong. > > Thanks for trying to help. > > JB > > -Original Message- > From: Brice Goglin > Sent: 26 January 2019 10:19 > To: Biddiscombe, John A. > Cc: Hardware locality user list > Subject: Re: [hwloc-users] unusual memory binding results > > Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit : >>> move_pages() returning 0 with -14 in the status array? As opposed to >>> move_pages() returning -1 with errno set to 14, which would definitely be a >>> bug in hwloc. >> I think it was move_pages returning zero with -14 in the status array, and >> then hwloc returning 0 with an empty nodeset (which I then messed up by >> calling get bitmap first and assuming 0 meant numa node zero and not >> checking for an empty nodeset). >> >> I'm not sure why I get -EFAULT status rather than -NOENT, but that's what >> I'm seeing in the status field when I pass the pointer returned from the >> alloc_membind call. > The only reason I see for getting -EFAULT there would be that you pass the > buffer to move_pages (what hwloc_get_area_memlocation() wants, a start > pointer and length) instead of a pointer to an array of page addresses > (move_pages wants a void** pointing to individual pages). > > Brice > > ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Brice I might have been using the wrong params to hwloc_get_area_memlocation in my original version, but I bypassed it and have been calling int get_numa_domain(void *page) { HPX_ASSERT( (std::size_t(page) & 4095) ==0 ); void *pages[1] = { page }; int status[1] = { -1 }; if (syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 0) { if (status[0]>=0 && status[0]<=HPX_HAVE_MAX_NUMA_DOMAIN_COUNT) { return status[0]; } return -1; } throw std::runtime_error("Failed to get numa node for page"); } this function instead. Just testing one page address at a time. I still see this kind of pattern 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 00101101010010101001010101011010011011010101110101110111010101010101 when I should see 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 01010101010101010101010101010101010101010101010101010101010101010101010101010101 10101010101010101010101010101010101010101010101010101010101010101010101010101010 I am deeply troubled by this and can't think of what to try next since I can see the memory contents hold the correct CPU ID of the thread that touched the memory, so either the syscall is wrong, or the kernel is doing something else. I welcome any suggestions on what might be wrong. Thanks for trying to help. JB -Original Message- From: Brice Goglin Sent: 26 January 2019 10:19 To: Biddiscombe, John A. Cc: Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit : >> move_pages() returning 0 with -14 in the status array? As opposed to >> move_pages() returning -1 with errno set to 14, which would definitely be a >> bug in hwloc. > I think it was move_pages returning zero with -14 in the status array, and > then hwloc returning 0 with an empty nodeset (which I then messed up by > calling get bitmap first and assuming 0 meant numa node zero and not checking > for an empty nodeset). > > I'm not sure why I get -EFAULT status rather than -NOENT, but that's what I'm > seeing in the status field when I pass the pointer returned from the > alloc_membind call. The only reason I see for getting -EFAULT there would be that you pass the buffer to move_pages (what hwloc_get_area_memlocation() wants, a start pointer and length) instead of a pointer to an array of page addresses (move_pages wants a void** pointing to individual pages). Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Le 25/01/2019 à 23:16, Biddiscombe, John A. a écrit : >> move_pages() returning 0 with -14 in the status array? As opposed to >> move_pages() returning -1 with errno set to 14, which would definitely be a >> bug in hwloc. > I think it was move_pages returning zero with -14 in the status array, and > then hwloc returning 0 with an empty nodeset (which I then messed up by > calling get bitmap first and assuming 0 meant numa node zero and not checking > for an empty nodeset). > > I'm not sure why I get -EFAULT status rather than -NOENT, but that's what I'm > seeing in the status field when I pass the pointer returned from the > alloc_membind call. The only reason I see for getting -EFAULT there would be that you pass the buffer to move_pages (what hwloc_get_area_memlocation() wants, a start pointer and length) instead of a pointer to an array of page addresses (move_pages wants a void** pointing to individual pages). Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
> move_pages() returning 0 with -14 in the status array? As opposed to > move_pages() returning -1 with errno set to 14, which would definitely be a > bug in hwloc. I think it was move_pages returning zero with -14 in the status array, and then hwloc returning 0 with an empty nodeset (which I then messed up by calling get bitmap first and assuming 0 meant numa node zero and not checking for an empty nodeset). I'm not sure why I get -EFAULT status rather than -NOENT, but that's what I'm seeing in the status field when I pass the pointer returned from the alloc_membind call. JB From: Brice Goglin [brice.gog...@inria.fr] Sent: 25 January 2019 21:08 To: Biddiscombe, John A. Cc: Hardware locality user list Subject: Re: [hwloc-users] unusual memory binding results Le 25/01/2019 à 14:17, Biddiscombe, John A. a écrit : > Dear List/Brice > > I experimented with disabling the memory touch on threads except for > N=1,2,3,4 etc and found a problem in hwloc, which is that the function > hwloc_get_area_memlocation was returning '0' when the status of the memory > null move operation was -14 (#define EFAULT 14 /* Bad address */). This was > when I call get area memlocation immediately after allocating and then 'not' > touching. I think if the status is an error, then the function should > probably return -1, but anyway. I'll file a bug and send a patch if this is > considered to be a bug. Just to be sure, you talking about move_pages() returning 0 with -14 in the status array? As opposed to move_pages() returning -1 with errno set to 14, which would definitely be a bug in hwloc. When the page is valid but not allocated yet, move_pages() is supposed to return status = -ENOENT. This case is not an error, so returning 0 with an empty nodeset looks fine to me (pages are not allocated, hence they are allocated on an empty set of nodes). -EFAULT means that the page is invalid (you'd get a segfault if you touch it). I am not sure what we should return in that case. It's also true that pages are allocated nowhere :) Anyway, if you get -EFAULT in status, it should mean that an invalid address was passed to hwloc_get_area_memlocation() or an invalid length. Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Le 25/01/2019 à 14:17, Biddiscombe, John A. a écrit : > Dear List/Brice > > I experimented with disabling the memory touch on threads except for > N=1,2,3,4 etc and found a problem in hwloc, which is that the function > hwloc_get_area_memlocation was returning '0' when the status of the memory > null move operation was -14 (#define EFAULT 14 /* Bad address */). This was > when I call get area memlocation immediately after allocating and then 'not' > touching. I think if the status is an error, then the function should > probably return -1, but anyway. I'll file a bug and send a patch if this is > considered to be a bug. Just to be sure, you talking about move_pages() returning 0 with -14 in the status array? As opposed to move_pages() returning -1 with errno set to 14, which would definitely be a bug in hwloc. When the page is valid but not allocated yet, move_pages() is supposed to return status = -ENOENT. This case is not an error, so returning 0 with an empty nodeset looks fine to me (pages are not allocated, hence they are allocated on an empty set of nodes). -EFAULT means that the page is invalid (you'd get a segfault if you touch it). I am not sure what we should return in that case. It's also true that pages are allocated nowhere :) Anyway, if you get -EFAULT in status, it should mean that an invalid address was passed to hwloc_get_area_memlocation() or an invalid length. Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Dear List/Brice I experimented with disabling the memory touch on threads except for N=1,2,3,4 etc and found a problem in hwloc, which is that the function hwloc_get_area_memlocation was returning '0' when the status of the memory null move operation was -14 (#define EFAULT 14 /* Bad address */). This was when I call get area memlocation immediately after allocating and then 'not' touching. I think if the status is an error, then the function should probably return -1, but anyway. I'll file a bug and send a patch if this is considered to be a bug. I then modified the test routine to write the value returned from sched_getcpu into the touched memory location to verify that the thread binding was doing the right thing. The output below from the AMD 8 numanode machine looks good with threads 0,8,16 etc each touching memory which follows the pattern expected from the 8 numanode test. my get numa domain function however, does not reflect the right numanode. It looks correct for the first column (matrices are stored in column major order), but after that it falls to pieces. In this test, I'm allocating tiles as 512x512 doubles, so 4096 bytes per tile giving one tile column per page and I do 512 pages per tile. All the memory locations check out and the patters seem fine, but the call to // edited version of the one in hwloc source syscall(__NR_move_pages, 0, 1, pages, nullptr, status, 0) == 0) is not returning the numanode that I expect to see from the first touch when it is enabled. Either the syscall is wrong, or the first touch/nexttouch doesn't work (could the alloc routine be wrong?) hwloc_alloc_membind(topo, len, bitmap->get_bmp(), (hwloc_membind_policy_t)(policy), flags | HWLOC_MEMBIND_BYNODESET); where the nodeset should match the numanode mask (I'd will double check that right now). Any ideas on what to try next? Thanks JB get_numa_domain() 8 Domain Numa pattern 00740640 10740640 20740640 30740640 40740640 50740640 60740640 70740640 Contents of memory locations = sched_getcpu() 0 8 16 24 32 40 48 56 8 16 24 32 40 48 56 0 16 24 32 40 48 56 0 8 24 32 40 48 56 0 8 16 32 40 48 56 0 8 16 24 40 48 56 0 8 16 24 32 48 56 0 8 16 24 32 40 56 0 8 16 24 32 40 48 Expected 8 Domain Numa pattern 01234567 12345670 23456701 34567012 45670123 56701234 67012345 70123456 ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
>One way to debug would be to disable touching in N-1 thread to check >that everything allocated in on the right node. I shall try that. Thanks ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Brice Apologies, I didn't explain it very well, I do make sure that if the tile size 256*8 < 4096 (pagesize), then I double the number of tiles per page, I just wanted to keep the explanation simple. here are some code snippets to give you the flavour of it initializing the helper sruct matrix_numa_binder(std::size_t Ncols, std::size_t Nrows, std::size_t Ntile, std::size_t Ntiles_per_domain, std::size_t Ncolprocs=1, std::size_t Nrowprocs=1, std::string pool_name="default" ) : cols_(Ncols), rows_(Nrows), tile_size_(Ntile), tiles_per_domain_(Ntiles_per_domain), colprocs_(Ncolprocs), rowprocs_(Nrowprocs) { using namespace hpx::compute::host; binding_helper::pool_name_ = pool_name; const int CACHE_LINE_SIZE = sysconf (_SC_LEVEL1_DCACHE_LINESIZE); const int PAGE_SIZE = sysconf(_SC_PAGE_SIZE); const int ALIGNMENT = std::max(PAGE_SIZE,CACHE_LINE_SIZE); const int ELEMS_ALIGN = (ALIGNMENT/sizeof(T)); rows_page_= ELEMS_ALIGN; leading_dim_ = ELEMS_ALIGN*((rows_*sizeof(T) + ALIGNMENT-1)/ALIGNMENT); tiles_per_domain_ = std::max(tiles_per_domain_, ELEMS_ALIGN/tile_size_); } operator called by allocator which returns the domain index to bind a page to virtual std::size_t operator ()( const T * const base_ptr, const T * const page_ptr, const std::size_t pagesize, const std::size_t domains) const override { std::size_t offset = (page_ptr - base_ptr); std::size_t col = (offset / leading_dim_); std::size_t row = (offset % leading_dim_); std::size_t index = (col / (tile_size_ * tiles_per_domain_)); if ((tile_size_*tiles_per_domain_*sizeof(T))>=pagesize) { index += (row / (tile_size_ * tiles_per_domain_)); } else { HPX_ASSERT(0); } return index % domains; } this function is called by each thread (one per numa domain) and if the domain returned by the page query matches the domain ID of the thread/task then the first memory location on the page is written to for (size_type i=0; ioperator()(p, page_ptr, pagesize, nodesets.size()); if (dom==numa_domain) { // trigger a memory read and rewrite without changing contents volatile char* vaddr = (volatile char*) page_ptr; *vaddr = T(0); // *vaddr; } page_ptr += pageN; } All of this has been debugged quite extensively and I can write numbers to memory and read them back and the patterns always match the domains expected. This function is called after all data is written to attempt to verify (and display the patterns above) int topology::get_numa_domain(const void *addr) const { #if HWLOC_API_VERSION >= 0x00010b06 hpx_hwloc_bitmap_wrapper *nodeset = topology::bitmap_storage_.get(); if (nullptr == nodeset) { hwloc_bitmap_t nodeset_ = hwloc_bitmap_alloc(); topology::bitmap_storage_.reset(new hpx_hwloc_bitmap_wrapper(nodeset_)); nodeset = topology::bitmap_storage_.get(); } // hwloc_nodeset_t ns = reinterpret_cast(nodeset->get_bmp()); int ret = hwloc_get_area_memlocation(topo, addr, 1, ns, HWLOC_MEMBIND_BYNODESET); if (ret<0) { std::string msg(strerror(errno)); HPX_THROW_EXCEPTION(kernel_error , "hpx::threads::topology::get_numa_domain" , "hwloc_get_area_memlocation failed " + msg); return -1; } // this uses hwloc directly //int bit = hwloc_bitmap_first(ns); //return bit // this uses an alternative method, both give the same result AFAICT threads::mask_type mask = bitmap_to_mask(ns, HWLOC_OBJ_NUMANODE); return static_cast(threads::find_first(mask)); #else return 0; #endif } Thanks for taking the time to look it over JB ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] unusual memory binding results
Le 21/01/2019 à 17:08, Biddiscombe, John A. a écrit : > Dear list, > > I'm allocating a matrix of size (say) 2048*2048 on a node with 2 numa domains > and initializing the matrix by using 2 threads, one pinned on each numa > domain - with the idea that I can create tiles of memory bound to each numa > domain rather than having pages assigned all to one, interleaved, or possibly > random. The tiling pattern can be user defined, but I am using a simple > strategy that touches pages based on a simple indexing scheme using (say) a > tile size of 256 elements and should give a pattern like this Hello John, First idea: A title of 256 element means you're switching between tiles every 2kB (if elements are double precision), hence half the page belongs to one thread and the other half to the another thread, hence only the first one touching his tile will actually allocate locally. One way to debug would be to disable touching in N-1 thread to check that everything allocated in on the right node. Can you share the code, or at least part of it? Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users
[hwloc-users] unusual memory binding results
Dear list, I'm allocating a matrix of size (say) 2048*2048 on a node with 2 numa domains and initializing the matrix by using 2 threads, one pinned on each numa domain - with the idea that I can create tiles of memory bound to each numa domain rather than having pages assigned all to one, interleaved, or possibly random. The tiling pattern can be user defined, but I am using a simple strategy that touches pages based on a simple indexing scheme using (say) a tile size of 256 elements and should give a pattern like this Expected 2 Domain Numa pattern Where the 0's and 1's correspond to the numa node that touches the block of memory. The memory is allocated using HWLOC_MEMBIND_FIRSTTOUCH (I also tried HWLOC_MEMBIND_NEXTTOUCH) and calls hwloc_alloc_membind_nodeset( ... ); On broadwell nodes (linux kernel 4.4.103-6.38_4.0.153-cray_ari_c), it seems to mostly work and when I display the memory binding using a call to hwloc_get_area_memlocation( ... ) I see a pattern that matches the one above. However, I do occasionally see 1's and 0's that are incorrect. When I run the same code on a login node, Haswell (4.4.156-94.61.1.16335.0.PTF.1107299-default), I generally see patterns that are more like 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 0010101010101010 and are clearly wrong. Testing on an AMD AMD EPYC 7501 32-Core, node (running 3.10.0-957.1.3.el7.x86_64), I should see a pattern of 8 nodes such as Expected 8 Domain Numa pattern 0011223344556677 0011223344556677 1122334455667700 1122334455667700 2233445566770011 2233445566770011 3344556677001122 3344556677001122 4455667700112233 4455667700112233 5566770011223344 5566770011223344 6677001122334455 6677001122334455 7700112233445566 7700112233445566 but I'm actually seeing 0021322302001122 0021322302001122 1021322302001122 1021322302001122 2021322302001122 2021322302001122 3021322302001122 3021322302001122 4021322302001122 4021322302001122 5021322302001122 5021322302001122 6021322302001122 6021322302001122 7021322302001122 7021322302001122 I've checked and triple checked the thread bindings and address mappings and am 99% certain that the fault is either in the get_area_memlocation or in the touch pages not actually causing the page to be bound as expected. Can anyone suggest what might be my problem - or a test I might try to help narrow down what's wrong. Many thanks JB -- Dr. John Biddiscombe, email:biddisco @.at.@ cscs.ch http://www.cscs.ch/ CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07 Via Trevano 131, 6900 Lugano, Switzerland | Fax: +41 (91) 610.82.82 ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users