Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
On 2019-Sep-25, at 20:27, Mark Millard wrote: > On 2019-Sep-25, at 19:26, Mark Millard wrote: > >> On 2019-Sep-25, at 10:02, Mark Johnston wrote: >> >>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 >>> wrote: Note: I have access to only one FreeBSD amd64 context, and it is also my only access to a NUMA context: 2 memory domains. A Threadripper 1950X context. Also: I have only a head FreeBSD context on any architecture, not 12.x or before. So I have limited compare/contrast material. I present the below basically to ask if the NUMA handling has been validated, or if it is going to be, at least for contexts that might apply to ThreadRipper 1950X and analogous contexts. My results suggest they are not (or libc++'s now times get messed up such that it looks like NUMA mishandling since this is based on odd benchmark results that involve mean time for laps, using a median of such across multiple trials). I ran a benchmark on both Fedora 30 and FreeBSD 13 on this 1950X got got expected results on Fedora but odd ones on FreeBSD. The benchmark is a variation on the old HINT benchmark, spanning the old multi-threading variation. I later tried Fedora because the FreeBSD results looked odd. The other architectures I tried FreeBSD benchmarking with did not look odd like this. (powerpc64 on a old PowerMac 2 socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd Ed. For these I used 4 threads, not more.) I tend to write in terms of plots made from the data instead of the raw benchmark data. FreeBSD testing based on: cpuset -l0-15 -n prefer:1 cpuset -l16-31 -n prefer:1 Fedora 30 testing based on: numactl --preferred 1 --cpunodebind 0 numactl --preferred 1 --cpunodebind 1 While I have more results, I reference primarily DSIZE and ISIZE being unsigned long long and also both being unsigned long as examples. Variations in results are not from the type differences for any LP64 architectures. (But they give an idea of benchmark variability in the test context.) The Fedora results solidly show the bandwidth limitation of using one memory controller. They also show the latency consequences for the remote memory domain case vs. the local memory domain case. There is not a lot of variability between the examples of the 2 type-pairs used for Fedora. Not true for FreeBSD on the 1950X: A) The latency-constrained part of the graph looks to normally be using the local memory domain when -l0-15 is in use for 8 threads. B) Both the -l0-15 and the -l16-31 parts of the graph for 8 threads that should be bandwidth limited show mostly examples that would have to involve both memory controllers for the bandwidth to get the results shown as far as I can tell. There is also wide variability ranging between the expected 1 controller result and, say, what a 2 controller round-robin would be expected produce. C) Even the single threaded result shows a higher result for larger total bytes for the kernel vectors. Fedora does not. I think that (B) is the most solid evidence for something being odd. >>> >>> The implication seems to be that your benchmark program is using pages >>> from both domains despite a policy which preferentially allocates pages >>> from domain 1, so you would first want to determine if this is actually >>> what's happening. As far as I know we currently don't have a good way >>> of characterizing per-domain memory usage within a process. >>> >>> If your benchmark uses a large fraction of the system's memory, you >>> could use the vm.phys_free sysctl to get a sense of how much memory from >>> each domain is free. >> >> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory >> domain. I've never configured the benchmark such that it even reaches >> 10 GiBytes on this hardware. (It stops for a time constraint first, >> based on the values in use for the "adjustable" items.) >> >> . . . (much omitted material) . . . > >> >>> Another possibility is to use DTrace to trace the >>> requested domain in vm_page_alloc_domain_after(). For example, the >>> following DTrace one-liner counts the number of pages allocated per >>> domain by ls(1): >>> >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry >>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls" >>> ... >>> 0 71 >>> 1 72 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry >>> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls" >>> ... >>> 1 143 >>> # dtrace -n
Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
On 2019-Sep-25, at 19:26, Mark Millard wrote: > On 2019-Sep-25, at 10:02, Mark Johnston wrote: > >> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 >> wrote: >>> Note: I have access to only one FreeBSD amd64 context, and >>> it is also my only access to a NUMA context: 2 memory >>> domains. A Threadripper 1950X context. Also: I have only >>> a head FreeBSD context on any architecture, not 12.x or >>> before. So I have limited compare/contrast material. >>> >>> I present the below basically to ask if the NUMA handling >>> has been validated, or if it is going to be, at least for >>> contexts that might apply to ThreadRipper 1950X and >>> analogous contexts. My results suggest they are not (or >>> libc++'s now times get messed up such that it looks like >>> NUMA mishandling since this is based on odd benchmark >>> results that involve mean time for laps, using a median >>> of such across multiple trials). >>> >>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>> 1950X got got expected results on Fedora but odd ones on >>> FreeBSD. The benchmark is a variation on the old HINT >>> benchmark, spanning the old multi-threading variation. I >>> later tried Fedora because the FreeBSD results looked odd. >>> The other architectures I tried FreeBSD benchmarking with >>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>> Ed. For these I used 4 threads, not more.) >>> >>> I tend to write in terms of plots made from the data instead >>> of the raw benchmark data. >>> >>> FreeBSD testing based on: >>> cpuset -l0-15 -n prefer:1 >>> cpuset -l16-31 -n prefer:1 >>> >>> Fedora 30 testing based on: >>> numactl --preferred 1 --cpunodebind 0 >>> numactl --preferred 1 --cpunodebind 1 >>> >>> While I have more results, I reference primarily DSIZE >>> and ISIZE being unsigned long long and also both being >>> unsigned long as examples. Variations in results are not >>> from the type differences for any LP64 architectures. >>> (But they give an idea of benchmark variability in the >>> test context.) >>> >>> The Fedora results solidly show the bandwidth limitation >>> of using one memory controller. They also show the latency >>> consequences for the remote memory domain case vs. the >>> local memory domain case. There is not a lot of >>> variability between the examples of the 2 type-pairs used >>> for Fedora. >>> >>> Not true for FreeBSD on the 1950X: >>> >>> A) The latency-constrained part of the graph looks to >>> normally be using the local memory domain when >>> -l0-15 is in use for 8 threads. >>> >>> B) Both the -l0-15 and the -l16-31 parts of the >>> graph for 8 threads that should be bandwidth >>> limited show mostly examples that would have to >>> involve both memory controllers for the bandwidth >>> to get the results shown as far as I can tell. >>> There is also wide variability ranging between the >>> expected 1 controller result and, say, what a 2 >>> controller round-robin would be expected produce. >>> >>> C) Even the single threaded result shows a higher >>> result for larger total bytes for the kernel >>> vectors. Fedora does not. >>> >>> I think that (B) is the most solid evidence for >>> something being odd. >> >> The implication seems to be that your benchmark program is using pages >> from both domains despite a policy which preferentially allocates pages >> from domain 1, so you would first want to determine if this is actually >> what's happening. As far as I know we currently don't have a good way >> of characterizing per-domain memory usage within a process. >> >> If your benchmark uses a large fraction of the system's memory, you >> could use the vm.phys_free sysctl to get a sense of how much memory from >> each domain is free. > > The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory > domain. I've never configured the benchmark such that it even reaches > 10 GiBytes on this hardware. (It stops for a time constraint first, > based on the values in use for the "adjustable" items.) > > . . . (much omitted material) . . . > >> Another possibility is to use DTrace to trace the >> requested domain in vm_page_alloc_domain_after(). For example, the >> following DTrace one-liner counts the number of pages allocated per >> domain by ls(1): >> >> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry >> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls" >> ... >> 0 71 >> 1 72 >> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry >> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls" >> ... >> 1 143 >> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry >> /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls" >> ... >> 0 143 > > I'll think about this,
Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
On 2019-Sep-25, at 10:02, Mark Johnston wrote: > On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 > wrote: >> Note: I have access to only one FreeBSD amd64 context, and >> it is also my only access to a NUMA context: 2 memory >> domains. A Threadripper 1950X context. Also: I have only >> a head FreeBSD context on any architecture, not 12.x or >> before. So I have limited compare/contrast material. >> >> I present the below basically to ask if the NUMA handling >> has been validated, or if it is going to be, at least for >> contexts that might apply to ThreadRipper 1950X and >> analogous contexts. My results suggest they are not (or >> libc++'s now times get messed up such that it looks like >> NUMA mishandling since this is based on odd benchmark >> results that involve mean time for laps, using a median >> of such across multiple trials). >> >> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >> 1950X got got expected results on Fedora but odd ones on >> FreeBSD. The benchmark is a variation on the old HINT >> benchmark, spanning the old multi-threading variation. I >> later tried Fedora because the FreeBSD results looked odd. >> The other architectures I tried FreeBSD benchmarking with >> did not look odd like this. (powerpc64 on a old PowerMac 2 >> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >> Ed. For these I used 4 threads, not more.) >> >> I tend to write in terms of plots made from the data instead >> of the raw benchmark data. >> >> FreeBSD testing based on: >> cpuset -l0-15 -n prefer:1 >> cpuset -l16-31 -n prefer:1 >> >> Fedora 30 testing based on: >> numactl --preferred 1 --cpunodebind 0 >> numactl --preferred 1 --cpunodebind 1 >> >> While I have more results, I reference primarily DSIZE >> and ISIZE being unsigned long long and also both being >> unsigned long as examples. Variations in results are not >> from the type differences for any LP64 architectures. >> (But they give an idea of benchmark variability in the >> test context.) >> >> The Fedora results solidly show the bandwidth limitation >> of using one memory controller. They also show the latency >> consequences for the remote memory domain case vs. the >> local memory domain case. There is not a lot of >> variability between the examples of the 2 type-pairs used >> for Fedora. >> >> Not true for FreeBSD on the 1950X: >> >> A) The latency-constrained part of the graph looks to >> normally be using the local memory domain when >> -l0-15 is in use for 8 threads. >> >> B) Both the -l0-15 and the -l16-31 parts of the >> graph for 8 threads that should be bandwidth >> limited show mostly examples that would have to >> involve both memory controllers for the bandwidth >> to get the results shown as far as I can tell. >> There is also wide variability ranging between the >> expected 1 controller result and, say, what a 2 >> controller round-robin would be expected produce. >> >> C) Even the single threaded result shows a higher >> result for larger total bytes for the kernel >> vectors. Fedora does not. >> >> I think that (B) is the most solid evidence for >> something being odd. > > The implication seems to be that your benchmark program is using pages > from both domains despite a policy which preferentially allocates pages > from domain 1, so you would first want to determine if this is actually > what's happening. As far as I know we currently don't have a good way > of characterizing per-domain memory usage within a process. > > If your benchmark uses a large fraction of the system's memory, you > could use the vm.phys_free sysctl to get a sense of how much memory from > each domain is free. The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per memory domain. I've never configured the benchmark such that it even reaches 10 GiBytes on this hardware. (It stops for a time constraint first, based on the values in use for the "adjustable" items.) The benchmark runs the Hierarchical INTegeration kernel for a sequence of larger and larger number of cells in the grid that it uses. Each size is run in isolation before the next is tried, each gets its own timings. Each size gets its own kernel vector allocations (and deallocations) with the trails and laps within a trail reusing the same memory. Each lap in each trial gets its own thread creations (and completions). The main thread combines the results when there are multiple threads involved. (So I'm not sure of the main thread's behavior relative to the cpuset commands.) Thus, there are lots of thread creations overall, as well as lots of allocations of vectors for use in the integration kernel code. What it looks like to me that the std::async's internal thread creations are not respecting the cpuset command settings: in a sense, not inheriting the cpuset information correctly (or such is being ignored). For reference,
Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote: > Note: I have access to only one FreeBSD amd64 context, and > it is also my only access to a NUMA context: 2 memory > domains. A Threadripper 1950X context. Also: I have only > a head FreeBSD context on any architecture, not 12.x or > before. So I have limited compare/contrast material. > > I present the below basically to ask if the NUMA handling > has been validated, or if it is going to be, at least for > contexts that might apply to ThreadRipper 1950X and > analogous contexts. My results suggest they are not (or > libc++'s now times get messed up such that it looks like > NUMA mishandling since this is based on odd benchmark > results that involve mean time for laps, using a median > of such across multiple trials). > > I ran a benchmark on both Fedora 30 and FreeBSD 13 on this > 1950X got got expected results on Fedora but odd ones on > FreeBSD. The benchmark is a variation on the old HINT > benchmark, spanning the old multi-threading variation. I > later tried Fedora because the FreeBSD results looked odd. > The other architectures I tried FreeBSD benchmarking with > did not look odd like this. (powerpc64 on a old PowerMac 2 > socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive > 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd > Ed. For these I used 4 threads, not more.) > > I tend to write in terms of plots made from the data instead > of the raw benchmark data. > > FreeBSD testing based on: > cpuset -l0-15 -n prefer:1 > cpuset -l16-31 -n prefer:1 > > Fedora 30 testing based on: > numactl --preferred 1 --cpunodebind 0 > numactl --preferred 1 --cpunodebind 1 > > While I have more results, I reference primarily DSIZE > and ISIZE being unsigned long long and also both being > unsigned long as examples. Variations in results are not > from the type differences for any LP64 architectures. > (But they give an idea of benchmark variability in the > test context.) > > The Fedora results solidly show the bandwidth limitation > of using one memory controller. They also show the latency > consequences for the remote memory domain case vs. the > local memory domain case. There is not a lot of > variability between the examples of the 2 type-pairs used > for Fedora. > > Not true for FreeBSD on the 1950X: > > A) The latency-constrained part of the graph looks to >normally be using the local memory domain when >-l0-15 is in use for 8 threads. > > B) Both the -l0-15 and the -l16-31 parts of the >graph for 8 threads that should be bandwidth >limited show mostly examples that would have to >involve both memory controllers for the bandwidth >to get the results shown as far as I can tell. >There is also wide variability ranging between the >expected 1 controller result and, say, what a 2 >controller round-robin would be expected produce. > > C) Even the single threaded result shows a higher >result for larger total bytes for the kernel >vectors. Fedora does not. > > I think that (B) is the most solid evidence for > something being odd. The implication seems to be that your benchmark program is using pages from both domains despite a policy which preferentially allocates pages from domain 1, so you would first want to determine if this is actually what's happening. As far as I know we currently don't have a good way of characterizing per-domain memory usage within a process. If your benchmark uses a large fraction of the system's memory, you could use the vm.phys_free sysctl to get a sense of how much memory from each domain is free. Another possibility is to use DTrace to trace the requested domain in vm_page_alloc_domain_after(). For example, the following DTrace one-liner counts the number of pages allocated per domain by ls(1): # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls" ... 0 71 1 72 # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls" ... 1 143 # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls" ... 0 143 This approach might not work for various reasons depending on how exactly your benchmark program works. ___ freebsd-amd64@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-amd64 To unsubscribe, send any mail to "freebsd-amd64-unsubscr...@freebsd.org"