OK I disabled NUMA in BIOS, there is a slight performance hit, but NetBSD is still much slower than Linux. This time I did single thread test, but disparity grows with number of threads.
NetBSD: $ ./sv_mem -mode=wr -size=16g -block=1k -threads=1 Thread 1 preflt=11285.07 msec, memcpy=3056.22 MiB/sec Total transfer rate: 3056.22 MiB/sec Linux: $ ./sv_mem -mode=wr -size=16g -block=1k -threads=1 Thread 1 preflt=7319.33 msec, memcpy=5089.21 MiB/sec Total transfer rate: 5089.21 MiB/sec Note that to pre-fault (touch 1 byte at every 4 KiB page) 16 GiB of pages it took NetBSD around 11 seconds, Linux took 7 seconds. With 16 concurrent threads, NetBSD pre-fault is 18 times longer. Maybe there is a global lock in NetBSD VM subsystem that slows things down with higher number of threads. So the average throughput of memcpy is slower on NetBSD with higher number of threads because they can't make progress until pages are allocated and a global lock causes contention, so they sit waiting idle. Note below how NetBSD memcpy for individual threads is faster, but the overall throughput is almost half of Linux, because NetBSD VM subsystem acts like a barrier and causes those threads to stall until pages are allocated. NetBSD: $ ./sv_mem -mode=wr -size=1g -block=1k -threads=16 Thread 5 preflt=16400.12 msec, memcpy=3130.44 MiB/sec Thread 11 preflt=16931.65 msec, memcpy=3154.73 MiB/sec Thread 9 preflt=17169.03 msec, memcpy=2514.06 MiB/sec Thread 4 preflt=17632.37 msec, memcpy=2928.74 MiB/sec Thread 14 preflt=17696.83 msec, memcpy=2146.89 MiB/sec Thread 7 preflt=17885.63 msec, memcpy=2926.97 MiB/sec Thread 1 preflt=17918.38 msec, memcpy=1338.85 MiB/sec Thread 10 preflt=18316.65 msec, memcpy=2082.36 MiB/sec Thread 15 preflt=18323.43 msec, memcpy=1338.62 MiB/sec Thread 12 preflt=18310.89 msec, memcpy=1322.38 MiB/sec Thread 6 preflt=18363.57 msec, memcpy=1507.58 MiB/sec Thread 16 preflt=18360.23 msec, memcpy=1909.12 MiB/sec Thread 8 preflt=18155.39 msec, memcpy=1478.17 MiB/sec Thread 13 preflt=18236.67 msec, memcpy=1849.76 MiB/sec Thread 3 preflt=18303.09 msec, memcpy=2116.50 MiB/sec Thread 2 preflt=17960.70 msec, memcpy=1325.43 MiB/sec Total transfer rate: 6087.94 MiB/sec Linux: $ ./sv_mem -mode=wr -size=1g -block=1k -threads=16 Thread 13 preflt=1182.27 msec, memcpy=902.88 MiB/sec Thread 9 preflt=1183.55 msec, memcpy=903.02 MiB/sec Thread 5 preflt=1191.65 msec, memcpy=899.32 MiB/sec Thread 11 preflt=1186.96 msec, memcpy=897.64 MiB/sec Thread 7 preflt=1195.46 msec, memcpy=898.71 MiB/sec Thread 6 preflt=1207.12 msec, memcpy=904.71 MiB/sec Thread 15 preflt=1194.18 msec, memcpy=896.05 MiB/sec Thread 4 preflt=1216.37 msec, memcpy=909.09 MiB/sec Thread 3 preflt=1210.41 msec, memcpy=897.77 MiB/sec Thread 2 preflt=1210.36 msec, memcpy=896.36 MiB/sec Thread 12 preflt=1210.59 msec, memcpy=898.79 MiB/sec Thread 14 preflt=1209.41 msec, memcpy=898.01 MiB/sec Thread 10 preflt=1210.00 msec, memcpy=896.88 MiB/sec Thread 1 preflt=1216.32 msec, memcpy=899.56 MiB/sec Thread 16 preflt=1209.18 msec, memcpy=899.34 MiB/sec Thread 8 preflt=1231.36 msec, memcpy=910.00 MiB/sec Total transfer rate: 13978.88 MiB/sec