I'm developing a small tool that tests memory performance/throughput across different environments. I'm noticing performance issues on NetBSD-8, below are the details:
The tool creates a number of concurrent threads, each threads allocates 1 GiB memory segment and a 1 KiB transfer block. It pre-faults every page by writing a single byte at every 4 KiB offset. It then calls memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is filled. NetBSD and Linux have different versions of GCC, but I was hoping the following flags would keep optimization differences to a minimum: gcc -O1 -fno-builtin -march=westmere -Wall -pedantic -std=c11 \ -D_FILE_OFFSET_BITS=64 -D_XOPEN_SOURCE=700 -D_DEFAULT_SOURCE Hardware has 48 GiB of RAM, For this test I'm using 16 threads x 1 GiB = 16 GiB total. I'm seeing several issues on NetBSD: 1. When each thread calls mlock() to lock pages, sometimes when unlocking those pages, munlock() fails with ENOMEM. It doesn't happen every time, but frequently enough and I don't know why specifically munlock() fails. Same code works correctly on Linux. 2. Performance with 16 concurrent threads is rather bad. Most threads are idle 60% of the time (on Linux they are 100% busy), which suggests some sort of contention somewhere. On NetBSD average throughput with 16 threads is around 5.8 GiB/sec, on Linux it is around 15.3 GiB/sec. 3. This issue affects both NetBSD and Linux. When using mlock() to lock memory pages before issuing memcpy(), overall throughput drops significantly. Threads seem to be serialized, while a few threads are running, others are blocked for some reason. I don't know why mlock() has this affect. If anyone has any thoughts on this, please let me know. Below are details of SMP architecture and test results # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Model name: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Stepping: 2 CPU MHz: 1596.000 CPU max MHz: 2395.0000 CPU min MHz: 1596.0000 BogoMIPS: 4787.71 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-3,8-11 NUMA node1 CPU(s): 4-7,12-15 NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec Thread 7 preflt=14277.53 msec, memcpy=2891.39 MiB/sec Thread 3 preflt=14765.99 msec, memcpy=2553.72 MiB/sec Thread 14 preflt=15036.90 msec, memcpy=2288.19 MiB/sec Thread 1 preflt=15126.01 msec, memcpy=2315.53 MiB/sec Thread 12 preflt=15333.82 msec, memcpy=2071.52 MiB/sec Thread 5 preflt=15603.25 msec, memcpy=1880.64 MiB/sec Thread 6 preflt=15704.05 msec, memcpy=1662.66 MiB/sec Thread 10 preflt=15693.48 msec, memcpy=1642.44 MiB/sec Thread 4 preflt=15571.64 msec, memcpy=1557.73 MiB/sec Thread 15 preflt=15574.60 msec, memcpy=1571.76 MiB/sec Thread 9 preflt=15750.08 msec, memcpy=2170.44 MiB/sec Thread 13 preflt=15588.69 msec, memcpy=1900.24 MiB/sec Thread 8 preflt=15587.50 msec, memcpy=2043.66 MiB/sec Thread 16 preflt=15265.48 msec, memcpy=1884.74 MiB/sec Thread 11 preflt=15294.87 msec, memcpy=2272.75 MiB/sec Total transfer rate: 5817.56 MiB/sec NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, with mlock: Thread 2 preflt=5.27 msec, memcpy=2595.67 MiB/sec Thread 3 preflt=5.37 msec, memcpy=2550.90 MiB/sec Thread 16 preflt=5.02 msec, memcpy=2770.11 MiB/sec Thread 4 preflt=4.12 msec, memcpy=3209.06 MiB/sec Thread 15 preflt=5.31 msec, memcpy=2496.82 MiB/sec Thread 13 preflt=7.46 msec, memcpy=3083.72 MiB/sec Thread 5 preflt=5.49 msec, memcpy=2766.81 MiB/sec Thread 14 preflt=6.94 msec, memcpy=2574.98 MiB/sec Thread 8 preflt=6.53 msec, memcpy=2201.47 MiB/sec Thread 12 preflt=4.90 msec, memcpy=2814.79 MiB/sec Thread 10 preflt=4.41 msec, memcpy=2615.27 MiB/sec Thread 6 preflt=6.18 msec, memcpy=2844.57 MiB/sec Thread 9 preflt=5.38 msec, memcpy=2976.05 MiB/sec Thread 7 preflt=4.81 msec, memcpy=2828.54 MiB/sec Thread 11 preflt=5.10 msec, memcpy=2778.69 MiB/sec Thread 1 preflt=3.84 msec, memcpy=3229.88 MiB/sec Total transfer rate: 3789.33 MiB/sec Linux: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock: Thread 5 preflt=1122.06 msec, memcpy=990.24 MiB/sec Thread 2 preflt=1137.94 msec, memcpy=990.41 MiB/sec Thread 15 preflt=1125.65 msec, memcpy=982.23 MiB/sec Thread 4 preflt=1130.02 msec, memcpy=981.37 MiB/sec Thread 9 preflt=1130.47 msec, memcpy=982.23 MiB/sec Thread 13 preflt=1127.70 msec, memcpy=982.00 MiB/sec Thread 3 preflt=1136.35 msec, memcpy=985.89 MiB/sec Thread 12 preflt=1133.20 msec, memcpy=985.05 MiB/sec Thread 8 preflt=1136.61 msec, memcpy=985.21 MiB/sec Thread 11 preflt=1147.40 msec, memcpy=989.12 MiB/sec Thread 14 preflt=1137.01 msec, memcpy=980.20 MiB/sec Thread 7 preflt=1140.52 msec, memcpy=980.16 MiB/sec Thread 6 preflt=1142.21 msec, memcpy=981.06 MiB/sec Thread 10 preflt=1143.08 msec, memcpy=982.90 MiB/sec Thread 16 preflt=1146.96 msec, memcpy=988.34 MiB/sec Thread 1 preflt=1150.99 msec, memcpy=983.68 MiB/sec Total transfer rate: 15314.12 MiB/sec Linux: 16 threads x 1 GiB, using 1 KiB memcpy size, with mlock: Thread 5 preflt=15.72 msec, memcpy=1555.03 MiB/sec Thread 4 preflt=7.49 msec, memcpy=1548.15 MiB/sec Thread 3 preflt=15.07 msec, memcpy=1471.69 MiB/sec Thread 2 preflt=15.98 msec, memcpy=1517.09 MiB/sec Thread 1 preflt=16.04 msec, memcpy=1533.20 MiB/sec Thread 6 preflt=4.13 msec, memcpy=5191.23 MiB/sec Thread 7 preflt=4.03 msec, memcpy=5825.18 MiB/sec Thread 8 preflt=4.19 msec, memcpy=5265.08 MiB/sec Thread 10 preflt=5.64 msec, memcpy=3359.36 MiB/sec Thread 9 preflt=5.68 msec, memcpy=3354.28 MiB/sec Thread 11 preflt=4.21 msec, memcpy=5255.38 MiB/sec Thread 12 preflt=4.04 msec, memcpy=5250.94 MiB/sec Thread 13 preflt=4.73 msec, memcpy=4224.99 MiB/sec Thread 15 preflt=5.61 msec, memcpy=3311.98 MiB/sec Thread 14 preflt=5.69 msec, memcpy=3312.76 MiB/sec Thread 16 preflt=3.88 msec, memcpy=6158.48 MiB/sec Total transfer rate: 2800.76 MiB/sec