On 01/14/18 19:44, Peter Veentjer wrote: > I'm working on some very simple aggregations on huge chunks of > offheap memory (500GB+) for a hackaton. This is done using a very > simple stride; every iteration the address increases with 20 bytes. > So the prefetcher should not have any problems with it. > > According to my calculations I'm currently processing 35 GB/s. > However I'm not sure if I'm close to the maximum bandwidth of this > machine. Specs: 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P 2x Intel(R) > *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket > > What is the best tool to determine the maximum bandwidth of a machine > running Linux (RHEL 7)
I recently had the same question (out of curiosity, after reading about Ryzen/EPYC memory performance) and still had my bookmarks, so here goes. - The 'perf' utility usually used for performance measurements has a memory benchmark. Somewhat fiddly with its parameters but OK for a quick test. Single-threaded only and you really need to pass larger memory blocks, otherwise you might only get cache bandwidth. - 'mbw' is also single-thread only, but quick & easy to run. Make sure to pass proper CFLAGS, otherwise it will build without any optimization at all. - 'pmbw' [2] is a parallel version of mbw with assembly loops, SSE/AVX and many variants of accesses (forwards, backwards, sideways ;). Unfortunately it has completely unreadable output; this is offset by the built-in capability to pass the output to gnuplot and make pretty pictures. Also has pretty extensive benchmark results on the website. - The "industry-standard" bandwidth benchmark is STREAM [3] by John McCalpin of SGI and comp.arch fame. Unfortunately the original code has been hacked on by various people, so different versions float around more or less unmaintained. I found two forks that are easy to use: - [4] is a cleaned-up version with optional OpenMP support that should build out of the box. Just get stream.c and build, with or without OpenMP. You REALLY need to pass much higher values for STREAM_ARRAY_SIZE (at least ~50-80x) and NTIMES (~10x), otherwise the run will be too short & meaningless on your machine. - [5] is another fork with NUMA suppoort. This is relevant because you have two sockets and are probably running without NUMA affinity, effectively trashing your caches not just from the local CPU but also from the other..just like real applications without NUMA awareness tend to do. :( In any case make sure to build STREAM with -O3 -march=native. Pass -fopenmp to get default OpenMP support. The NUMA fork has both the OpenMP and a version with "manual threading" with explicit NUMA awareness. Happy benchmarking! Holger [1] https://github.com/raas/mbw [2] https://github.com/bingmann/pmbw [3] http://www.cs.virginia.edu/stream/ [4] https://github.com/jeffhammond/STREAM [5] https://github.com/larsbergstrom/NUMA-STREAM -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
