On 01/14/18 19:44, Peter Veentjer wrote:
> I'm working on some very simple aggregations on huge chunks of
> offheap memory (500GB+) for a hackaton. This is done using a very
> simple stride; every iteration the address increases with 20 bytes.
> So the prefetcher should not have any problems with it.
> 
> According to my calculations I'm currently processing 35 GB/s.
> However I'm not sure if I'm close to the maximum bandwidth of this
> machine. Specs: 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P 2x Intel(R)
> *Xeon*(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket
> 
> What is the best tool to determine the maximum bandwidth of a machine
> running Linux (RHEL 7)

I recently had the same question (out of curiosity, after reading
about Ryzen/EPYC memory performance) and still had my bookmarks,
so here goes.

- The 'perf' utility usually used for performance measurements has
a memory benchmark. Somewhat fiddly with its parameters but OK for a
quick test. Single-threaded only and you really need to pass larger
memory blocks, otherwise you might only get cache bandwidth.

- 'mbw' is also single-thread only, but quick & easy to run.
Make sure to pass proper CFLAGS, otherwise it will build without
any optimization at all.

- 'pmbw' [2] is a parallel version of mbw with assembly loops, SSE/AVX
and many variants of accesses (forwards, backwards, sideways ;).
Unfortunately it has completely unreadable output; this is offset by
the built-in capability to pass the output to gnuplot and make pretty
pictures. Also has pretty extensive benchmark results on the website.

- The "industry-standard" bandwidth benchmark is STREAM [3] by
John McCalpin of SGI and comp.arch fame. Unfortunately the original code
has been hacked on by various people, so different versions float around
more or less unmaintained. I found two forks that are easy to use:

- [4] is a cleaned-up version with optional OpenMP support that
should build out of the box. Just get stream.c and build, with or
without OpenMP. You REALLY need to pass much higher values for
STREAM_ARRAY_SIZE (at least ~50-80x) and NTIMES (~10x), otherwise
the run will be too short & meaningless on your machine.

- [5] is another fork with NUMA suppoort. This is relevant
because you have two sockets and are probably running without
NUMA affinity, effectively trashing your caches not just from the
local CPU but also from the other..just like real applications
without NUMA awareness tend to do. :(

In any case make sure to build STREAM with -O3 -march=native.
Pass -fopenmp to get default OpenMP support. The NUMA fork has both
the OpenMP and a version with "manual threading" with explicit NUMA
awareness.

Happy benchmarking!

Holger

[1] https://github.com/raas/mbw
[2] https://github.com/bingmann/pmbw
[3] http://www.cs.virginia.edu/stream/
[4] https://github.com/jeffhammond/STREAM
[5] https://github.com/larsbergstrom/NUMA-STREAM

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to