On Thu, 12 Apr 2012, Hans Rosenfeld wrote:

On Thu, Apr 12, 2012 at 09:39:23AM -0500, Bob Friesenhahn wrote:
My OpenMP-based application definitely fits the description of a
potentially "problematic application" because it does execute the same
code in tight loops in both cores of a compute unit.  That is its
whole purpose.  The algorithms mostly qualify as "embarrasingly
parallel".  The code is part of the same application so the page
mappings should be identical.  If the shared inner loops fail to fit
in the L1 instruction cache or there is aliasing then the performance
would be poor.

Could that application be turned into a test case that I could use to
benchmark and debug this further?

The application is open source and available from "http://www.graphicsmagick.org/";. It is one of the few OpenMP-based applications outside of the HPC space, and one of the very few OpenMP-based applications that one might find in a Linux distribution. A version of it is included in a popular benchmark suite.

I can send you a script with a few input which acts as a benchmark. The application includes its own built in benchmarking facility.

I was hoping to investigate GCC's bdver1 output (which does try to
address L1 instruction cache issues) on Illumos but I discovered that
Illumos is not currently capable of executing this code ("illegal
instruction").

Did you test this with the latest code from illumos-gate? The patches to
support the new instruction sets on Bulldozer just went in a few days
ago.

No. I don't have physical access to the system. I could update kernel binaries and remotely reboot the system if it is reasonably easy to install/modify the kernel. There is little time available though since the system will be gone tomorrow.

Could you compile your program with gcc and tuned for barcelona on Linux
and compare the runtime with Illumos on the same hardware?

I did this previously on a 16-core Opteron 6200 system (courtesy of the same hardware vendor) and have the results available. Linux and Illumos GCC results were quite similar. Performance with the AMD Open64 compiler on Linux (the one that AMD benchmarks with) was better and much more consistent. There was never anything close to a 16X performance boost, although the software can achieve linear speedup (12X speedup, or more, with 12 cores) for some algorithms on Intel Xeon CPUs.

64-core is pretty different from 16-core since there is a whole lot more contention going on for the same amount of work.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to