Very interesting, I was running parallel finite element code and was seeing
great performance compared to Intel in most cases, but on larger runs it was 20x
slower.  This would explain it.

Do you know which commit, or anything else that might help find any related
discussion?  I tried a few google searches without luck.

Is it specific to the 24-core?  The slowdown I described happened on a 32 core
Epyc single socket as well as a dual socket 24 core AMD Epyc system.
