On 05/25/2017 10:20 AM, Janne Blomqvist wrote:
On Thu, May 25, 2017 at 1:45 PM, Thomas Koenig <tkoe...@netcologne.de> wrote:
Hello world,

the attached patch speeds up the library version of matmul for AMD chips
by selecting AVX128 instructions and, depending on which instructions
are supported, either FMA3 (aka FMA) or FMA4.

Jerry tested this on his AMD systems, and found a speedup vs. the
current code of around 10%.

I have been unable to test this on a Ryzen system (the new compile farm
machines won't accept my login yet).  From the benchmarks I have read,
this method should also work fairly well on a Ryzen.

So, OK for trunk?

In some comments, you have -mprefer=avx128 whereas the option that gcc
understands is -mprefer-avx128. Also, have you verified that e.g.
contemporary Intel processors still use the avx256 codepath and don't
accidentally end up with avx128?

As for FMA4, are there sufficient numbers of processors supporting
FMA4 but not FMA3 around to justify bloating the library to support
them? I understood that this is only a single AMD CPU generation
("bulldozer" in 2011), the next one ("piledriver" in 2012) added FMA3
in addition to FMA4. And in the new Zen core (Ryzen, Epyc, etc.) AMD
has dropped support for FMA4 although there are reports that it will
still execute FMA4 for backward compatibility although it's no longer
advertised in CPUID, but in any case AMD seems to consider it a legacy
instruction that should not be used anymore (Intel never supported
it).


Good questions. I am testing this on Ryzen now. It does work as advertised. The cpu flags only advertise FMA.

So I will be testing the older AMD machine which advertises FMA4 and FMA with just the FMA flag and likewise the Ryzen with FMA4 and FMA.

I want to see if there is any breakage between the two generations of AMD I can access.

Also Ryzen with and without -mprefer-avx128 will be tested.

I do not have an Intel box to test.

Regards,

Jerry


Reply via email to