Have you found hyper-threading to have a positive impact vector width scalability? Among other things, I'm wondering if it can help speed up gathers. I could test this myself, it just isn't the easiest thing to turn on and off on production systems.
Cheers, -Brian On Wednesday, September 25, 2019 at 6:41:24 PM UTC-7, Dmitry Babokin wrote: > > I'm attaching perf measurements on SKX machine for avx2-i32x8, > avx2-i32x16, avx512skx-i32x8, and avx512skx-i32x16 targets. > > Note, that some benchmarks behave better with "double pumped" targets > (i.e. avx2-i32x16), than with targets with native architecture width. So it > makes sense to have a closer look at individual benchmarks, rather than > just looks at geomean. > > Also, don't put attention to "ISPC+tasks", as examples are not really > tuned for that wide machine (my machine has 2x28 cores with enabled HT, > i.e. 112 virtual cores). So these runs basically don't have enough work for > that many cores. > > Raw speedup geomean (against clang-8 compiler) is: > avx2-i32x8: 6.85 > avx2-i32x16: 7.33 > avx512skx-i32x8: 7.13 > avx512skx-i32x16: 9.18 > > Dmitry. > > On Wed, Sep 25, 2019 at 3:56 PM Dmitry Babokin <[email protected] > <javascript:>> wrote: > >> Hi, >> >> We haven't updated performance numbers for a while, thanks for pointing >> this out. >> >> I'll make measurements on the machine that I have and will post the >> results here. And we'll update the "official" numbers a bit later. >> >> AVX512 is indeed ideal target for ISPC. Though a few factors need to be >> taken into account. AVX512 triggers lower frequencies than AVX2, which >> contributes to less-than-expected scaling factor when going to 16 lanes. >> >> We also have AVX512 8-wide target, which doesn't triggers frequency >> problem. >> >> Right now the most common platform with AVX512 is server Skylake platform >> (Purley). >> >> AVX512 is also available on the client in Ice Like chips (since >> recently), but unfortunatelly I don't have such machines within my reach. >> It would be very interesting to experiment with their performance. But they >> are *client* chips, which means they have fewer AVX execution units, than >> server parts. >> >> I'll run tests on Skylake server and post it here. >> >> Dmitry. >> >> On Wed, Sep 25, 2019 at 3:43 PM bb3141 <[email protected] <javascript:>> >> wrote: >> >>> Hi, >>> >>> the speedup of ISPC-vectorized is very impressive and for SSE and AVX, >>> many of the examples show near ideal 4x and 8x scaling with the SIMD width. >>> >>> So I'm very interested in the results of the 16 wide AVX512 for the >>> common ISPC examples (like aobench and ray-tracer). >>> In theory, AVX512 should be the ideal hardware for ISPC with a potential >>> of sixteen times speedup. (the only limiting factor would be memory >>> bandwidth). >>> >>> Am I missing something or are the AVX512 results not reported in the >>> "performance" paper ? >>> >>> Does anyone have some numbers that comprae the speedup (esp. with >>> respect to 8 wide AVX2) >>> or build an run the examples on a AVX512 capable machine ? >>> >>> Thanks & Regards, >>> bb3141 >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Intel SPMD Program Compiler Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/ispc-users/7405df17-7632-4b1e-bd8a-5da48ca8a8f9%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/ispc-users/7405df17-7632-4b1e-bd8a-5da48ca8a8f9%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "Intel SPMD Program Compiler Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ispc-users/1929e11b-0037-41f1-becc-3184578b91c7%40googlegroups.com.
