Re: [DISCUSS][C++] Strategies for SIMD cross-compilation?

Yibo Cai Sun, 18 Jul 2021 19:48:13 -0700



On 7/17/21 12:08 AM, Wes McKinney wrote:

hi folks,

I had a conversation with the developers of xsimd last week in Paris
and was made aware that they are working on a substantial refactor of
xsimd to improve its usability for cross-compilation and
dynamic-dispatch based on runtime processor capabilities. The branch
with the refactor is located here:

https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring

In particular, the simd batch API is changing from

template <class T, size_t N>
class batch;

to

template <class T, class arch>
class batch;

So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,
you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.
neon/neon64 for ARM ISAs) and then access the batch size through the
batch::size static property.

Adding this 'arch' parameter is a bit strange at first glance, given thepurpose of an simd wrapper is to hide arch dependent code.But as latest simd isa (sve, avx512) has much richer features thansimply widening the data width, looks arch code is a must.

I think this change won't cause trouble to existing xsimd client code.


A few comments for discussion / investigation:

* Firstly, we will have to prepare ourselves to migrate to this new
API in the future

* At some point, we will likely want to generate SIMD-variants of our
C++ math kernels usable via dynamic dispatch for each different CPU
support level. It would be beneficial to author as much code in an
ISA-independent fashion that can be cross-compiled to generate binary
code for each ISA. We should investigate whether the new approach in
xsimd will provide what we need or if we need to take a different
approach.

* We have some of our own dynamic dispatch code to enable runtime
function pointer selection based on available SIMD levels. Can we
benefit from any of the work that is happening in this xsimd refactor?

I think they have some overlaps. Runtime dispatch at xsimd level(simdcode block) looks better than at kernel dispatch level, IIUC.


* We have some compute code (e.g. hash tables for aggregation / joins)
that uses explicit AVX2 intrinsics — can some of this code be ported
to use generic xsimd APIs or will we need to use a different
fundamental algorithm design to yield maximum efficiency for each SIMD
ISA?

Thanks,
Wes

Re: [DISCUSS][C++] Strategies for SIMD cross-compilation?

Reply via email to