On 7/17/21 12:08 AM, Wes McKinney wrote:
hi folks, I had a conversation with the developers of xsimd last week in Paris and was made aware that they are working on a substantial refactor of xsimd to improve its usability for cross-compilation and dynamic-dispatch based on runtime processor capabilities. The branch with the refactor is located here: https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring In particular, the simd batch API is changing from template <class T, size_t N> class batch; to template <class T, class arch> class batch; So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch, you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g. neon/neon64 for ARM ISAs) and then access the batch size through the batch::size static property.
Adding this 'arch' parameter is a bit strange at first glance, given the purpose of an simd wrapper is to hide arch dependent code. But as latest simd isa (sve, avx512) has much richer features than simply widening the data width, looks arch code is a must.
I think this change won't cause trouble to existing xsimd client code.
A few comments for discussion / investigation: * Firstly, we will have to prepare ourselves to migrate to this new API in the future * At some point, we will likely want to generate SIMD-variants of our C++ math kernels usable via dynamic dispatch for each different CPU support level. It would be beneficial to author as much code in an ISA-independent fashion that can be cross-compiled to generate binary code for each ISA. We should investigate whether the new approach in xsimd will provide what we need or if we need to take a different approach. * We have some of our own dynamic dispatch code to enable runtime function pointer selection based on available SIMD levels. Can we benefit from any of the work that is happening in this xsimd refactor?
I think they have some overlaps. Runtime dispatch at xsimd level(simd code block) looks better than at kernel dispatch level, IIUC.
* We have some compute code (e.g. hash tables for aggregation / joins) that uses explicit AVX2 intrinsics — can some of this code be ported to use generic xsimd APIs or will we need to use a different fundamental algorithm design to yield maximum efficiency for each SIMD ISA? Thanks, Wes