https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86267
Bug ID: 86267 Summary: detect conversions between bitmasks and vector masks Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kretz at kde dot org Target Milestone: --- Testcase (cf. https://godbolt.org/g/gi6f7V): #include <x86intrin.h> auto f(__m256i a, __m256i b) { __m256i k = a < b; long long bitmask = _mm256_movemask_pd((__m256d)k) & 0xf; return _mm256_cmpgt_epi64( __m256i{bitmask, bitmask, bitmask, bitmask} & __m256i{1, 2, 4, 8}, __m256i() ); } This should be optimized to "return a < b;". A more complex case also allows conversion of the vector mask (cf. https://godbolt.org/g/FLAEgC): #include <x86intrin.h> auto f(__m256i a, __m256i b) { using V [[gnu::vector_size(16)]] = int; __m256i k = a < b; int bitmask = _mm256_movemask_pd((__m256d)k) & 0xf; return (V{bitmask, bitmask, bitmask, bitmask} & V{1, 2, 4, 8}) != 0; } I believe the most portable and readable strategy would be to introduce new builtins that convert between bitmasks and vector masks. (This can be especially helpful with AVX512, where the builtin comparison operators return vector masks, but Intel intrinsics require bitmasks.) E.g.: using W [[gnu::vector_size(32)]] = long long; using V [[gnu::vector_size(16)]] = int; V f(W a, W b) { unsigned bitmask = __builtin_vector_to_bitmask(a < b); return __builtin_bitmask_to_vector(bitmask, V); } I'd define __builtin_vector_to_bitmask to only consider the MSB of each element. And, to make optimization simpler, consider all remaining input bits to be whatever the canonical mask representation on the target system is.