[Bug target/100627] missing optimization

2021-05-21 Thread g.peterhoff--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100627

--- Comment #2 from g.peterh...@t-online.de ---
Hello,
i found a better solution here
https://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx
and ported to "normal" C++-code (no intrinsics)
https://godbolt.org/z/scjEdze99. This has these advantages:
- constexpr
- flexible - can be vectorized (autovectorization)

These implementations require C++20 (std::bit_cast and constexpr std::exp2),
but can easily be implemented with older C++ versions. Possibly this trick can
also be used on s/uint64 -> float32, so that one saves the detour s/uint64 ->
float64 -> float32.

However, i have stated:
- with -march=skylake-avx512 no AVX512 code is generated
- only with -march=skylake-avx512 -mprefer-vector-width=512 or -mavx512f
-mavx512dq -mavx512vl does that work
- for s/uint64 -> float32 no correct AVX512 code is generated either
(_mm512_cvtepi64_ps, _mm512_cvtepu64_ps)

thx
Gero

[Bug target/100627] missing optimization

2021-05-16 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100627

--- Comment #1 from Andrew Pinski  ---
This is a target issue dealing with how uint64_t ->float/double conversions are
done.
On aarch64 for cvt_f64_std we get good code at -O3:
cvt_f64_std(std::array&, std::array const&):
ldp q7, q6, [x1]
ldp q5, q4, [x1, 32]
ldp q3, q2, [x1, 64]
ldp q1, q0, [x1, 96]
ucvtf   v7.2d, v7.2d
ucvtf   v6.2d, v6.2d
ucvtf   v5.2d, v5.2d
ucvtf   v4.2d, v4.2d
ucvtf   v3.2d, v3.2d
ucvtf   v2.2d, v2.2d
stp q7, q6, [x0]
ucvtf   v1.2d, v1.2d
stp q5, q4, [x0, 32]
ucvtf   v0.2d, v0.2d
stp q3, q2, [x0, 64]
stp q1, q0, [x0, 96]
ret

The other function is:
cvt_f32_std(std::array&, std::array const&):
ldp x3, x2, [x1]
ucvtf   s7, x2
ucvtf   s3, x3
ldp x3, x2, [x1, 32]
ins v3.s[1], v7.s[0]
ucvtf   s6, x2
ucvtf   s2, x3
ldp x3, x2, [x1, 64]
ins v2.s[1], v6.s[0]
ucvtf   s5, x2
ucvtf   s1, x3
ldp x3, x2, [x1, 96]
ins v1.s[1], v5.s[0]
ucvtf   s4, x2
ucvtf   s0, x3
ldr x2, [x1, 48]
ldr x3, [x1, 16]
ucvtf   s17, x2
ldr x2, [x1, 112]
ucvtf   s18, x3
ldr x3, [x1, 80]
ins v0.s[1], v4.s[0]
ucvtf   s4, x2
ucvtf   s16, x3
ldr x2, [x1, 24]
ldr x3, [x1, 56]
ucvtf   s7, x2
ldr x2, [x1, 88]
ucvtf   s6, x3
ldr x1, [x1, 120]
ucvtf   s5, x2
ins v3.s[2], v18.s[0]
ins v2.s[2], v17.s[0]
ins v1.s[2], v16.s[0]
ins v0.s[2], v4.s[0]
ucvtf   s4, x1
ins v3.s[3], v7.s[0]
ins v2.s[3], v6.s[0]
ins v1.s[3], v5.s[0]
ins v0.s[3], v4.s[0]
stp q3, q2, [x0]
stp q1, q0, [x0, 32]
ret