| Issue |
110308
|
| Summary |
[AVX-512] clz(32 x u8) and clz(64 x u8) should use an algorithm similar to avx2
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
Validark
|
This code:
```zig
export fn foo(x: @Vector(32, u8)) @TypeOf(x) {
return @clz(x);
}
```
LLVM version:
```llvm
define dso_local range(i8 0, 9) <32 x i8> @foo(<32 x i8> %0) local_unnamed_addr {
Entry:
%1 = tail call range(i8 0, 9) <32 x i8> @llvm.ctlz.v32i8(<32 x i8> %0, i1 false)
ret <32 x i8> %1
}
declare <32 x i8> @llvm.ctlz.v32i8(<32 x i8>, i1 immarg) #1
```
Results in this emit for Zen 4:
```asm
.LCPI0_0:
.zero 32,24
foo:
vpmovzxbd zmm1, xmm0
vextracti128 xmm0, ymm0, 1
vpmovzxbd zmm0, xmm0
vplzcntd zmm1, zmm1
vplzcntd zmm0, zmm0
vpmovdb xmm1, zmm1
vpmovdb xmm0, zmm0
vinserti128 ymm0, ymm1, xmm0, 1
vpsubb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
ret
```
LLVM mca claims this should take ~23 cycles per iteration.
This is pretty unfortunate because if we downgrade to Zen 3, we get:
```asm
.LCPI0_1:
.zero 32,15
.LCPI0_2:
.byte 4
.byte 3
.byte 2
.byte 2
.byte 1
.byte 1
.byte 1
.byte 1
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
foo:
vbroadcasti128 ymm1, xmmword ptr [rip + .LCPI0_2]
vpxor xmm3, xmm3, xmm3
vpshufb ymm2, ymm1, ymm0
vpsrlw ymm0, ymm0, 4
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
vpcmpeqb ymm3, ymm0, ymm3
vpshufb ymm0, ymm1, ymm0
vpand ymm2, ymm2, ymm3
vpaddb ymm0, ymm2, ymm0
ret
```
LLVM-mca says Zen 3 should be able to compute this in ~11 cycles per iteration.
We can reproduce this functionality in Zig like so:
```zig
export fn foo2(x: @Vector(32, u8)) @TypeOf(x) {
const vec: @TypeOf(x) = comptime std.simd.repeat(@sizeOf(@TypeOf(x)), [16]u8{4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0});
return @select(u8, x == @as(@TypeOf(x), @splat(0)), vpshufb(vec, x), @as(@TypeOf(x), @splat(0))) + vpshufb(vec, x >> @splat(4));
}
```
Which gives us functionally equivalent assembly, albeit reordered. LLVM-mca says that we can 10 cycles per iteration with the instructions reordered. Not sure if there is anything to that. [Godbolt link here](https://zig.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYgAzKVpMGoAF55gpJfWQE8Ayo3QBhVLQCuLBiABMpJwAyeAyYAHKeAEaYxBJcpAAOqAqE9gyuHl4SCUkpAkEh4SxRMQAcVpg2dgJCBEzEBOme3n7WmLapNXUE%2BWGR0bFWtfWNmXEKQ93BvUX9JQCUVqjuxMjsHGgM4wDU4%2BhbAKR6ACJbAAJ4LIn1EPs%2BPru3PnOHAEL7GgCCG9sR7nR2DAOxzOFyuBBud1%2B/2Cj2eejen3eHyogP4qAg6jOADU2kRiBA9H4tu55nMzgAVACe8UwAHkqBiyfsAOwIj5bDlbYiYAjLQGnZC0UyM15IllHMWfTCqMFbFFy1CoHwYkDY3EkAlEklzMmnKk0%2BmMg6spGcrbfAhbABubVVeupdIZqiZwLQlzsbB2BHQADpkixfdyaUxwadkqZHRB7QanTrSAcAKwvLgANn2CaOJJZL2kBj8fjihdIcQ0pFL5bLlfL4rhbLN3N5xH5rVsEBJ8dUQIlwNOTAUUf1ked8bD8SM4I0cet8QUCHcVAiEBtyA7CzOfYHDsNw7OCjHIYgk51Bx8L2ns/ni%2BXHaB2EO2F3%2B/Bkh1osRzIliM%2B8qtM7nC4gWoInELZjEpAgHXjYJ8FWBQ7UHQ0gPoY9o0jJDMCZE1PjNPAqC2KNglcd0LkwCBj2zU0zU5K06i5TAFHcWgCHgrcGWgvBYJdE53AYLAaBCdA33ZKiOX4Yh8I0H0fVOCJCCEPAI0NVDDXYziyQAei2eZjWcPAWWcY06xEqiLS2aDpS7MzeI4%2Bj0xePSMyE4yqO5BimLshyTkOE5cPw8zO3vbytg0VQSknYKtgqJQtnQjzeIs24E3wmS5IUyNlIZdDj00%2BZ027IyRPFSVhJEhs%2BTotyCCczkiq/ErzQEbY2AIBAMAUSzxmIdxbEMyiROlAholRflHloWgrRYH1QpTH0mCtVQEy4HwfT/ecfQiH1Fp8R4oxxWwNRTaRiXmEc9rxCBDvjbVdTOg6ju1ar%2BtUQamzlEa7jGiappKGa5tUZbVqodadtOW78UJK6TrVfbwa1UlofOiHjtrPqqIGoa3rOUbxsm6a/QUJQ9BWi8gY2paShBsGIFTSG11B9V8Rp5GboZ6mU1px7jXy4r6x5cqFAAd0IZAEGShDMqYYCMMwgqqPpmGLvu7TvPvMy8IgN14g9TAvV9ck6mAHlvpmqhMBDZZMCEHkAAkNyhJjgh9ZB4ncH1TfN1z41m%2BatoiAXj2a1r0AUaTsa%2BvG/q24n/3Wzalp29CoOszjItoaKBVQS46BwYhiA1R5nGUOQYoNnktiMZAAGt2oY%2BJZTE89/wiLbYVIVGzXlxG4a41XfI1zOtZI3WfX14hDYIY23bNxtLZtu2/gdhgnZdqePfor2/qeMlA7a0OPpxyfN%2BjtaIgTyX6CTmD6LJKKdYzrP6GwXP87uQvi6Gcfy6YKua/cOuSEtA3X8JMIg%2BATGmO4Cx26ck7hqJm2ouy93VprbWw9R7j0nu7GeVsCC237PbAEy9XZYItgoL2BNCYBx5EHEOpww64x%2BvjQmx9SY%2BnJmfKWl8bIKBvmnO%2Bmts5PzzviAuRcS5jzLhXauOw/71xII3S87DIFt2wsZW%2BiC9APnvvEQRz98S7DdiwCeKCSLKGIMEcEjwACSDAaK0DwHsA2nhGCWggjSLY8Q%2BxKD2EQBRC5VQUQ/OmZwDBHhe2zBlQC58MI1igaojkRV4Rig/BwBYtBOAJl4N4bgvBUCcAAFoWB2EsFYOtbh6D0LwZiHAtBxgQGbLAMQyKkEriARaPo9AlB8CmLgABOPQKYfDMikIYTgkheAsAkBoUs2StCkDyRwXgcFSzVNqaQOAsAYCIBQAPbOZAKD9wfv0UwBAuoMErnwf40Q4IQAiJoXgslmDEEpJwHgpBHl1EpLSCI2hcSvN4G6NgghaQMFoC8mpvAsC/GAM4MQad/mkCwCwYwwBxAQsRXgbk7QbRwXRdKNo7hBoIosRUe5hg8ARGIJ81wWAyWnIuAim0xAIhJEwEcTAyKTD2JMPchYVAjDAAUFiPAmABa0hpNkt5/BBAiDEOwEZ0r5BKDUGS3QcQjA8vMJYexEQ4KQAWKgQejVOAAFpaRbAAEoVDNkoAAYn2S0JqP5l2mgAfUOia7l7hOwmpYM7dw3lTC2OiJU%2BZTLzFYD1c0lsVRvAQCcCMbwcRAhTEKMULIiRkixsTRmnIsaehpv6GMa17RqgTBzcWyoHQJgFr6DEMY5a3BNCyOMLotaZj1oWAoEpqx9BpIyVkslCythapimcyu%2BFcCEHkeUvQcwqm8oWPUpgjTKALFaZIEoPoenMk6RoFMehmQ%2BB%2BhodVYyJltJmUOzgSyQArN5esrZEAkBEDcOQSgBtcW8E/coYwFQhCtQFpKgFuz6DEFCKwNYo7Tk8UrrwTA%2BA8QOP0DIGVohxAKtkIoFQ6h0VqtICwAQ39UBTrxHgl4mBGAfEJagcVjAEMIsI8wNApGSB0eAwRojaAaj4A47sYIN7Bi8d/SEWgAHUBAYRa%2B2gVj0AgE%2BiwX1ogYPnNIALKl8R/n9o4Jk0gszcmcAU0ppgI6ikqYnRAVj4lZ3xiIjosDJ49BcHnaQVZqSl0NP6M09JHBxkEcvXp69iyrB3tc4ulpIBN3bq4LusKB6j0nrPRwUN%2Bn5mCbc3GHzPhB3ooWQuiFcYmXJAcJIIAA%3D%3D)
Upgrading back to Zen 4, `foo2` gives us:
```asm
.LCPI0_2:
.byte 4
.byte 3
.byte 2
.byte 2
.byte 1
.byte 1
.byte 1
.byte 1
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.LCPI0_3:
.byte 0
.byte 0
.byte 0
.byte 0
.byte 128
.byte 64
.byte 32
.byte 16
foo2:
vbroadcasti128 ymm1, xmmword ptr [rip + .LCPI0_2]
vptestnmb k1, ymm0, ymm0
vpshufb ymm2, ymm1, ymm0
vgf2p8affineqb ymm0, ymm0, qword ptr [rip + .LCPI0_3]{1to4}, 0
vpshufb ymm0, ymm1, ymm0
vpaddb ymm0 {k1}, ymm0, ymm2
ret
```
LLVM-mca says this gives us ~12 cycles of latency per iteration. Which, I notice, is higher than the ~10 latency from Zen 3.
Not sure what the best course of action is, but probably one of these things should be done:
1. The latter implementation should be used to implement `@clz(32 x u8)` and `@clz(64 x u8)`
2. The Zen 3 implementation should be lifted directly to Zen 4.
Thank you!
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs