| Issue |
109122
|
| Summary |
[Aarch64] `clz` on a vector of 2 x u64 should be better optimized
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
Validark
|
This code ([Godbolt link](https://zig.godbolt.org/z/4j538eG1P)):
```zig
export fn clz(x: @Vector(2, u64)) @Vector(2, u64) {
return @clz(x);
}
```
Gives me this emit for the Apple M3:
```asm
clz:
ushr v1.2d, v0.2d, #1
orr v0.16b, v0.16b, v1.16b
ushr v1.2d, v0.2d, #2
orr v0.16b, v0.16b, v1.16b
ushr v1.2d, v0.2d, #4
orr v0.16b, v0.16b, v1.16b
ushr v1.2d, v0.2d, #8
orr v0.16b, v0.16b, v1.16b
ushr v1.2d, v0.2d, #16
orr v0.16b, v0.16b, v1.16b
ushr v1.2d, v0.2d, #32
orr v0.16b, v0.16b, v1.16b
mvn v0.16b, v0.16b
cnt v0.16b, v0.16b
uaddlp v0.8h, v0.16b
uaddlp v0.4s, v0.8h
uaddlp v0.2d, v0.4s
ret
```
It seems to me we could combine `bitReverse`+`ctz` to get better emit for `clz` for vectors where each operand is a u64.
It's also conceivable that we could use `clz` with u32 granularity and combine adjacent elements.
I think it should do something like this:
```zig
export fn clz2(x: @Vector(2, u64)) @Vector(2, u64) {
const clz_with_u32_granularity: @Vector(4, u32) = @clz(@as(@Vector(4, u32), @bitCast(x)));
const base = @as(@Vector(2, u64), @bitCast(clz_with_u32_granularity)) >> @splat(32);
const mask = @select(u32, @as(@Vector(4, u32), @bitCast(base)) == @as(@Vector(4, u32), @splat(32)),
clz_with_u32_granularity,
@as(@Vector(4, u32), @splat(0)),
);
return base + @as(@Vector(2, u64), @bitCast(mask));
}
```
That gives us this assembly:
```asm
clz2:
clz v1.4s, v0.4s
ushr v0.2d, v1.2d, #32
movi v2.4s, #32
cmeq v0.4s, v0.4s, v2.4s
and v0.16b, v1.16b, v0.16b
usra v0.2d, v1.2d, #32
ret
```
Alternatively, the `usra` could probably have been an `add`.
Assuming I didn't mess anything up, Z3 seems to prove this is a correct transformation? https://alive2.llvm.org/ce/z/878QXU
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs