[llvm-bugs] [Bug 109122] [Aarch64] `clz` on a vector of 2 x u64 should be better optimized

LLVM Bugs via llvm-bugs Wed, 18 Sep 2024 03:22:25 -0700

Issue	109122
Summary	[Aarch64] `clz` on a vector of 2 x u64 should be better optimized
Labels	new issue
Assignees
Reporter	Validark

    This code ([Godbolt link](https://zig.godbolt.org/z/4j538eG1P)):

```zig
export fn clz(x: @Vector(2, u64)) @Vector(2, u64) {
    return @clz(x);
}
```


Gives me this emit for the Apple M3:

```asm
clz:
        ushr    v1.2d, v0.2d, #1
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #2
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #4
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #8
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #16
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #32
 orr     v0.16b, v0.16b, v1.16b
        mvn     v0.16b, v0.16b
 cnt     v0.16b, v0.16b
        uaddlp  v0.8h, v0.16b
        uaddlp v0.4s, v0.8h
        uaddlp  v0.2d, v0.4s
        ret
```

It seems to me we could combine `bitReverse`+`ctz` to get better emit for `clz` for vectors where each operand is a u64.

It's also conceivable that we could use `clz` with u32 granularity and combine adjacent elements.

I think it should do something like this:

```zig
export fn clz2(x: @Vector(2, u64)) @Vector(2, u64) {
    const clz_with_u32_granularity: @Vector(4, u32) = @clz(@as(@Vector(4, u32), @bitCast(x)));
    const base = @as(@Vector(2, u64), @bitCast(clz_with_u32_granularity)) >> @splat(32);

    const mask = @select(u32, @as(@Vector(4, u32), @bitCast(base)) == @as(@Vector(4, u32), @splat(32)), 
 clz_with_u32_granularity,
        @as(@Vector(4, u32), @splat(0)),
 );

    return base + @as(@Vector(2, u64), @bitCast(mask));
}
```

That gives us this assembly:

```asm
clz2:
        clz     v1.4s, v0.4s
 ushr    v0.2d, v1.2d, #32
        movi    v2.4s, #32
        cmeq v0.4s, v0.4s, v2.4s
        and     v0.16b, v1.16b, v0.16b
        usra v0.2d, v1.2d, #32
        ret
```

Alternatively, the `usra` could probably have been an `add`.

Assuming I didn't mess anything up, Z3 seems to prove this is a correct transformation? https://alive2.llvm.org/ce/z/878QXU

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 109122] [Aarch64] `clz` on a vector of 2 x u64 should be better optimized

Reply via email to