Issue 179584
Summary [RISC-V] suboptimal big integer comparisons
Labels new issue
Assignees
Reporter tom-rein
    When comparing integers larger than register width on RISC-V targets with Zicond, there are useless extra `czero.eqz` ([godbolt](https://godbolt.org/z/hqW4qGbKP))
```C++
using I = std::size_t;
constexpr int digits = std::numeric_limits<I>::digits;
using L = unsigned _BitInt(2 * digits);

bool test_lt(L x, L y) {
    return x < y;
}
```
```
test_lt(unsigned _BitInt(128), unsigned _BitInt(128)):
 xor     a4, a1, a3
        sltu    a1, a1, a3
        sltu    a0, a0, a2
        czero.eqz       a1, a1, a4
        czero.nez       a0, a0, a4
 or      a0, a0, a1
        ret
```
a1 is of course always already zero when a4 is zero.

When there is a branch based on the comparison, the branch and the `or` can be combined, because the operands of the `or` are never both nonzero we can use a `beq` / `bne` instead of `or` + `beqz` / `bnez`.

There is an alternative way to do branch free compares with just base integer instructions:
```C++
bool test_lt2(I x0, I x1, I y0, I y1) {
 return ((y1 < x1) - (x1 < y1)) < (x0 < y0);
}
```
```
test_lt2(unsigned _BitInt(128), unsigned _BitInt(128)):
 sltu    a4, a1, a3
        sltu    a1, a3, a1
        sub     a1, a1, a4
        sltu    a0, a0, a2
        slt     a0, a1, a0
 ret
```
This has a similar critical path length as the `czero` version, except the `czero.nez` may be fused, making it potentially faster on some targets.
The size is only two bytes larger if the xor can be compressed.

For greater or equal comparison currently there is just an extra `xori 1` at the end, but there could be more parallelism if something like this gets generated:
```
test_ge3(unsigned _BitInt(128), unsigned _BitInt(128)):
        sltu    a2, a0, a2
        sltu    a0, a1, a3
 xori    a0, a0, 1
        xor     a1, a1, a3
        czero.nez       a2, a2, a1
        czero.nez       a0, a0, a2
        ret
```
The alternative base integer version could use this:
```
test_ge2(unsigned _BitInt(128), unsigned _BitInt(128)):
        sltu    a0, a0, a2
        sltu    a2, a3, a1
        sltu    a1, a1, a3
        addi    a0, a0, -1
        sub a1, a1, a2
        slt     a0, a0, a1
        ret
```
The `addi` can be compressed unlike the `xori`.

With my experiments I also noticed that some boolean combinations fail to use conditional zeros:
```C++
bool test_lt3(I x0, I x1, I y0, I y1) {
    bool b0 = x0 < y0;
    bool b1 = x1 < y1;
 if (x1 != y1) b0 = false;
    b1 |= b0;
    return b1;
}
```
```
test_lt3(unsigned long, unsigned long, unsigned long, unsigned long):
        sltu    a0, a0, a2
        xor     a2, a1, a3
 seqz    a2, a2
        and     a0, a0, a2
        sltu    a1, a1, a3
 or      a0, a0, a1
        ret
```
Here `seqz` + `and` is used instead of `czero.nez`. I think lowerSelectToBinOp() in RISCVISelLowering.cpp is where this comes from.

There seems to also be a general issue, where e.g. a conditional or gets turned into a select, but the information that one argument is already conditionally zero gets lost, like in this example:
```C++
bool test_lt4(I x0, I x1, I y0, I y1) {
    bool b0 = x0 < y0;
    bool b1 = x1 < y1;
    if (x1 == y1) b1 |= b0;
    return b1;
}
```
```
test_lt4(unsigned long, unsigned long, unsigned long, unsigned long):
        sltu    a4, a1, a3
        xor     a1, a1, a3
 sltu    a0, a0, a2
        czero.eqz       a2, a4, a1
        czero.nez a0, a0, a1
        or      a0, a0, a2
        ret
```
The above case can be pattern matched in the lowering pass, but the general case with assumes probably needs the select to somehow preserve this information.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to