| Issue |
179607
|
| Summary |
[X86] Missed Fold: `vgf2p8affineqb(vgf2p8affineqb(x, C1), C2)` => `vgf2p8affineqb(x, C3)`
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
WalterKruger
|
Given that `vgf2p8affineqb` performs a XOR permutations of the input, and that XORs are associative, a composition of multiple `vgf2p8affineqb` operations can be folded into a one:
```asm
revRunningParity_clang:
gf2p8affineqb xmm0, xmmword ptr [rip + .LCPI0_0], 0
gf2p8affineqb xmm0, xmmword ptr [rip + .LCPI0_1], 0
ret
```
```asm
revRunningParity_tgt:
gf2p8affineqb xmm0, xmmword ptr [rip + .LCPI1_0], 0
ret
```
https://godbolt.org/z/536MK4PTq
The matrix for such fold can be calculated by affining the sub-matrix by the super-matrix at a byte, rather then bit, granularity. Given that the sub-matrix's byte order determines the bit-output in reverse order, this "byte affine" should determine the byte output in-order and select the source bytes in-reverse order from `vgf2p8affineqb`. See the following pseudocode code:
```
u64 byteGranularAffine(u64 subMatrix, u64 superMatrix) {
u64 res = 0
for i FROM 0...7 {
u8 matrixRow = superMatrix.byte[i]
for j FROM 0...7 {
res.byte[i] ^= matrixRow.bit[j]? subMatrix.byte[7-j] : 0
}
}
return res
}
```
The sub-matrix's immediate also needs to be folded, which can be done by performing a normal `vgf2p8affineqb` on it using the super's matrix and immediate. See #179606.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs