Although GCC provides __uint128 extension for beyond 64-bit integer
computation, some programmers still count on portable but less efficient plain
emulation-based code to do 64x64->128 integer multiplication, as the case given
in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122488.
void multiply_uint64_to_uint128(uint64_t op0, uint64_t op1, uint64_t *res)
{
uint64_t op0_lo = op0 & 0x00000000FFFFFFFF;
uint64_t op1_lo = op1 & 0x00000000FFFFFFFF;
uint64_t op0_hi = op0 >> 32;
uint64_t op1_hi = op1 >> 32;
uint64_t mul_lo = op0_lo * op1_lo;
uint64_t mul_mi_01 = op0_hi * op1_lo;
uint64_t mul_mi_10 = op1_hi * op0_lo;
uint64_t mul_hi = op0_hi * op1_hi;
uint64_t sum_mi = mul_mi_01 + mul_mi_10;
uint64_t sum_mi_carry = sum_mi < mul_mi_01;
uint64_t sum_32 = (mul_lo >> 32) + (sum_mi & 0x00000000FFFFFFFF);
res[1] = mul_hi + (sum_mi_carry << 32) + (sum_mi >> 32) + (sum_32 >> 32);
res[0] = (sum_32 << 32) | (mul_lo & 0x00000000FFFFFFFF);
}
On aarch64 and x86, the above sequence could actually be optimized to a 64-bit
mult-high and mult, and we did observe a nice benefit for some reality
application if the simplification is applied. Recently, LLVM just has already
supported this pattern matching. https://godbolt.org/z/7dsorWdKa
So I'd like to discuss recognition of the pattern in GCC, and could add the
feature following with an agreed proposal. Something special makes the thing a
little more complicated, and simply introducing a new AST-based pattern rule in
match.pd might do not work.
o. Not just linear code sequence, the emulated multiplication could also be
implemented in a manner using conditional statement, such as the variant case
in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107090. GCC does not translate
the conditional statement to COND_EXPR in the middle-end, and only rely on
backend to map the statement to COND-SELECT instruction.
// Linear code
sum = a + b;
carry = sum < a;
value += carry << 32;
// Conditional code
sum = a + b;
if (sum < a)
value += 1 << 32;
o. The final result is computed from additions of more than two parts,
whose associative order could be arbitrary. The fixed order constraint
naturally arising from match.pd mechanism could not handle this gracefully, one
possible solution is to enumerate all combinations and create one pattern
respectively, which tends to be a lengthy trick.
result = ((part1 + part2) + part3) + part4;
result = (part4 + part1) + (part3 + part2);
...
result = part2 + ((part4 + part3) + part1));
At the same time, other addition that does not belong to the sequence may break
entirety of the pattern when reassociation transform takes effect.
result = part1 + part2 + part3 + part4;
final = other1 + result + other2;
// After reassociation
tmp0 = other1 + part1;
tmp1 = part2 + part3;
tmp2 = part4 + other2;
final = tmp0 + tmp1 + tmp2;
o. The pattern is comprised of two sub-patterns, mult-to-highpart and
mult-to-lowpart, the latter much simpler and is exactly equivalent to a 64x64
operation. But some intermediate values are shared by two sub-patterns, and are
not thought as single-used temporaries, which would cause the idea of
recognizing those two patterns independently to become somewhat harder, we have
to take them both into consideration.
A possible approach to reference is detection of tabled-based CTZ in
ssa-forward pass, which might be a suitable position for this pattern, in that
ssa-forward pass is invoked more than one time, some happen before
reassociation pass. But simplification of the 64x64->128 pattern should be
classified as mathematics related optimization, and is better to put it in
tree-ssa-mathopt.cc for logical consistency. In the file, there is a
pass_optimize_widening_mul that is very close to what we want, but it is too
late to keep entirety of the pattern against reassociation. So I consider
adding a dedicated pass like pass_cse_reciprocals, meanwhile, place it prior to
reassociation, and the major procedure is manually coding based matching, and
only some leaf patterns are defined via match.pd.
Now, I am working on a prototype implementation, and hope your suggestions on
this.
Thanks,
Feng