Issue 137912
Summary [Optimization opportunity] Optimize repeated inline assembly operations when semantically equivalent
Labels new issue
Assignees
Reporter 399833783
    I would like to propose an **optimization enhancement** regarding inline assembly operations in C++, specifically for cases where multiple non-volatile `asm` statements perform similar or nearly identical operations.

### Issue description
Currently, when a function calls multiple other functions containing similar inline assembly operations (e.g., one retrieving the quotient and another retrieving the remainder from a division operation), Clang generates code that **repeats the same assembly instruction multiple times**, even though **a single execution could provide all needed results**.

### Reproduction case

#### 64-bit version

##### C++ code:
```cpp
#include <cstdint>

using U = uint64_t;

struct A { U quo; U rem; };

U myDiv(U const& a, U const& b)
{
	U result, dummyDividendHigh;

	asm (
		"xor %[dividendHigh], %[dividendHigh]  \n\t"
		"divq %[divisor]  \n\t"
		: [quotient] "=a"(result), [dividendHigh] "=d" (dummyDividendHigh)
		: [dividendLow] "a"(a), [divisor] "rm"(b)
	);


	return result;
}

U myMod(U const& a, U const& b)
{    
	U result;

	asm (
		"xor %[dividendHigh], %[dividendHigh]  \n\t"
		"divq %[divisor]  \n\t"
		: [dividendHigh] "=d" (result)
		: [dividendLow] "a"(a), [divisor] "rm"(b)
	);

	return result;
}

A myDivMod1(U const& a, U const& b)
{
	A result;
	result.quo = myDiv(a, b);
	result.rem = myMod(a, b);
	return result;
}

A myDivMod2(U const& a, U const& b)
{
	A result;
	
	asm (
		"xor %[dividendHighOrRemainder], %[dividendHighOrRemainder]  \n\t"
		"divq %[divisor]  \n\t"
		: [quotient] "=a"(result.quo), [dividendHighOrRemainder] "=d" (result.rem)
		: [dividendLow] "a"(a), [divisor] "rm"(b)
	);
	
	return result;
}
```

##### Generated assembly:
```asm
myDiv(unsigned long const&, unsigned long const&):
 mov     rax, qword ptr [rdi]
    mov     rcx, qword ptr [rsi]
    mov qword ptr [rsp - 8], rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]

    ret

myMod(unsigned long const&, unsigned long const&):
    mov rax, qword ptr [rdi]
    mov     rcx, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]

 mov     rax, rdx
    ret

myDivMod1(unsigned long const&, unsigned long const&):
    mov     rcx, qword ptr [rdi]
    mov     rdi, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rdi
    mov     rax, rcx
    xor rdx, rdx
    div     qword ptr [rsp - 8]

    mov     rsi, rax
    mov qword ptr [rsp - 16], rdi
    mov     rax, rcx
    xor     rdx, rdx
 div     qword ptr [rsp - 16]

    mov     rax, rsi
 ret

myDivMod2(unsigned long const&, unsigned long const&):
    mov rax, qword ptr [rdi]
    mov     rcx, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]

 ret
```

##### Godbolt link (64-bit version):
https://gcc.godbolt.org/z/rqfhxrGMf

#### 32/16/8-bit version

##### Godbolt link (32-bit version):
https://gcc.godbolt.org/z/MK6Y9rrd7

##### Godbolt link (16-bit version):
https://gcc.godbolt.org/z/aMcqqWz14

##### Godbolt link (8-bit version, **optimize correctly**):
https://gcc.godbolt.org/z/aoxPcWfxs

### Optimization suggestion
I suggest enhancing Clang to recognize situations where multiple non-volatile inline assembly blocks across function calls share identical or highly similar operations, and optimize them by merging the operations when semantically safe.
For the example above, the assembly code generated for `myDivMod1` should be no more complex than that of `myDivMod2` and contain no more than one `div` instruction.

### Theoretical basis
According to Clang's inline assembly documentation, **non-volatile `asm` statements can be reordered and optimized by the compiler**. In this case, both `myDiv` and `myMod` are essentially performing the same `div` operation but extracting different outputs (quotient vs. remainder). Since the x86 `div` instruction naturally produces both results simultaneously, executing it twice is redundant.
Another strong reason to support this optimization suggestion is that, in the case of `uint8_t`, Clang generates **identical code** for both `myDivMod1` and `myDivMod2`, which proves that Clang has effectively optimized the case in `myDivMod1`.

### Broader application
This optimization pattern would benefit not just this specific case but any scenario where:
- Multiple non-volatile inline assembly blocks across function calls contain similar operations.
- These operations could be merged without changing program semantics.
- The reuse of intermediate results would reduce instruction count.

Similar to function inlining optimizations, the compiler could analyze the context of inline assembly calls and merge redundant operations where safe.

### Expected benefit
Such optimization would result in more efficient code generation, particularly important for performance-critical applications using inline assembly for specialized operations like cryptography, multimedia processing, or low-level system operations.

I appreciate your consideration of this enhancement request and am happy to provide any additional information if needed.

Sincerely.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to