| Issue |
55384
|
| Summary |
[libcxx] Miscompilation of string code when using new-struct-path-tbaa
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
brunodefraine
|
When using libcxx and clang 14, we observe a miscompilation of the following simple `std::string` code:
```
std::string r;
...
r = std::string("Hello, ") + "world";
```
We expect string r to be equal to "Hello, world" but we end up with "H" followed by 11 garbage characters, depending on the earlier contents of r. This is a [godbolt demo](https://godbolt.org/z/qdo7WPEzz). The only non-trivial factor to trigger the wrong behavior, is the use of the new TBAA struct path format (using clang option `-new-struct-path-tbaa`).
# Analysis
After appending to the left-hand side string, the statement triggers a move constructor and move assignment to get the result in place. Each move consists of a copy of the string representation, followed by a zero operation that clears the representation of the stolen string. It seems these memory access operations do not have proper TBAA information, leading to incorrect memory optimization.
## TBAA tags
In the [LLVM IR for the move assignment operator and move constructor](https://godbolt.org/z/fdver7dd5), we can find back the TBAA tags for the memory operations. These tags should be compatible as they may refer to overlapping memory access. The internal representation of a basic_string is the following structure:
```
union __ulx{__long __lx; __short __lxx;};
enum {__n_words = sizeof(__ulx) / sizeof(size_type)};
struct __raw
{
size_type __words[__n_words];
};
struct __rep
{
union
{
__long __l;
__short __s;
__raw __r;
};
};
__compressed_pair<__rep, allocator_type> __r_;
```
Type-punning using a union violates strict aliasing, but is supported by e.g. GCC [provided the memory is accessed through the union type](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type-punning). Also in clang, access using a union is normally given the tag of "omnipotent char".
The move assignment operator is doing (in `__move_assign`):
```
__r_.first() = __str.__r_.first();
__str.__set_short_size(0);
```
The assignment at the level of `__rep` gives a `memcpy` intrinsic with TBAA base and access tag of `__rep`, which has a member (the anonymous union) of omnipotent char. `__set_short_size` uses the access _expression_ `__r_.first().__s.__size_`, the union access has base and access tag of omnipotent char.
The move constructor is doing:
```
: __r_(_VSTD::move(__str.__r_))
{
__str.__zero();
```
The move at the level of the `__compressed_pair` gives a `memcpy` intrinsic with TBAA base and access tag of the `__compressed_pair`, but this structure has no members in the TBAA info, **this seems incomplete**:
```
!8 = !{!9, !9, i64 0, i64 24}
!9 = !{!6, i64 24, !"_ZTSNSt3__117__compressed_pairINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE5__repES5_EE"}
```
The function `__zero` is doing:
```
void __zero() _NOEXCEPT
{
size_type (&__a)[__n_words] = __r_.first().__r.__words;
for (unsigned __i = 0; __i < __n_words; ++__i)
__a[__i] = 0;
}
```
The store of the zero value in the loop body is given TBAA base and access tag of `long`, **this is clearly wrong**. However, function `__zero` violates the constraint "memory is accessed through the union type" from the GCC manual (cited above). The use of the intermediate reference variable `a` is equivalent to the counter-example (using an intermediate pointer variable `int* ip`) given there. If the source code is changed to remove the use of the reference variable, the base and access tag is omnipotent char, as expected.
# Workaround
Changing the code of `__move_assign` do the assignment at the level of `__raw`:
```
__r_.first().__r = __str.__r_.first().__r;
```
Gives a memcpy with TBAA tag of omnipotent char (since we are accessing a union member) and prevents the incorrect memory optimization. Presumably because this forces to alias with the `__compressed_pair` (but as noted above, that tag seems incomplete).
# Credits
Bhramar Bhushan Vatsa, Wouter Vermaelen and Jeroen Dobbelaere contributed to this bug report.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs