https://bugs.llvm.org/show_bug.cgi?id=42550
Bug ID: 42550
Summary: Byte+shift loads are not autovectorized to movdqu on
SSE4.1
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected], [email protected]
Take the following code:
#ifdef BYTE_SHIFT
static uint32_t read32(uint8_t const *data, size_t offset)
{
return (uint32_t) data[offset + 0]
| ((uint32_t) data[offset + 1] << 8)
| ((uint32_t) data[offset + 2] << 16)
| ((uint32_t) data[offset + 3] << 24);
}
#else
static uint32_t read32(uint8_t const *data, size_t offset)
{
uint32_t ret;
memcpy(&ret, data + offset, sizeof(float));
return ret;
}
#endif
Both ways are perfectly valid ways to perform an unaligned load, and when SSE
is disabled, they generate the same code when used.
However, when SSE4 is enabled and loops using these loads are autovectorized,
instead of a movdqu which is what memcpy outputs, the byte shift is expanded
literally into a bunch of pslld, pinsrb, and pmovzxbd instructions.
Demo: https://godbolt.org/z/jCAm2o
These types of loads should be converted to movdqu as well.
--
You are receiving this mail because:
You are on the CC list for the bug._______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs