https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77610
Bug ID: 77610 Summary: [sh] memcpy is wrongly inlined even for large copies Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bugdal at aerifal dot cx Target Milestone: --- The logic in sh-mem.cc does not suppress inlining of memcpy when the size is constant but large. This suppresses use of a library memcpy which may be much faster than the inline version once the threshold of function call overhead is passed. At present the reason this is problematic on J2 is that the cache is direct mapped, so that when source and dest are aligned mod large powers of two (typical when page-aligned), each write to dest evicts src from the cache, making memcpy 4-5x slower than it should be. A library memcpy can handle this by copying cache line size or larger at a time, but the inline memcpy can't. Even if we have a set-associative cache on J-core in the future, I plan to have Linux provide a vdso memcpy function that can use DMA transfers, which are several times faster than what you can achieve with any cpu-driven memcpy and which free up the cpu for other work. However it's impossible to for such a function to get called as long as gcc is inlining it. Using -fno-builtin-memcpy is not desirable because we certainly want inline memcpy for small transfers that would be dominated by function call time (or where the actual memory accesses can be optimized out entirely and the copy performed in registers, like memcpy for type punning), just not for large copies.