https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77610

            Bug ID: 77610
           Summary: [sh] memcpy is wrongly inlined even for large copies
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bugdal at aerifal dot cx
  Target Milestone: ---

The logic in sh-mem.cc does not suppress inlining of memcpy when the size is
constant but large. This suppresses use of a library memcpy which may be much
faster than the inline version once the threshold of function call overhead is
passed.

At present the reason this is problematic on J2 is that the cache is direct
mapped, so that when source and dest are aligned mod large powers of two
(typical when page-aligned), each write to dest evicts src from the cache,
making memcpy 4-5x slower than it should be. A library memcpy can handle this
by copying cache line size or larger at a time, but the inline memcpy can't.

Even if we have a set-associative cache on J-core in the future, I plan to have
Linux provide a vdso memcpy function that can use DMA transfers, which are
several times faster than what you can achieve with any cpu-driven memcpy and
which free up the cpu for other work. However it's impossible to for such a
function to get called as long as gcc is inlining it.

Using -fno-builtin-memcpy is not desirable because we certainly want inline
memcpy for small transfers that would be dominated by function call time (or
where the actual memory accesses can be optimized out entirely and the copy
performed in registers, like memcpy for type punning), just not for large
copies.

Reply via email to