Oleg Endo <olegendo at gcc dot> changed:

           What    |Removed                     |Added
                 CC|                            |olegendo at gcc dot

--- Comment #4 from Oleg Endo <olegendo at gcc dot> ---
(In reply to Rich Felker from comment #0)
> Even if we have a set-associative cache on J-core in the future, I plan to
> have Linux provide a vdso memcpy function that can use DMA transfers, which
> are several times faster than what you can achieve with any cpu-driven
> memcpy and which free up the cpu for other work. However it's impossible to
> for such a function to get called as long as gcc is inlining it.

Just a note on the side... the above can also be done on a off-the-shelf SH
MCU.  However, it is only going to be beneficial for large memory blocks, since
you'd have to synchronize (i.e. flush) the data cache lines of the memcpy'ed
regions.  For small blocks DMA packet setup time will dominate, unless you've
got one dedicated DMA channel sitting around just waiting for memcpy commands. 
Normally it's better to avoid copying large memory blocks at all and use
reference-counted buffers or something like that instead.   That is of course,
unless you've got some special cache coherent DMA machinery ready at hand and
memory is very fast :)

(In reply to Rich Felker from comment #2)
> I'm testing a patch where I used 256 as the limit and it made the Linux
> kernel very slightly faster (~1-2%) and does not seem
> to hurt anywhere.

I'm curious, how did you measure this performance of the kernel?  Which part in
particular got faster in which situation?

Reply via email to