Re: Optimized kernel memcpy/memset

Måns Rullgård Thu, 05 May 2011 10:45:43 -0700

David Gilbert <david.gilb...@linaro.org> writes:

> On 5 May 2011 16:08, Måns Rullgård <m...@mansr.com> wrote:
>> David Gilbert <david.gilb...@linaro.org> writes:
>>> Not quite:
>>>   a) Neon memcpy/memset is worse on A9 than non-neon versions (better
>>> on A8 typically)
>>
>> That is not my experience at all.  On the contrary, I've seen memcpy
>> throughput on A9 roughly double with use of NEON for large copies.
>> For small copies, plain ARM is might be faster since the overhead of
>> preparing for a properly aligned NEON loop is avoided.
>>
>> What do you base your claims on?
>
> My tests here:
> https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy


At the top of the page: Do not rely on or use the numbers.

> at the bottom of the page are sets of graphs for A9 (left) and A8
> (right); on A9 the Neon memcpy's (red and green) top out much lower
> than their non-neon best equivalents (black and cyan).

That page is rather fuzzy on exactly what code was being tested as well
as how the tests were performed.  Without some actual code with which
one can reproduce the results, those figures should not be used as basis
for any decisions.

> Also, when I showed those numbers to the guys at ARM they all said it was
> a bad idea to use Neon on A9 for memory manipulation workloads.

I have heard many claims passed around concerning memcpy on A9, none of
which I have been able to reproduce myself.  Some allegedly came from
people at ARM.

> What code do you base your claims on :-)

My own testing wherein the Bionic NEON memcpy vastly outperformed both
glibc and Bionic ARMv5 memcpy.

>> I don't see the connection between Thumb2 and memcpy performance.
>> Thumb2 can do anything 32-bit ARM can.
>
> There are the purists who says write everything in Thumb2 now; however
> there is an interesting question of which is faster, and IMHO the ARM
> code is likely to be a bit faster in most cases.

Code with many conditional instructions may be faster in ARM mode since
it avoids the IT instructions.  Other than that I don't see why it
should matter.  The instruction prefetching should make possible
misalignment of 32-bit instructions irrelevant.  If anything, the
usually smaller Thumb2 code should decrease I-cache pressure and
increase performance.  

-- 
Måns Rullgård
m...@mansr.com

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: Optimized kernel memcpy/memset

Reply via email to