On 14 June 2015 at 17:59, Robert Willy <[email protected]> wrote:

> I know generally C compiler can do an excellent work.
>

Well, to be honest I've also seen both GCC and Clang produce pretty idiotic
output quite often enough. However, they do save you from the large amount
of boilerplate you'd need for example to produce code that supports proper
debugging and/or stack unwinding (see compiler output when compiling with
-g and/or -fexceptions). Also, manual instruction scheduling is tedious,
and a processor like the cortex-a8 can be quite sensitive to it.


I have done asm coding on one DSP core, one Synopsis ARC600 core. Now I
> feel that ARM processor is still very different from those.
>

I am unfamiliar with that particular architecture. The only DSP I have
experience with is TI's C674x which I do like, but it is pretty weird in
comparison to a "regular" CPU architecture. ARM is a fairly clean and
standard RISC architecture though, although it accumulated more cruft over
time (as any architecture does).

Do you know any small project which can be a good exercise to grasp ARM asm
> coding?
>

I'm sorry, I don't really know how to answer that.


> Second, I used SIMD on other cores in the past. When I compile a project
> having for loop, I do not see the generated ARM NEON
> assembly code in the disassembly window.
>

I get clearly vectorized output if I compile some simple dst[] += a * src[]
loop with: -mcpu=cortex-a8 -mfpu=neon -Ofast
(-Ofast is basically short for -O3 -ffast-math). I don't understand why
you're specifying a softfp ABI: hardfloat is the standard on modern ARM
targets.

I've tested with both gcc 4.9.3 (linaro) and 5.1.1 (debian), and it works
for me both for floats and for int16. The vectorized kernel loop is usually
buried among a mass of code dealing with "leftovers" and/or misaligned
cases: much of it can be removed using things like alignment attributes,
the "restrict" qualifier, making sure the loop count is always a multiple
of some nice power of two, etc.

Still, the output doesn't look particularly good to me. Uhh, in fact the
output looks pretty invalid to me: it's putting :64 alignment specifiers on
vector loads, yet those addresses only increment by 16 each loop
iteration...

I would say GCC's auto-vectorization still looks like work-in-progress to
me. :P

-- 
For more options, visit http://beagleboard.org/discuss
--- 
You received this message because you are subscribed to the Google Groups 
"BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to