On 14 June 2015 at 17:59, Robert Willy <[email protected]> wrote: > I know generally C compiler can do an excellent work. >
Well, to be honest I've also seen both GCC and Clang produce pretty idiotic output quite often enough. However, they do save you from the large amount of boilerplate you'd need for example to produce code that supports proper debugging and/or stack unwinding (see compiler output when compiling with -g and/or -fexceptions). Also, manual instruction scheduling is tedious, and a processor like the cortex-a8 can be quite sensitive to it. I have done asm coding on one DSP core, one Synopsis ARC600 core. Now I > feel that ARM processor is still very different from those. > I am unfamiliar with that particular architecture. The only DSP I have experience with is TI's C674x which I do like, but it is pretty weird in comparison to a "regular" CPU architecture. ARM is a fairly clean and standard RISC architecture though, although it accumulated more cruft over time (as any architecture does). Do you know any small project which can be a good exercise to grasp ARM asm > coding? > I'm sorry, I don't really know how to answer that. > Second, I used SIMD on other cores in the past. When I compile a project > having for loop, I do not see the generated ARM NEON > assembly code in the disassembly window. > I get clearly vectorized output if I compile some simple dst[] += a * src[] loop with: -mcpu=cortex-a8 -mfpu=neon -Ofast (-Ofast is basically short for -O3 -ffast-math). I don't understand why you're specifying a softfp ABI: hardfloat is the standard on modern ARM targets. I've tested with both gcc 4.9.3 (linaro) and 5.1.1 (debian), and it works for me both for floats and for int16. The vectorized kernel loop is usually buried among a mass of code dealing with "leftovers" and/or misaligned cases: much of it can be removed using things like alignment attributes, the "restrict" qualifier, making sure the loop count is always a multiple of some nice power of two, etc. Still, the output doesn't look particularly good to me. Uhh, in fact the output looks pretty invalid to me: it's putting :64 alignment specifiers on vector loads, yet those addresses only increment by 16 each loop iteration... I would say GCC's auto-vectorization still looks like work-in-progress to me. :P -- For more options, visit http://beagleboard.org/discuss --- You received this message because you are subscribed to the Google Groups "BeagleBoard" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
