On 2/6/2020 6:23 AM, Bill Gaylord wrote:
Personally it seems pretty fast only like a max of 9 us for most ops on floats (doubles are a different story.) It can be as low
as almost 1 us or as high as 55 us depending on operation.
I recommend evaluating the performance in terms of measured clock cycles
per DSP operation, such as this chart from Espressif
(https://docs.espressif.com/projects/esp-dsp/en/latest/esp-dsp-benchmarks.html):
*FFTs* *ESP32 *
*ANSI C*
dsps_fft2r_fc32 for 64 complex points 5451 8187
dsps_fft2r_fc32 for 128 complex points 12400 18756
dsps_fft2r_fc32 for 256 complex points 27829 42381
dsps_fft2r_fc32 for 512 complex points 61755 94616
dsps_fft2r_fc32 for 1024 complex points 135745 209058
The ESP-DSP library is substantially faster than the non-optimized
ANSI C implementation, I can only speculate that a human is better
at exploiting this CPU than the current level of compiler.
Note that I'm talking about F32 data types here.
You can find a similar chart from ARM in a white paper for Cortex-M4F,
but it doesn't contain the specific cases as the Espressif chart, so even
a superficial comparison is not possible. Even if a superficial comparison
was possible, it's *likely to be misleading* for several reasons.
Notably in this case, the optimized ESP-DSP library is written in assembler
and cycle count will be constant regardless of the compiler
version/optimization.
The ARM CMSIS-DSP library is written in C and surely will vary with the compiler
used (Keil vs GCC for example).
There's likely quite a bit more variation in the Cortex-M benchmarks because
different vendors use the same Cortex-M cores in different ways, with varying
I-cache, D-cache, flash speed and memory-interconnect schemes. For this reason,
my guess is that ARM benchmarks are done at relatively low clock rates that
yield accurate cycle counts *of the CPU* without taking into account the impact
of memory schemes. Heck, the ARM cycle counts might be done in simulation
that assumes zero-wait at all times. [ Another example: it's temping to FIR
coefficients
in flash (declare 'const') but some of the STM32F-series parts will punish you
with wait-states (it's quite measurable). ]
I didn't look how the ESP32 benchmark numbers are produced, but I'm willing
to bet they're done using IRAM for the code (for example).
Without digging up the notes I scribbled down at the time, I don't remember
exactly how the STM32F446 compared to the ESP32 off the top of my head;
now that I think about it, I recall roughly the same percentage of real-time was
used by both the ESP32 (@240MHz) and the 'F446 (@180MHz) for modem
processing.
Given the SM1000 uses an STM32F405 (Cortex-M4F, single-precision FPU), it
doesn't seem immediately ridiculous that the ESP32 + ESP-DSP could perform
similarly,
but you'll want to do some testing on your chosen hardware.
Also, you might have a look at the AiThinker ESP32-A1S, which is the same form
factor as an ESP32-WROOM/WROVER but integrates an AC101 CODEC. It's what
I've been using; the only real wrinkle I've found is the AC101 CODEC at 12ks/s
insists on sending each sample 4x at 48ks/s.
Cheers,
Dana K6JQ
On Feb 6, 2020, at 12:45 AM, DANA MYERS <[email protected]> wrote:
On February 5, 2020 at 9:25 PM Bruce Perens via Freetel-codec2
<[email protected]> wrote:
Dana, the only thing you didn't make clear is whether your code is using the fixed or floating data type. If it's using the
floating one, it would be interesting to isolate why performance is so poor when more conventional code is generated by the
compiler. I can understand float code being slightly slower than double, if the hardware FPU is implemented in double size, as
it normally would be.
Yes, I am using floating types on both the Cortex-M4F and ESP32. My
apologies for calling-out M4F without mentioning the significance of the
'F' :-).
MCU FPUs are, in my limited experience (Cortex-M4F and ESP32),
single-precision. IIRC, higher-end parts (Cortex-M7) may feature
double-precision.
I don't know why the optimized assembly is 2x faster than compiled code;
that would be a question for Espressif/Tensilica, I suppose.
The floating performance as previously benchmarked is poor enough that I wondered whether there was really hardware, or
whether some of that blobby code was processing float in an exception handler.
As did I. So I gave it a try.
_______________________________________________
Freetel-codec2 mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/freetel-codec2