Re: [Freetel-codec2] ESP32 port

Dana Myers Thu, 06 Feb 2020 10:23:18 -0800

On 2/6/2020 6:23 AM, Bill Gaylord wrote:

Personally it seems pretty fast only like a max of 9 us for most ops on floats (doubles are a different story.) It can be as lowas almost 1 us or as high as 55 us depending on operation.


I recommend evaluating the performance in terms of measured clock cycles
per DSP operation, such as this chart from Espressif
(https://docs.espressif.com/projects/esp-dsp/en/latest/esp-dsp-benchmarks.html):

*FFTs*  *ESP32 *
        *ANSI C*
dsps_fft2r_fc32 for 64 complex points   5451    8187
dsps_fft2r_fc32 for 128 complex points  12400   18756
dsps_fft2r_fc32 for 256 complex points  27829   42381
dsps_fft2r_fc32 for 512 complex points  61755   94616
dsps_fft2r_fc32 for 1024 complex points         135745  209058

The ESP-DSP library is substantially faster than the non-optimized
ANSI C implementation, I can only speculate that a human is better
at exploiting this CPU than the current level of compiler.

Note that I'm talking about F32 data types here.

You can find a similar chart from ARM in a white paper for Cortex-M4F,
but it doesn't contain the specific cases as the Espressif chart, so even
a superficial comparison is not possible. Even if a superficial comparison
was possible, it's *likely to be misleading* for several reasons.

Notably in this case, the optimized ESP-DSP library is written in assembler
and cycle count will be constant regardless of the compiler 
version/optimization.
The ARM CMSIS-DSP library is written in C and surely will vary with the compiler
used (Keil vs GCC for example).

There's likely quite a bit more variation in the Cortex-M benchmarks because
different vendors use the same Cortex-M cores in different ways, with varying
I-cache, D-cache, flash speed and memory-interconnect schemes. For this reason,
my guess is that ARM benchmarks are done at relatively low clock rates that
yield accurate cycle counts *of the CPU* without taking into account the impact
of memory schemes. Heck, the ARM cycle counts might be done in simulation
that assumes zero-wait at all times. [ Another example: it's temping to FIR 
coefficients
in flash (declare 'const') but some of the STM32F-series parts will punish you
with wait-states (it's quite measurable). ]

I didn't look how the ESP32 benchmark numbers are produced, but I'm willing
to bet they're done using IRAM for the code (for example).

Without digging up the notes I scribbled down at the time, I don't remember
exactly how the STM32F446 compared to the ESP32 off the top of my head;
now that I think about it, I recall roughly the same percentage of real-time was
used by both the ESP32 (@240MHz) and the 'F446 (@180MHz) for modem
processing.

Given the SM1000 uses an STM32F405 (Cortex-M4F, single-precision FPU), it
doesn't seem immediately ridiculous that the ESP32 + ESP-DSP could perform 
similarly,
but you'll want to do some testing on your chosen hardware.

Also, you might have a look at the AiThinker ESP32-A1S, which is the same form
factor as an ESP32-WROOM/WROVER but integrates an AC101 CODEC. It's what
I've been using; the only real wrinkle I've found is the AC101 CODEC at 12ks/s
insists on sending each sample 4x at 48ks/s.

Cheers,
Dana K6JQ

On Feb 6, 2020, at 12:45 AM, DANA MYERS <[email protected]> wrote:
On February 5, 2020 at 9:25 PM Bruce Perens via Freetel-codec2 
<[email protected]> wrote:
Dana, the only thing you didn't make clear is whether your code is using the fixed or floating data type. If it's using thefloating one, it would be interesting to isolate why performance is so poor when more conventional code is generated by thecompiler. I can understand float code being slightly slower than double, if the hardware FPU is implemented in double size, asit normally would be.
Yes, I am using floating types on both the Cortex-M4F and ESP32. My
apologies for calling-out M4F without mentioning the significance of the
'F' :-).

MCU FPUs are, in my limited experience (Cortex-M4F and ESP32),
single-precision. IIRC, higher-end parts (Cortex-M7) may feature 
double-precision.

I don't know why the optimized assembly is 2x faster than compiled code;
that would be a question for Espressif/Tensilica, I suppose.
The floating performance as previously benchmarked is poor enough that I wondered whether there was really hardware, orwhether some of that blobby code was processing float in an exception handler.
As did I. So I gave it a try.

_______________________________________________
Freetel-codec2 mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] ESP32 port

Reply via email to