Hi,

I'd like to share a few thoughts/ideas as well:

Since we finished and removed most of the easy to remove performance hotspots with the exception of the kiss_fft calls which now contribute a major part to the overall runtime of modem and decoder, I played around with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.

Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32) the time difference is not existing if used in our mcHF code. How does this relate to the measurements of Glen (he measure better performance)? I believe this is due to the fact that the arm lib stores some of its data in precomputed flash arrays. Access to flash is slow (5 wait states), so this reduces the performance. kiss has all its data in RAM. Since Glen did initialize the arm DSP tables in RAM, he got speed gains on the expense of RAM. On a STM32F746 RAM this is not as much of an issue (384K) as it is on the STM32F4 (the default MCU in the mcHF project has 192K RAM and we have to fit the full SDR firmware RAM needs into it this space). Speed is traded for RAM use reduction. Since have reached our goal timewise, now memory reductions is more in focus for us (but it should not get slower of course).

Because of that I would like to propose the following approach to keep the code easily readable while providing efficient solutions for the ARM MCUs both with little and more RAM:

1. We create an abstract interface for running fft in the codec2 sources. Initially this should closely resemble the existing kiss_fft calls, which makes introducing the interface easy, since we may use all the existing tests and can verify that introducing the interface does not change a bit in the output.

2. Once validated, we can now introduce/activate the use of the arm DSP FFT with some glue code to map between the abstract interface and the arm DSP interface. Here again we have to validate everything is working nicely but we will see some slight differences I assume. However, with it we can produce reference data for step 3

3. Now we modify the existing code so that we can benefit from some nice properties of the arm DSP fft (inplace FFT) which means this will reduce RAM usage significantly (in relation to 192K RAM).

4. We enable optional use of RAM instead of flash in the ARM code, so that depending on the amount of available memory you can get some extra boost.

For that to work nicely, we have to fix some issues in the existing code first, so here comes

0. As Glen pointed out, some of the #define constant have not so good names, especially M (defined for 2 different purposes in defines.h and fdmdv_internal.h) is nasty and also N in defines.h (there are some local variables N and the stm headers also get confused by it). So we need to change these to something unambiguous. I think Glen already suggest names for them.


And I would like to point out, that the use of dynamic memory allocation (malloc/free) is necessary in our mcHF case, so I would like to keep this more or less as it is. The mcHF needs the ability to reuse the memory for other operational modes, if FreeDV is not active. Which does not mean I am against removing the internal use of malloc, but then it should be possible to easily create the required data structures "outside" the code using malloc. I.e. the use of static data structures for anything but const data is a no go.

To support that discuss I created a draft suggestion for the interface (attached to this mail). It is right now defined using inline code for the sake of simplicity. This may change later, I don't think there should be any issue with that. It essentially contains 3 functions for complex fft and real fft each (alloc,fft,free) and the necessary data structures.

Danilo




Am 16.09.2016 um 20:59 schrieb Dana Myers:

Subject:        Re: [Freetel-codec2] more benching and thoughts
Date:   Fri, 16 Sep 2016 12:13:05 +1000
From:   glen english <g...@cortexrf.com.au>
Reply-To:       freetel-codec2@lists.sourceforge.net
To:     freetel-codec2@lists.sourceforge.net



Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.

I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)

I still don't trust C++ heap allocators in embedded applications, though.

However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a PSoC 5LP. I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter Block to do bandpass heavy-lifting and frequency response correction for the ADC. I use CMSIS-DSP for the rest of the DSP crunching required, and, wait for it,
wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm careful to avoid blowing out past the +/- 1.0 range, it's probably every bit as good as single-precision floating point. The PSoC 5LP has a Cortex-M3,
I'm running it at 80MHz.

 My dynamic buffer implementation uses ...

*******************************************************************************
Take a look at the memory management routine  heap2.c in freertos.c
   (in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
-this is a much smarter memory alloc and dealloc routine that is fairly
cheap.
much better than usual brain dead malloc.
********************************************************************************
I'd recommend using that. It looks for blocks same size, existing used etc

heap_4.c and, while I've never explicitly profiled it, I've never had a reason to suspect the allocator is misbehaving. I commit 32KB to the heap and currently
never use more than about 3KB.
I would expect the same improvements on the F4 as the F7 using the CMSIS
library. The F7 is much faster on that sort of code.

I only got rid of the FFT malloc stuff the huge stack additions are
still in there
and you could save 50% there ...

Without knowing the details of this application (I'm new here), I am quite
impressed with the quality of CMSIS-DSP, particularly in terms of exploiting
the ARM extensions.

Cheers,
Dana

-glen


On 16/09/2016 12:06 PM, Danilo Beuche wrote:
> Hi Glen,
>
> nice, would be interesting to see how much the STM32F4 gains by use of
> CMSIS FFT routines.
>
> BTW, I am not sure, but I think you mentioned the removal of malloc as
> one of your changes. For us with the mcHF it would not be good to have
> the memory for FreeDV code statically allocated since FreeDV is just one
> operation mode of the mcHF, and we need the memory at other times for
> other stuff, especially since it really eats a lot of memory (in
> relation to the STM32F4 RAM sizes). Even half of it is still a lot.
>
> Looking forward to gain some more free cycles with your work.
>
> Regards,
> Danilo
>
>
> Am 16.09.2016 um 03:53 schrieb glen english:
>> Hi Danilo
>>
>> yeah, you have plenty in hand.
>>
>> OK so M7 and CMSIS FFT,  about 2 x speed (same clock) 7.74mS (1200bps)
>> for decode.
>>
>> On 16/09/2016 11:49 AM, Danilo Beuche wrote:
>>> H
>>>
>>> regarding the times @mcHF (STM32F4, 168Mhz) some clarifications: We
>>> measured 17.3ms per 40ms interval for the voice decode part only (this
>>> is only happening once the modem is synced) and roughly 5ms of
>>> fdmdv_demod per 20ms interval (happens all the time). Which gives us in
>>> total some 27ms per 40ms once synced. This is about 68% load.




------------------------------------------------------------------------------


_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

/*
 * codec2_fft.h
 *
 *  Created on: 17.09.2016
 *      Author: danilo
 */

#ifndef DRIVERS_FREEDV_CODEC2_FFT_H_
#define DRIVERS_FREEDV_CODEC2_FFT_H_

#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#ifdef ARM_MATH_CM4
  #include "stm32f4xx.h"
  #include "core_cm4.h"
  #include "arm_math.h"
  #include "arm_const_structs.h"
#endif

#include "defines.h"
#include "kiss_fft.h"
#include "kiss_fftr.h"
#include "comp.h"

#ifndef ARM_MATH_CM4
    typedef kiss_fftr_cfg codec2_fftr_cfg;
    typedef kiss_fft_cfg codec2_fft_cfg;
#else
  typedef struct {
      arm_rfft_fast_instance_f32* instance;
      int inverse;
  } codec2_fftr_struct;

  typedef codec2_fftr_struct* codec2_fftr_cfg;

  typedef struct {
        const arm_cfft_instance_f32* instance;
        int inverse;
    } codec2_fft_struct;
  typedef codec2_fft_struct* codec2_fft_cfg;
#endif

inline codec2_fftr_cfg codec2_fftr_alloc(int nfft, int inverse_fft, void* mem, 
unsigned int* lenmem)
{
    codec2_fftr_cfg retval;
#ifndef ARM_MATH_CM4
    retval = kiss_fftr_alloc(nfft, inverse_fft, mem, lenmem);
#else
    retval = malloc(sizeof(codec2_fftr_struct));
    retval->inverse  = inverse_fft;
    retval->instance = malloc(sizeof(arm_rfft_fast_instance_f32));
    arm_rfft_fast_init_f32(retval->instance,nfft);
#endif
    return retval;
}
typedef kiss_fft_scalar codec2_fft_scalar;
typedef COMP    codec2_fft_cpx;

inline void codec2_fftr(codec2_fftr_cfg cfg, codec2_fft_scalar* in, 
codec2_fft_cpx* out)
{

#ifndef ARM_MATH_CM4
      kiss_fftr(cfg, in, (kiss_fft_cpx*)out);
#else
    arm_rfft_fast_f32(cfg->instance,in,(float*)out,cfg->inverse);
    out->imag = 0; // remove out[FFT_ENC/2]->real stored in out[0].imag
#endif
}

inline void codec2_fftr_free(codec2_fftr_cfg cfg)
{
#ifndef ARM_MATH_CM4
    KISS_FFT_FREE(cfg);
#else
    free(cfg->instance);
    free(cfg);
#endif
}


inline codec2_fft_cfg codec2_fft_alloc(int nfft, int inverse_fft, void* mem, 
unsigned int* lenmem)
{
    codec2_fft_cfg retval;
#ifndef ARM_MATH_CM4
    retval = kiss_fft_alloc(nfft, inverse_fft, mem, lenmem);
#else
    retval = malloc(sizeof(codec2_fft_struct));
    retval->inverse  = inverse_fft;
    switch(nfft)
    {
    case 256:
        retval->instance = &arm_cfft_sR_f32_len256;
        break;
    case 512:
        retval->instance = &arm_cfft_sR_f32_len512;
        break;
    case 1024:
        retval->instance = &arm_cfft_sR_f32_len1024;
        break;
    default:
        abort();
    }
#endif
    return retval;
}

inline void codec2_fft(codec2_fft_cfg cfg, codec2_fft_cpx* in, codec2_fft_cpx* 
out)
{

#ifndef ARM_MATH_CM4
      kiss_fftr(cfg, (kiss_fft_cpx*)in, (kiss_fft_cpx*)out);
#else
    arm_cfft_f32(cfg->instance,(float*)in,cfg->inverse,0);
    // TODO: this is not nice, but for now required to keep changes minimal
    // however, since main goal is to reduce the memory usage
    // we should convert to an in place interface
    // on PC like platforms the overhead of using the "inplace" kiss_fft calls
    // is neglectable compared to the gain in memory usage on STM32 platforms
    memcpy(out,in,cfg->instance->fftLen*2*sizeof(float));
#endif
}

inline void codec2_fft_free(codec2_fft_cfg cfg)
{
#ifndef ARM_MATH_CM4
    KISS_FFT_FREE(cfg);
#else
    free(cfg);
#endif
}

#endif /* DRIVERS_FREEDV_CODEC2_FFT_H_ */
------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Reply via email to