Thanks Danilo,

That's an interesting explanation of why kiss_fft performs just as well 
as the optimised ARM FFT on the M4.  I also found no performance 
improvement when I tried the optimised ARM FFT a few years ago.

I'm inclined to keep malloc/free in codec 2.  Happy to look a patches 
for alternate memory allocators if it's an itch anyone really wants to 
scratch.

Danilo - I will get back to you on your other points and proposal shortly.

To the List in general - I'd like to publicly thank Danilo and the mcHF 
team for the fine patches he/they have submitted over the last few 
weeks.  The FreeDV 1600 decoder is now around 40% faster on the STM32F4

Importantly - his suggestions have been backed by quality patches I can 
easily apply and test.  He has shortened my TODO list - not made it 
longer.  Very important for me and a fine example to anyone else who 
would like to contribute to Codec 2.  Thanks Danilo!

- David

On 18/09/16 10:10, Danilo Beuche wrote:
> Hi,
>
> I'd like to share a few thoughts/ideas as well:
>
> Since we finished and removed most of the easy to remove performance
> hotspots with the exception of the kiss_fft calls which now contribute a
> major part to the overall runtime of modem and decoder, I played around
> with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.
>
> Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32)
> the time difference is not existing if used in our mcHF code. How does
> this relate to the measurements of Glen (he measure better performance)?
> I believe this is due to the fact that the arm lib stores some of its
> data in precomputed flash arrays. Access to flash is slow (5 wait
> states), so this reduces the performance. kiss has all its data in RAM.
> Since Glen did initialize the arm DSP tables in RAM, he got speed gains
> on the expense of RAM. On a STM32F746 RAM this is not as much of an
> issue (384K) as it is on the STM32F4 (the default MCU in the mcHF
> project has 192K RAM and we have to fit the full SDR firmware RAM needs
> into it this space). Speed is traded for RAM use reduction. Since have
> reached our goal timewise, now memory reductions is more in focus for us
> (but it should not get slower of course).
>
> Because of that I would like to propose the following approach to keep
> the code easily readable while providing efficient solutions for the ARM
> MCUs both with little and more RAM:
>
> 1. We create an abstract interface for running fft in the codec2
> sources. Initially this should closely resemble the existing kiss_fft
> calls, which makes introducing the interface easy, since we may
> use all the existing tests and can verify that introducing the interface
> does not change a bit in the output.
>
> 2. Once validated, we can now introduce/activate the use of the arm DSP
> FFT with some glue code to map between the abstract interface and the
> arm DSP interface. Here again we have to validate everything is working
> nicely but we will see some slight differences I assume. However, with
> it we can produce reference data for step 3
>
> 3. Now we modify the existing code so that we can benefit from some nice
> properties of the arm DSP fft (inplace FFT) which means this will reduce
> RAM usage significantly (in relation to 192K RAM).
>
> 4. We enable optional use of RAM instead of flash in the ARM code, so
> that depending on the amount of available memory you can get some extra
> boost.
>
> For that to work nicely, we have to fix some issues in the existing code
> first, so here comes
>
> 0. As Glen pointed out, some of the #define constant have not so good
> names, especially M (defined for 2 different purposes in defines.h and
> fdmdv_internal.h) is nasty and also N in defines.h (there are some local
> variables N and the stm headers also get confused by it). So we need to
> change these to something unambiguous. I think Glen already suggest
> names for them.
>
>
> And I would like to point out, that the use of dynamic memory allocation
> (malloc/free) is necessary in our mcHF case, so I would like to keep
> this more or less as it is. The mcHF needs the ability to reuse the
> memory for other operational modes, if FreeDV is not active. Which does
> not mean I am against removing the internal use of malloc, but then it
> should be possible to easily create the required data structures
> "outside" the code using malloc. I.e. the use of static data structures
> for anything but const data is a no go.
>
> To support that discuss I created a draft suggestion for the interface
> (attached to this mail). It is right now defined using inline code for
> the sake of simplicity. This may change later, I don't think there
> should be any issue with that. It essentially contains 3 functions for
> complex fft and real fft each (alloc,fft,free) and the necessary data
> structures.
>
> Danilo
>
>
>
>
> Am 16.09.2016 um 20:59 schrieb Dana Myers:
>>
>>> Subject:    Re: [Freetel-codec2] more benching and thoughts
>>> Date:       Fri, 16 Sep 2016 12:13:05 +1000
>>> From:       glen english <g...@cortexrf.com.au>
>>> Reply-To:   freetel-codec2@lists.sourceforge.net
>>> To:         freetel-codec2@lists.sourceforge.net
>>>
>>>
>>>
>>> Hi Danilo
>>> Yeah, I guess being a very bare metal programmer from the old 128 byte
>>> RAM days, , I dislike MALLOCs in embedded code on principal.
>>
>> I'm similar, though now that we have 32kB, 64kB (or even more) RAM
>> in embedded chips, they're basically like the systems that malloc()
>> was initially built on :-)
>>
>> I still don't trust C++ heap allocators in embedded applications, though.
>>
>>> However, because the heap usage would be deterministic, it should be
>>> fairly safe.
>> I have a similar project; a 1200 baud modem + TNC stack built in a
>> PSoC 5LP.
>> I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter
>> Block
>> to do bandpass heavy-lifting and frequency response correction for the
>> ADC.
>> I use CMSIS-DSP for the rest of the DSP crunching required, and, wait
>> for it,
>> wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
>> as long as I'm careful to avoid blowing out past the +/- 1.0 range,
>> it's probably
>> every bit as good as single-precision floating point. The PSoC 5LP has
>> a Cortex-M3,
>> I'm running it at 80MHz.
>>
>>  My dynamic buffer implementation uses ...
>>
>>> *******************************************************************************
>>> Take a look at the memory management routine  heap2.c in freertos.c
>>>   (in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also)
>>> -this is a much smarter memory alloc and dealloc routine that is fairly
>>> cheap.
>>> much better than usual brain dead malloc.
>>> ********************************************************************************
>>> I'd recommend using that. It looks for blocks same size, existing used etc
>>
>> heap_4.c and, while I've never explicitly profiled it, I've never had
>> a reason to
>> suspect the allocator is misbehaving. I commit 32KB to the heap and
>> currently
>> never use more than about 3KB.
>>> I would expect the same improvements on the F4 as the F7 using the CMSIS
>>> library. The F7 is much faster on that sort of code.
>>>
>>> I only got rid of the FFT malloc stuff the huge stack additions are
>>> still in there
>>> and you could save 50% there ...
>>
>> Without knowing the details of this application (I'm new here), I am quite
>> impressed with the quality of CMSIS-DSP, particularly in terms of
>> exploiting
>> the ARM extensions.
>>
>> Cheers,
>> Dana
>>
>>> -glen
>>>
>>>
>>> On 16/09/2016 12:06 PM, Danilo Beuche wrote:
>>> > Hi Glen,
>>> >
>>> > nice, would be interesting to see how much the STM32F4 gains by use of
>>> > CMSIS FFT routines.
>>> >
>>> > BTW, I am not sure, but I think you mentioned the removal of malloc as
>>> > one of your changes. For us with the mcHF it would not be good to have
>>> > the memory for FreeDV code statically allocated since FreeDV is just one
>>> > operation mode of the mcHF, and we need the memory at other times for
>>> > other stuff, especially since it really eats a lot of memory (in
>>> > relation to the STM32F4 RAM sizes). Even half of it is still a lot.
>>> >
>>> > Looking forward to gain some more free cycles with your work.
>>> >
>>> > Regards,
>>> > Danilo
>>> >
>>> >
>>> > Am 16.09.2016 um 03:53 schrieb glen english:
>>> >> Hi Danilo
>>> >>
>>> >> yeah, you have plenty in hand.
>>> >>
>>> >> OK so M7 and CMSIS FFT,  about 2 x speed (same clock) 7.74mS (1200bps)
>>> >> for decode.
>>> >>
>>> >> On 16/09/2016 11:49 AM, Danilo Beuche wrote:
>>> >>> H
>>> >>>
>>> >>> regarding the times @mcHF (STM32F4, 168Mhz) some clarifications: We
>>> >>> measured 17.3ms per 40ms interval for the voice decode part only (this
>>> >>> is only happening once the modem is synced) and roughly 5ms of
>>> >>> fdmdv_demod per 20ms interval (happens all the time). Which gives us in
>>> >>> total some 27ms per 40ms once synced. This is about 68% load.
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> Freetel-codec2 mailing list
>> Freetel-codec2@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>

------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Reply via email to