Thanks Danilo, That's an interesting explanation of why kiss_fft performs just as well as the optimised ARM FFT on the M4. I also found no performance improvement when I tried the optimised ARM FFT a few years ago.
I'm inclined to keep malloc/free in codec 2. Happy to look a patches for alternate memory allocators if it's an itch anyone really wants to scratch. Danilo - I will get back to you on your other points and proposal shortly. To the List in general - I'd like to publicly thank Danilo and the mcHF team for the fine patches he/they have submitted over the last few weeks. The FreeDV 1600 decoder is now around 40% faster on the STM32F4 Importantly - his suggestions have been backed by quality patches I can easily apply and test. He has shortened my TODO list - not made it longer. Very important for me and a fine example to anyone else who would like to contribute to Codec 2. Thanks Danilo! - David On 18/09/16 10:10, Danilo Beuche wrote: > Hi, > > I'd like to share a few thoughts/ideas as well: > > Since we finished and removed most of the easy to remove performance > hotspots with the exception of the kiss_fft calls which now contribute a > major part to the overall runtime of modem and decoder, I played around > with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs. > > Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32) > the time difference is not existing if used in our mcHF code. How does > this relate to the measurements of Glen (he measure better performance)? > I believe this is due to the fact that the arm lib stores some of its > data in precomputed flash arrays. Access to flash is slow (5 wait > states), so this reduces the performance. kiss has all its data in RAM. > Since Glen did initialize the arm DSP tables in RAM, he got speed gains > on the expense of RAM. On a STM32F746 RAM this is not as much of an > issue (384K) as it is on the STM32F4 (the default MCU in the mcHF > project has 192K RAM and we have to fit the full SDR firmware RAM needs > into it this space). Speed is traded for RAM use reduction. Since have > reached our goal timewise, now memory reductions is more in focus for us > (but it should not get slower of course). > > Because of that I would like to propose the following approach to keep > the code easily readable while providing efficient solutions for the ARM > MCUs both with little and more RAM: > > 1. We create an abstract interface for running fft in the codec2 > sources. Initially this should closely resemble the existing kiss_fft > calls, which makes introducing the interface easy, since we may > use all the existing tests and can verify that introducing the interface > does not change a bit in the output. > > 2. Once validated, we can now introduce/activate the use of the arm DSP > FFT with some glue code to map between the abstract interface and the > arm DSP interface. Here again we have to validate everything is working > nicely but we will see some slight differences I assume. However, with > it we can produce reference data for step 3 > > 3. Now we modify the existing code so that we can benefit from some nice > properties of the arm DSP fft (inplace FFT) which means this will reduce > RAM usage significantly (in relation to 192K RAM). > > 4. We enable optional use of RAM instead of flash in the ARM code, so > that depending on the amount of available memory you can get some extra > boost. > > For that to work nicely, we have to fix some issues in the existing code > first, so here comes > > 0. As Glen pointed out, some of the #define constant have not so good > names, especially M (defined for 2 different purposes in defines.h and > fdmdv_internal.h) is nasty and also N in defines.h (there are some local > variables N and the stm headers also get confused by it). So we need to > change these to something unambiguous. I think Glen already suggest > names for them. > > > And I would like to point out, that the use of dynamic memory allocation > (malloc/free) is necessary in our mcHF case, so I would like to keep > this more or less as it is. The mcHF needs the ability to reuse the > memory for other operational modes, if FreeDV is not active. Which does > not mean I am against removing the internal use of malloc, but then it > should be possible to easily create the required data structures > "outside" the code using malloc. I.e. the use of static data structures > for anything but const data is a no go. > > To support that discuss I created a draft suggestion for the interface > (attached to this mail). It is right now defined using inline code for > the sake of simplicity. This may change later, I don't think there > should be any issue with that. It essentially contains 3 functions for > complex fft and real fft each (alloc,fft,free) and the necessary data > structures. > > Danilo > > > > > Am 16.09.2016 um 20:59 schrieb Dana Myers: >> >>> Subject: Re: [Freetel-codec2] more benching and thoughts >>> Date: Fri, 16 Sep 2016 12:13:05 +1000 >>> From: glen english <g...@cortexrf.com.au> >>> Reply-To: freetel-codec2@lists.sourceforge.net >>> To: freetel-codec2@lists.sourceforge.net >>> >>> >>> >>> Hi Danilo >>> Yeah, I guess being a very bare metal programmer from the old 128 byte >>> RAM days, , I dislike MALLOCs in embedded code on principal. >> >> I'm similar, though now that we have 32kB, 64kB (or even more) RAM >> in embedded chips, they're basically like the systems that malloc() >> was initially built on :-) >> >> I still don't trust C++ heap allocators in embedded applications, though. >> >>> However, because the heap usage would be deterministic, it should be >>> fairly safe. >> I have a similar project; a 1200 baud modem + TNC stack built in a >> PSoC 5LP. >> I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter >> Block >> to do bandpass heavy-lifting and frequency response correction for the >> ADC. >> I use CMSIS-DSP for the rest of the DSP crunching required, and, wait >> for it, >> wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP, >> as long as I'm careful to avoid blowing out past the +/- 1.0 range, >> it's probably >> every bit as good as single-precision floating point. The PSoC 5LP has >> a Cortex-M3, >> I'm running it at 80MHz. >> >> My dynamic buffer implementation uses ... >> >>> ******************************************************************************* >>> Take a look at the memory management routine heap2.c in freertos.c >>> (in fact, there are heap1,2,3,4,5 .c - a few options... try heap4, also) >>> -this is a much smarter memory alloc and dealloc routine that is fairly >>> cheap. >>> much better than usual brain dead malloc. >>> ******************************************************************************** >>> I'd recommend using that. It looks for blocks same size, existing used etc >> >> heap_4.c and, while I've never explicitly profiled it, I've never had >> a reason to >> suspect the allocator is misbehaving. I commit 32KB to the heap and >> currently >> never use more than about 3KB. >>> I would expect the same improvements on the F4 as the F7 using the CMSIS >>> library. The F7 is much faster on that sort of code. >>> >>> I only got rid of the FFT malloc stuff the huge stack additions are >>> still in there >>> and you could save 50% there ... >> >> Without knowing the details of this application (I'm new here), I am quite >> impressed with the quality of CMSIS-DSP, particularly in terms of >> exploiting >> the ARM extensions. >> >> Cheers, >> Dana >> >>> -glen >>> >>> >>> On 16/09/2016 12:06 PM, Danilo Beuche wrote: >>> > Hi Glen, >>> > >>> > nice, would be interesting to see how much the STM32F4 gains by use of >>> > CMSIS FFT routines. >>> > >>> > BTW, I am not sure, but I think you mentioned the removal of malloc as >>> > one of your changes. For us with the mcHF it would not be good to have >>> > the memory for FreeDV code statically allocated since FreeDV is just one >>> > operation mode of the mcHF, and we need the memory at other times for >>> > other stuff, especially since it really eats a lot of memory (in >>> > relation to the STM32F4 RAM sizes). Even half of it is still a lot. >>> > >>> > Looking forward to gain some more free cycles with your work. >>> > >>> > Regards, >>> > Danilo >>> > >>> > >>> > Am 16.09.2016 um 03:53 schrieb glen english: >>> >> Hi Danilo >>> >> >>> >> yeah, you have plenty in hand. >>> >> >>> >> OK so M7 and CMSIS FFT, about 2 x speed (same clock) 7.74mS (1200bps) >>> >> for decode. >>> >> >>> >> On 16/09/2016 11:49 AM, Danilo Beuche wrote: >>> >>> H >>> >>> >>> >>> regarding the times @mcHF (STM32F4, 168Mhz) some clarifications: We >>> >>> measured 17.3ms per 40ms interval for the voice decode part only (this >>> >>> is only happening once the modem is synced) and roughly 5ms of >>> >>> fdmdv_demod per 20ms interval (happens all the time). Which gives us in >>> >>> total some 27ms per 40ms once synced. This is about 68% load. >>> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Freetel-codec2 mailing list >> Freetel-codec2@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > ------------------------------------------------------------------------------ _______________________________________________ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2