Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS
Hi Glen, I just verified, the cycle counts for true 16Mhz are within 1% of the 8Mhz operations (but measurements now take half as long :-) ) Danilo Am 18.09.2016 um 07:15 schrieb glen english: > Hi > OK. > well, anyway, decode 1200 (40mS) takes 12.34mS on my kit, and 19.86 > using kiss-fft. > I think you approximated about 14.4mS for a decode 1300 on your kit > > so, I will be interested to see what you come up with using cfft > > My codec 2 codebase is AUGUST 2015 > > cheers > > > > > > On 18/09/2016 2:58 PM, Danilo Beuche wrote: >> Hi Glen, I would not worry to much: >> >> - Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end >> you are looking). It is a lot but not unexplainable. >> >> - Maybe it is my test data. I don't know how much jitter in the kiss_fft >> algorithm is, when different data is presented. I am running >> "artificially" generated audio input (digitally captured codec2 frames >> from a single 750Hz sine way also generated digitally).- >> >> - Maybe it is my strange way of running the mcHF firmware: the mcHF >> Hardware has a 16Mhz XO, but the discovery board which I have here for >> testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So >> everything takes twice the time. If the flash would asynchronously >> coupled, which I doubt (otherwise no need for explicit wait state >> settings), it would have an influence. But here I am quite sure, this >> is not the case. If the caches are asynchronous: Maybe. Maybe I should >> remeasure with fixed PLL setup so that the processor runs at true >> 168Mhz. Will do that later and get back with updated numbers. >> >> Danilo >> >> >> >> >> >> On 18.09.2016 06:35, glen english wrote: >>> Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9 >>> >>> using cycle counter (yes) >>> >>> interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs >>> only set off flags... >>> >>> for kissfft 5ws F4, I wonder why you have 112500 cycles and I have >>> 141000. Something for me to look at . >>> >>> hmm >>> >>> -O2 but I also have a bunch of debug symbol stuff in there dunno I think >>> it is only symbol data at DB2 which pushes up the image size. >>> >>> >>> >>> >>> -- >>> ___ >>> Freetel-codec2 mailing list >>> Freetel-codec2@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2 >> -- >> ___ >> Freetel-codec2 mailing list >> Freetel-codec2@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/freetel-codec2 >> > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS
Hi OK. well, anyway, decode 1200 (40mS) takes 12.34mS on my kit, and 19.86 using kiss-fft. I think you approximated about 14.4mS for a decode 1300 on your kit so, I will be interested to see what you come up with using cfft My codec 2 codebase is AUGUST 2015 cheers On 18/09/2016 2:58 PM, Danilo Beuche wrote: > Hi Glen, I would not worry to much: > > - Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end > you are looking). It is a lot but not unexplainable. > > - Maybe it is my test data. I don't know how much jitter in the kiss_fft > algorithm is, when different data is presented. I am running > "artificially" generated audio input (digitally captured codec2 frames > from a single 750Hz sine way also generated digitally).- > > - Maybe it is my strange way of running the mcHF firmware: the mcHF > Hardware has a 16Mhz XO, but the discovery board which I have here for > testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So > everything takes twice the time. If the flash would asynchronously > coupled, which I doubt (otherwise no need for explicit wait state > settings), it would have an influence. But here I am quite sure, this > is not the case. If the caches are asynchronous: Maybe. Maybe I should > remeasure with fixed PLL setup so that the processor runs at true > 168Mhz. Will do that later and get back with updated numbers. > > Danilo > > > > > > On 18.09.2016 06:35, glen english wrote: >> Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9 >> >> using cycle counter (yes) >> >> interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs >> only set off flags... >> >> for kissfft 5ws F4, I wonder why you have 112500 cycles and I have >> 141000. Something for me to look at . >> >> hmm >> >> -O2 but I also have a bunch of debug symbol stuff in there dunno I think >> it is only symbol data at DB2 which pushes up the image size. >> >> >> >> >> -- >> ___ >> Freetel-codec2 mailing list >> Freetel-codec2@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS
Hi Glen, I would not worry to much: - Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end you are looking). It is a lot but not unexplainable. - Maybe it is my test data. I don't know how much jitter in the kiss_fft algorithm is, when different data is presented. I am running "artificially" generated audio input (digitally captured codec2 frames from a single 750Hz sine way also generated digitally).- - Maybe it is my strange way of running the mcHF firmware: the mcHF Hardware has a 16Mhz XO, but the discovery board which I have here for testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So everything takes twice the time. If the flash would asynchronously coupled, which I doubt (otherwise no need for explicit wait state settings), it would have an influence. But here I am quite sure, this is not the case. If the caches are asynchronous: Maybe. Maybe I should remeasure with fixed PLL setup so that the processor runs at true 168Mhz. Will do that later and get back with updated numbers. Danilo On 18.09.2016 06:35, glen english wrote: > Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9 > > using cycle counter (yes) > > interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs > only set off flags... > > for kissfft 5ws F4, I wonder why you have 112500 cycles and I have > 141000. Something for me to look at . > > hmm > > -O2 but I also have a bunch of debug symbol stuff in there dunno I think > it is only symbol data at DB2 which pushes up the image size. > > > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9 using cycle counter (yes) interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs only set off flags... for kissfft 5ws F4, I wonder why you have 112500 cycles and I have 141000. Something for me to look at . hmm -O2 but I also have a bunch of debug symbol stuff in there dunno I think it is only symbol data at DB2 which pushes up the image size. -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS
Hi Glen, what and how are you measuring, in order to repeat measurements in my environment. We use the cycle counter of the CORTEX processor (seems you do the same), interrupt overhead is removed (counter is stopped during the major interrupt, which is the audio irq). I have some prerecorded "frames" stored in flash which are feed into the freedv_comprx routine. Measurements are taken every 50 frames, since initially the decoder needs to lock on the data. After 50 frames I get stable measurements. I did some hotspot analysis and a single kiss_fft 512 fft call was taking around 670uS (*168) = 112500 cycles. So this is in line with your numbers. My results for kiss_fft vs arm_fft may not be correct, since in this case fact I was running kiss_fftr vs. arm_rfft_fast_f32. Maybe there is a huge penalty for the arm_rfft_fast_f32 code. Let me try the kiss_fft vs. arm_cfft case to confirm your measurements. Danilo On 18.09.2016 06:07, glen english wrote: > arm fft, stm32F405RGT6, 6WS > -O2, debug level2. > > encode 1200 (40mS frame) > 1745799 : 10.39mS > decode > 2292497 : 13.64mS > > so, 2x speed of my other runs > > let's run 5WS > > encode 1200 (40mS frame) > 1579922 : 9.4mS > decode > 2073979 : 12.34mS > > WOW the wait states hurt on the M4 > > > > > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
[Freetel-codec2] M4 runs, summary :
-O2, debug level2 1200 (40mS frame) 5 wait states STM32F405RGT6 kissfft encode : 2780314 16.5mS decode : 3335953 19.85mS kissFTT512cpxfloat : 141,032 cycles ARMfft encode 1579922 : 9.4mS decode 2073979 : 12.34mS ARMfft cycles : 50,597 cycles SO, I still see more than 2:1 on cycle count for the FFT in favour of arm fft can one of you guys get a cycle count on kiss-fft ??? and. the cortex M7 is about 2x speed on that code ..for SAME clock.. On 18/09/2016 1:57 PM, glen english wrote: > kiss fft per standard codec2 code... > > -O2, debug level2. > encode 3200 (20mS) frame > encode : 1388964 cycles: 8.26mS > decode : 1807440 cycles : 10.75mS > > encode 1200 (40mS frame) > 3006497 (17.89mS) > decode > 3669321 (21.8mS) > > hmm seems pretty slow > I wonder what I am doing wrong ? > Or is that inline with other's measurments ? > > kissFFT512 : 155,483 cycles (versus 70,000 on the M7) > > Next... arm asm on F4 (STM32F405RGT6) > > > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] F4 runs.....
Hi Glen, difficult to say: I have data for freedv_comprx() for FreeDV 1600 (translates to decode 1300 because of use of some bits for data transmission) which does modem decode every 20ms and voice decode very 40ms. On average I have 12.2 ms -> 24.4ms, 2x5ms are the fdmdv_demod part, so decode1300 is 14.4ms. I am using gcc 5.4, O2 mostly (some files have O3 enable, but none of the codec2 files. And I run more or less newest SVN code (codec2-dev 2875) If you run somewhat older code, the numbers you gave have been the performance a few days ago. As David mentioned, we gained about 40% since r2842 (or so). So they could be right. Danilo On 18.09.2016 05:57, glen english wrote: > kiss fft per standard codec2 code... > > -O2, debug level2. > encode 3200 (20mS) frame > encode : 1388964 cycles: 8.26mS > decode : 1807440 cycles : 10.75mS > > encode 1200 (40mS frame) > 3006497 (17.89mS) > decode > 3669321 (21.8mS) > > hmm seems pretty slow > I wonder what I am doing wrong ? > Or is that inline with other's measurments ? > > kissFFT512 : 155,483 cycles (versus 70,000 on the M7) > > Next... arm asm on F4 (STM32F405RGT6) > > > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
[Freetel-codec2] arm fft F4 runs 6WS and 5WS
arm fft, stm32F405RGT6, 6WS -O2, debug level2. encode 1200 (40mS frame) 1745799 : 10.39mS decode 2292497 : 13.64mS so, 2x speed of my other runs let's run 5WS encode 1200 (40mS frame) 1579922 : 9.4mS decode 2073979 : 12.34mS WOW the wait states hurt on the M4 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
[Freetel-codec2] F4 runs.....
kiss fft per standard codec2 code... -O2, debug level2. encode 3200 (20mS) frame encode : 1388964 cycles: 8.26mS decode : 1807440 cycles : 10.75mS encode 1200 (40mS frame) 3006497 (17.89mS) decode 3669321 (21.8mS) hmm seems pretty slow I wonder what I am doing wrong ? Or is that inline with other's measurments ? kissFFT512 : 155,483 cycles (versus 70,000 on the M7) Next... arm asm on F4 (STM32F405RGT6) -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] more benching and thoughts
Hi Glen, Am 18.09.2016 um 05:22 schrieb glen english: > Hi Danilo > > Yes, there is most-certainly a penalty for const access from flash, at > least on the M4. > > and of course instruction cache is no use, is just that, only for > instructions Hence the name :-) > I wonder what the bus matrix penalty is for data fetches from flash. > There is an app-note about it somewhere I once read. It depends on what > else is going on. The processor is pretty smart and interleaving the > accesses as not to stall the pipeline or bus matrix. > > As Danilo I am sure you know : (pointed out for others) > You can force variables into sections (like forcing a static const ) by > using > > __attribute__((section("name"))) to assign say into data. Yes, indeed. We use that extensively at the mcHF to move certain parts to the CCM memory which is otherwise not so easily accessible (and should be used with care as it does not support DMA to/from peripherals). Works great. > I will some time run up the code on an M4 and see what kiss-fft does. > > I am very very very surprised , and do not really believe that kissFFT > is as fast as the arm assembler on the M4 - my immediate thoughts are > "you are doing it wrong". so, I will investigate. Would be happy if I am wrong. Which would mean we get even faster fft on the M4 almost for free (minus investigation time that is). No problem at all with me :-) Danilo -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
[Freetel-codec2] memory management options
David Have a look at http://www.freertos.org/a00111.html I use heap2.c for most things... works on the premise of repeated requests and returns of things the same size. heap4 is good, also. -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] more benching and thoughts
Hi Danilo Yes, there is most-certainly a penalty for const access from flash, at least on the M4. and of course instruction cache is no use, is just that, only for instructions I wonder what the bus matrix penalty is for data fetches from flash. There is an app-note about it somewhere I once read. It depends on what else is going on. The processor is pretty smart and interleaving the accesses as not to stall the pipeline or bus matrix. As Danilo I am sure you know : (pointed out for others) You can force variables into sections (like forcing a static const ) by using __attribute__((section("name"))) to assign say into data. I will some time run up the code on an M4 and see what kiss-fft does. I am very very very surprised , and do not really believe that kissFFT is as fast as the arm assembler on the M4 - my immediate thoughts are "you are doing it wrong". so, I will investigate. g On 18/09/2016 12:13 PM, Danilo Beuche wrote: > Hi Glen, > > just checked, it seems to me that we have all the caches running in the > mcHF: > > https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424 > > So even with these caches/buffers enabled, the M4 looses performance by > data reads from flash. Haven't checked the manual/internet but maybe the > flash caching works well for code but not in the same way for data. > > > Danilo > > Am 18.09.2016 um 03:12 schrieb glen english: >> Hi Danilo >> >> Good thoughts and points. >> >> while on the RAM subject : require reading for every serious programmer : >> "Memory" >> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich >> >> read all 7 parts, 100 pages, but if you only have an hour, just read >> "part 2 - cache" >> >> On the fft: >> >> I am not surprised that the ARM lib hand optimized assembler is that >> much faster. >> more that 2x faster in fact. >> >> I don't think kiss-fft is particular suitable for this sort of platform, >> either, I'll hold back what I really think :-) . >> >> The 5WS on flash (actually 6WS I am running @ 168M) does not really >> affect the performance too much. In fact I can vary the WS count +/- 2 >> without much change- the ART and the prefetch and the instruction and >> data caches are doing their job, so there is very little difference with >> the const values in ram or cache. >> >> In fact, most FFT implementations are very tough on a machine with cache . >> Have you read the paper on how FFTW works ? It is very cache aware- and >> adaptive to the architecture- that is why it does trial runs and picks >> the best. >> >> The M7 is very impressive. It is certainly impressive work by ARM. >> >> However, the M4 is what all of you have to work with so we can stay >> focussed on that. >> >> I think also the ram usage will be significantly less with the arm FFT >> because of the re-entrant Kiss-fft behaviour. >> >> The m4 is quite a different beast, and no D-cache can improve >> performance over the M7 for some (inaptly) written applications (not >> this one- but as a generalization for applications grabbing a byte from >> memory randomly and all over a large dataset) >> >> Large matrix operations are where cache machines fall over- that is once >> the dataset is bigger than the cache >> >> The question is how much optimization is enough. I am tempted NOT to >> optimize any more, although I feel (just by looking at it ) I can get >> another 2x out of it. Why- well there is no real pressing need. >> Going too far away from the reference code will island the code a bit. >> However, if you run out of modem cycles/ modem ram, then we can probably >> get a bit more... >> >> cheers >> >> >> >> >> >> >> >> >> >> -- >> ___ >> Freetel-codec2 mailing list >> Freetel-codec2@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 > -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] more benching and thoughts
Thanks Danilo, That's an interesting explanation of why kiss_fft performs just as well as the optimised ARM FFT on the M4. I also found no performance improvement when I tried the optimised ARM FFT a few years ago. I'm inclined to keep malloc/free in codec 2. Happy to look a patches for alternate memory allocators if it's an itch anyone really wants to scratch. Danilo - I will get back to you on your other points and proposal shortly. To the List in general - I'd like to publicly thank Danilo and the mcHF team for the fine patches he/they have submitted over the last few weeks. The FreeDV 1600 decoder is now around 40% faster on the STM32F4 Importantly - his suggestions have been backed by quality patches I can easily apply and test. He has shortened my TODO list - not made it longer. Very important for me and a fine example to anyone else who would like to contribute to Codec 2. Thanks Danilo! - David On 18/09/16 10:10, Danilo Beuche wrote: > Hi, > > I'd like to share a few thoughts/ideas as well: > > Since we finished and removed most of the easy to remove performance > hotspots with the exception of the kiss_fft calls which now contribute a > major part to the overall runtime of modem and decoder, I played around > with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs. > > Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32) > the time difference is not existing if used in our mcHF code. How does > this relate to the measurements of Glen (he measure better performance)? > I believe this is due to the fact that the arm lib stores some of its > data in precomputed flash arrays. Access to flash is slow (5 wait > states), so this reduces the performance. kiss has all its data in RAM. > Since Glen did initialize the arm DSP tables in RAM, he got speed gains > on the expense of RAM. On a STM32F746 RAM this is not as much of an > issue (384K) as it is on the STM32F4 (the default MCU in the mcHF > project has 192K RAM and we have to fit the full SDR firmware RAM needs > into it this space). Speed is traded for RAM use reduction. Since have > reached our goal timewise, now memory reductions is more in focus for us > (but it should not get slower of course). > > Because of that I would like to propose the following approach to keep > the code easily readable while providing efficient solutions for the ARM > MCUs both with little and more RAM: > > 1. We create an abstract interface for running fft in the codec2 > sources. Initially this should closely resemble the existing kiss_fft > calls, which makes introducing the interface easy, since we may > use all the existing tests and can verify that introducing the interface > does not change a bit in the output. > > 2. Once validated, we can now introduce/activate the use of the arm DSP > FFT with some glue code to map between the abstract interface and the > arm DSP interface. Here again we have to validate everything is working > nicely but we will see some slight differences I assume. However, with > it we can produce reference data for step 3 > > 3. Now we modify the existing code so that we can benefit from some nice > properties of the arm DSP fft (inplace FFT) which means this will reduce > RAM usage significantly (in relation to 192K RAM). > > 4. We enable optional use of RAM instead of flash in the ARM code, so > that depending on the amount of available memory you can get some extra > boost. > > For that to work nicely, we have to fix some issues in the existing code > first, so here comes > > 0. As Glen pointed out, some of the #define constant have not so good > names, especially M (defined for 2 different purposes in defines.h and > fdmdv_internal.h) is nasty and also N in defines.h (there are some local > variables N and the stm headers also get confused by it). So we need to > change these to something unambiguous. I think Glen already suggest > names for them. > > > And I would like to point out, that the use of dynamic memory allocation > (malloc/free) is necessary in our mcHF case, so I would like to keep > this more or less as it is. The mcHF needs the ability to reuse the > memory for other operational modes, if FreeDV is not active. Which does > not mean I am against removing the internal use of malloc, but then it > should be possible to easily create the required data structures > "outside" the code using malloc. I.e. the use of static data structures > for anything but const data is a no go. > > To support that discuss I created a draft suggestion for the interface > (attached to this mail). It is right now defined using inline code for > the sake of simplicity. This may change later, I don't think there > should be any issue with that. It essentially contains 3 functions for > complex fft and real fft each (alloc,fft,free) and the necessary data > structures. > > Danilo > > > > > Am 16.09.2016 um 20:59 schrieb Dana Myers: >> >>> Subject:Re: [Freetel-codec2] more benching and though
Re: [Freetel-codec2] more benching and thoughts
Hi Glen, just checked, it seems to me that we have all the caches running in the mcHF: https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424 So even with these caches/buffers enabled, the M4 looses performance by data reads from flash. Haven't checked the manual/internet but maybe the flash caching works well for code but not in the same way for data. Danilo Am 18.09.2016 um 03:12 schrieb glen english: > Hi Danilo > > Good thoughts and points. > > while on the RAM subject : require reading for every serious programmer : > "Memory" > https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich > > read all 7 parts, 100 pages, but if you only have an hour, just read > "part 2 - cache" > > On the fft: > > I am not surprised that the ARM lib hand optimized assembler is that > much faster. > more that 2x faster in fact. > > I don't think kiss-fft is particular suitable for this sort of platform, > either, I'll hold back what I really think :-) . > > The 5WS on flash (actually 6WS I am running @ 168M) does not really > affect the performance too much. In fact I can vary the WS count +/- 2 > without much change- the ART and the prefetch and the instruction and > data caches are doing their job, so there is very little difference with > the const values in ram or cache. > > In fact, most FFT implementations are very tough on a machine with cache . > Have you read the paper on how FFTW works ? It is very cache aware- and > adaptive to the architecture- that is why it does trial runs and picks > the best. > > The M7 is very impressive. It is certainly impressive work by ARM. > > However, the M4 is what all of you have to work with so we can stay > focussed on that. > > I think also the ram usage will be significantly less with the arm FFT > because of the re-entrant Kiss-fft behaviour. > > The m4 is quite a different beast, and no D-cache can improve > performance over the M7 for some (inaptly) written applications (not > this one- but as a generalization for applications grabbing a byte from > memory randomly and all over a large dataset) > > Large matrix operations are where cache machines fall over- that is once > the dataset is bigger than the cache > > The question is how much optimization is enough. I am tempted NOT to > optimize any more, although I feel (just by looking at it ) I can get > another 2x out of it. Why- well there is no real pressing need. > Going too far away from the reference code will island the code a bit. > However, if you run out of modem cycles/ modem ram, then we can probably > get a bit more... > > cheers > > > > > > > > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
[Freetel-codec2] array subscript is above array bounds
Not a very useful posting, but I found an old style C bug in my encoder translation. This reduced my bad data by quite a bit, but there is still some persistent errors yet. Anyway: #include static int snr[4]; int main() { int big = 40; snr[big] = 7; fprintf(stderr, "snr[%d] = %d\n", big, snr[big]); return 0; } If you run this it will print: snr[40] = 7 Ha, anyway, I haven't made this error in years. So I tried: -mpx -fcheck-pointer-bounds as added switches on GCC and it did flag the warning. One major bug down, one left... argh... -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] more benching and thoughts
Hi Glen, On 18.09.2016 03:12, glen english wrote: > Hi Danilo > > Good thoughts and points. > > while on the RAM subject : require reading for every serious programmer : > "Memory" > https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich > > read all 7 parts, 100 pages, but if you only have an hour, just read > "part 2 - cache" > > On the fft: > > I am not surprised that the ARM lib hand optimized assembler is that > much faster. > more that 2x faster in fact. As I said on the m4 kiss fft and arm dsp are on par, not much difference. And using RAM vs. flash makes a lot of difference on the mcHF M4 Code for code which uses tables heavily. We gained a lot from removing the need to go to flash twice in the fir_filter vs. fir_filter2. Maybe the mcHF startup configuration is not enabling all the caches. Will check that. > I don't think kiss-fft is particular suitable for this sort of platform, > either, I'll hold back what I really think :-) . > > The 5WS on flash (actually 6WS I am running @ 168M) does not really > affect the performance too much. In fact I can vary the WS count +/- 2 > without much change- the ART and the prefetch and the instruction and > data caches are doing their job, so there is very little difference with > the const values in ram or cache. > > In fact, most FFT implementations are very tough on a machine with cache . > Have you read the paper on how FFTW works ? It is very cache aware- and > adaptive to the architecture- that is why it does trial runs and picks > the best. > > The M7 is very impressive. It is certainly impressive work by ARM. > > However, the M4 is what all of you have to work with so we can stay > focussed on that. > > I think also the ram usage will be significantly less with the arm FFT > because of the re-entrant Kiss-fft behaviour. > > The m4 is quite a different beast, and no D-cache can improve > performance over the M7 for some (inaptly) written applications (not > this one- but as a generalization for applications grabbing a byte from > memory randomly and all over a large dataset) > > Large matrix operations are where cache machines fall over- that is once > the dataset is bigger than the cache > > The question is how much optimization is enough. I am tempted NOT to > optimize any more, although I feel (just by looking at it ) I can get > another 2x out of it. Why- well there is no real pressing need. > Going too far away from the reference code will island the code a bit. > However, if you run out of modem cycles/ modem ram, then we can probably > get a bit more... > > cheers Yes, unfortunately we have a M4 at hand, so a little more RAM would be nice :-( Regards, Danilo > > > > > > > > > -- > ___ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] more benching and thoughts
Hi Danilo Good thoughts and points. while on the RAM subject : require reading for every serious programmer : "Memory" https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich read all 7 parts, 100 pages, but if you only have an hour, just read "part 2 - cache" On the fft: I am not surprised that the ARM lib hand optimized assembler is that much faster. more that 2x faster in fact. I don't think kiss-fft is particular suitable for this sort of platform, either, I'll hold back what I really think :-) . The 5WS on flash (actually 6WS I am running @ 168M) does not really affect the performance too much. In fact I can vary the WS count +/- 2 without much change- the ART and the prefetch and the instruction and data caches are doing their job, so there is very little difference with the const values in ram or cache. In fact, most FFT implementations are very tough on a machine with cache . Have you read the paper on how FFTW works ? It is very cache aware- and adaptive to the architecture- that is why it does trial runs and picks the best. The M7 is very impressive. It is certainly impressive work by ARM. However, the M4 is what all of you have to work with so we can stay focussed on that. I think also the ram usage will be significantly less with the arm FFT because of the re-entrant Kiss-fft behaviour. The m4 is quite a different beast, and no D-cache can improve performance over the M7 for some (inaptly) written applications (not this one- but as a generalization for applications grabbing a byte from memory randomly and all over a large dataset) Large matrix operations are where cache machines fall over- that is once the dataset is bigger than the cache The question is how much optimization is enough. I am tempted NOT to optimize any more, although I feel (just by looking at it ) I can get another 2x out of it. Why- well there is no real pressing need. Going too far away from the reference code will island the code a bit. However, if you run out of modem cycles/ modem ram, then we can probably get a bit more... cheers -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] more benching and thoughts
Hi, I'd like to share a few thoughts/ideas as well: Since we finished and removed most of the easy to remove performance hotspots with the exception of the kiss_fft calls which now contribute a major part to the overall runtime of modem and decoder, I played around with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs. Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32) the time difference is not existing if used in our mcHF code. How does this relate to the measurements of Glen (he measure better performance)? I believe this is due to the fact that the arm lib stores some of its data in precomputed flash arrays. Access to flash is slow (5 wait states), so this reduces the performance. kiss has all its data in RAM. Since Glen did initialize the arm DSP tables in RAM, he got speed gains on the expense of RAM. On a STM32F746 RAM this is not as much of an issue (384K) as it is on the STM32F4 (the default MCU in the mcHF project has 192K RAM and we have to fit the full SDR firmware RAM needs into it this space). Speed is traded for RAM use reduction. Since have reached our goal timewise, now memory reductions is more in focus for us (but it should not get slower of course). Because of that I would like to propose the following approach to keep the code easily readable while providing efficient solutions for the ARM MCUs both with little and more RAM: 1. We create an abstract interface for running fft in the codec2 sources. Initially this should closely resemble the existing kiss_fft calls, which makes introducing the interface easy, since we may use all the existing tests and can verify that introducing the interface does not change a bit in the output. 2. Once validated, we can now introduce/activate the use of the arm DSP FFT with some glue code to map between the abstract interface and the arm DSP interface. Here again we have to validate everything is working nicely but we will see some slight differences I assume. However, with it we can produce reference data for step 3 3. Now we modify the existing code so that we can benefit from some nice properties of the arm DSP fft (inplace FFT) which means this will reduce RAM usage significantly (in relation to 192K RAM). 4. We enable optional use of RAM instead of flash in the ARM code, so that depending on the amount of available memory you can get some extra boost. For that to work nicely, we have to fix some issues in the existing code first, so here comes 0. As Glen pointed out, some of the #define constant have not so good names, especially M (defined for 2 different purposes in defines.h and fdmdv_internal.h) is nasty and also N in defines.h (there are some local variables N and the stm headers also get confused by it). So we need to change these to something unambiguous. I think Glen already suggest names for them. And I would like to point out, that the use of dynamic memory allocation (malloc/free) is necessary in our mcHF case, so I would like to keep this more or less as it is. The mcHF needs the ability to reuse the memory for other operational modes, if FreeDV is not active. Which does not mean I am against removing the internal use of malloc, but then it should be possible to easily create the required data structures "outside" the code using malloc. I.e. the use of static data structures for anything but const data is a no go. To support that discuss I created a draft suggestion for the interface (attached to this mail). It is right now defined using inline code for the sake of simplicity. This may change later, I don't think there should be any issue with that. It essentially contains 3 functions for complex fft and real fft each (alloc,fft,free) and the necessary data structures. Danilo Am 16.09.2016 um 20:59 schrieb Dana Myers: Subject:Re: [Freetel-codec2] more benching and thoughts Date: Fri, 16 Sep 2016 12:13:05 +1000 From: glen english Reply-To: freetel-codec2@lists.sourceforge.net To: freetel-codec2@lists.sourceforge.net Hi Danilo Yeah, I guess being a very bare metal programmer from the old 128 byte RAM days, , I dislike MALLOCs in embedded code on principal. I'm similar, though now that we have 32kB, 64kB (or even more) RAM in embedded chips, they're basically like the systems that malloc() was initially built on :-) I still don't trust C++ heap allocators in embedded applications, though. However, because the heap usage would be deterministic, it should be fairly safe. I have a similar project; a 1200 baud modem + TNC stack built in a PSoC 5LP. I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter Block to do bandpass heavy-lifting and frequency response correction for the ADC. I use CMSIS-DSP for the rest of the DSP crunching required, and, wait for it, wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP, as long as I'm careful to avoid blowing out
Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc
On 17/09/16 18:28, glen english wrote: > Yeah. > > Cascading codecs is always trouble. > > BTW My understand of the word TRANSCODING is going between one encoded > method and another without going back all the way to uncompressed. Going > between different video encoding methods usually done by transcoding. > Most video encoding is all DCT / macroblock based so is probably more > relationship between codecs than speech codec variations > > > In this problematic voip case: uncoded PCM (microphone) >> AMBE2 >> > over air >> decoded >> PCM >> uLaw encode>> ulaw decode >> headset. (and > reverse) . Transcoding is going from one format to another, and when the destination is a lossy format, that will mean a reduction in quality. Decoding to uncompressed shouldn't incur loss compared with a hypothetical conversion from AMBE2 direct to µ-law, more likely the assumptions made by the µ-law CODEC don't hold true for the synthesized voice from the AMBE2 CODEC, and that would be why it sounds so terrible. Regards, -- Stuart Longland (aka Redhatter, VK4MSL) I haven't lost my mind... ...it's backed up on a tape somewhere. -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2
Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc
Yeah. Cascading codecs is always trouble. BTW My understand of the word TRANSCODING is going between one encoded method and another without going back all the way to uncompressed. Going between different video encoding methods usually done by transcoding. Most video encoding is all DCT / macroblock based so is probably more relationship between codecs than speech codec variations In this problematic voip case: uncoded PCM (microphone) >> AMBE2 >> over air >> decoded >> PCM >> uLaw encode>> ulaw decode >> headset. (and reverse) . regards On 17/09/2016 3:46 PM, David Rowe wrote: > It's a good question Glen, off the top of my head I'm not sure. When > one or more codecs are combined it's called transcoding and IIRC often > causes problems. > > Alaw/mulaw are rather non-linear operations. That could upset the > parameter estimation algorithms. > > - David > > On 17/09/16 11:38, glen english wrote: >> David >> you are the man to ask this one >> >> Why do (or why do you think) relatively benign companding algorithms >> like Alaw, uLaw, that are commonly used for encoding for VOIP links for >> radio systems (and other) , sounds so awful when they pass AMBE/ AMBE2 >> etc processed speech ? >> >> Must be something to do we the re distribution of the quantizating noise >> characteristics? >> >> Not sure what the effect is yet with codec2, but that would be a cinch >> to test up. >> >> >> g -- ___ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2