Hi Glen,
On 18.09.2016 03:12, glen english wrote: > Hi Danilo > > Good thoughts and points. > > while on the RAM subject : require reading for every serious programmer : > "Memory" > https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich > > read all 7 parts, 100 pages, but if you only have an hour, just read > "part 2 - cache" > > On the fft: > > I am not surprised that the ARM lib hand optimized assembler is that > much faster. > more that 2x faster.... in fact. As I said on the m4 kiss fft and arm dsp are on par, not much difference. And using RAM vs. flash makes a lot of difference on the mcHF M4 Code for code which uses tables heavily. We gained a lot from removing the need to go to flash twice in the fir_filter vs. fir_filter2. Maybe the mcHF startup configuration is not enabling all the caches. Will check that. > I don't think kiss-fft is particular suitable for this sort of platform, > either, I'll hold back what I really think :-) . > > The 5WS on flash (actually 6WS I am running @ 168M) does not really > affect the performance too much. In fact I can vary the WS count +/- 2 > without much change- the ART and the prefetch and the instruction and > data caches are doing their job, so there is very little difference with > the const values in ram or cache. > > In fact, most FFT implementations are very tough on a machine with cache . > Have you read the paper on how FFTW works ? It is very cache aware- and > adaptive to the architecture- that is why it does trial runs and picks > the best. > > The M7 is very impressive. It is certainly impressive work by ARM. > > However, the M4 is what all of you have to work with so we can stay > focussed on that. > > I think also the ram usage will be significantly less with the arm FFT > because of the re-entrant Kiss-fft behaviour. > > The m4 is quite a different beast, and no D-cache can improve > performance over the M7 for some (inaptly) written applications (not > this one- but as a generalization for applications grabbing a byte from > memory randomly and all over a large dataset) > > Large matrix operations are where cache machines fall over- that is once > the dataset is bigger than the cache.... > > The question is how much optimization is enough. I am tempted NOT to > optimize any more, although I feel (just by looking at it ) I can get > another 2x out of it..... Why- well there is no real pressing need. > Going too far away from the reference code will island the code a bit. > However, if you run out of modem cycles/ modem ram, then we can probably > get a bit more... > > cheers Yes, unfortunately we have a M4 at hand, so a little more RAM would be nice :-( Regards, Danilo > > > > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Freetel-codec2 mailing list > Freetel-codec2@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/freetel-codec2 ------------------------------------------------------------------------------ _______________________________________________ Freetel-codec2 mailing list Freetel-codec2@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freetel-codec2