Hi Glen,

On 18.09.2016 03:12, glen english wrote:
> Hi Danilo
>
> Good thoughts and points.
>
> while on the RAM subject : require reading for every serious programmer :
> "Memory"
> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
>
> read all 7 parts, 100 pages, but if you only have an hour, just read 
> "part 2 - cache"
>
> On the fft:
>
> I am not surprised that the ARM lib  hand optimized assembler is that 
> much faster.
> more that 2x faster.... in fact.
As I said on the m4 kiss fft and arm dsp are on par, not much difference.
And using RAM vs. flash makes a lot of difference on the mcHF M4 Code
for code which uses tables heavily. We gained a lot from removing the
need to go to flash twice in the fir_filter vs. fir_filter2.
Maybe the mcHF startup configuration is not enabling all the caches.
Will check that.
> I don't think kiss-fft is particular suitable for this sort of platform, 
> either, I'll hold back what I really think :-) .
>
> The 5WS on flash (actually 6WS I am running @ 168M) does not really 
> affect the performance too much.  In fact I can vary the WS count +/- 2 
> without much change- the ART and the prefetch and the instruction and 
> data caches are doing their job, so there is very little difference with 
> the const values in ram or cache.
>
> In fact, most  FFT implementations are very tough on a machine with cache .
> Have you read the paper on how FFTW works ? It is very cache aware- and 
> adaptive to the architecture- that is why it does trial runs and picks 
> the best.
>
> The M7 is very impressive. It is certainly impressive work by ARM.
>
> However, the M4 is what all of you have to work with so we can stay 
> focussed on that.
>
> I think also the ram usage will be significantly less with the arm FFT 
> because of the re-entrant Kiss-fft behaviour.
>
> The m4 is quite a different beast, and no D-cache can improve 
> performance over the M7 for some (inaptly) written applications (not 
> this one- but as a generalization for applications grabbing a byte from 
> memory randomly and all over a large dataset)
>
> Large matrix operations are where cache machines fall over- that is once 
> the dataset is bigger than the cache....
>
> The question is how much optimization is enough. I am tempted NOT to 
> optimize any more, although I feel (just by looking at it )  I can get 
> another 2x out of it..... Why- well there is no real pressing need. 
> Going too far away from the reference code will island the code a bit.
> However, if you run out of modem cycles/ modem ram, then we can probably 
> get a bit more...
>
> cheers
Yes, unfortunately we have a M4 at hand, so a little more RAM would be
nice :-(

Regards,
Danilo
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Reply via email to