Hi Glen,

just checked, it seems to me that we have all the caches running in the 
mcHF:

https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424

So even with these caches/buffers enabled, the M4 looses performance  by 
data reads from flash. Haven't checked the manual/internet but maybe the 
flash caching works well for code but not in the same way for data.


Danilo

Am 18.09.2016 um 03:12 schrieb glen english:
> Hi Danilo
>
> Good thoughts and points.
>
> while on the RAM subject : require reading for every serious programmer :
> "Memory"
> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
>
> read all 7 parts, 100 pages, but if you only have an hour, just read
> "part 2 - cache"
>
> On the fft:
>
> I am not surprised that the ARM lib  hand optimized assembler is that
> much faster.
> more that 2x faster.... in fact.
>
> I don't think kiss-fft is particular suitable for this sort of platform,
> either, I'll hold back what I really think :-) .
>
> The 5WS on flash (actually 6WS I am running @ 168M) does not really
> affect the performance too much.  In fact I can vary the WS count +/- 2
> without much change- the ART and the prefetch and the instruction and
> data caches are doing their job, so there is very little difference with
> the const values in ram or cache.
>
> In fact, most  FFT implementations are very tough on a machine with cache .
> Have you read the paper on how FFTW works ? It is very cache aware- and
> adaptive to the architecture- that is why it does trial runs and picks
> the best.
>
> The M7 is very impressive. It is certainly impressive work by ARM.
>
> However, the M4 is what all of you have to work with so we can stay
> focussed on that.
>
> I think also the ram usage will be significantly less with the arm FFT
> because of the re-entrant Kiss-fft behaviour.
>
> The m4 is quite a different beast, and no D-cache can improve
> performance over the M7 for some (inaptly) written applications (not
> this one- but as a generalization for applications grabbing a byte from
> memory randomly and all over a large dataset)
>
> Large matrix operations are where cache machines fall over- that is once
> the dataset is bigger than the cache....
>
> The question is how much optimization is enough. I am tempted NOT to
> optimize any more, although I feel (just by looking at it )  I can get
> another 2x out of it..... Why- well there is no real pressing need.
> Going too far away from the reference code will island the code a bit.
> However, if you run out of modem cycles/ modem ram, then we can probably
> get a bit more...
>
> cheers
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2



------------------------------------------------------------------------------
_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Reply via email to