date:20160917

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread Danilo Beuche

Hi Glen,

I just verified, the cycle counts for true 16Mhz are within 1% of the 
8Mhz operations (but measurements now take half as long :-) )

Danilo


Am 18.09.2016 um 07:15 schrieb glen english:
> Hi
> OK.
> well, anyway, decode 1200 (40mS) takes  12.34mS on my kit, and 19.86
> using kiss-fft.
> I think you approximated about 14.4mS for a decode 1300 on your kit
>
> so, I will be interested to see what you come up with using cfft
>
> My codec 2 codebase is AUGUST 2015
>
> cheers
>
>
>
>
>
> On 18/09/2016 2:58 PM, Danilo Beuche wrote:
>> Hi Glen, I would not worry to much:
>>
>> - Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
>> you are looking).  It is a lot but not unexplainable.
>>
>> - Maybe it is my test data. I don't know how much jitter in the kiss_fft
>> algorithm is, when different data is presented. I am running
>> "artificially" generated audio input (digitally captured codec2 frames
>> from a single 750Hz sine way also generated digitally).-
>>
>> - Maybe it is my strange way of running the mcHF firmware: the mcHF
>> Hardware has a 16Mhz XO,  but the discovery board which I have here for
>> testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
>> everything takes twice the time. If the flash would asynchronously
>> coupled, which I doubt (otherwise no need for explicit wait state
>> settings),  it would have an influence.  But here I am quite sure, this
>> is not the case. If the caches are asynchronous: Maybe. Maybe I should
>> remeasure with fixed PLL setup so that the processor runs at true
>> 168Mhz. Will do that later and get back with updated numbers.
>>
>> Danilo
>>
>>
>>
>>
>>
>> On 18.09.2016 06:35, glen english wrote:
>>> Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
>>>
>>> using cycle counter (yes)
>>>
>>> interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
>>> only set off flags...
>>>
>>> for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
>>> 141000. Something for me to look at .
>>>
>>> hmm
>>>
>>> -O2 but I also have a bunch of debug symbol stuff in there dunno I think
>>> it is only symbol data at DB2 which pushes up the image size.
>>>
>>>
>>>
>>>
>>> --
>>> ___
>>> Freetel-codec2 mailing list
>>> Freetel-codec2@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>> --
>> ___
>> Freetel-codec2 mailing list
>> Freetel-codec2@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread glen english

Hi
OK.
well, anyway, decode 1200 (40mS) takes  12.34mS on my kit, and 19.86 
using kiss-fft.
I think you approximated about 14.4mS for a decode 1300 on your kit

so, I will be interested to see what you come up with using cfft

My codec 2 codebase is AUGUST 2015

cheers





On 18/09/2016 2:58 PM, Danilo Beuche wrote:
> Hi Glen, I would not worry to much:
>
> - Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
> you are looking).  It is a lot but not unexplainable.
>
> - Maybe it is my test data. I don't know how much jitter in the kiss_fft
> algorithm is, when different data is presented. I am running
> "artificially" generated audio input (digitally captured codec2 frames
> from a single 750Hz sine way also generated digitally).-
>
> - Maybe it is my strange way of running the mcHF firmware: the mcHF
> Hardware has a 16Mhz XO,  but the discovery board which I have here for
> testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
> everything takes twice the time. If the flash would asynchronously
> coupled, which I doubt (otherwise no need for explicit wait state
> settings),  it would have an influence.  But here I am quite sure, this
> is not the case. If the caches are asynchronous: Maybe. Maybe I should
> remeasure with fixed PLL setup so that the processor runs at true
> 168Mhz. Will do that later and get back with updated numbers.
>
> Danilo
>
>
>
>
>
> On 18.09.2016 06:35, glen english wrote:
>> Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
>>
>> using cycle counter (yes)
>>
>> interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs
>> only set off flags...
>>
>> for kissfft 5ws F4, I wonder why you have 112500 cycles and I have
>> 141000. Something for me to look at .
>>
>> hmm
>>
>> -O2 but I also have a bunch of debug symbol stuff in there dunno I think
>> it is only symbol data at DB2 which pushes up the image size.
>>
>>
>>
>>
>> --
>> ___
>> Freetel-codec2 mailing list
>> Freetel-codec2@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread Danilo Beuche

Hi Glen, I would not worry to much:

- Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
you are looking).  It is a lot but not unexplainable.

- Maybe it is my test data. I don't know how much jitter in the kiss_fft
algorithm is, when different data is presented. I am running
"artificially" generated audio input (digitally captured codec2 frames
from a single 750Hz sine way also generated digitally).-

- Maybe it is my strange way of running the mcHF firmware: the mcHF
Hardware has a 16Mhz XO,  but the discovery board which I have here for
testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
everything takes twice the time. If the flash would asynchronously
coupled, which I doubt (otherwise no need for explicit wait state
settings),  it would have an influence.  But here I am quite sure, this
is not the case. If the caches are asynchronous: Maybe. Maybe I should
remeasure with fixed PLL setup so that the processor runs at true
168Mhz. Will do that later and get back with updated numbers.

Danilo





On 18.09.2016 06:35, glen english wrote:
> Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
>
> using cycle counter (yes)
>
> interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs 
> only set off flags...
>
> for kissfft 5ws F4, I wonder why you have 112500 cycles and I have 
> 141000. Something for me to look at .
>
> hmm
>
> -O2 but I also have a bunch of debug symbol stuff in there dunno I think 
> it is only symbol data at DB2 which pushes up the image size.
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread glen english

Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9

using cycle counter (yes)

interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs 
only set off flags...

for kissfft 5ws F4, I wonder why you have 112500 cycles and I have 
141000. Something for me to look at .

hmm

-O2 but I also have a bunch of debug symbol stuff in there dunno I think 
it is only symbol data at DB2 which pushes up the image size.




--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread Danilo Beuche

Hi Glen,

what and how are you measuring, in order to repeat measurements in my
environment.

We use the cycle counter of the CORTEX processor (seems you do the
same), interrupt overhead is removed (counter is stopped during the
major interrupt, which is the audio irq).

I have some prerecorded "frames" stored in flash which are feed into the
freedv_comprx routine. Measurements are taken every 50 frames, since
initially the decoder needs to lock on the data.

After 50 frames I get stable measurements.

I did some hotspot analysis and a single kiss_fft 512 fft call was
taking around 670uS  (*168) = 112500 cycles.

So this is in line with your numbers.

My results for kiss_fft vs arm_fft may not be correct, since in this
case fact I was running kiss_fftr vs. arm_rfft_fast_f32.

Maybe there is a huge penalty for the arm_rfft_fast_f32 code.

Let me try the kiss_fft vs. arm_cfft case  to confirm your measurements.

Danilo

On 18.09.2016 06:07, glen english wrote:
> arm fft, stm32F405RGT6, 6WS
> -O2, debug level2.
>
> encode 1200 (40mS frame)
> 1745799 : 10.39mS
> decode
> 2292497 : 13.64mS
>
> so, 2x speed of my other runs
>
> let's run 5WS
>
> encode 1200 (40mS frame)
> 1579922 : 9.4mS
> decode
> 2073979 : 12.34mS
>
> WOW the wait states hurt on the M4 
>
>
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2

--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

[Freetel-codec2] M4 runs, summary :

2016-09-17 Thread glen english

-O2, debug level2
1200 (40mS frame)
5 wait states

STM32F405RGT6

kissfft
encode : 2780314 16.5mS
decode : 3335953 19.85mS
kissFTT512cpxfloat  : 141,032 cycles

ARMfft
encode 1579922 : 9.4mS
decode 2073979 : 12.34mS
ARMfft cycles :  50,597 cycles


SO, I still see more than 2:1 on cycle count for the FFT in favour of arm fft

can one of you guys get a cycle count on kiss-fft ???

and. the cortex M7  is about 2x speed on that code ..for SAME clock..





On 18/09/2016 1:57 PM, glen english wrote:
> kiss fft per standard codec2 code...
>
> -O2, debug level2.
> encode 3200 (20mS) frame
> encode : 1388964 cycles: 8.26mS
> decode : 1807440 cycles : 10.75mS
>
> encode 1200 (40mS frame)
> 3006497  (17.89mS)
> decode
> 3669321 (21.8mS)
>
> hmm seems pretty slow
> I wonder what I am doing wrong ?
> Or is that inline with other's measurments ?
>
> kissFFT512 : 155,483 cycles (versus 70,000 on the M7)
>
> Next... arm asm on F4 (STM32F405RGT6)
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] F4 runs.....

2016-09-17 Thread Danilo Beuche

Hi Glen,

difficult to say:

I have data for freedv_comprx() for FreeDV 1600 (translates to decode
1300 because of use of some bits for data transmission)  which does
modem decode every 20ms and voice decode very 40ms.

On average I have 12.2 ms -> 24.4ms, 2x5ms are the fdmdv_demod part, so
decode1300 is 14.4ms.

I am using gcc 5.4, O2 mostly (some files have O3 enable, but none of
the codec2 files.

And I run more or less newest SVN code (codec2-dev 2875)

If you run somewhat older code, the numbers you gave have been the
performance a few days ago. As David mentioned, we gained about 40%
since r2842 (or so). So they could be right.

Danilo

On 18.09.2016 05:57, glen english wrote:
> kiss fft per standard codec2 code...
>
> -O2, debug level2.
> encode 3200 (20mS) frame
> encode : 1388964 cycles: 8.26mS
> decode : 1807440 cycles : 10.75mS
>
> encode 1200 (40mS frame)
> 3006497  (17.89mS)
> decode
> 3669321 (21.8mS)
>
> hmm seems pretty slow
> I wonder what I am doing wrong ?
> Or is that inline with other's measurments ?
>
> kissFFT512 : 155,483 cycles (versus 70,000 on the M7)
>
> Next... arm asm on F4 (STM32F405RGT6)
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2

--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

[Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread glen english

arm fft, stm32F405RGT6, 6WS

-O2, debug level2.

encode 1200 (40mS frame)
1745799 : 10.39mS
decode
2292497 : 13.64mS

so, 2x speed of my other runs

let's run 5WS

encode 1200 (40mS frame)
1579922 : 9.4mS
decode
2073979 : 12.34mS

WOW the wait states hurt on the M4 






--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

[Freetel-codec2] F4 runs.....

2016-09-17 Thread glen english

kiss fft per standard codec2 code...

-O2, debug level2.
encode 3200 (20mS) frame
encode : 1388964 cycles: 8.26mS
decode : 1807440 cycles : 10.75mS

encode 1200 (40mS frame)
3006497  (17.89mS)
decode
3669321 (21.8mS)

hmm seems pretty slow
I wonder what I am doing wrong ?
Or is that inline with other's measurments ?

kissFFT512 : 155,483 cycles (versus 70,000 on the M7)

Next... arm asm on F4 (STM32F405RGT6)




--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread Danilo Beuche

Hi Glen,

Am 18.09.2016 um 05:22 schrieb glen english:
> Hi Danilo
>
> Yes, there is most-certainly a penalty for const access from flash, at
> least on the M4.
>
> and of course instruction cache is no use, is just that, only for
> instructions
Hence the name :-)
> I wonder what the bus matrix penalty is for data fetches from flash.
> There is an app-note about it somewhere I once read. It depends on what
> else is going on. The processor is pretty smart and interleaving the
> accesses as not to stall the pipeline or bus matrix.
>
> As Danilo I am sure you know : (pointed out for others)
> You can force variables into sections (like forcing a static const ) by
> using
>
> __attribute__((section("name"))) to assign say into data.
Yes,  indeed. We use that extensively at the mcHF to move certain parts 
to the CCM memory which is otherwise not so easily accessible (and 
should be used with care as it does not support DMA to/from peripherals).
Works great.
> I will some time run up the code on an M4 and see what kiss-fft does.
>
> I am very very very surprised , and do not really believe that kissFFT
> is as fast as the arm assembler  on the M4 -  my immediate thoughts are
> "you are doing it wrong". so, I will investigate.
Would be happy if I am wrong. Which would mean we get even faster fft on 
the M4 almost for free (minus investigation time that is). No problem at 
all with me :-)


Danilo



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

[Freetel-codec2] memory management options

2016-09-17 Thread glen english

David
Have a look at

http://www.freertos.org/a00111.html

I use heap2.c for most things... works on the premise of repeated 
requests and returns of things the same size.

heap4 is good, also.



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread glen english

Hi Danilo

Yes, there is most-certainly a penalty for const access from flash, at 
least on the M4.

and of course instruction cache is no use, is just that, only for 
instructions

I wonder what the bus matrix penalty is for data fetches from flash. 
There is an app-note about it somewhere I once read. It depends on what 
else is going on. The processor is pretty smart and interleaving the 
accesses as not to stall the pipeline or bus matrix.

As Danilo I am sure you know : (pointed out for others)
You can force variables into sections (like forcing a static const ) by 
using

__attribute__((section("name"))) to assign say into data.

I will some time run up the code on an M4 and see what kiss-fft does.

I am very very very surprised , and do not really believe that kissFFT 
is as fast as the arm assembler  on the M4 -  my immediate thoughts are 
"you are doing it wrong". so, I will investigate.

g









On 18/09/2016 12:13 PM, Danilo Beuche wrote:
> Hi Glen,
>
> just checked, it seems to me that we have all the caches running in the
> mcHF:
>
> https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424
>
> So even with these caches/buffers enabled, the M4 looses performance  by
> data reads from flash. Haven't checked the manual/internet but maybe the
> flash caching works well for code but not in the same way for data.
>
>
> Danilo
>
> Am 18.09.2016 um 03:12 schrieb glen english:
>> Hi Danilo
>>
>> Good thoughts and points.
>>
>> while on the RAM subject : require reading for every serious programmer :
>> "Memory"
>> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
>>
>> read all 7 parts, 100 pages, but if you only have an hour, just read
>> "part 2 - cache"
>>
>> On the fft:
>>
>> I am not surprised that the ARM lib  hand optimized assembler is that
>> much faster.
>> more that 2x faster in fact.
>>
>> I don't think kiss-fft is particular suitable for this sort of platform,
>> either, I'll hold back what I really think :-) .
>>
>> The 5WS on flash (actually 6WS I am running @ 168M) does not really
>> affect the performance too much.  In fact I can vary the WS count +/- 2
>> without much change- the ART and the prefetch and the instruction and
>> data caches are doing their job, so there is very little difference with
>> the const values in ram or cache.
>>
>> In fact, most  FFT implementations are very tough on a machine with cache .
>> Have you read the paper on how FFTW works ? It is very cache aware- and
>> adaptive to the architecture- that is why it does trial runs and picks
>> the best.
>>
>> The M7 is very impressive. It is certainly impressive work by ARM.
>>
>> However, the M4 is what all of you have to work with so we can stay
>> focussed on that.
>>
>> I think also the ram usage will be significantly less with the arm FFT
>> because of the re-entrant Kiss-fft behaviour.
>>
>> The m4 is quite a different beast, and no D-cache can improve
>> performance over the M7 for some (inaptly) written applications (not
>> this one- but as a generalization for applications grabbing a byte from
>> memory randomly and all over a large dataset)
>>
>> Large matrix operations are where cache machines fall over- that is once
>> the dataset is bigger than the cache
>>
>> The question is how much optimization is enough. I am tempted NOT to
>> optimize any more, although I feel (just by looking at it )  I can get
>> another 2x out of it. Why- well there is no real pressing need.
>> Going too far away from the reference code will island the code a bit.
>> However, if you run out of modem cycles/ modem ram, then we can probably
>> get a bit more...
>>
>> cheers
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> ___
>> Freetel-codec2 mailing list
>> Freetel-codec2@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread David Rowe

Thanks Danilo,

That's an interesting explanation of why kiss_fft performs just as well 
as the optimised ARM FFT on the M4.  I also found no performance 
improvement when I tried the optimised ARM FFT a few years ago.

I'm inclined to keep malloc/free in codec 2.  Happy to look a patches 
for alternate memory allocators if it's an itch anyone really wants to 
scratch.

Danilo - I will get back to you on your other points and proposal shortly.

To the List in general - I'd like to publicly thank Danilo and the mcHF 
team for the fine patches he/they have submitted over the last few 
weeks.  The FreeDV 1600 decoder is now around 40% faster on the STM32F4

Importantly - his suggestions have been backed by quality patches I can 
easily apply and test.  He has shortened my TODO list - not made it 
longer.  Very important for me and a fine example to anyone else who 
would like to contribute to Codec 2.  Thanks Danilo!

- David

On 18/09/16 10:10, Danilo Beuche wrote:
> Hi,
>
> I'd like to share a few thoughts/ideas as well:
>
> Since we finished and removed most of the easy to remove performance
> hotspots with the exception of the kiss_fft calls which now contribute a
> major part to the overall runtime of modem and decoder, I played around
> with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.
>
> Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32)
> the time difference is not existing if used in our mcHF code. How does
> this relate to the measurements of Glen (he measure better performance)?
> I believe this is due to the fact that the arm lib stores some of its
> data in precomputed flash arrays. Access to flash is slow (5 wait
> states), so this reduces the performance. kiss has all its data in RAM.
> Since Glen did initialize the arm DSP tables in RAM, he got speed gains
> on the expense of RAM. On a STM32F746 RAM this is not as much of an
> issue (384K) as it is on the STM32F4 (the default MCU in the mcHF
> project has 192K RAM and we have to fit the full SDR firmware RAM needs
> into it this space). Speed is traded for RAM use reduction. Since have
> reached our goal timewise, now memory reductions is more in focus for us
> (but it should not get slower of course).
>
> Because of that I would like to propose the following approach to keep
> the code easily readable while providing efficient solutions for the ARM
> MCUs both with little and more RAM:
>
> 1. We create an abstract interface for running fft in the codec2
> sources. Initially this should closely resemble the existing kiss_fft
> calls, which makes introducing the interface easy, since we may
> use all the existing tests and can verify that introducing the interface
> does not change a bit in the output.
>
> 2. Once validated, we can now introduce/activate the use of the arm DSP
> FFT with some glue code to map between the abstract interface and the
> arm DSP interface. Here again we have to validate everything is working
> nicely but we will see some slight differences I assume. However, with
> it we can produce reference data for step 3
>
> 3. Now we modify the existing code so that we can benefit from some nice
> properties of the arm DSP fft (inplace FFT) which means this will reduce
> RAM usage significantly (in relation to 192K RAM).
>
> 4. We enable optional use of RAM instead of flash in the ARM code, so
> that depending on the amount of available memory you can get some extra
> boost.
>
> For that to work nicely, we have to fix some issues in the existing code
> first, so here comes
>
> 0. As Glen pointed out, some of the #define constant have not so good
> names, especially M (defined for 2 different purposes in defines.h and
> fdmdv_internal.h) is nasty and also N in defines.h (there are some local
> variables N and the stm headers also get confused by it). So we need to
> change these to something unambiguous. I think Glen already suggest
> names for them.
>
>
> And I would like to point out, that the use of dynamic memory allocation
> (malloc/free) is necessary in our mcHF case, so I would like to keep
> this more or less as it is. The mcHF needs the ability to reuse the
> memory for other operational modes, if FreeDV is not active. Which does
> not mean I am against removing the internal use of malloc, but then it
> should be possible to easily create the required data structures
> "outside" the code using malloc. I.e. the use of static data structures
> for anything but const data is a no go.
>
> To support that discuss I created a draft suggestion for the interface
> (attached to this mail). It is right now defined using inline code for
> the sake of simplicity. This may change later, I don't think there
> should be any issue with that. It essentially contains 3 functions for
> complex fft and real fft each (alloc,fft,free) and the necessary data
> structures.
>
> Danilo
>
>
>
>
> Am 16.09.2016 um 20:59 schrieb Dana Myers:
>>
>>> Subject:Re: [Freetel-codec2] more benching and though

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread Danilo Beuche

Hi Glen,

just checked, it seems to me that we have all the caches running in the 
mcHF:

https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424

So even with these caches/buffers enabled, the M4 looses performance  by 
data reads from flash. Haven't checked the manual/internet but maybe the 
flash caching works well for code but not in the same way for data.


Danilo

Am 18.09.2016 um 03:12 schrieb glen english:
> Hi Danilo
>
> Good thoughts and points.
>
> while on the RAM subject : require reading for every serious programmer :
> "Memory"
> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
>
> read all 7 parts, 100 pages, but if you only have an hour, just read
> "part 2 - cache"
>
> On the fft:
>
> I am not surprised that the ARM lib  hand optimized assembler is that
> much faster.
> more that 2x faster in fact.
>
> I don't think kiss-fft is particular suitable for this sort of platform,
> either, I'll hold back what I really think :-) .
>
> The 5WS on flash (actually 6WS I am running @ 168M) does not really
> affect the performance too much.  In fact I can vary the WS count +/- 2
> without much change- the ART and the prefetch and the instruction and
> data caches are doing their job, so there is very little difference with
> the const values in ram or cache.
>
> In fact, most  FFT implementations are very tough on a machine with cache .
> Have you read the paper on how FFTW works ? It is very cache aware- and
> adaptive to the architecture- that is why it does trial runs and picks
> the best.
>
> The M7 is very impressive. It is certainly impressive work by ARM.
>
> However, the M4 is what all of you have to work with so we can stay
> focussed on that.
>
> I think also the ram usage will be significantly less with the arm FFT
> because of the re-entrant Kiss-fft behaviour.
>
> The m4 is quite a different beast, and no D-cache can improve
> performance over the M7 for some (inaptly) written applications (not
> this one- but as a generalization for applications grabbing a byte from
> memory randomly and all over a large dataset)
>
> Large matrix operations are where cache machines fall over- that is once
> the dataset is bigger than the cache
>
> The question is how much optimization is enough. I am tempted NOT to
> optimize any more, although I feel (just by looking at it )  I can get
> another 2x out of it. Why- well there is no real pressing need.
> Going too far away from the reference code will island the code a bit.
> However, if you run out of modem cycles/ modem ram, then we can probably
> get a bit more...
>
> cheers
>
>
>
>
>
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

[Freetel-codec2] array subscript is above array bounds

2016-09-17 Thread Steve

Not a very useful posting, but I found an old style C bug in my encoder
translation. This reduced my bad data by quite a bit, but there is still
some persistent errors yet.

Anyway:

#include 

static int snr[4];

int main() {
int big = 40;

snr[big] = 7;
fprintf(stderr, "snr[%d] = %d\n", big, snr[big]);

return 0;
}

If you run this it will print:

snr[40] = 7

Ha, anyway, I haven't made this error in years. So I tried:

-mpx -fcheck-pointer-bounds

as added switches on GCC and it did flag the warning. One major bug down,
one left... argh...
--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread Danilo Beuche

Hi Glen,


On 18.09.2016 03:12, glen english wrote:
> Hi Danilo
>
> Good thoughts and points.
>
> while on the RAM subject : require reading for every serious programmer :
> "Memory"
> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
>
> read all 7 parts, 100 pages, but if you only have an hour, just read 
> "part 2 - cache"
>
> On the fft:
>
> I am not surprised that the ARM lib  hand optimized assembler is that 
> much faster.
> more that 2x faster in fact.
As I said on the m4 kiss fft and arm dsp are on par, not much difference.
And using RAM vs. flash makes a lot of difference on the mcHF M4 Code
for code which uses tables heavily. We gained a lot from removing the
need to go to flash twice in the fir_filter vs. fir_filter2.
Maybe the mcHF startup configuration is not enabling all the caches.
Will check that.
> I don't think kiss-fft is particular suitable for this sort of platform, 
> either, I'll hold back what I really think :-) .
>
> The 5WS on flash (actually 6WS I am running @ 168M) does not really 
> affect the performance too much.  In fact I can vary the WS count +/- 2 
> without much change- the ART and the prefetch and the instruction and 
> data caches are doing their job, so there is very little difference with 
> the const values in ram or cache.
>
> In fact, most  FFT implementations are very tough on a machine with cache .
> Have you read the paper on how FFTW works ? It is very cache aware- and 
> adaptive to the architecture- that is why it does trial runs and picks 
> the best.
>
> The M7 is very impressive. It is certainly impressive work by ARM.
>
> However, the M4 is what all of you have to work with so we can stay 
> focussed on that.
>
> I think also the ram usage will be significantly less with the arm FFT 
> because of the re-entrant Kiss-fft behaviour.
>
> The m4 is quite a different beast, and no D-cache can improve 
> performance over the M7 for some (inaptly) written applications (not 
> this one- but as a generalization for applications grabbing a byte from 
> memory randomly and all over a large dataset)
>
> Large matrix operations are where cache machines fall over- that is once 
> the dataset is bigger than the cache
>
> The question is how much optimization is enough. I am tempted NOT to 
> optimize any more, although I feel (just by looking at it )  I can get 
> another 2x out of it. Why- well there is no real pressing need. 
> Going too far away from the reference code will island the code a bit.
> However, if you run out of modem cycles/ modem ram, then we can probably 
> get a bit more...
>
> cheers
Yes, unfortunately we have a M4 at hand, so a little more RAM would be
nice :-(

Regards,
Danilo
>
>
>
>
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread glen english

Hi Danilo

Good thoughts and points.

while on the RAM subject : require reading for every serious programmer :
"Memory"
https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich

read all 7 parts, 100 pages, but if you only have an hour, just read 
"part 2 - cache"

On the fft:

I am not surprised that the ARM lib  hand optimized assembler is that 
much faster.
more that 2x faster in fact.

I don't think kiss-fft is particular suitable for this sort of platform, 
either, I'll hold back what I really think :-) .

The 5WS on flash (actually 6WS I am running @ 168M) does not really 
affect the performance too much.  In fact I can vary the WS count +/- 2 
without much change- the ART and the prefetch and the instruction and 
data caches are doing their job, so there is very little difference with 
the const values in ram or cache.

In fact, most  FFT implementations are very tough on a machine with cache .
Have you read the paper on how FFTW works ? It is very cache aware- and 
adaptive to the architecture- that is why it does trial runs and picks 
the best.

The M7 is very impressive. It is certainly impressive work by ARM.

However, the M4 is what all of you have to work with so we can stay 
focussed on that.

I think also the ram usage will be significantly less with the arm FFT 
because of the re-entrant Kiss-fft behaviour.

The m4 is quite a different beast, and no D-cache can improve 
performance over the M7 for some (inaptly) written applications (not 
this one- but as a generalization for applications grabbing a byte from 
memory randomly and all over a large dataset)

Large matrix operations are where cache machines fall over- that is once 
the dataset is bigger than the cache

The question is how much optimization is enough. I am tempted NOT to 
optimize any more, although I feel (just by looking at it )  I can get 
another 2x out of it. Why- well there is no real pressing need. 
Going too far away from the reference code will island the code a bit.
However, if you run out of modem cycles/ modem ram, then we can probably 
get a bit more...

cheers









--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread Danilo Beuche


Hi,

I'd like to share a few thoughts/ideas as well:

Since we finished and removed most of the easy to remove performance 
hotspots with the exception of the kiss_fft calls which now contribute a 
major part to the overall runtime of modem and decoder, I played around 
with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.


Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32) 
the time difference is not existing if used in our mcHF code. How does 
this relate to the measurements of Glen (he measure better performance)?
I believe this is due to the fact that the arm lib stores some of its 
data in precomputed flash arrays. Access to flash is slow (5 wait 
states), so this reduces the performance. kiss has all its data in RAM. 
Since Glen did initialize the arm DSP tables in RAM, he got speed gains 
on the expense of RAM. On a STM32F746 RAM this is not as much of an 
issue (384K) as it is on the STM32F4 (the default MCU in the mcHF 
project has 192K RAM and we have to fit the full SDR firmware RAM needs 
into it this space). Speed is traded for RAM use reduction. Since have 
reached our goal timewise, now memory reductions is more in focus for us 
(but it should not get slower of course).


Because of that I would like to propose the following approach to keep 
the code easily readable while providing efficient solutions for the ARM 
MCUs both with little and more RAM:


1. We create an abstract interface for running fft in the codec2 
sources. Initially this should closely resemble the existing kiss_fft 
calls, which makes introducing the interface easy, since we may
use all the existing tests and can verify that introducing the interface 
does not change a bit in the output.


2. Once validated, we can now introduce/activate the use of the arm DSP 
FFT with some glue code to map between the abstract interface and the 
arm DSP interface. Here again we have to validate everything is working 
nicely but we will see some slight differences I assume. However, with 
it we can produce reference data for step 3


3. Now we modify the existing code so that we can benefit from some nice 
properties of the arm DSP fft (inplace FFT) which means this will reduce 
RAM usage significantly (in relation to 192K RAM).


4. We enable optional use of RAM instead of flash in the ARM code, so 
that depending on the amount of available memory you can get some extra 
boost.


For that to work nicely, we have to fix some issues in the existing code 
first, so here comes


0. As Glen pointed out, some of the #define constant have not so good 
names, especially M (defined for 2 different purposes in defines.h and 
fdmdv_internal.h) is nasty and also N in defines.h (there are some local 
variables N and the stm headers also get confused by it). So we need to 
change these to something unambiguous. I think Glen already suggest 
names for them.



And I would like to point out, that the use of dynamic memory allocation 
(malloc/free) is necessary in our mcHF case, so I would like to keep 
this more or less as it is. The mcHF needs the ability to reuse the 
memory for other operational modes, if FreeDV is not active. Which does 
not mean I am against removing the internal use of malloc, but then it 
should be possible to easily create the required data structures 
"outside" the code using malloc. I.e. the use of static data structures 
for anything but const data is a no go.


To support that discuss I created a draft suggestion for the interface 
(attached to this mail). It is right now defined using inline code for 
the sake of simplicity. This may change later, I don't think there 
should be any issue with that. It essentially contains 3 functions for 
complex fft and real fft each (alloc,fft,free) and the necessary data 
structures.


Danilo




Am 16.09.2016 um 20:59 schrieb Dana Myers:



Subject:Re: [Freetel-codec2] more benching and thoughts
Date:   Fri, 16 Sep 2016 12:13:05 +1000
From:   glen english 
Reply-To:   freetel-codec2@lists.sourceforge.net
To: freetel-codec2@lists.sourceforge.net



Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.


I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)

I still don't trust C++ heap allocators in embedded applications, though.


However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a 
PSoC 5LP.
I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter 
Block
to do bandpass heavy-lifting and frequency response correction for the 
ADC.
I use CMSIS-DSP for the rest of the DSP crunching required, and, wait 
for it,

wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm careful to avoid blowing out

Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc

2016-09-17 Thread Stuart Longland

On 17/09/16 18:28, glen english wrote:
> Yeah.
> 
> Cascading codecs is always trouble.
> 
> BTW My understand of the word TRANSCODING is going between one encoded 
> method and another without going back all the way to uncompressed. Going 
> between different video encoding methods usually done by transcoding. 
> Most video encoding is all DCT / macroblock based so is probably more 
> relationship between codecs than speech codec variations
> 
> 
> In this problematic voip case:  uncoded PCM (microphone)  >> AMBE2 >> 
> over air >> decoded >> PCM >> uLaw encode>> ulaw decode >> headset. (and 
> reverse) .

Transcoding is going from one format to another, and when the
destination is a lossy format, that will mean a reduction in quality.

Decoding to uncompressed shouldn't incur loss compared with a
hypothetical conversion from AMBE2 direct to µ-law, more likely the
assumptions made by the µ-law CODEC don't hold true for the synthesized
voice from the AMBE2 CODEC, and that would be why it sounds so terrible.

Regards,
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.

--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc

2016-09-17 Thread glen english

Yeah.

Cascading codecs is always trouble.

BTW My understand of the word TRANSCODING is going between one encoded 
method and another without going back all the way to uncompressed. Going 
between different video encoding methods usually done by transcoding. 
Most video encoding is all DCT / macroblock based so is probably more 
relationship between codecs than speech codec variations

In this problematic voip case:  uncoded PCM (microphone)  >> AMBE2 >> 
over air >> decoded >> PCM >> uLaw encode>> ulaw decode >> headset. (and 
reverse) .

regards

On 17/09/2016 3:46 PM, David Rowe wrote:
> It's a good question Glen, off the top of my head I'm not sure.  When
> one or more codecs are combined it's called transcoding and IIRC often
> causes problems.
>
> Alaw/mulaw are rather non-linear operations.  That could upset the
> parameter estimation algorithms.
>
> - David
>
> On 17/09/16 11:38, glen english wrote:
>> David
>> you are the man to ask this one
>>
>> Why do (or why do you think) relatively benign companding algorithms
>> like Alaw, uLaw,  that are commonly used for encoding for VOIP links for
>> radio systems (and other) , sounds so awful when they pass AMBE/ AMBE2
>> etc processed speech ?
>>
>> Must be something to do we the re distribution of the quantizating noise
>> characteristics?
>>
>> Not sure what the effect is yet with codec2, but that would be a cinch
>> to test up.
>>
>>
>> g

--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

[Freetel-codec2] M4 runs, summary :

Re: [Freetel-codec2] F4 runs.....

[Freetel-codec2] arm fft F4 runs 6WS and 5WS

[Freetel-codec2] F4 runs.....

Re: [Freetel-codec2] more benching and thoughts

[Freetel-codec2] memory management options

Re: [Freetel-codec2] more benching and thoughts

Re: [Freetel-codec2] more benching and thoughts

Re: [Freetel-codec2] more benching and thoughts

[Freetel-codec2] array subscript is above array bounds

Re: [Freetel-codec2] more benching and thoughts

Re: [Freetel-codec2] more benching and thoughts

Re: [Freetel-codec2] more benching and thoughts

Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc

Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc

20 matches

Site Navigation

Mail list logo

Footer information