RE: DSP SBC encoder update

2008-07-10 Thread Simon Pickering
 
 I should add that running DSP tasks will move the CPU frequency to 330MHz,
 so this is probably not the answer to everyone's prayers with regard to
 freeing the CPU to do Xvid decoding or the like. There is a kernel patch
to
 not force the CPU to 330MHz (the DSP runs slower) and I'll do some testing
 to see if the DSP task can run in real-time at the lower DSP clock speed.
 Then it will be significantly more useful. 

Right, I've tested the SBC encoder task with the ARM running at 400MHz (and
therefore the DSP running at 133MHz (rather than its top speed of 220MHz
with the ARM running at 330MHz). Thanks qwerty for the link to the patch.
Anyway the task runs and plays music, but there are far too many drop-outs
and the sound gets progressively deeper on the run up to each dropout (due
to the encoder being too slow). So it certainly needs more optimisation
before it could be considered for this role.

 The change which has allowed it to encode an entire song rather than just
a
 few seconds was to move the input and output buffers from SDRAM (OMAP main
 memory) to SRAM (DSP fast single access memory). There are probably other
 things which would benefit from being moved, the sbc-priv data (or parts
 thereof) for one. This structure is pretty big so I allocated it in SDRAM,
 but at least parts of it might be better off in faster local memory. This
is
 something to look at.

I looked at this yesterday evening (thanks to derf, crashanddie, and others
for answering my C questions), trying to move some parts of the priv
structure to SARAM (sorry for the SRAM typo above). Unfortunately just
moving the bare minimum (the X array) won't happen as there's not enough
SARAM (so dsp_dld tells me). I don't know where it's all gone, anyone have
any ideas?

I currently have a fast_in[] array in SARAM to which I copy part of the data
from the slow (SDRAM) X[] array in the sbc_analyze_eight/four() fns before
it's used in the _sbc_analyze_eight/four() fns. These two fns are inlined,
so this memcpy is performed in every loop through the code (called something
like 150,000 times in total for my test file iirc). I'm not sure if the
faster manipulation of the data makes up for the copy overhead (it is a
faster 32bit copy at least). No clocks available, so I'll try removing this
optimisation and testing what it sounds like.

More importantly, if the whole X array could be placed in SARAM, there'd be
no need for my memcpy anyway and I'd have the benefits of faster access. I'm
not too sure how to analyse the code to work out how much data is allocated
in SARAM (to work out if I'm close to fitting it or have no chance).

Talking about SARAM, the input and output buffers (which the dsp task uses
for bulk transfers) are in SARAM, this is what I changed to make the task
play in real-time so this obviously makes a difference. It would be good if
I could avoid having to copy from the input buffer into one of the priv
structure arrays (which holds the PCM data). This is probably not really a
big saving compared to optimising the main loop as the read fn is not called
all that often (~5000 times for my test file), but every little helps and
obviously did before. The input array is currently read into a 2D array, I
need to check and see the array dimensions and whether I could write the
data into it directly (and place it in SARAM rather than the input array).
The output array has data packed into it, so I'm not sure I'll get any
savings from fiddling with this.

There may yet be other little bits of code which would benefit from being
moved to faster memory (or intrinsic-ised), it's just a bit hard to quantify
the memcpy slowdown vs. any possible memory access speedup gains without any
way of timing individual parts of the code :(

I'm currently revisiting my attempt to re-write the inner loop to use lots
of DSP intrinsics and the like in the hope that this will provide some sort
of speed up. Again to be tested with the mk1 ear ;)

Anyway, that's about where I am. If anyone wants to take a look at the code
and suggest possible locations for optimisations I'm all ears :)

Thanks for reading,

Cheers,


Simon

___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: DSP SBC encoder update

2008-07-10 Thread Siarhei Siamashka
On Thu, Jul 10, 2008 at 2:06 PM, Simon Pickering
[EMAIL PROTECTED] wrote:
 The change which has allowed it to encode an entire song rather than just
 a
 few seconds was to move the input and output buffers from SDRAM (OMAP main
 memory) to SRAM (DSP fast single access memory). There are probably other
 things which would benefit from being moved, the sbc-priv data (or parts
 thereof) for one. This structure is pretty big so I allocated it in SDRAM,
 but at least parts of it might be better off in faster local memory. This
 is
 something to look at.

 I looked at this yesterday evening (thanks to derf, crashanddie, and others
 for answering my C questions), trying to move some parts of the priv
 structure to SARAM (sorry for the SRAM typo above). Unfortunately just
 moving the bare minimum (the X array) won't happen as there's not enough
 SARAM (so dsp_dld tells me). I don't know where it's all gone, anyone have
 any ideas?

Do you use any buffers allocated by malloc? My guess is that malloc
does allocation of DARAM and SARAM memory.
In any case, memory returned by malloc should be not worse than the
memory buffer explicitly statically placed to EXTMEM.
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


RE: DSP SBC encoder update

2008-07-10 Thread Simon Pickering
  I looked at this yesterday evening (thanks to derf, crashanddie, and
others
  for answering my C questions), trying to move some parts of the priv
  structure to SARAM (sorry for the SRAM typo above). Unfortunately just
  moving the bare minimum (the X array) won't happen as there's not enough
  SARAM (so dsp_dld tells me). I don't know where it's all gone, anyone
have
  any ideas?
 
 Do you use any buffers allocated by malloc? My guess is that malloc
 does allocation of DARAM and SARAM memory.
 In any case, memory returned by malloc should be not worse than the
 memory buffer explicitly statically placed to EXTMEM.

Yes, I think you're right, in the avs_kernelcfg.cmd file it talks about a
DARAM_heap and a SARAM_heap, presumably it's possible to allocate from
either somehow (using the CSL MEM_* calls probably, I don't know off hand
which heap is used for task data, but will have a look this evening). It
also talks about a/the stack being in SARAM.

To answer the question, only if the thing to be malloc'd is small. In this
case it's only a couple of structures (and they are large), so I've manually
created them in EXTMEM2. I know this is not ideal, but they won't fit in
SARAM.

Over lunch I had a play with the things I talked about in my last email.
Removing the memcpy (from the slow SDRAM X[] array to the fast SARAM
fast_in[] array) made the code marginally slower - at least there were more
drop outs, so it appears that the memcpy() overhead is less than the extra
time needed to access the data in SDRAM.

I shaved a few array elements off the output[] SARAM array (down from 100 to
78 elements, this fits the current Bluez encoder parameters, but if they
were changed upwards, both the input[] and output[] arrays would probably
need to be made bigger). I also removed the #PRAGMAs I had been using to
place the const data from sbc_tables.h in SARAM as from looking the
avs_kernelcfg.cmd file, .const data is already placed in SARAM (SARAM_DATA
section) and I thought this might free up some room to fit the X[] array in
SARAM directly. It didn't. Moving the const tables freed 72x32bits, removing
the fast_in[] array (not needed if X[] itself is fast) freed 80x32bits, but
the X[] array requires 2x160x32bits. It still doesn't fit :(

Brad (Midgley) was talking earlier about implementing zero-copy; this would
be good as then at least some of the data could be left in the SARAM
input/output arrays (both faster because these are SARAM and because it
doesn't need a copy). I'll have a look at this.

So onto trying to optimise the inner loop...

Cheers,


Simon

___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


RE: DSP SBC encoder update

2008-07-10 Thread Simon Pickering
 
  Do you use any buffers allocated by malloc? My guess is that malloc
  does allocation of DARAM and SARAM memory.
  In any case, memory returned by malloc should be not worse than the
  memory buffer explicitly statically placed to EXTMEM.
 
  Yes, I think you're right, in the avs_kernelcfg.cmd file it talks
about a
  DARAM_heap and a SARAM_heap, presumably it's possible to allocate
from
  either somehow (using the CSL MEM_* calls probably, I don't know off
hand
  which heap is used for task data, but will have a look this
evening). It
  also talks about a/the stack being in SARAM.
 
 I'm sorry if it was not clear enough. Just use normal malloc from C
library 
 without any CSL_MEM_* stuff. You can add some debugging prints for the
addresses
 of allocated blocks and identify what kind of memory they are actually
in (DARAM,
 SARAM, SDRAM). By the way, this information is especially important if
you want
 to use DMA, as you need specifically configure the type of memory (not
just
 address) when setting up DMA transfer.

No, I understood, I was just mentioning that there appear to be two
heaps to chose from - presumably one is used by the DSP tasks (malloc is
probably #defined as one of the CSL MEM* fns in the DSP Gateway task
functions).

  Over lunch I had a play with the things I talked about in my last
email.
  Removing the memcpy (from the slow SDRAM X[] array to the fast SARAM
  fast_in[] array) made the code marginally slower - at least there
were more
  drop outs, so it appears that the memcpy() overhead is less than the
extra
  time needed to access the data in SDRAM.
 
 Yes, accessing SDRAM memory is extremely slow. And if you access SDRAM
 memory using 16-bit accesses instead of 32-bit accesses, the overhead
 doubles. So if your data processing algorithm does not deal
exclusively
 with 32-bit data accesses, you are better not to run it to process
data
 in SDRAM memory. Copying data to a temporary buffer in DARAM or
 SARAM, processing it there and copying results back to SDRAM would be
 faster in this case.

The X[] array data type is an int32, so even accessing 32bit from SDRAM
is still slower than using a local buffer (depending on what you need to
do with it of course).

  I shaved a few array elements off the output[] SARAM array (down
from 100 to
  78 elements, this fits the current Bluez encoder parameters, but if
they
  were changed upwards, both the input[] and output[] arrays would
probably
  need to be made bigger). I also removed the #PRAGMAs I had been
using to
  place the const data from sbc_tables.h in SARAM as from looking the
  avs_kernelcfg.cmd file, .const data is already placed in SARAM
(SARAM_DATA
  section) and I thought this might free up some room to fit the X[]
array in
  SARAM directly. It didn't. Moving the const tables freed 72x32bits,
removing
  the fast_in[] array (not needed if X[] itself is fast) freed
80x32bits, but
  the X[] array requires 2x160x32bits. It still doesn't fit :(
 
 2x160x32 bits is only 1280 bytes, which is hardly too big. Try to
allocate 
 buffers with malloc and copy constant tables there on initialization.

It know there's not much free SARAM memory (note that the DSP Gateway
kernel also appears to hold the majority of its data in the DSP internal
memory). But parts of the SARAM and DARAM are reserved for the stack and
heaps, which may well have free space on them. It appears that 0xc00
(8bit) bytes are reserved for the stack, the DARAM heap is 0xbf40 bytes
in size and the SARAM heap is 0xd000 bytes long. So this may mean that
there's enough space in the SARAM if I use malloc to get the memory from
the heap. I had thought that the heap would be fairly small so I was
#PRAGMAing my large structures to be placed in SDRAM; this was probably
the wrong thing to do.

I'll test and report back.

Cheers,


Simon

___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: DSP SBC encoder update

2008-07-10 Thread Siarhei Siamashka
On Thu, Jul 10, 2008 at 6:57 PM, Simon Pickering
[EMAIL PROTECTED] wrote:
 No, I understood, I was just mentioning that there appear to be two
 heaps to chose from - presumably one is used by the DSP tasks (malloc is
 probably #defined as one of the CSL MEM* fns in the DSP Gateway task
 functions).

Maybe it is just clever/stupid enough to do the allocation
automatically. At least when I did some experiments with DSP before,
it was alocating DARAM memory. Surely, you might want to have better
control to put most performance critical data into DARAM, but malloc
is a standard C function and is more portable.

 Yes, accessing SDRAM memory is extremely slow. And if you access SDRAM
 memory using 16-bit accesses instead of 32-bit accesses, the overhead
 doubles. So if your data processing algorithm does not deal
 exclusively
 with 32-bit data accesses, you are better not to run it to process
 data
 in SDRAM memory. Copying data to a temporary buffer in DARAM or
 SARAM, processing it there and copying results back to SDRAM would be
 faster in this case.

 The X[] array data type is an int32, so even accessing 32bit from SDRAM
 is still slower than using a local buffer (depending on what you need to
 do with it of course).

It depends on how many times the data is accessed. For example, if you
have some algorithm that accesses this memory location 10 times, you
would have 2 SDRAM + 10 SRAM memory accesses by using
fetch/process/store pattern vs. just 10 SDRAM memory accesses if
working with this buffer directly in SDRAM. As SDRAM is an order of
magnitude slower (decimal order, not binary), you really want to avoid
dealing with SDRAM as much as possible.
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


DSP SBC encoder update

2008-07-07 Thread Simon Pickering
Hi all,

I'm happy to say I've got the DSP task working for more than 4s now, in fact
it even runs all the way to the end of the song, as expected ;).

You can download version 1.0.0 from here:
https://garage.maemo.org/projects/dsp-sbc/. This is for Diablo only.

This consists of a tarball containing the DSP task and command file, a
tweaked Bluez-utils which can use said DSP task for SBC encoding (so it will
just work with mplayer and the like) and an installation script which writes
some config data about the new task to the DSP dynamic loader conf file and
then extracts the tarball, installs the deb and tells you to reboot.

[Note to would-be DSP hackers: rather than rebooting, you can just run
dsp_dld in the terminal to restart the loader daemon, but make sure you've
made a symlink from /lib/dsp/dsp_dld_avs.conf - /lib/dsp/dsp_dld.conf as
this is where it expects to find the conf file.]

If you want to go back to software encoding, rename the sbcenc.o file (in
/lib/dsp/modules) and it will automatically fall back to the original
software method (it falls back whenever the DSP fails, and renaming the task
will cause it to fail). I've not checked to see if the fallback method is as
quick as the original code, I'd be interested to know though if anyone is
bored. I should add some logic using an env var or similar to switch method
- anyone have some example code I could use?

You still need to enable a2dp with either johnx's a2dp deb which can be
found here: http://www.internettablettalk.com/forums/showthread.php?t=13468
or manually (use the deb, far easier).

I should add that running DSP tasks will move the CPU frequency to 330MHz,
so this is probably not the answer to everyone's prayers with regard to
freeing the CPU to do Xvid decoding or the like. There is a kernel patch to
not force the CPU to 330MHz (the DSP runs slower) and I'll do some testing
to see if the DSP task can run in real-time at the lower DSP clock speed.
Then it will be significantly more useful. In the meantime, it may or may
not use less power this way, please let me know if you do any testing.



Next bit is for those interested in the gory details:

This is pretty much the same code I had running a week ago or thereabouts,
and it was only encoding ~4s of audio in real-time (using bulk transfers 
ioctls for sync). I tested the SW encoder and it would encode a test file
more slowly than the DSP method but would output more seconds worth of audio
when testing with mplayer, which made me wonder if the DSP was just cursed
(or perhaps something to do with the CPU speed being set to 330MHz when the
DSP is running...). The released code is from my mk2 branch:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk2/?root=dsp-sbc

The change which has allowed it to encode an entire song rather than just a
few seconds was to move the input and output buffers from SDRAM (OMAP main
memory) to SRAM (DSP fast single access memory). There are probably other
things which would benefit from being moved, the sbc-priv data (or parts
thereof) for one. This structure is pretty big so I allocated it in SDRAM,
but at least parts of it might be better off in faster local memory. This is
something to look at.

I tested the speed of the bulk transfers (29s au file, took ~20s to encode
with the DSP and ~9s to just transfer the data), which are pretty slow as
you can see. I then decided to convert the task to use shared memory and
some polling and sleeping to synchronise (mk4 branch:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk4/?root=dsp-sbc). The
mk4 code takes absolutely forever to run though, the same test file which
takes ~20s with the bulk transfer method (mk2) takes ~45s using shared
memory. Unfortunately there appear to be no clocks available in the DSP
kernel (which makes benchmarking code quite tricky) and also means you can't
sleep() between polling memory.

So the DSP task sits in a tight polling loop (bad!) and the ARM sleeps for
1us and then polls the shared memory. Anyway, there's something not right
and I'm not sure what it might be (the DSP manages ~650 loops before the ARM
presents it with input data), the DSP then processes and the ARM sleeps for
1 loop (1us) before the DSP gives it back the encoded data, and so on. This
is not a good method for the task to use, but I am interested to know why
it's so slow, so may do some more work on it eventually.

Talking about a lack of clocks, the mk3 branch was my attempt to rewrite the
sbc conversion fns using DSP intrinsics, dual MACs, and the like. It doesn't
produce the correct output data (probably some issue with my Q15 arithmetic,
this was only the first hack at the code) but also didn't improve the speed
of the code (and with no clock fns it's hard to tell where the bottleneck
is) so I'm leaving it alone for the time being.

Last but not least, even when running at 165MHz (or whatever the
conservative governor produces) the sw fall back code doesn't produce any
error messages (when 

Re: DSP SBC encoder update

2008-07-07 Thread Faheem Pervez
Hi,

Glad you got it working, will grab a copy when I go home :)

Just some notes:
I found that patch I mentioned:
http://pastebin.com/m34ed3cd3

It's supposidly adds sysfs interface for modding the dsp speed. If that
fails, a modification to n800-dvfs.c will work :) (I think I've got a kernel
stashed away with DSP/CPU at 133/400.)

Using this snippet of code in the installer sh will automatically relaunch
the script as root if you are not root:
http://www.internettablettalk.com/forums/showpost.php?p=122754postcount=5


Cheers,
Faheem


On Mon, Jul 7, 2008 at 11:43 AM, Simon Pickering [EMAIL PROTECTED]
wrote:

 Hi all,

 I'm happy to say I've got the DSP task working for more than 4s now, in
 fact
 it even runs all the way to the end of the song, as expected ;).

 You can download version 1.0.0 from here:
 https://garage.maemo.org/projects/dsp-sbc/. This is for Diablo only.

 This consists of a tarball containing the DSP task and command file, a
 tweaked Bluez-utils which can use said DSP task for SBC encoding (so it
 will
 just work with mplayer and the like) and an installation script which
 writes
 some config data about the new task to the DSP dynamic loader conf file and
 then extracts the tarball, installs the deb and tells you to reboot.

 [Note to would-be DSP hackers: rather than rebooting, you can just run
 dsp_dld in the terminal to restart the loader daemon, but make sure
 you've
 made a symlink from /lib/dsp/dsp_dld_avs.conf - /lib/dsp/dsp_dld.conf as
 this is where it expects to find the conf file.]

 If you want to go back to software encoding, rename the sbcenc.o file (in
 /lib/dsp/modules) and it will automatically fall back to the original
 software method (it falls back whenever the DSP fails, and renaming the
 task
 will cause it to fail). I've not checked to see if the fallback method is
 as
 quick as the original code, I'd be interested to know though if anyone is
 bored. I should add some logic using an env var or similar to switch method
 - anyone have some example code I could use?

 You still need to enable a2dp with either johnx's a2dp deb which can be
 found here:
 http://www.internettablettalk.com/forums/showthread.php?t=13468
 or manually (use the deb, far easier).

 I should add that running DSP tasks will move the CPU frequency to 330MHz,
 so this is probably not the answer to everyone's prayers with regard to
 freeing the CPU to do Xvid decoding or the like. There is a kernel patch to
 not force the CPU to 330MHz (the DSP runs slower) and I'll do some testing
 to see if the DSP task can run in real-time at the lower DSP clock speed.
 Then it will be significantly more useful. In the meantime, it may or may
 not use less power this way, please let me know if you do any testing.



 Next bit is for those interested in the gory details:

 This is pretty much the same code I had running a week ago or thereabouts,
 and it was only encoding ~4s of audio in real-time (using bulk transfers 
 ioctls for sync). I tested the SW encoder and it would encode a test file
 more slowly than the DSP method but would output more seconds worth of
 audio
 when testing with mplayer, which made me wonder if the DSP was just cursed
 (or perhaps something to do with the CPU speed being set to 330MHz when the
 DSP is running...). The released code is from my mk2 branch:
 https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk2/?root=dsp-sbc

 The change which has allowed it to encode an entire song rather than just a
 few seconds was to move the input and output buffers from SDRAM (OMAP main
 memory) to SRAM (DSP fast single access memory). There are probably other
 things which would benefit from being moved, the sbc-priv data (or parts
 thereof) for one. This structure is pretty big so I allocated it in SDRAM,
 but at least parts of it might be better off in faster local memory. This
 is
 something to look at.

 I tested the speed of the bulk transfers (29s au file, took ~20s to encode
 with the DSP and ~9s to just transfer the data), which are pretty slow as
 you can see. I then decided to convert the task to use shared memory and
 some polling and sleeping to synchronise (mk4 branch:
 https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk4/?root=dsp-sbc).
 The
 mk4 code takes absolutely forever to run though, the same test file which
 takes ~20s with the bulk transfer method (mk2) takes ~45s using shared
 memory. Unfortunately there appear to be no clocks available in the DSP
 kernel (which makes benchmarking code quite tricky) and also means you
 can't
 sleep() between polling memory.

 So the DSP task sits in a tight polling loop (bad!) and the ARM sleeps for
 1us and then polls the shared memory. Anyway, there's something not right
 and I'm not sure what it might be (the DSP manages ~650 loops before the
 ARM
 presents it with input data), the DSP then processes and the ARM sleeps for
 1 loop (1us) before the DSP gives it back the encoded data, and so on. This
 is not a good method for the