RE: DSP SBC encoder update
I should add that running DSP tasks will move the CPU frequency to 330MHz, so this is probably not the answer to everyone's prayers with regard to freeing the CPU to do Xvid decoding or the like. There is a kernel patch to not force the CPU to 330MHz (the DSP runs slower) and I'll do some testing to see if the DSP task can run in real-time at the lower DSP clock speed. Then it will be significantly more useful. Right, I've tested the SBC encoder task with the ARM running at 400MHz (and therefore the DSP running at 133MHz (rather than its top speed of 220MHz with the ARM running at 330MHz). Thanks qwerty for the link to the patch. Anyway the task runs and plays music, but there are far too many drop-outs and the sound gets progressively deeper on the run up to each dropout (due to the encoder being too slow). So it certainly needs more optimisation before it could be considered for this role. The change which has allowed it to encode an entire song rather than just a few seconds was to move the input and output buffers from SDRAM (OMAP main memory) to SRAM (DSP fast single access memory). There are probably other things which would benefit from being moved, the sbc-priv data (or parts thereof) for one. This structure is pretty big so I allocated it in SDRAM, but at least parts of it might be better off in faster local memory. This is something to look at. I looked at this yesterday evening (thanks to derf, crashanddie, and others for answering my C questions), trying to move some parts of the priv structure to SARAM (sorry for the SRAM typo above). Unfortunately just moving the bare minimum (the X array) won't happen as there's not enough SARAM (so dsp_dld tells me). I don't know where it's all gone, anyone have any ideas? I currently have a fast_in[] array in SARAM to which I copy part of the data from the slow (SDRAM) X[] array in the sbc_analyze_eight/four() fns before it's used in the _sbc_analyze_eight/four() fns. These two fns are inlined, so this memcpy is performed in every loop through the code (called something like 150,000 times in total for my test file iirc). I'm not sure if the faster manipulation of the data makes up for the copy overhead (it is a faster 32bit copy at least). No clocks available, so I'll try removing this optimisation and testing what it sounds like. More importantly, if the whole X array could be placed in SARAM, there'd be no need for my memcpy anyway and I'd have the benefits of faster access. I'm not too sure how to analyse the code to work out how much data is allocated in SARAM (to work out if I'm close to fitting it or have no chance). Talking about SARAM, the input and output buffers (which the dsp task uses for bulk transfers) are in SARAM, this is what I changed to make the task play in real-time so this obviously makes a difference. It would be good if I could avoid having to copy from the input buffer into one of the priv structure arrays (which holds the PCM data). This is probably not really a big saving compared to optimising the main loop as the read fn is not called all that often (~5000 times for my test file), but every little helps and obviously did before. The input array is currently read into a 2D array, I need to check and see the array dimensions and whether I could write the data into it directly (and place it in SARAM rather than the input array). The output array has data packed into it, so I'm not sure I'll get any savings from fiddling with this. There may yet be other little bits of code which would benefit from being moved to faster memory (or intrinsic-ised), it's just a bit hard to quantify the memcpy slowdown vs. any possible memory access speedup gains without any way of timing individual parts of the code :( I'm currently revisiting my attempt to re-write the inner loop to use lots of DSP intrinsics and the like in the hope that this will provide some sort of speed up. Again to be tested with the mk1 ear ;) Anyway, that's about where I am. If anyone wants to take a look at the code and suggest possible locations for optimisations I'm all ears :) Thanks for reading, Cheers, Simon ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: DSP SBC encoder update
On Thu, Jul 10, 2008 at 2:06 PM, Simon Pickering [EMAIL PROTECTED] wrote: The change which has allowed it to encode an entire song rather than just a few seconds was to move the input and output buffers from SDRAM (OMAP main memory) to SRAM (DSP fast single access memory). There are probably other things which would benefit from being moved, the sbc-priv data (or parts thereof) for one. This structure is pretty big so I allocated it in SDRAM, but at least parts of it might be better off in faster local memory. This is something to look at. I looked at this yesterday evening (thanks to derf, crashanddie, and others for answering my C questions), trying to move some parts of the priv structure to SARAM (sorry for the SRAM typo above). Unfortunately just moving the bare minimum (the X array) won't happen as there's not enough SARAM (so dsp_dld tells me). I don't know where it's all gone, anyone have any ideas? Do you use any buffers allocated by malloc? My guess is that malloc does allocation of DARAM and SARAM memory. In any case, memory returned by malloc should be not worse than the memory buffer explicitly statically placed to EXTMEM. ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
RE: DSP SBC encoder update
I looked at this yesterday evening (thanks to derf, crashanddie, and others for answering my C questions), trying to move some parts of the priv structure to SARAM (sorry for the SRAM typo above). Unfortunately just moving the bare minimum (the X array) won't happen as there's not enough SARAM (so dsp_dld tells me). I don't know where it's all gone, anyone have any ideas? Do you use any buffers allocated by malloc? My guess is that malloc does allocation of DARAM and SARAM memory. In any case, memory returned by malloc should be not worse than the memory buffer explicitly statically placed to EXTMEM. Yes, I think you're right, in the avs_kernelcfg.cmd file it talks about a DARAM_heap and a SARAM_heap, presumably it's possible to allocate from either somehow (using the CSL MEM_* calls probably, I don't know off hand which heap is used for task data, but will have a look this evening). It also talks about a/the stack being in SARAM. To answer the question, only if the thing to be malloc'd is small. In this case it's only a couple of structures (and they are large), so I've manually created them in EXTMEM2. I know this is not ideal, but they won't fit in SARAM. Over lunch I had a play with the things I talked about in my last email. Removing the memcpy (from the slow SDRAM X[] array to the fast SARAM fast_in[] array) made the code marginally slower - at least there were more drop outs, so it appears that the memcpy() overhead is less than the extra time needed to access the data in SDRAM. I shaved a few array elements off the output[] SARAM array (down from 100 to 78 elements, this fits the current Bluez encoder parameters, but if they were changed upwards, both the input[] and output[] arrays would probably need to be made bigger). I also removed the #PRAGMAs I had been using to place the const data from sbc_tables.h in SARAM as from looking the avs_kernelcfg.cmd file, .const data is already placed in SARAM (SARAM_DATA section) and I thought this might free up some room to fit the X[] array in SARAM directly. It didn't. Moving the const tables freed 72x32bits, removing the fast_in[] array (not needed if X[] itself is fast) freed 80x32bits, but the X[] array requires 2x160x32bits. It still doesn't fit :( Brad (Midgley) was talking earlier about implementing zero-copy; this would be good as then at least some of the data could be left in the SARAM input/output arrays (both faster because these are SARAM and because it doesn't need a copy). I'll have a look at this. So onto trying to optimise the inner loop... Cheers, Simon ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
RE: DSP SBC encoder update
Do you use any buffers allocated by malloc? My guess is that malloc does allocation of DARAM and SARAM memory. In any case, memory returned by malloc should be not worse than the memory buffer explicitly statically placed to EXTMEM. Yes, I think you're right, in the avs_kernelcfg.cmd file it talks about a DARAM_heap and a SARAM_heap, presumably it's possible to allocate from either somehow (using the CSL MEM_* calls probably, I don't know off hand which heap is used for task data, but will have a look this evening). It also talks about a/the stack being in SARAM. I'm sorry if it was not clear enough. Just use normal malloc from C library without any CSL_MEM_* stuff. You can add some debugging prints for the addresses of allocated blocks and identify what kind of memory they are actually in (DARAM, SARAM, SDRAM). By the way, this information is especially important if you want to use DMA, as you need specifically configure the type of memory (not just address) when setting up DMA transfer. No, I understood, I was just mentioning that there appear to be two heaps to chose from - presumably one is used by the DSP tasks (malloc is probably #defined as one of the CSL MEM* fns in the DSP Gateway task functions). Over lunch I had a play with the things I talked about in my last email. Removing the memcpy (from the slow SDRAM X[] array to the fast SARAM fast_in[] array) made the code marginally slower - at least there were more drop outs, so it appears that the memcpy() overhead is less than the extra time needed to access the data in SDRAM. Yes, accessing SDRAM memory is extremely slow. And if you access SDRAM memory using 16-bit accesses instead of 32-bit accesses, the overhead doubles. So if your data processing algorithm does not deal exclusively with 32-bit data accesses, you are better not to run it to process data in SDRAM memory. Copying data to a temporary buffer in DARAM or SARAM, processing it there and copying results back to SDRAM would be faster in this case. The X[] array data type is an int32, so even accessing 32bit from SDRAM is still slower than using a local buffer (depending on what you need to do with it of course). I shaved a few array elements off the output[] SARAM array (down from 100 to 78 elements, this fits the current Bluez encoder parameters, but if they were changed upwards, both the input[] and output[] arrays would probably need to be made bigger). I also removed the #PRAGMAs I had been using to place the const data from sbc_tables.h in SARAM as from looking the avs_kernelcfg.cmd file, .const data is already placed in SARAM (SARAM_DATA section) and I thought this might free up some room to fit the X[] array in SARAM directly. It didn't. Moving the const tables freed 72x32bits, removing the fast_in[] array (not needed if X[] itself is fast) freed 80x32bits, but the X[] array requires 2x160x32bits. It still doesn't fit :( 2x160x32 bits is only 1280 bytes, which is hardly too big. Try to allocate buffers with malloc and copy constant tables there on initialization. It know there's not much free SARAM memory (note that the DSP Gateway kernel also appears to hold the majority of its data in the DSP internal memory). But parts of the SARAM and DARAM are reserved for the stack and heaps, which may well have free space on them. It appears that 0xc00 (8bit) bytes are reserved for the stack, the DARAM heap is 0xbf40 bytes in size and the SARAM heap is 0xd000 bytes long. So this may mean that there's enough space in the SARAM if I use malloc to get the memory from the heap. I had thought that the heap would be fairly small so I was #PRAGMAing my large structures to be placed in SDRAM; this was probably the wrong thing to do. I'll test and report back. Cheers, Simon ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: DSP SBC encoder update
On Thu, Jul 10, 2008 at 6:57 PM, Simon Pickering [EMAIL PROTECTED] wrote: No, I understood, I was just mentioning that there appear to be two heaps to chose from - presumably one is used by the DSP tasks (malloc is probably #defined as one of the CSL MEM* fns in the DSP Gateway task functions). Maybe it is just clever/stupid enough to do the allocation automatically. At least when I did some experiments with DSP before, it was alocating DARAM memory. Surely, you might want to have better control to put most performance critical data into DARAM, but malloc is a standard C function and is more portable. Yes, accessing SDRAM memory is extremely slow. And if you access SDRAM memory using 16-bit accesses instead of 32-bit accesses, the overhead doubles. So if your data processing algorithm does not deal exclusively with 32-bit data accesses, you are better not to run it to process data in SDRAM memory. Copying data to a temporary buffer in DARAM or SARAM, processing it there and copying results back to SDRAM would be faster in this case. The X[] array data type is an int32, so even accessing 32bit from SDRAM is still slower than using a local buffer (depending on what you need to do with it of course). It depends on how many times the data is accessed. For example, if you have some algorithm that accesses this memory location 10 times, you would have 2 SDRAM + 10 SRAM memory accesses by using fetch/process/store pattern vs. just 10 SDRAM memory accesses if working with this buffer directly in SDRAM. As SDRAM is an order of magnitude slower (decimal order, not binary), you really want to avoid dealing with SDRAM as much as possible. ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
DSP SBC encoder update
Hi all, I'm happy to say I've got the DSP task working for more than 4s now, in fact it even runs all the way to the end of the song, as expected ;). You can download version 1.0.0 from here: https://garage.maemo.org/projects/dsp-sbc/. This is for Diablo only. This consists of a tarball containing the DSP task and command file, a tweaked Bluez-utils which can use said DSP task for SBC encoding (so it will just work with mplayer and the like) and an installation script which writes some config data about the new task to the DSP dynamic loader conf file and then extracts the tarball, installs the deb and tells you to reboot. [Note to would-be DSP hackers: rather than rebooting, you can just run dsp_dld in the terminal to restart the loader daemon, but make sure you've made a symlink from /lib/dsp/dsp_dld_avs.conf - /lib/dsp/dsp_dld.conf as this is where it expects to find the conf file.] If you want to go back to software encoding, rename the sbcenc.o file (in /lib/dsp/modules) and it will automatically fall back to the original software method (it falls back whenever the DSP fails, and renaming the task will cause it to fail). I've not checked to see if the fallback method is as quick as the original code, I'd be interested to know though if anyone is bored. I should add some logic using an env var or similar to switch method - anyone have some example code I could use? You still need to enable a2dp with either johnx's a2dp deb which can be found here: http://www.internettablettalk.com/forums/showthread.php?t=13468 or manually (use the deb, far easier). I should add that running DSP tasks will move the CPU frequency to 330MHz, so this is probably not the answer to everyone's prayers with regard to freeing the CPU to do Xvid decoding or the like. There is a kernel patch to not force the CPU to 330MHz (the DSP runs slower) and I'll do some testing to see if the DSP task can run in real-time at the lower DSP clock speed. Then it will be significantly more useful. In the meantime, it may or may not use less power this way, please let me know if you do any testing. Next bit is for those interested in the gory details: This is pretty much the same code I had running a week ago or thereabouts, and it was only encoding ~4s of audio in real-time (using bulk transfers ioctls for sync). I tested the SW encoder and it would encode a test file more slowly than the DSP method but would output more seconds worth of audio when testing with mplayer, which made me wonder if the DSP was just cursed (or perhaps something to do with the CPU speed being set to 330MHz when the DSP is running...). The released code is from my mk2 branch: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk2/?root=dsp-sbc The change which has allowed it to encode an entire song rather than just a few seconds was to move the input and output buffers from SDRAM (OMAP main memory) to SRAM (DSP fast single access memory). There are probably other things which would benefit from being moved, the sbc-priv data (or parts thereof) for one. This structure is pretty big so I allocated it in SDRAM, but at least parts of it might be better off in faster local memory. This is something to look at. I tested the speed of the bulk transfers (29s au file, took ~20s to encode with the DSP and ~9s to just transfer the data), which are pretty slow as you can see. I then decided to convert the task to use shared memory and some polling and sleeping to synchronise (mk4 branch: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk4/?root=dsp-sbc). The mk4 code takes absolutely forever to run though, the same test file which takes ~20s with the bulk transfer method (mk2) takes ~45s using shared memory. Unfortunately there appear to be no clocks available in the DSP kernel (which makes benchmarking code quite tricky) and also means you can't sleep() between polling memory. So the DSP task sits in a tight polling loop (bad!) and the ARM sleeps for 1us and then polls the shared memory. Anyway, there's something not right and I'm not sure what it might be (the DSP manages ~650 loops before the ARM presents it with input data), the DSP then processes and the ARM sleeps for 1 loop (1us) before the DSP gives it back the encoded data, and so on. This is not a good method for the task to use, but I am interested to know why it's so slow, so may do some more work on it eventually. Talking about a lack of clocks, the mk3 branch was my attempt to rewrite the sbc conversion fns using DSP intrinsics, dual MACs, and the like. It doesn't produce the correct output data (probably some issue with my Q15 arithmetic, this was only the first hack at the code) but also didn't improve the speed of the code (and with no clock fns it's hard to tell where the bottleneck is) so I'm leaving it alone for the time being. Last but not least, even when running at 165MHz (or whatever the conservative governor produces) the sw fall back code doesn't produce any error messages (when
Re: DSP SBC encoder update
Hi, Glad you got it working, will grab a copy when I go home :) Just some notes: I found that patch I mentioned: http://pastebin.com/m34ed3cd3 It's supposidly adds sysfs interface for modding the dsp speed. If that fails, a modification to n800-dvfs.c will work :) (I think I've got a kernel stashed away with DSP/CPU at 133/400.) Using this snippet of code in the installer sh will automatically relaunch the script as root if you are not root: http://www.internettablettalk.com/forums/showpost.php?p=122754postcount=5 Cheers, Faheem On Mon, Jul 7, 2008 at 11:43 AM, Simon Pickering [EMAIL PROTECTED] wrote: Hi all, I'm happy to say I've got the DSP task working for more than 4s now, in fact it even runs all the way to the end of the song, as expected ;). You can download version 1.0.0 from here: https://garage.maemo.org/projects/dsp-sbc/. This is for Diablo only. This consists of a tarball containing the DSP task and command file, a tweaked Bluez-utils which can use said DSP task for SBC encoding (so it will just work with mplayer and the like) and an installation script which writes some config data about the new task to the DSP dynamic loader conf file and then extracts the tarball, installs the deb and tells you to reboot. [Note to would-be DSP hackers: rather than rebooting, you can just run dsp_dld in the terminal to restart the loader daemon, but make sure you've made a symlink from /lib/dsp/dsp_dld_avs.conf - /lib/dsp/dsp_dld.conf as this is where it expects to find the conf file.] If you want to go back to software encoding, rename the sbcenc.o file (in /lib/dsp/modules) and it will automatically fall back to the original software method (it falls back whenever the DSP fails, and renaming the task will cause it to fail). I've not checked to see if the fallback method is as quick as the original code, I'd be interested to know though if anyone is bored. I should add some logic using an env var or similar to switch method - anyone have some example code I could use? You still need to enable a2dp with either johnx's a2dp deb which can be found here: http://www.internettablettalk.com/forums/showthread.php?t=13468 or manually (use the deb, far easier). I should add that running DSP tasks will move the CPU frequency to 330MHz, so this is probably not the answer to everyone's prayers with regard to freeing the CPU to do Xvid decoding or the like. There is a kernel patch to not force the CPU to 330MHz (the DSP runs slower) and I'll do some testing to see if the DSP task can run in real-time at the lower DSP clock speed. Then it will be significantly more useful. In the meantime, it may or may not use less power this way, please let me know if you do any testing. Next bit is for those interested in the gory details: This is pretty much the same code I had running a week ago or thereabouts, and it was only encoding ~4s of audio in real-time (using bulk transfers ioctls for sync). I tested the SW encoder and it would encode a test file more slowly than the DSP method but would output more seconds worth of audio when testing with mplayer, which made me wonder if the DSP was just cursed (or perhaps something to do with the CPU speed being set to 330MHz when the DSP is running...). The released code is from my mk2 branch: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk2/?root=dsp-sbc The change which has allowed it to encode an entire song rather than just a few seconds was to move the input and output buffers from SDRAM (OMAP main memory) to SRAM (DSP fast single access memory). There are probably other things which would benefit from being moved, the sbc-priv data (or parts thereof) for one. This structure is pretty big so I allocated it in SDRAM, but at least parts of it might be better off in faster local memory. This is something to look at. I tested the speed of the bulk transfers (29s au file, took ~20s to encode with the DSP and ~9s to just transfer the data), which are pretty slow as you can see. I then decided to convert the task to use shared memory and some polling and sleeping to synchronise (mk4 branch: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/mk4/?root=dsp-sbc). The mk4 code takes absolutely forever to run though, the same test file which takes ~20s with the bulk transfer method (mk2) takes ~45s using shared memory. Unfortunately there appear to be no clocks available in the DSP kernel (which makes benchmarking code quite tricky) and also means you can't sleep() between polling memory. So the DSP task sits in a tight polling loop (bad!) and the ARM sleeps for 1us and then polls the shared memory. Anyway, there's something not right and I'm not sure what it might be (the DSP manages ~650 loops before the ARM presents it with input data), the DSP then processes and the ARM sleeps for 1 loop (1us) before the DSP gives it back the encoded data, and so on. This is not a good method for the