Quick 2 cents of my own to re-emphasize a point that Ross made -
profile to find out which is fastest if you aren't sure (although it's
good to ask too in case different systems have different oddities you
don't know about)

Also, if in the future you have performance issues, profile before
acting for maximum efficiency... often times what we suspect to be the
bottleneck of our application is in fact not the bottleneck at all.
Happens to everyone :P

lastly, copying buffers is an important thing to get right, but in
case you haven't heard this enough, when hitting performance problems
it's often better to do MACRO optimization instead of MICRO
optimization.

Macro optimization means changing your algorithm, being smarter with
the resources you have etc.

Micro optimization means turning multiplications into bitshifts,
breaking out the assembly and things like that.

Often times macro optimizations will get you a bigger win (don't
optimize a crappy sorting algorithm, just use a better sorting
algorithm and it'll be way better) and also will result in more
maintainable, portable code, so you should prefer going that route
first.

Hope this helps!

On Thu, Mar 7, 2013 at 2:48 PM, Ross Bencina <rossb-li...@audiomulch.com> wrote:
> Stephen,
>
>
> On 8/03/2013 9:29 AM, ChordWizard Software wrote:
>>
>> a) additive mixing of audio buffers b) clearing to zero before
>> additive processing
>
>
> You could also consider writing (rather than adding) the first signal to the
> buffer. That way you don't have to zero it first. It requires having a
> "write" and an "add" version of your generators. Depending on your code this
> may or may not be worth the trouble vs zeroing first.
>
> In the past I've sometimes used C++ templates to paramaterise by the output
> operation (write/add) so you only have to write the code that generates the
> signals once
>
>
> c) copying from one buffer to another
>
> Of course you should avoid this whereever possible. Consider using
> (reference counted) buffer objects so you can share them instead of copying
> data. You could use reference counting, or just reclaim everything at the
> end of every cycle.
>
>
>
> d) converting between short and float formats
>>
>>
>> No surprises to any of you there I'm sure.  My question is, can you
>> give me a few pointers about making them as efficient as possible
>> within that critical realtime loop?
>>
>> For example, how does the efficiency of memset, or ZeroMemory,
>> compare to a simple for loop?
>
>
> Usually memset has a special case for writing zeros, so you shouldn't see
> too much difference between memset and ZeroMemory.
>
> memset vs simple loop will depend on your compiler.
>
> The usual wisdom is:
>
> 1) use memset vs writing your own. the library implementation will use
> SSE/whatever and will be fast. Of course this depends on the runtime
>
> 2) always profile and compare if you care.
>
>
>
>> Or using HeapAlloc with the
>> HEAP_ZERO_MEMORY flag when the buffer is created (I know buffers
>> shouldn’t be allocated in a realtime callback, but just out of
>> interest, I assume an initial zeroing must come at a cost compared to
>> not using that flag)?
>
>
> It could happen in a few ways, but I'm not sure how it *does* happen on
> Windows and OS X.
>
> For example the MMU could map all the pages to a single zero page and then
> allocate+zero only when there is a write to the page.
>
>
>
>> I'm using Win32 but intend to port to OSX as well, so comments on the
>> merits of cross-platform options like the C RTL would be particularly
>> helpful.  I realise some of those I mention above are Win-specific.
>>
>> Also for converting sample formats, are there more efficient options
>> than simply using
>>
>> nFloat = (float)nShort / 32768.0
>
>
> Unless you have a good reason not to you should prefer multiplication by
> reciprocal for the first one
>
> const float scale = (float)(1. / 32768.0);
> nFloat = (float)nShort * scale;
>
> You can do 4 at once if you use SSE/intrinsics.
>
>
>> nShort = (short)(nFloat * 32768.0)
>
> Float => int conversion can be expensive depending on your compiler settings
> and supported processor architectures. There are various ways around this.
>
> Take a look at pa_converters.c and the pa_x86_plain_converters.c in
> PortAudio. But you can do better with SSE.
>
>
>
>> for every sample?
>>
>> Are there any articles on this type of optimisation that can give me
>> some insight into what is happening behind the various memory
>> management calls?
>
>
> Probably. I would make sure you allocate aligned memory, maybe lock it in
> physical memory, and then use it -- and generally avoid OS-level memory
> calls from then on.
>
> I would use memset() memcpy(). These are optimised and the compiler may even
> inline an even more optimal version.
>
> The alternative is to go low-level and benchmark everything and write your
> own code in SSE (and learn how to optimise it).
>
> If you really care you need a good profiler.
>
> That's my 2c.
>
> HTH
>
> Ross.
>
>
>
>
>>
>> Regards,
>>
>> Stephen Clarke Managing Director ChordWizard Software Pty Ltd
>> corpor...@chordwizard.com http://www.chordwizard.com ph: (+61) 2 4960
>> 9520 fax: (+61) 2 4960 9580
>>
>>
>>
>> -- dupswapdrop -- the music-dsp mailing list and website:
>> subscription info, FAQ, source code archive, list archive, book
>> reviews, dsp links http://music.columbia.edu/cmc/music-dsp
>> http://music.columbia.edu/mailman/listinfo/music-dsp
>>
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp
> links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

Reply via email to