Re: Confusion with low-level audio libraries

AudioGames . net Forum — Developers room : camlorn via Audiogames-reflector Sun, 24 Jan 2021 17:39:29 -0800

The really short answer is read Synthizer, which uses Miniaudio.

The slightly longer answer is it depends, so I will explain Synthizer:

First, you want everything to be the same samplerate at some point. Samplerate conversions are expensive, not enough that a couple matters but enough that if you run one per sound you're going to start hurting. So you push that to the edge. E.g. Synthizer buffers are resampled on load. Unfortunately for streaming and for audio output this isn't always possible: in the streaming case you're of course limited to the samplerate of the original audio and have to insert one there, and for audio output it's sometimes the case that running the library at a fixed samplerate internally is a big win. For Synthizer, that's 44100 because HRTF datasets like to be that way, and resampling those at runtime is complicated.

The rest of this is basically summing arrays. You take your sources of audio--Synthizer generators for example. Sum those to sources. Pan them. Sum the output of those to the audio output buffers. At each stage of this process, you might have to convert between channel formats, especially mono->stereo and stereo->mono. Synthizer does that with specialized functions. The general case is a matrix multiplication, but stereo->mono is (l+r)/2 and mono->stereo is just copying to two arrays, so it's worth it. Also, you want to specialize the no conversion case, either to memcpy or to adding to an output buffer.

Synthizer is fast enough to run in prod in debug builds. You shouldn't, but you can, and that's been valuable for testing. If you optimize you can do very demanding audio on the CPU. Unfortunately this means running at maximum efficiency; Synthizer in a real world scenario can easily approach 50 megaflops or even a couple gigaflops or more, depending. That's not too bad except that you're sharing the system, and it's effectively a hard realtime requirement, even more so than graphics. To do so, you have to address 3 aspects.

First, anything that might block the audio generation thread cannot happen on the audio generation thread. That's not a hard rule, because sometimes it's just not possible to avoid needing to do memory allocation or something, but mutexes/locks are not your friend at all in any fashion. You can make this a hard rule, but only if you add limitations like maximum number of sources. Synthizer just says "here is some reasonable pre-allocated bits, if you do something crazy things might click while we grow the buffers", basically.

Second, memory bandwidth is a problem. Any excess zeroing of buffers, any excess buffers at all for that matter, will push things out of L1. Your worst case is that things get pushed all the way to ram. So you don't want that. Synthizer deals with this in two ways. First, instead of making buffers per object, you can make buffers per invocation of a function and cache them. I estimated out the number of buffers a typical generator->source->context stack would need, then wrote a fixed-size cache which can hand them out on request. Instead of putting a bunch of temporary buffers in your class for the intermediate steps, you ask the cache for buffers, and it's a stack so it's likely that the one you get is still in L1. Second, Synthizer establishes a convention that all audio processing will add to the specified output buffer rather than just writing. The naive way of doing this is one buffer per source, you fill all of those, then you loop over it and grab every source and add. But this is one buffer per source, usually a few kdb each, and you just did the worst thing you could: read all of them start to finish, pushing everything else out of the cache. Adding to the output buffer shifts this to some need to zero, but you've gone from O(n) buffers to roughly O(1) buffers. Synthizer isn't perfect about this, but it's fast enough and I'll probably only finish improving it when it becomes time to optimize for the Pi or something like that.

Also, pointer chasing is bad so allocate buffers inline when you can, but this is already long enough. And also also, it doesn't matter how much memory is allocated but how much you access, so allocate all day long as long as you don't have to read all of it all the time, but going into that is also probably beyond the scope of what is quickly becoming an essay. Suffice it to say that Synthizer does lots of arrays that are waaay oversized for what they need to be, but it's fine because you only access the front. In particular there's a hard-coded internal limit of 16 audio channels (why 16 is another topic, but I didn't pull it out of my ass and it's not related to CPU efficiency).

Third, you have to take advantage of the CPU. This means autovectorization or hand-vectorized code, and being friendly to the branch predictor and compiler optimizations. This gets a little bit speculative in that I haven't firmly benchmarked and a lot of what I did in Synthizer for this is based off mental heuristics, so grain of salt.

To make your code autovectorizable, prefer compile-time constants. Synthizer does this by having a hard-coded block size and hard-coded samplerate, which means that the compiler almost always knows the exact number of iterations of a loop. This gets you loop unrolling and stuff for free right away, and a lot of free autovectorization. Also, be aware that floating point math isn't optimized how you think: (a+b)+(c+d) is efficient, but a+b+c+d isn't, because it has to evaluate left to right and they're not equivalent to each other. So you probably want -ffast-math or a good intuition of when/how to write efficient floating point (synthizer goes off good intuition, but will turn on fast math eventually). For one concrete example, a/b in a loop is terrible, but a*(1/b) where b is a constant or (1/b) is computed before the loop is something like 2 to 10 times faster. You also get a lot here out of having loops that do iterations of 4 or 16, and everything being a power of 2.

X86 can do as much as 16 floating point operations in a cycle, sometimes more than that per cycle, if and only if you can hit SIMD, but writing it by hand is a pain and architecture specific; fortunately using the stuff I'm describing here means you never have to, and also there's vectorization pragmas and things like that as well, some trig tricks, etc.

To make your code friendly to the branch predictor, lift if statemenets out. A for loop inside an if/else is almost always better/faster than an if inside a for loop, because the former case evaluates it once but the latter case evaluates it once per loop. The higher you push this up, the better. Branches are terribly expensive, especially when mispredicted; it's something like 30-50 math operations vs 1 branch. Synthizer uses the fact that C++ has constants as template parameters and that short-lived audio loops means that we don't care about the instruction cache for this purpose. For example, BufferGenerator does if statements in a helper template that are inside the loops and look terrible to determine if the position should reset when the buffer reaches the end due to looping, but it's actually off a constant template parameter and thus trivially optimized by the compiler (you can reliably assume the compiler will do this; and in debug builds, it's a perfectly predicgted branch). Higher up, an if tree determines which bools to turn on, then instantiates dedicated branchless functions off those (you know C++, just read src/generators/buffer.cpp, the code is *much* cleaner than this explanation and it's fairly obvious why I wrote it this way).

As a bonus, pushing if statements up makes more loops autovectorizable, because autovectorization isn't perfect and one of the ways in which it's not perfect is that the compiler can't always know when to push the if statement up itself, and SIMD has no facilities for branching (that's a lie to children, but I'm not going to go consult references I don't have memorized to show incredibly complicated examples of how you can do it, and doing it effectively is like an entire Saturday of research for a few lines. Suffice it to say your compiler can't be relied on for this, so move the if statement up or else).

Anyway, not sure if this answers the question or not, but there's a reason the third time's the charm with me and my audio library forays. Getting all of the above right can be anywhere from a 2x to 20x or more performance increase, depending how wrong whatever it was to start, and a lot of the knowledge just comes from doing C/C++-level coding for a good while.

-- 
Audiogames-reflector mailing list
Audiogames-reflector@sabahattin-gucukoglu.com
https://sabahattin-gucukoglu.com/cgi-bin/mailman/listinfo/audiogames-reflector

Re: Confusion with low-level audio libraries

Re: Confusion with low-level audio libraries

Reply via email to