Re: [nova-dev] Performance ideas

Phil Frost Wed, 26 Sep 2007 19:49:33 -0700

On Wed, Sep 26, 2007 at 07:18:53PM +0200, Tim Blechmann wrote:
> > For a while I played around with writing a synth software called Dubnium
> > <http://bitglue.com/dubnium>. I never developed it enough to actually be
> > useful, but I did learn some things, anyway.
> > 
> > I think the most original thing about dubnium was that the processing
> > units were not functions which took pointers to input and output buffers
> > and did their magic, but instead were chunks of C code which would be
> > inlined with the code from other modules, then compiled with gcc.
> 
> i think i see your point ... the problem with this approach is, that you
> have problems when trying to dynamically change the dsp graph ... 
> 
> so if you add two signals, and then do apply a filter, the most
> efficient way would probably be something like:
> - get input1 and input2 from memory to registers
> - add them
> - apply the filter in the registers
> - move the result from registers to memory ...
> 
> the less efficient way, which can be used in dynamic environments
> (without a compile cycle) is something like:
> - get input1 and input2 from memory to registers
> - add them
> - move the result from registers to memory
> - get the result from memory to registers
> - apply the filters
> - move the result back to the memory
> 
> when written as a nova patch, the second approach is the only usable ...
> 
> however there are some other aspects that one would have to keep in
> mind. 
> - adding or multiplying two sample vectors can be done in steps of 4
> samples, while filtering is a sample-wise operation
> - if you start to have too many local variables so that they don't fit
> to your registers any more, they will be stored to the stack, and
> instead of saving memory operations, you end up, having even more ...
> 
> 
> > As I mentioned, dubnium would generate C code provided by the processing
> > units then feed it to gcc. What happened basically is that each input
> > and output of each processing unit was declared as a local variable in
> > one function that would become the aggregation of some subgraph of
> > processing units. So, if you had a patch which would multiply 4 numbers
> > with a tree of "multiply" widgets with 2 inputs each, it would generate
> > code something like:
> > 
> > the graph:
> > 
> >   in1         in2   in3        in4
> >      \       /         \       /
> >       [mult1]           [mult2]
> >              \         /
> >                [mult3]
> >                   |
> >                  out
> > 
> > inline multiply(float in1, float in2, float *out) {
> >     // the body of this function is provided by the
> >     // processing unit implementation
> >     *out = in1 * in2;
> > }
> 
> actually i am using a similar approach for implicit ugens, that add
> memory chunks, when you connect multiple signal outlets to one signal
> inlet (see: source/kernel/ugen/add_ugen.hpp) ...
> however this tree structure you described is only very efficient, when
> your data is always located in the registers. for my Add_Ugen class, the
> maximum number of signal vectors that i add in one loop is 4, because of
> the number of floating point registers on the sse unit of x86 cpus.
> 
> i am not really sure, how a cross-ugen optimization could be realized,
> from a technical point of view, as it is dependent on the architecture
> (how many registers the cpu provides), the algorithm complexity (how
> many registers your algorithm needs) and the algorithm type
> (vectorizable or not).
> i somehow prefer to have efficient ugen implementations, and
> cache-friendly code ...


Well, I don't know that it would be impossible to implement; maybe just
hard. You'd probably have to accept some delay when the processing graph
changes due to the recompile. However, if it's done in a separate
process, it could be done without any break in the audio processing.

You may be right that it would not provide a substantial optimization
for nontrivial processing units. In dubnium I actually implemented IIR
filters as abstractions of add, multiply, and 1 sample delay builtins,
so the complexity of the primitives was much less.

However, my intuition still tells me that add and multiply nodes are
still common enough that if there were a way to avoid the loads and
stores for just those, the performance gain could be significant. Not
sure of the best way to implement it though, but I'll be sure to let you
know if something comes to mind :)
_______________________________________________
nova-dev mailing list
[email protected]
http://klingt.org/cgi-bin/mailman/listinfo/nova-dev
http://tim.klingt.org/nova

Re: [nova-dev] Performance ideas

Reply via email to