[music-dsp] A theory of optimal splicing of audio in the time domain.

robert bristow-johnson Sun, 05 Dec 2010 23:59:23 -0800

This is a continuation of the thread started by Element Green titled:Algorithms for finding seamless loops in audio

As far as I know, it is not published anywhere. A few years ago, Iwas thinking of writing this up and publishing it (or submitting itfor publication, probably to JAES), and had let it fall by thewayside. I'm "publishing" the main ideas here on music-dsp because ofsome possible interest here (and the hope it might be helpful tosomebody), and so that "prior art" is established in case of anyonelike IVL is thinking of claiming it as their own. I really do notknow how useful it will be in practice. It might not make anydifference. It's just a theory.


______________________________________________________________________

Section 0:

This is about the generalization of the different ways we can spliceand crossfade audio that has these two extremes:


   (1)  Splicing perfectly coherent and correlated signals
   (2)  Splicing completely uncorrelated signals

I sometimes call the first case the "constant-voltage crossfade"because the crossfade envelopes of the two signals being spliced addup to one. The two envelopes meet when both have a value of 1/2. Inthe second case, we use a "constant-power crossfade", the square ofthe two envelopes add to one and they meet when both have a value ofsqrt(1/2)=0.707.

The questions I wanted to answer are: What does one do for cases inbetween, and how does one know from the audio, which crossfadefunction to use? How does one quantify the answers to thesequestions? How much can we generalize the answer?


______________________________________________________________________

Section 1: Set up the problem.

We have two continuous-time audio signals, x(t) and y(t), and we wantto splice from one to the other at time t=0. In pitch-shifting ortime-scaling or any other looping, y(t) can be some delayed oradvanced version of x(t).


    e.g.    y(t) = x(t-P)

where P is a period length or some other "good" splicedisplacement. We get that value from an algorithm we call a "pitchdetector".

Also, it doesn't matter whether x(t) is getting spliced to y(t) or theother way around, it should work just as well for the audio played inreverse. And it should be no loss of generality that the splicehappens at t=0, we define our coordinate system any damn way we damnwell please.


The signal resulting from the splice is

    v(t)  =  a(t)*x(t) + a(-t)*y(t)

By restricting our result to be equivalent if run either forward orbackward in time, we can conclude that "fade-out" function (say that'sa(t)) is the time-reversed copy of the "fade-in" function, a(-t).


For the correlated case   (1):   a(t)    +  a(-t)    = 1   for all t

For the uncorrelated case (2):  (a(t))^2 + (a(-t))^2 = 1   for all t

This crossfade function, a(t), has well-defined even and odd symmetrycomponents:


                a(t)  =  e(t) + o(t)
where

    even part:  e(t) =  e(-t)  =  ( a(t) + a(-t) )/2
    odd part:   o(t) = -o(-t)  =  ( a(t) - a(-t) )/2

And it's clear that

                a(-t)  =  e(t) - o(t)  .

For example, if it's a simple linear crossfade (equivalent to splicinganalog tape with a diagonally-oriented razor blade):


           { 0                 for   t <= 1
           {
    a(t) = { 1/2 + t/2         for  |t| < 1
           {
           { 1                 for   t >= 1

This is represented simply, in the even and odd components, as:

    e(t) = 1/2

           { t/2               for  |t| < 1
    o(t) = {
           { sgn(t)/2          for  |t| >= 1


    where  sgn(t) is the "sign function":  sgn(t) = t/|t| .

This is a constant voltage-crossfade, appropriate for perfectlycorrelated signals; x(t) and y(t). There is no loss of generality bydefining the crossfade to take place around t=0 and have two timeunits in length. Both are simply a matter of offset and scaling oftime.

Another constant-voltage crossfade would be what I might call a "Hanncrossfade" (after the Hann window):


    e(t) = 1/2

           { (1/2)*sin(pi/2 * t)     for  |t| < 1
    o(t) = {
           { sgn(t)/2                for  |t| >= 1

Some might like that better because the derivative is continuouseverywhere. Extending this idea, one more constant-voltage crossfadeis what I might call a "Flattened Hann crossfade":


    e(t) = 1/2

           { (9/16)*sin(pi/2 * t) - (1/16)*sin(3*pi/2 * t) for |t| < 1
    o(t) = {
           { sgn(t)/2                                     for |t| >= 1

This splice is everywhere continuous in the zeroth, first, and secondderivative. A very smooth crossfade.

As another example, a constant-power crossfade would be the same asany of the above, but where the above a(t) is square rooted:


           { 0                 for   t <= 1
           {
    a(t) = { sqrt(1/2 + t/2)   for  |t| < 1
           {
           { 1                 for   t >= 1

This is what we might use to splice to completely uncorrelated signalstogether. We can separate this into even and odd parts as:



           { (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2))   for  |t| < 1
    e(t) = {
           {  1/2                                        for  |t| >= 1


           { (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2))   for  |t| < 1
    o(t) = {
           { sgn(t)/2                                    for  |t| >= 1

______________________________________________________________________

Section 2:  Which crossfade function to use?

Now we shall make a definition and an assumption. We shall define aninner product of two general signals as:


                                 +inf
    <x,y> = <x(t), y(t)>  =  integral{ x(t)*y(t) * w(t) dt}
                                 -inf

w(t) is a window function that is symmetrical about t=0 and isprobably wider than the crossfade. Strictly speaking, if you werecoming at this from out of a graduate course in metric spaces orfunctional analysis, one of the components (probably y(t)) should becomplex conjugated, but since x(t) and y(t) are always real, in thiswhole theory, I will not bother with that notation.

This inner product is an degenerate case of the more general cross-correlation evaluated with a lag of zero:


                                        +inf
    Rxy(tau) = <x(t), y(t+tau)>  =  integral{ x(t)*y(t+tau) * w(t) dt}
                                        -inf

If y(t) is a time-offset copy of x(t), then Rxy(tau) is theautocorrelation of x(t), Rxx(tau), but also accounting for the timeoffset in the lag, tau.


    So  <x,y>  =  Rxy(0)

A measure of signal energy or average power is:

                         +inf
    Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
                         -inf

Now, the assumption that we are going to toss in here is that the meanpower of the two signals that we are crossfading, x(t) and y(t), areequal.


    <x,x> = <y,y>

We are assuming that we're not crossfading this very quiet tone orsound to a very loud sound that is 60 dB louder. Similarly, theresulting spliced sound, v(t), has the same mean power of the twosignals being spliced:


    <v,v> = <x,x> = <y,y>

So, assuming we lined up x(t) and y(t) so that we want to splice fromone to the other at t=0, and scaled x(t) and y(t) so that they havethe same mean power in the neighborhood of t=0, then the inner productis a measure of how well they are correlated. We shall define thisnormalized measure of correlation as:


    r  =  <x,y>/<x,x>  =  <x,y>/<y,y>

If r = 1, they are perfectly correlated and if r = 0, they arecompletely uncorrelated.

We will make the additional assumption that our pitch detectionalgorithm will find *some* lag where the correlation is at least zerocorrelated. We should not have to deal with splicing *negatively*correlated audio (that would be quite a "glitch" or a bad splice). Ifthe signals have no DC component, then their autocorrelations andtheir cross-correlations to each other) must have no DC component.That means there will be values of tau such that Rxy(tau) are eithernegative or positive. If it was theoretical white noise, Rxx(tau)would be zero for |tau| > 0 and Rxx(0) would be the noise variance orpower. But Rxx(tau) cannot be negative for *all* values of tau, evenexcluding tau=0.

We can find a value of tau so that Rxx(tau) is non-negative and wewant to choose tau so that has the highest value of Rxx(tau). Thendefine


    y(t)  =  x(t + tau)

and then

    <x,y>  =  Rxy(0)  =  Rxx(tau)

Now we shall also assume that the crossfade function, a(t), iscompletely uncorrelated and even statistically independent from thetwo signals being spliced. a(t) is a volume control that varies intime, but is unaffected by anything in x(t) or y(t).

We shall also assume something called "ergodicity". This means that*time* averages of x(t) and y(t) (or combinations of x(t) and y(t))are equal to *statistical* averages. If this window, w(t) is scaled(or normalized) so that its integral is 1,


         +inf
     integral{ w(t) dt} = 1
         -inf

then all these inner products can be related to "expectation values":

     <x,y> = E{ x(t) * y(t) }

If x(t) and y(t) are thought of as sorta "random" processes (ratherthan well defined deterministic functions), the expectation value isunmoved no matter what t is. But if the envelope a(t) is considereddeterministic, then it simply scales x(t) or y(t) and is treated as aconstant in the expectation. So at some particular time t0,


     <a(t0)*x,y>  =  E{ (a(t0)*x(t)) * y(t) }

                  =  a(t0) * E{ x(t) * y(t) }

                  = a(t0) * <x,y>

This is a little sloppy, mathematically, because I am "fixing" t fora(t) to be t0, but not fixing t for x(t) or y(t) (so that "timeaverages" for x(t) and y(t) can be meaningful and equated tostatistical averages).


Recall that

    v(t)  =  a(t)*x(t) + a(-t)*y(t)

Then:

    <v,v> =  <(a(t)*x(t) + a(-t)*y(t)), (a(t)*x(t) + a(-t)*y(t))>

Using identities that we can apply to expectation values

    <v,v> = (a(t))^2*<x,x>  +  2*a(t)*a(-t)*<x,y>  +  (a(-t))^2*<y,y>

Since <v,v> = <x,x = <y,y>, we can divide by <v,v> and get to the keyequation of this whole theory:


    1  =  (a(t))^2  +  2*r*a(t)*a(-t)  +  (a(-t))^2

Given the normalized correlation measure, we want the above equationto be true all of the time. If r=0 (completely uncorrelated), one cansee we get a constant-power crossfade:


    (a(t))^2 + (a(-t))^2  =  1

If r=1 (completely correlated), one can see that we get a constant-voltage crossfade:


    (a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t)  =  ( a(t) + a(-t) )^2  =  1

or, assuming a(t) is non-negative,

    a(t) + a(-t) = 1 .

______________________________________________________________________

Section 3:  Generalizing the crossfade function

Recall that

                a(t)   =  e(t) + o(t)

                a(-t)  =  e(t) - o(t)

and substituting into

    (a(t))^2  +  (a(-t))^2  +  2*r*a(t)*a(-t)  =  1

results in

     (e(t) + o(t))^2  +  (e(t) - o(t))^2
          +  2*r*(e(t) + o(t))*(e(t) - o(t))  =  1

Blasting through that gets:

    (1+r)*(e(t))^2  +  (1-r)*(o(t))^2  =  1/2

This means that, if r is measured and known (from the correlationfunction) we have the freedom to define either one of e(t) or o(t)arbitrarily (as long as the even or odd symmetry is kept) and solvefor the other. We can see that square rooting is involved in solvingfor either e(t) or o(t) and there is an ambiguity for which sign topick. We shall resolve that ambiguity by adding the additionalassumption that the even-symmetry component, e(t), is non-negative.


    e(t)  =  e(-t)  >=  0

Given a general and bipolar odd-symmetry component function,

    o(t)  =  -o(-t)

then we solve for the even component (picking the non-negative squareroot):


    e(t)  =  sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

The overall crossfade envelope would be

    a(t)  =  e(t)  +  o(t)

          =  sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )  +  o(t)

______________________________________________________________________

Section 4:  Implementation:

Given a particular form for the odd part, o(t) (linear or Hann orFlattened Hann or whatever is your heart's desire), and for a varietyof values of r, ranging from r=0 to r=1, a collection of envelopefunctions, a(t), are pre-calculated and stored in memory. Then, whenpitch detection or loop matching is done, a splice displacement thatis optimal is determined, and if autocorrelation of some form is usedin determining a measure of goodness (or seamlessness, using Element'slanguage) of that loop splice, that autocorrelation is normalized (bydividing by Rxx(0)) to get r and that value of r is used to choosewhich pre-calculated a(t) from the above collection is used for thecrossfade in the splice.


______________________________________________________________________


--

r b-j                  r...@audioimagination.com

"Imagination is more important than knowledge."




--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

[music-dsp] A theory of optimal splicing of audio in the time domain.

Reply via email to