This is a continuation of the thread started by Element Green titled: Algorithms for finding seamless loops in audio

As far as I know, it is not published anywhere. A few years ago, I was thinking of writing this up and publishing it (or submitting it for publication, probably to JAES), and had let it fall by the wayside. I'm "publishing" the main ideas here on music-dsp because of some possible interest here (and the hope it might be helpful to somebody), and so that "prior art" is established in case of anyone like IVL is thinking of claiming it as their own. I really do not know how useful it will be in practice. It might not make any difference. It's just a theory.

______________________________________________________________________

Section 0:

This is about the generalization of the different ways we can splice and crossfade audio that has these two extremes:

   (1)  Splicing perfectly coherent and correlated signals
   (2)  Splicing completely uncorrelated signals

I sometimes call the first case the "constant-voltage crossfade" because the crossfade envelopes of the two signals being spliced add up to one. The two envelopes meet when both have a value of 1/2. In the second case, we use a "constant-power crossfade", the square of the two envelopes add to one and they meet when both have a value of sqrt(1/2)=0.707.

The questions I wanted to answer are: What does one do for cases in between, and how does one know from the audio, which crossfade function to use? How does one quantify the answers to these questions? How much can we generalize the answer?

______________________________________________________________________

Section 1: Set up the problem.

We have two continuous-time audio signals, x(t) and y(t), and we want to splice from one to the other at time t=0. In pitch-shifting or time-scaling or any other looping, y(t) can be some delayed or advanced version of x(t).

    e.g.    y(t) = x(t-P)

where P is a period length or some other "good" splice displacement. We get that value from an algorithm we call a "pitch detector".

Also, it doesn't matter whether x(t) is getting spliced to y(t) or the other way around, it should work just as well for the audio played in reverse. And it should be no loss of generality that the splice happens at t=0, we define our coordinate system any damn way we damn well please.

The signal resulting from the splice is

    v(t)  =  a(t)*x(t) + a(-t)*y(t)

By restricting our result to be equivalent if run either forward or backward in time, we can conclude that "fade-out" function (say that's a(t)) is the time-reversed copy of the "fade-in" function, a(-t).

For the correlated case   (1):   a(t)    +  a(-t)    = 1   for all t

For the uncorrelated case (2):  (a(t))^2 + (a(-t))^2 = 1   for all t

This crossfade function, a(t), has well-defined even and odd symmetry components:

                a(t)  =  e(t) + o(t)
where

    even part:  e(t) =  e(-t)  =  ( a(t) + a(-t) )/2
    odd part:   o(t) = -o(-t)  =  ( a(t) - a(-t) )/2

And it's clear that

                a(-t)  =  e(t) - o(t)  .


For example, if it's a simple linear crossfade (equivalent to splicing analog tape with a diagonally-oriented razor blade):

           { 0                 for   t <= 1
           {
    a(t) = { 1/2 + t/2         for  |t| < 1
           {
           { 1                 for   t >= 1

This is represented simply, in the even and odd components, as:

    e(t) = 1/2

           { t/2               for  |t| < 1
    o(t) = {
           { sgn(t)/2          for  |t| >= 1


    where  sgn(t) is the "sign function":  sgn(t) = t/|t| .

This is a constant voltage-crossfade, appropriate for perfectly correlated signals; x(t) and y(t). There is no loss of generality by defining the crossfade to take place around t=0 and have two time units in length. Both are simply a matter of offset and scaling of time.

Another constant-voltage crossfade would be what I might call a "Hann crossfade" (after the Hann window):

    e(t) = 1/2

           { (1/2)*sin(pi/2 * t)     for  |t| < 1
    o(t) = {
           { sgn(t)/2                for  |t| >= 1


Some might like that better because the derivative is continuous everywhere. Extending this idea, one more constant-voltage crossfade is what I might call a "Flattened Hann crossfade":

    e(t) = 1/2

           { (9/16)*sin(pi/2 * t) - (1/16)*sin(3*pi/2 * t) for |t| < 1
    o(t) = {
           { sgn(t)/2                                     for |t| >= 1

This splice is everywhere continuous in the zeroth, first, and second derivative. A very smooth crossfade.

As another example, a constant-power crossfade would be the same as any of the above, but where the above a(t) is square rooted:

           { 0                 for   t <= 1
           {
    a(t) = { sqrt(1/2 + t/2)   for  |t| < 1
           {
           { 1                 for   t >= 1

This is what we might use to splice to completely uncorrelated signals together. We can separate this into even and odd parts as:


           { (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2))   for  |t| < 1
    e(t) = {
           {  1/2                                        for  |t| >= 1


           { (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2))   for  |t| < 1
    o(t) = {
           { sgn(t)/2                                    for  |t| >= 1

______________________________________________________________________

Section 2:  Which crossfade function to use?

Now we shall make a definition and an assumption. We shall define an inner product of two general signals as:

                                 +inf
    <x,y> = <x(t), y(t)>  =  integral{ x(t)*y(t) * w(t) dt}
                                 -inf

w(t) is a window function that is symmetrical about t=0 and is probably wider than the crossfade. Strictly speaking, if you were coming at this from out of a graduate course in metric spaces or functional analysis, one of the components (probably y(t)) should be complex conjugated, but since x(t) and y(t) are always real, in this whole theory, I will not bother with that notation.

This inner product is an degenerate case of the more general cross- correlation evaluated with a lag of zero:

                                        +inf
    Rxy(tau) = <x(t), y(t+tau)>  =  integral{ x(t)*y(t+tau) * w(t) dt}
                                        -inf

If y(t) is a time-offset copy of x(t), then Rxy(tau) is the autocorrelation of x(t), Rxx(tau), but also accounting for the time offset in the lag, tau.

    So  <x,y>  =  Rxy(0)

A measure of signal energy or average power is:

                         +inf
    Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
                         -inf

Now, the assumption that we are going to toss in here is that the mean power of the two signals that we are crossfading, x(t) and y(t), are equal.

    <x,x> = <y,y>

We are assuming that we're not crossfading this very quiet tone or sound to a very loud sound that is 60 dB louder. Similarly, the resulting spliced sound, v(t), has the same mean power of the two signals being spliced:

    <v,v> = <x,x> = <y,y>

So, assuming we lined up x(t) and y(t) so that we want to splice from one to the other at t=0, and scaled x(t) and y(t) so that they have the same mean power in the neighborhood of t=0, then the inner product is a measure of how well they are correlated. We shall define this normalized measure of correlation as:

    r  =  <x,y>/<x,x>  =  <x,y>/<y,y>

If r = 1, they are perfectly correlated and if r = 0, they are completely uncorrelated.

We will make the additional assumption that our pitch detection algorithm will find *some* lag where the correlation is at least zero correlated. We should not have to deal with splicing *negatively* correlated audio (that would be quite a "glitch" or a bad splice). If the signals have no DC component, then their autocorrelations and their cross-correlations to each other) must have no DC component. That means there will be values of tau such that Rxy(tau) are either negative or positive. If it was theoretical white noise, Rxx(tau) would be zero for |tau| > 0 and Rxx(0) would be the noise variance or power. But Rxx(tau) cannot be negative for *all* values of tau, even excluding tau=0.

We can find a value of tau so that Rxx(tau) is non-negative and we want to choose tau so that has the highest value of Rxx(tau). Then define

    y(t)  =  x(t + tau)

and then

    <x,y>  =  Rxy(0)  =  Rxx(tau)

Now we shall also assume that the crossfade function, a(t), is completely uncorrelated and even statistically independent from the two signals being spliced. a(t) is a volume control that varies in time, but is unaffected by anything in x(t) or y(t).

We shall also assume something called "ergodicity". This means that *time* averages of x(t) and y(t) (or combinations of x(t) and y(t)) are equal to *statistical* averages. If this window, w(t) is scaled (or normalized) so that its integral is 1,

         +inf
     integral{ w(t) dt} = 1
         -inf

then all these inner products can be related to "expectation values":

     <x,y> = E{ x(t) * y(t) }

If x(t) and y(t) are thought of as sorta "random" processes (rather than well defined deterministic functions), the expectation value is unmoved no matter what t is. But if the envelope a(t) is considered deterministic, then it simply scales x(t) or y(t) and is treated as a constant in the expectation. So at some particular time t0,

     <a(t0)*x,y>  =  E{ (a(t0)*x(t)) * y(t) }

                  =  a(t0) * E{ x(t) * y(t) }

                  = a(t0) * <x,y>

This is a little sloppy, mathematically, because I am "fixing" t for a(t) to be t0, but not fixing t for x(t) or y(t) (so that "time averages" for x(t) and y(t) can be meaningful and equated to statistical averages).

Recall that

    v(t)  =  a(t)*x(t) + a(-t)*y(t)

Then:

    <v,v> =  <(a(t)*x(t) + a(-t)*y(t)), (a(t)*x(t) + a(-t)*y(t))>

Using identities that we can apply to expectation values

    <v,v> = (a(t))^2*<x,x>  +  2*a(t)*a(-t)*<x,y>  +  (a(-t))^2*<y,y>

Since <v,v> = <x,x = <y,y>, we can divide by <v,v> and get to the key equation of this whole theory:

    1  =  (a(t))^2  +  2*r*a(t)*a(-t)  +  (a(-t))^2

Given the normalized correlation measure, we want the above equation to be true all of the time. If r=0 (completely uncorrelated), one can see we get a constant-power crossfade:

    (a(t))^2 + (a(-t))^2  =  1

If r=1 (completely correlated), one can see that we get a constant- voltage crossfade:

    (a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t)  =  ( a(t) + a(-t) )^2  =  1

or, assuming a(t) is non-negative,

    a(t) + a(-t) = 1 .

______________________________________________________________________

Section 3:  Generalizing the crossfade function

Recall that

                a(t)   =  e(t) + o(t)

                a(-t)  =  e(t) - o(t)

and substituting into

    (a(t))^2  +  (a(-t))^2  +  2*r*a(t)*a(-t)  =  1

results in

     (e(t) + o(t))^2  +  (e(t) - o(t))^2
          +  2*r*(e(t) + o(t))*(e(t) - o(t))  =  1

Blasting through that gets:

    (1+r)*(e(t))^2  +  (1-r)*(o(t))^2  =  1/2


This means that, if r is measured and known (from the correlation function) we have the freedom to define either one of e(t) or o(t) arbitrarily (as long as the even or odd symmetry is kept) and solve for the other. We can see that square rooting is involved in solving for either e(t) or o(t) and there is an ambiguity for which sign to pick. We shall resolve that ambiguity by adding the additional assumption that the even-symmetry component, e(t), is non-negative.

    e(t)  =  e(-t)  >=  0

Given a general and bipolar odd-symmetry component function,

    o(t)  =  -o(-t)

then we solve for the even component (picking the non-negative square root):

    e(t)  =  sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

The overall crossfade envelope would be

    a(t)  =  e(t)  +  o(t)

          =  sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )  +  o(t)

______________________________________________________________________

Section 4:  Implementation:

Given a particular form for the odd part, o(t) (linear or Hann or Flattened Hann or whatever is your heart's desire), and for a variety of values of r, ranging from r=0 to r=1, a collection of envelope functions, a(t), are pre-calculated and stored in memory. Then, when pitch detection or loop matching is done, a splice displacement that is optimal is determined, and if autocorrelation of some form is used in determining a measure of goodness (or seamlessness, using Element's language) of that loop splice, that autocorrelation is normalized (by dividing by Rxx(0)) to get r and that value of r is used to choose which pre-calculated a(t) from the above collection is used for the crossfade in the splice.

______________________________________________________________________


--

r b-j                  r...@audioimagination.com

"Imagination is more important than knowledge."




--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

Reply via email to