< a few mistakes are spotted and corrected before i forget >
This is a continuation of the thread started by Element Green titled:
Algorithms for finding seamless loops in audio
As far as I know, it is not published anywhere. A few years ago, I
was thinking of writing this up and publishing it (or submitting it
for publication, probably to JAES), and had let it fall by the
wayside. I'm "publishing" the main ideas here on music-dsp because of
some possible interest here (and the hope it might be helpful to
somebody), and so that "prior art" is established in case of anyone
like IVL is thinking of claiming it as their own. I really do not
know how useful it will be in practice. It might not make any
difference. It's just a theory.
______________________________________________________________________
Section 0:
This is about the generalization of the different ways we can splice
and crossfade audio that has these two extremes:
(1) Splicing perfectly coherent and correlated signals
(2) Splicing completely uncorrelated signals
I sometimes call the first case the "constant-voltage crossfade"
because the crossfade envelopes of the two signals being spliced add
up to one. The two envelopes meet when both have a value of 1/2. In
the second case, we use a "constant-power crossfade", the square of
the two envelopes add to one and they meet when both have a value of
sqrt(1/2)=0.707.
The questions I wanted to answer are: What does one do for cases in
between, and how does one know from the audio, which crossfade
function to use? How does one quantify the answers to these
questions? How much can we generalize the answer?
______________________________________________________________________
Section 1: Set up the problem.
We have two continuous-time audio signals, x(t) and y(t), and we want
to splice from one to the other at time t=0. In pitch-shifting or
time-scaling or any other looping, y(t) can be some delayed or
advanced version of x(t).
e.g. y(t) = x(t-P)
where P is a period length or some other "good" splice
displacement. We get that value from an algorithm we call a "pitch
detector".
Also, it doesn't matter whether x(t) is getting spliced to y(t) or the
other way around, it should work just as well for the audio played in
reverse. And it should be no loss of generality that the splice
happens at t=0, we define our coordinate system any damn way we damn
well please.
The signal resulting from the splice is
v(t) = a(t)*x(t) + a(-t)*y(t)
By restricting our result to be equivalent if run either forward or
backward in time, we can conclude that "fade-out" function (say that's
a(t)) is the time-reversed copy of the "fade-in" function, a(-t).
For the correlated case (1): a(t) + a(-t) = 1 for all t
For the uncorrelated case (2): (a(t))^2 + (a(-t))^2 = 1 for all t
This crossfade function, a(t), has well-defined even and odd symmetry
components:
a(t) = e(t) + o(t)
where
even part: e(t) = e(-t) = ( a(t) + a(-t) )/2
odd part: o(t) = -o(-t) = ( a(t) - a(-t) )/2
And it's clear that
a(-t) = e(t) - o(t) .
For example, if it's a simple linear crossfade (equivalent to splicing
analog tape with a diagonally-oriented razor blade):
{ 0 for t <= -1
{
a(t) = { 1/2 + t/2 for -1 < t < 1
{
{ 1 for t >= 1
This is represented simply, in the even and odd components, as:
e(t) = 1/2
{ t/2 for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1
where sgn(t) is the "sign function": sgn(t) = t/|t| .
This is a constant voltage-crossfade, appropriate for perfectly
correlated signals; x(t) and y(t). There is no loss of generality by
defining the crossfade to take place around t=0 and have two time
units in length. Both are simply a matter of offset and scaling of
time.
Another constant-voltage crossfade would be what I might call a "Hann
crossfade" (after the Hann window):
e(t) = 1/2
{ (1/2)*sin(pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1
Some might like that better because the derivative is continuous
everywhere. Extending this idea, one more constant-voltage crossfade
is what I might call a "Flattened Hann crossfade":
e(t) = 1/2
{ (9/16)*sin(pi/2 * t) + (1/16)*sin(3*pi/2 * t) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1
This splice is everywhere continuous in the zeroth, first, and second
derivative. A very smooth crossfade.
As another example, a constant-power crossfade would be the same as
any of the above, but where the above a(t) is square rooted:
{ 0 for t <= -1
{
a(t) = { sqrt(1/2 + t/2) for -1 < t < 1
{
{ 1 for t >= 1
This is what we might use to splice to completely uncorrelated signals
together. We can separate this into even and odd parts as:
{ (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2)) for |t| < 1
e(t) = {
{ 1/2 for |t| >= 1
{ (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2)) for |t| < 1
o(t) = {
{ sgn(t)/2 for |t| >= 1
______________________________________________________________________
Section 2: Which crossfade function to use?
Now we shall make a definition and an assumption. We shall define an
inner product of two general signals as:
+inf
<x,y> = <x(t), y(t)> = integral{ x(t)*y(t) * w(t) dt}
-inf
w(t) is a window function that is symmetrical about t=0 and is
probably wider than the crossfade. Strictly speaking, if you were
coming at this from out of a graduate course in metric spaces or
functional analysis, one of the components (probably y(t)) should be
complex conjugated, but since x(t) and y(t) are always real, in this
whole theory, I will not bother with that notation.
This inner product is an degenerate case of the more general cross-
correlation evaluated with a lag of zero:
+inf
Rxy(tau) = <x(t), y(t+tau)> = integral{ x(t)*y(t+tau) * w(t) dt}
-inf
If y(t) is a time-offset copy of x(t), then Rxy(tau) is the
autocorrelation of x(t), Rxx(tau), but also accounting for the time
offset in the lag, tau.
So <x,y> = Rxy(0)
A measure of signal energy or average power is:
+inf
Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
-inf
Now, the assumption that we are going to toss in here is that the mean
power of the two signals that we are crossfading, x(t) and y(t), are
equal.
<x,x> = <y,y>
We are assuming that we're not crossfading this very quiet tone or
sound to a very loud sound that is 60 dB louder. Similarly, the
resulting spliced sound, v(t), has the same mean power of the two
signals being spliced:
<v,v> = <x,x> = <y,y>
So, assuming we lined up x(t) and y(t) so that we want to splice from
one to the other at t=0, and scaled x(t) and y(t) so that they have
the same mean power in the neighborhood of t=0, then the inner product
is a measure of how well they are correlated. We shall define this
normalized measure of correlation as:
r = <x,y>/<x,x> = <x,y>/<y,y>
If r = 1, they are perfectly correlated and if r = 0, they are
completely uncorrelated.
We will make the additional assumption that our pitch detection
algorithm will find *some* lag where the correlation is at least zero
correlated. We should not have to deal with splicing *negatively*
correlated audio (that would be quite a "glitch" or a bad splice). If
the signals have no DC component, then their autocorrelations and
their cross-correlations to each other) must have no DC component.
That means there will be values of tau such that Rxy(tau) are either
negative or positive. If it was theoretical white noise, Rxx(tau)
would be zero for |tau| > 0 and Rxx(0) would be the noise variance or
power. But Rxx(tau) cannot be negative for *all* values of tau, even
excluding tau=0.
We can find a value of tau so that Rxx(tau) is non-negative and we
want to choose tau so that has the highest value of Rxx(tau). Then
define
y(t) = x(t + tau)
and then
<x,y> = Rxy(0) = Rxx(tau)
Now we shall also assume that the crossfade function, a(t), is
completely uncorrelated and even statistically independent from the
two signals being spliced. a(t) is a volume control that varies in
time, but is unaffected by anything in x(t) or y(t).
We shall also assume something called "ergodicity". This means that
*time* averages of x(t) and y(t) (or combinations of x(t) and y(t))
are equal to *statistical* averages. If this window, w(t) is scaled
(or normalized) so that its integral is 1,
+inf
integral{ w(t) dt} = 1
-inf
then all these inner products can be related to "expectation values":
<x,y> = E{ x(t) * y(t) }
If x(t) and y(t) are thought of as sorta "random" processes (rather
than well defined deterministic functions), the expectation value is
unmoved no matter what t is. But if the envelope a(t) is considered
deterministic, then it simply scales x(t) or y(t) and is treated as a
constant in the expectation. So at some particular time t0,
<a(t0)*x,y> = E{ (a(t0)*x(t)) * y(t) }
= a(t0) * E{ x(t) * y(t) }
= a(t0) * <x,y>
This is a little sloppy, mathematically, because I am "fixing" t for
a(t) to be t0, but not fixing t for x(t) or y(t) (so that "time
averages" for x(t) and y(t) can be meaningful and equated to
statistical averages).
Recall that
v(t) = a(t)*x(t) + a(-t)*y(t)
Then:
<v,v> = <(a(t)*x(t) + a(-t)*y(t)), (a(t)*x(t) + a(-t)*y(t))>
Using identities that we can apply to expectation values
<v,v> = (a(t))^2*<x,x> + 2*a(t)*a(-t)*<x,y> + (a(-t))^2*<y,y>
Since <v,v> = <x,x = <y,y>, we can divide by <v,v> and get to the key
equation of this whole theory:
1 = (a(t))^2 + 2*r*a(t)*a(-t) + (a(-t))^2
Given the normalized correlation measure, we want the above equation
to be true all of the time. If r=0 (completely uncorrelated), one can
see we get a constant-power crossfade:
(a(t))^2 + (a(-t))^2 = 1
If r=1 (completely correlated), one can see that we get a constant-
voltage crossfade:
(a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t) = ( a(t) + a(-t) )^2 = 1
or, assuming a(t) is non-negative,
a(t) + a(-t) = 1 .
______________________________________________________________________
Section 3: Generalizing the crossfade function
Recall that
a(t) = e(t) + o(t)
a(-t) = e(t) - o(t)
and substituting into
(a(t))^2 + (a(-t))^2 + 2*r*a(t)*a(-t) = 1
results in
(e(t) + o(t))^2 + (e(t) - o(t))^2
+ 2*r*(e(t) + o(t))*(e(t) - o(t)) = 1
Blasting through that gets:
(1+r)*(e(t))^2 + (1-r)*(o(t))^2 = 1/2
This means that, if r is measured and known (from the correlation
function) we have the freedom to define either one of e(t) or o(t)
arbitrarily (as long as the even or odd symmetry is kept) and solve
for the other. We can see that square rooting is involved in solving
for either e(t) or o(t) and there is an ambiguity for which sign to
pick. We shall resolve that ambiguity by adding the additional
assumption that the even-symmetry component, e(t), is non-negative.
e(t) = e(-t) >= 0
Given a general and bipolar odd-symmetry component function,
o(t) = -o(-t)
then we solve for the even component (picking the non-negative square
root):
e(t) = sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )
The overall crossfade envelope would be
a(t) = e(t) + o(t)
= sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 ) + o(t)
______________________________________________________________________
Section 4: Implementation:
Given a particular form for the odd part, o(t) (linear or Hann or
Flattened Hann or whatever is your heart's desire), and for a variety
of values of r, ranging from r=0 to r=1, a collection of envelope
functions, a(t), are pre-calculated and stored in memory. Then, when
pitch detection or loop matching is done, a splice displacement that
is optimal is determined, and if autocorrelation of some form is used
in determining a measure of goodness (or seamlessness, using Element's
language) of that loop splice, that autocorrelation is normalized (by
dividing by Rxx(0)) to get r and that value of r is used to choose
which pre-calculated a(t) from the above collection is used for the
crossfade in the splice.
______________________________________________________________________
--
r b-j r...@audioimagination.com
"Imagination is more important than knowledge."
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp