> On March 12, 2020 5:35 PM Ethan Duni <ethan.d...@gmail.com> wrote:
> 
> 
> Hi Robert
> 
> 
> On Wed, Mar 11, 2020 at 4:19 PM robert bristow-johnson 
> <r...@audioimagination.com> wrote:
> > 
> >  i don't think it's too generic for "STFT processing". step #4 is pretty 
> > generic.
> 
> I think the part that chafes my intuition is more that the windows in steps 
> #2 and #6 should "match" in some way, and obey an appropriate perfect 
> reconstruction condition.

i think we're supposed to multiply the analysis window with the synthesis 
window to get a net effective window, but i am not always persuaded that the 
analysis window is preserved in the frequency-domain modification operation.  
if it's a phase vocoder and you do the Miller Puckette thing and apply the same 
phase to a entire spectral peak, then supposedly the window shape is preserved 
on each sinusoidal component.  then i would use no synthesis window.  that's 
sorta what i was thinking in that Stack Exchange thing that i pointed to, but 
in a more general STFT process, i might want to use the Gaussian window, 
because the result of each sinusoidal component is a single peak with the 
Gaussian shape.  we can even measure the rate of change of frequency related to 
each peak.  being Gaussian, there shouldn't be side lobes to worry about.

> I think of STFT as intentionally wiping out any spill-over effects between 
> frames with synthesis windowing, to impose a particular time-frequency tiling.

yup.  and unlike wavelet, the tiles all have the same widths in time and 
frequency.

> Whereas fast convolution is defined by how it explicitly accounts for 
> spill-over between frames.

yup.  you don't even think of "windowing effects", even with overlap-add in 
which you are multiplying by 1 or 0, which is a rectangular window. but we 
consider the operation in the time domain, confirm linearity and can treat each 
frame as it's own linear-time output and add the FIR output from frame m to 
that of frame m+1.  and the end tail of frame end adds the beginning tail of 
frame m+1.  doesn't get affected by the windowing effects. 

> 
> My intuition isn't definitive, but that's what comes to mind. In any case, 
> "STFT processing" is a very generic term.

i think of it as the series of DFTs of windowed frames of audio.

> > 
> >  here is my attempt to quantitatively define and describe the STFT:
> >  
> >  
> > https://dsp.stackexchange.com/questions/45625/is-windowed-fourier-transform-a-synonym-for-stft/45631#45631
> >  
> 
> 
> Cool, that's a helpful reference for this stuff.

but i didn't account for both analysis and synthesis windows.  but whatever is 
the resultant window for the output grain y_m[n], you just add them up, 
properly positioned in time.


> In terms of "what even is STFT", it seems there is more consensus on the 
> analysis part. Many STFT applications don't involve any synthesis or 
> filtering, but only frequency domain parameter estimation.

that's right.

> For analysis-only, probably everyone agrees that STFT consists of some 
> Constant OverLap Add (COLA) window scheme, followed by DFT.

well, no, i don't agree.  for analysis-only, i don't know why you need 
complementary windows (which is what i think you mean by COLA).  it's in the 
synthesis where overlap-adding is done.  assuming the sinusoidal components in 
your output are phase aligned between adjacent frames, if you don't want a dip 
in the amplitude of each sinusoidal component you want their windows to add to 
1.  but for analysis, there might be other properties of the window that is 
more important than being complementary.

again, i like the Gaussian window for analysis (because it has smooth Gaussian 
pulses for each sinusoidal component in the frequency domain), but it's not 
complementary.  if, after analysis, i am modifying each Gaussian pulse and 
inverse DFT back to the time domain, i will have a Gaussian window effectively 
on the output frame.  by multiplying by a Hann window and dividing by the 
original Gaussian window, the result has a Hann window shape and that should be 
complementary in the overlap-add.

> Rectangular windows are a perfectly valid choice here, albeit one with poor 
> sidelobe suppression.

but it doesn't matter with overlap-add fast convolution.  somehow, the sidelobe 
effects come out in the wash, because we can insure (to finite precision) the 
correctness of the output with a time-domain analysis.

> Note that there are two potential layers of oversampling available: one from 
> overlapped windows, and another from zero-padding.
> 
> To summarize my understanding of your earlier remarks, the situation gets 
> fuzzier for synthesis. Broadly, there are two basic approaches. One is to 
> keep the COLA analysis and use raw (unwindowed) overlap-add for synthesis. 
> The other is to add synthesis windows, in which case the PR condition becomes 
> COLA on the product of the analysis and synthesis windows (I'd call this 
> "STFT filter bank" or maybe "FFT phase vocoder" depending on the 
> audience/application).

right.

> 
> The first approach has immediate problems if the DFT values are modified, 
> because the COLA condition is not enforced on the output.

well, sorta Miller Puckette sorta did, but Hann window has sidelobes and, if 
you want to apply the same phase shift factor to the entire sinusoid, you have 
to deal with the sidelobes.  this is one reason why i like Gaussian window for 
analysis; no sidelobes (or very tiny ones because the Gaussian window *does* 
get truncated at the sides).

> For the special case that the modification is multiplication by a DFT kernel 
> that corresponds to a length-K FIR filter, this can be accommodated by 
> zero-padding type oversampling,

so you're oversampling in the frequency domain because you're zero-padding in 
the time domain.

> which results in the Overlap-Add flavor of fast convolution to account for 
> the inter frame effects. Note that this implicitly extends the (raw) 
> overlap-add region in synthesis accordingly - the analysis windows obey COLA, 
> but the synthesis "windows" have different support and are not part of the PR 
> condition.

i think the thing that Jean Laroche did was a Hann for the analysis and another 
Hann for the synthesis which because Hann^2 and then they show that with 75% 
overlap (the hop is 1/4 of the window length) that the windows add to a 
constant.

> 
> As you point out, this works for any COLA analysis window scheme, not just 
> rectangular, although the efficiency is correspondingly reduced with overlap. 
> This system is equivalent to a SISO FIR, up to finite word length effects.

yup.  as long as the size of the FFT is as long as the sum of the FIR length 
and window width.

> Note that this equivalence happens because we are adding an additional 
> time-variant stage (zero-padding/raw OLA), to explicitly correct for the 
> time-variant effects of the underlying DFT operation. This is the block 
> processing analog of upsampling a scalar signal by K so that we can apply an 
> order-K polynomial nonlinearity without aliasing.

where is this polynomial nonlinearity?  and i am still not groking the 
upsampling.
 
> The synthesis window approach is more general in the types of modifications 
> that can be accommodated (spectral subtraction, nonlinear operations, etc.). 
> This is because it allows time domain aliasing to occur, but explicitly 
> suppresses it by attenuating the frame edges. This is also throwing 
> oversampling at the problem, but of the overlap type instead of the 
> zero-padding type.
> 
> You can also apply zero-padding on top of synthesis windows to further 
> increase the margin for circular aliasing. However unlike fast convolution 
> you would still apply the synthesis windows to remove spill-over between 
> frames and not use raw OLA. This is required for the filterbank PR condition. 
> There is no equivalent SISO system in this case. The level of aliasing is 
> determined by how hard you push on the response, and how much 
> overlap/zero-padding you can afford. I.e., it's ultimately engineered/tuned 
> rather than designed out explicitly as in fast convolution.
> 
> We're all on the same page on this stuff, I hope?

i think so, but i wonder if i am understanding your "upsampling" and 
zero-padding.  we're zero-padding in the time domain (making the length longer 
to the DFT length N).  the factor of length increase in the time domain is the 
factor of oversampling in the frequency domain.  a lot of people make the 
sub-optimal decision to simply pad zeros equal to length of the frame of audio. 
 then doubling the length would be like inserting oversampled frequency DFT 
bins between each of the "original" spectral points.

--
 
r b-j                  r...@audioimagination.com
 
"Imagination is more important than knowledge."
_______________________________________________
dupswapdrop: music-dsp mailing list
music-dsp@music.columbia.edu
https://lists.columbia.edu/mailman/listinfo/music-dsp

Reply via email to