Hello David,

thank you for your explanations. Everything makes a little bit more sense to me
now. I think most of the confusion comes from the fact - correct me if I am
wrong - that Codec 2 is actually a combination of Sinusoidal and LPC coding, but
in most materials you present it as a Sinusoidal coder, which is a bit 
misleading.

Let me try to sum up:

So you are basically using LPC with LSP quantization on the source signal. On
the channel you are transmitting a fixed number of LSP-coded LPC filter
coefficients instead of the variable harmonic amplitudes. It might be a bit
misleading to call this "spectral magnitudes" under Bit Allocation on the
website. (Or are LPC coefficients actually also called spectral magnitudes?)
Voicing is determined via MBE, but reduced to a single bit per 10ms.
Pitch is determined using the NLP algorithm (cf. chapter 4) and some refinement
via MBE (for which you naturally need a DFT as well).

In the decoder, a signal is synthesized using the LPC synth filter. That should
already be audible speech. But since the audio quality of pure LPC systems is
low (you write of a "mechanical quality" in your thesis), you do the following
trick: You get the DFT of the LPC-produced signal and extract the harmonic
amplitudes - but from the RMS for reasons not well understood. You then apply
the sinusoidal model (ie. "Reverse FFT"?) including phase information derived
from the voicing bits. Effectively, you are enhancing the LPC synthesized speech
by correcting the harmonic phases, resulting in increased quality.

Did I get this right?
How exactly the harmonic amplitudes are extracted, I still don't fully
understand. It's described in chapter 5.2.1, so that's my problem. Also, I do
not understand what the quantized energy information is used for.

This is not so bad for my small report, as I can outline first a pure sinusoidal
model with pros and cons, then LPC with pros and cons as an alternative and
ultimately Codec 2 as a clever synthesis of both.

I must say, that this is the most complex DSP algorithm I have seen so far, but
I presume there is equally complex stuff going on in video coding.

Best regards,
Robin Haberkorn

PS: Off-topic, but what system did you write your thesis in? Is this TeX or 
Troff?

On 9/20/23 22:14, david wrote:
> Hi Robin,
> 
>> Regarding the quantization of sinusoidal magnitudes/amplitudes, you
>> write in a
>> blog post (https://www.rowetel.com/?p=130) that the "red line" Am is
>> quantized.
>> This is not the plain frequency curve (the green one Sw). How exactly
>> do you
>> derive Am from Sw?
> 
> By sampling the LPC synthesis filter Pw=1/|A(e^jw)|^2 at each harmonic.
> 
>> But in the Harmonic Sinusoidal Model, you need to have all L
>> amplitudes
>> available to synthesize the speech signal. How is that achieved? Are
>> you simply
>> synthesizing 10 harmonics with an appropriately scaled Wo no matter
>> what?
>>
> 
> The LSPs are converted back to LPC coeffcients {ak}, which are used to
> create a LPC synthesis filter, which we sample.  Well actually we take
> the RMS value of the spectra in that band rather than sampling at the
> harmonic centre.  The blog post you linked to explains that a little
> further down, and I think it's in the thesis too.
> 
>> The fundamental frequency is determined by trying a number of
>> frequencies
>> between 50-500 Hz, determining the sinusoidal amplitudes, decoding
>> that data and
>> comparing it with the original signal? The fundamental frequency will
>> be the one
>> where that comparison yields the smallest error. This is the
>> algorithm described
>> in chapter 3.4 of your PhD thesis.
>>
> We use the non linear pitch estimation algorithm (in the thesis), the
> the MBE pitch estimator (which you outlined above) is used for
> refinement of the pitch estimate.
> 
>> What's the algorithm you are using to estimate voicing?
> 
> The MBE algorithm, but the voicing of all bands is averaged to get a
> single metric which we compare to a threshold.
> 
>> Furthermore, LPC analysis is performed directly on the speech samples
>> (time
>> domain) according to the block diagram. How does that fit together
>> with using Am
>> which is obviously a feature in the frequency domain?
> 
> The Am are extracted using freq domain techniques for the purpose of
> estimating voicing.  In the LPC quantised modes, then Am are then
> discarded and the time domain LPC are transformed to LSPs and sent to
> the decoder, where the Am are extracted.
>  
>> I do have a little bit of experience in signal/audio processing, but
>> still find
>> it hard to understand all of it. Okay I admit, I get terribly
>> confused.
> 
> Yes, we realise there is a gap here.  We plan to write a complete
> algorithm description to provide a reference in one place.
> 
> Cheers,
> David R
> 
> 
> 
> _______________________________________________
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Reply via email to