zuo bf wrote:

Hi Steve,

inline

On 5/19/06, *Steve Underwood* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> wrote:

    > Hi,
    >
    > The theory is not based on the music, it's based on that given
    by the
    > ITU G.711 Appendix I (BTW: the music is converted to
    8K/mono/16bit by
    > CoolEdit).
    >
    What works well for music is very different from what works well for
voice.

yeah, but i don't think the difference is so big unless you give me a voice file to prove me wrong. And again the reason i prolong it based on theory given by G.711 Appendix I, which is said to be
derived from experimentation of BELL.

Just because its derived from Bell, doesn't make it the word of God. For example, the pitch search only goes down to 66Hz. The F0 of my voice can go well below 50Hz, and the pitch is completely messed up. As I said, for music a far slower decay characteristic works a lot better. Also, windowing before the AMDF will give better temporal localisation of the pitch estimate. This is pretty much a waste of time for voice, but it helps stabilise the pitch for music reducing the watery quality of the synthetic sound on higher pitches. All good modern codecs do some form of fractional pitch search to reduce wateriness in female (i.e. high pitched) voices. This PLC algorithm does everything in whole samples. I suspect the fractional pitch approach would noticeably help quality, but at substantial computational expense.

G.711 Appendix 1 and my code fade to silence over 50ms. For music

    much greater sustain to fill in the gaps works much better. With
    speech,
that badly affects intelligibility.

I didn't change this, BTW, G.711 Appendix I fade to silence over 60ms because it doesn't fade for the first erasure but you did and i think as you can't know the wave are going to rise or down you'd better keep the same level for the first erasure.

Ah, I forgot about this. Its something that isn't very sane in Appendix 1, and I never went back to experiment with. Several areas of Appendix 1 are very much oriented to 10ms packets. In the real world hardly anyone uses 10ms packets. I suspect the decay rate should be different for 20ms or 30ms packets, and that requires investigation.

////////////////////////////////////////////////////////////////////////////////////////////////
G.711 Appendix I
I.2.4 Synthetic signal generation for first 10 ms
For the first 10 ms of the erasure, the best results are obtained by generating the synthesized signal
from the last pitch period with no attenuation.
/////////////////////////////////////////////////////////////////////////////////////////////////////

    I used the Appendix 1 approach
    without experimenting. I suspect something other than linear
    attenuation
    would behave better.


By experimentation, i think as long as the algorithm aimed at Generic Linear concealment, probably you cann't find one much better than this, unless you analyse some voice parameters from
previous samples.

Actually, there are rather better concealment algorithms, but they require greater amounts of computation. Try a Google search. Several people have reported results using LPC analysis and synthesis which seem better, especially for longer erasures.


    > And the current plc algorithm is similar to the G.711 Appendix I
    except:
    > 1. The pitch detection algorithm : G.711 Appendix I uses cross
    > correlation, but Asterisk uses AMDF which is simpler and also
    performs
    > well
    >
    Correct.

    > 2. The OLA window: G.711 update the OLA window length when burst
    loss
    > occurs, but Asterisk didn't
    >
    Wrong. They both use the same OLA strategy - 1/4 pitch period overlap.


G.711 will prolong the OLA window by 4ms until it reached 10ms, but the Asterisk one doesn't?

////////////////////////////////////////////////////////////////////////////////////////////////
G.711 Appendix I
I.2.7 First good frame after an erasure
At the first good frame after an erasure, a smooth transition is needed between the synthesized erasure speech and the real signal. To do this, the synthesized speech from the pitch buffer is continued beyond the end of the erasure, and then mixed with the real signal using an OLA. The length of the OLA depends on both the pitch period and the length of the erasure. For short, 10 ms erasures, a 1/4 wavelength window is used. For longer erasures the window is increased by 4 ms per
10 ms of erasure, up to a maximum of the frame size, 10 ms.
////////////////////////////////////////////////////////////////////////////////////////////////

    > 3. The nearby field of the first erasure: G.711 delays the
    output for
    > 3.75 ms to compensate the probable loss, but Asterisk just use the
    > symmetrical
    >
    > part before the lost to do the OLA. The one G.711 Appendix I
    utilized
    > should be better, but it's not very important as human being's ears
    > are really anti-jamming.
    >
    That 3.75ms delay is so the Appendix 1 algorithm can do a 1/4 pitch
    period of OLA when erasure commences. However, it incurs lots of
    buffer
    copying when there are no lost packets. What my code does is time
    reverse the last 1/4 pitch period and OLA with that. It sounds nasty,
    but listening tests with speech showed it was very close to the
    sound of
    the G.711 appendix 1 algorithm, and improves efficiency a lot in the
    common case - no packets being lost.


Yeah, the result are similar, but the difference is just 3.75 ms delay, i didn't see more buffer copying than necessary, both algorithm save the same history (although G.711 keeps
a longer one and delay for 3.75ms)
BTW: packet loss is very common at least in China, and the burst loss can last very long. For example, as the bandwith between the two major carriers are very low, two user from each will experience packet loss very often if they use the public internet not some softswitch network.

There is a lot more copying in the Appendix 1 algorithm. It not only saves a copy of the audio. It has to rearrange the output buffer to be delayed by 30 samples. When there are no erasures the difference in compute requirements is substantial. Enough to make me rework the algorithm to optimise the common case. If you don't think no erasures is the common case you have real problems. :-)

In Southern China my experience has been of very very low packet loss, and the full bandwidth of ADSL connections being available most of the time. International comms can be more congested, but there is a lot of local overcapacity. I don't know much about Northern China.

    > I prolong the pitch period to a maximum of 3 pitch period, but
    > Asterisk only uses one which
    >
    > saves memory but behave bad at burst loss.
    >
    For ptolonged erasures G.711 Appendix 1 and my code act in exactly the
    same way. They linearly attenuate to zero over the first 50ms. In
    that
    period they repeat the last 1.25 pitch periods of real speech, with a
    quarter pitch period of overlap. When real speech restarts they
    both do
    a 1/4 pitch period of OLA, based on the last known pitch. The
    algorithms
    are identical beyond the initial 1/4 pitch period of OLA. Why would
    anyone want to save memory here? It only uses a small amount. The
    algorithmic changes were to reduce the buffer manipulation in the
    common
case.
> 4. whether prolong the pitch period during burst loss: G.711 Appendix

Not the same.

////////////////////////////////////////////////////////////////////////////////////////////////
G.711 Appendix I
I.2.5 Synthetic signal generation after 10 ms
If the next frame is also erased, the erasure will be at least 20 ms long and further action is required. While repeating a single pitch period works well for short erasures (e.g. 10 ms), on long erasures it introduces unnatural harmonic artifacts (beeps). This is especially noticeable if the erasure lands in an unvoiced region of speech, or in a region of rapid transition such as a stop. It was discovered by experimentation that these artifacts are significantly reduced by increasing the number of pitch periods used to synthesize the signal as the erasure progresses. Playing more pitch periods increases the variation in the signal. Although the pitch periods are not played in the order they occurred in the original signal, the resulting output still sounds natural. At 10 ms into the erasure the number of pitch periods used to synthesize the speech is increased to two, and at 20 ms a third pitch period is added. For erasures longer than 20 ms no additional modifications to the pitch buffer are made.
////////////////////////////////////////////////////////////////////////////////////////////////

Actually, people complain Appendix 1 PLC implementations also beep. You'll find that improvements in that area are one of the main claims for the LPC based PLC algorithms. I'd have to go back and check on this. Its a while since I wrote the code. If I diverged from the Appendix 1 algorithm I must have done so for a good reason, like it simplified something without noticeable impact on qualtity.

I think the documentation for my PLC code is missing from the Asterisk

No, it's available  in plc.h under  asterisk/include. :)

    source code, but you can find it at
    http://www.soft-switch.org/spandsp-doc/plc_page.html

    Regards,
    Steve

As I said before you really have to try voice, and not music. It makes a huge difference. If you try a continuous tone the PLC algorithm behaves terribly, but that's another case nobody cares about. :-)

Regards,
Steve

_______________________________________________
--Bandwidth and Colocation provided by Easynews.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
  http://lists.digium.com/mailman/listinfo/asterisk-dev

Reply via email to