zuo bf wrote:
Hi Steve,
inline
On 5/19/06, *Steve Underwood* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> wrote:
> Hi,
>
> The theory is not based on the music, it's based on that given
by the
> ITU G.711 Appendix I (BTW: the music is converted to
8K/mono/16bit by
> CoolEdit).
>
What works well for music is very different from what works well for
voice.
yeah, but i don't think the difference is so big unless you give me a
voice file to prove me wrong.
And again the reason i prolong it based on theory given by G.711
Appendix I, which is said to be
derived from experimentation of BELL.
Just because its derived from Bell, doesn't make it the word of God. For
example, the pitch search only goes down to 66Hz. The F0 of my voice can
go well below 50Hz, and the pitch is completely messed up. As I said,
for music a far slower decay characteristic works a lot better. Also,
windowing before the AMDF will give better temporal localisation of the
pitch estimate. This is pretty much a waste of time for voice, but it
helps stabilise the pitch for music reducing the watery quality of the
synthetic sound on higher pitches. All good modern codecs do some form
of fractional pitch search to reduce wateriness in female (i.e. high
pitched) voices. This PLC algorithm does everything in whole samples. I
suspect the fractional pitch approach would noticeably help quality, but
at substantial computational expense.
G.711 Appendix 1 and my code fade to silence over 50ms. For music
much greater sustain to fill in the gaps works much better. With
speech,
that badly affects intelligibility.
I didn't change this, BTW, G.711 Appendix I fade to silence over 60ms
because it doesn't
fade for the first erasure but you did and i think as you can't know
the wave are going to
rise or down you'd better keep the same level for the first erasure.
Ah, I forgot about this. Its something that isn't very sane in Appendix
1, and I never went back to experiment with. Several areas of Appendix 1
are very much oriented to 10ms packets. In the real world hardly anyone
uses 10ms packets. I suspect the decay rate should be different for 20ms
or 30ms packets, and that requires investigation.
////////////////////////////////////////////////////////////////////////////////////////////////
G.711 Appendix I
I.2.4 Synthetic signal generation for first 10 ms
For the first 10 ms of the erasure, the best results are obtained by
generating the synthesized signal
from the last pitch period with no attenuation.
/////////////////////////////////////////////////////////////////////////////////////////////////////
I used the Appendix 1 approach
without experimenting. I suspect something other than linear
attenuation
would behave better.
By experimentation, i think as long as the algorithm aimed at Generic
Linear concealment,
probably you cann't find one much better than this, unless you analyse
some voice parameters from
previous samples.
Actually, there are rather better concealment algorithms, but they
require greater amounts of computation. Try a Google search. Several
people have reported results using LPC analysis and synthesis which seem
better, especially for longer erasures.
> And the current plc algorithm is similar to the G.711 Appendix I
except:
> 1. The pitch detection algorithm : G.711 Appendix I uses cross
> correlation, but Asterisk uses AMDF which is simpler and also
performs
> well
>
Correct.
> 2. The OLA window: G.711 update the OLA window length when burst
loss
> occurs, but Asterisk didn't
>
Wrong. They both use the same OLA strategy - 1/4 pitch period overlap.
G.711 will prolong the OLA window by 4ms until it reached 10ms, but
the Asterisk one doesn't?
////////////////////////////////////////////////////////////////////////////////////////////////
G.711 Appendix I
I.2.7 First good frame after an erasure
At the first good frame after an erasure, a smooth transition is
needed between the synthesized
erasure speech and the real signal. To do this, the synthesized speech
from the pitch buffer is
continued beyond the end of the erasure, and then mixed with the real
signal using an OLA. The
length of the OLA depends on both the pitch period and the length of
the erasure. For short, 10 ms
erasures, a 1/4 wavelength window is used. For longer erasures the
window is increased by 4 ms per
10 ms of erasure, up to a maximum of the frame size, 10 ms.
////////////////////////////////////////////////////////////////////////////////////////////////
> 3. The nearby field of the first erasure: G.711 delays the
output for
> 3.75 ms to compensate the probable loss, but Asterisk just use the
> symmetrical
>
> part before the lost to do the OLA. The one G.711 Appendix I
utilized
> should be better, but it's not very important as human being's ears
> are really anti-jamming.
>
That 3.75ms delay is so the Appendix 1 algorithm can do a 1/4 pitch
period of OLA when erasure commences. However, it incurs lots of
buffer
copying when there are no lost packets. What my code does is time
reverse the last 1/4 pitch period and OLA with that. It sounds nasty,
but listening tests with speech showed it was very close to the
sound of
the G.711 appendix 1 algorithm, and improves efficiency a lot in the
common case - no packets being lost.
Yeah, the result are similar, but the difference is just 3.75 ms
delay, i didn't see
more buffer copying than necessary, both algorithm save the same
history (although G.711 keeps
a longer one and delay for 3.75ms)
BTW: packet loss is very common at least in China, and the burst loss
can last very long.
For example, as the bandwith between the two major carriers are very
low, two user from each
will experience packet loss very often if they use the public internet
not some softswitch network.
There is a lot more copying in the Appendix 1 algorithm. It not only
saves a copy of the audio. It has to rearrange the output buffer to be
delayed by 30 samples. When there are no erasures the difference in
compute requirements is substantial. Enough to make me rework the
algorithm to optimise the common case. If you don't think no erasures is
the common case you have real problems. :-)
In Southern China my experience has been of very very low packet loss,
and the full bandwidth of ADSL connections being available most of the
time. International comms can be more congested, but there is a lot of
local overcapacity. I don't know much about Northern China.
> I prolong the pitch period to a maximum of 3 pitch period, but
> Asterisk only uses one which
>
> saves memory but behave bad at burst loss.
>
For ptolonged erasures G.711 Appendix 1 and my code act in exactly the
same way. They linearly attenuate to zero over the first 50ms. In
that
period they repeat the last 1.25 pitch periods of real speech, with a
quarter pitch period of overlap. When real speech restarts they
both do
a 1/4 pitch period of OLA, based on the last known pitch. The
algorithms
are identical beyond the initial 1/4 pitch period of OLA. Why would
anyone want to save memory here? It only uses a small amount. The
algorithmic changes were to reduce the buffer manipulation in the
common
case.
> 4. whether prolong the pitch period during burst loss: G.711 Appendix
Not the same.
////////////////////////////////////////////////////////////////////////////////////////////////
G.711 Appendix I
I.2.5 Synthetic signal generation after 10 ms
If the next frame is also erased, the erasure will be at least 20 ms
long and further action is required.
While repeating a single pitch period works well for short erasures
(e.g. 10 ms), on long erasures it
introduces unnatural harmonic artifacts (beeps). This is especially
noticeable if the erasure lands in
an unvoiced region of speech, or in a region of rapid transition such
as a stop. It was discovered by
experimentation that these artifacts are significantly reduced by
increasing the number of pitch
periods used to synthesize the signal as the erasure progresses.
Playing more pitch periods increases
the variation in the signal. Although the pitch periods are not played
in the order they occurred in the
original signal, the resulting output still sounds natural. At 10 ms
into the erasure the number of pitch
periods used to synthesize the speech is increased to two, and at 20
ms a third pitch period is added.
For erasures longer than 20 ms no additional modifications to the
pitch buffer are made.
////////////////////////////////////////////////////////////////////////////////////////////////
Actually, people complain Appendix 1 PLC implementations also beep.
You'll find that improvements in that area are one of the main claims
for the LPC based PLC algorithms. I'd have to go back and check on this.
Its a while since I wrote the code. If I diverged from the Appendix 1
algorithm I must have done so for a good reason, like it simplified
something without noticeable impact on qualtity.
I think the documentation for my PLC code is missing from the Asterisk
No, it's available in plc.h under asterisk/include. :)
source code, but you can find it at
http://www.soft-switch.org/spandsp-doc/plc_page.html
Regards,
Steve
As I said before you really have to try voice, and not music. It makes a
huge difference. If you try a continuous tone the PLC algorithm behaves
terribly, but that's another case nobody cares about. :-)
Regards,
Steve
_______________________________________________
--Bandwidth and Colocation provided by Easynews.com --
asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
http://lists.digium.com/mailman/listinfo/asterisk-dev