I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that operate in
MELP for a vocoder – Mixed Excitation Linear Prediction.  We have found MELP
to be superior (more human-like voice qualities – less Charlie Brown’s
teacher) to LPC-10 but we use far larger bandwidths than 100 khz.  I do not
know how well any of this will play out at such a narrow bandwidth.
Listening to Charlie Brown’s teacher will send you running away quickly and
you should think of your listeners . . . they will tire very quickly.  Just
because voice can be sent at such narrower bandwidths does not necessarily
mean that people will like to listen to it.

 

Rick – KH2DF

 

  _____  

From: digitalradio@yahoogroups.com [mailto:[EMAIL PROTECTED] On
Behalf Of Vojtech Bubník
Sent: Saturday, November 17, 2007 9:11 AM
To: [EMAIL PROTECTED]; digitalradio@yahoogroups.com
Subject: [digitalradio] Re: digital voice within 100 Hz bandwidth

 

Hi Mike.

I studied some aspects of voice recognition about 10 years ago when I
thought of joining a research group at Czech Technical University in Prague.
I have a 260 pages text book on my book shelf on voice recognition.

Voice signal has high redundancy if compared to a text transcription. But
there is additional information stored in the voice signal like pitch,
intonation, speed. One could estimate for example mood of the speaker from
the utterance.

Voice tract could be described by a generator (tone for vowels, hiss for
consonants) and filter. Translating voice into generator and filter
coefficients greatly decreases voice data redundancy. This is roughly the
technique that the common voice codecs do. GSM voice compression is a kind
of Algebraic Code Excited Linear Prediction. Another interesting codec is
AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM half-rate
codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use
excitation tables, but AMBE is more efficient and closed source. I think the
clue to the efficiency is in size and quality of the excitation tables. To
create such an algorithm requires considerable amount of research and data
analysis. The intelligibility of GSM or AMBE codecs is very good. You could
buy the intelectual property of the AMBE codec by buying the chip. There are
couple of projects running trying to built DSTAR into legacy transceivers.

About 10 years ago we at OK1KPI club experimented with an echolink like
system. We modified speakfreely software to control FM transceiver and we
added web interface to control tuning and subtone of the transceiver. It was
a lot of fun and a very unique system at that time. http://www.speakfre
<http://www.speakfreely.org/> ely.org/ The best compression factor offers
LPC-10 codec (3460kbps), but the sound is very robot-like and quite hard to
understand. At the end we reverted to GSM. I think IVOX is a variant of the
LPC system that we tried.

Your proposal is to increase compression rate by transmitting phonemes. I
once had the same idea, but I quickly rejected it. Although it may be a nice
exercise, I find it not very useless until good continuous speech
multi-speaker multi-language recognition systems are available. I will try
to explain my reasoning behind that statement.

Let's classify voice recognition systems by the implementation complexity:
1) Single-speaker, limited set of utterances recognized (control your
desktop by voice)
2) Multiple-speaker, limited set of utterances recognized (automated phone
system)
3) dictating system
4) continuous speech transcription
5) speech recognition and understanding

Your proposal will need implement most of the code from 4) or 5) to be
really usable and it has to be reliable.

State of the art voice recognition systems use hidden Markov models to
detect phonemes. Phoneme is searched by traversing state diagram by
evaluating multiple recorded spectra. The phoneme is soft-decoded. Output of
the classifier is a list of phonemes with their probabilities of detection
assigned. To cope with phoneme smearing on their boundaries, either
sub-phonemes or phoneme pairs need to be detected.

After the phonemes are classified, they are chained into words. Depending on
the dictionary, most probable words are picked. You suppose that your system
will not need it. But the trouble are consonants. They carry much less
energy than vowels and are much easier to be confused. Dictionary is used to
pick some second highest probability detected consonants in the word. Not
only the dictionary, but also the phoneme classifier is language dependent. 

I think human brain works in the same way. Imagine learning foreign
language. Even if you are able to recognize slowly pronounced words, you
will be unable to pick them in a fast pronounced sentence. The word will
sound different. Human needs considerable training to understand a language.
You could decrease complexity of the decoder by constraining the detection
to slowly dictated separate words.

If you simply pick the high probability phoneme, you will experience
comprehension problems of people with hearing loss. Oh yes, I am currently
working for hearing instrument manufacturer (I have nothing to do with
merck.com).

from http://www.merck. <http://www.merck.com/mmhe/sec19/ch218/ch218a.html>
com/mmhe/sec19/ch218/ch218a.html
> Loss of the ability to hear high-pitched sounds often makes it more
difficult to understand speech. Although the loudness of speech appears
normal to the person, certain consonant sounds—such as the sound of letters
C, D, K, P, S, and T—become hard to distinguish, so that many people with
hearing loss think the speaker is mumbling. Words can be misinterpreted. For
example, a person may hear “bone” when the speaker said “stone.”

For me, it would be very irritating to dictate slowly to a system knowing it
will add some mumbling and not even having feedback about the errors the
recognizer does. From my perspective, before good voice recognition systems
are known, it is reasonable to stick to keyboard for extremely low bit
rates. If you would like to experiment, there are lot of open source voice
recognition packages. I am sure you could hack it to output the most
probable phoneme detected and you may try yourself, whether the result will
be intelligible or not. You do not need the sound generating system for that
experiment, it is quite easy to read the written phonemes. After you have a
good phoneme detector, the rest of your proposed software package is a piece
of cake.

I am afraid I will disappoint you. I do not contemn your work. I found
couple of nice ideas in your text. I like the idea to setup the varicode
table to code similarly sounding phonemes by neighbor codes and to code
phoneme length by filling gaps in the data stream by a special code. But I
would propose you to read text book on voice recognition not to reinvent the
wheel.

73 and GL, Vojtech OK1IAK

 

Reply via email to