(C) Copyright, David (No) - June 2012
---This is the second part of the article. Please read part 1 and 3 as well, to
get the full info.---
Reproducing The Human Voice
- The Crew Meet Their First Challenges
The human voice is a master-piece of creation. As air passes the vocal cords in
your throat, they start to vibrate. Controlling the air-flow in your lungs and
throat, and actively using the tongue, as well as the whole of your mouth, you
can produce a quite wide range of sounds. Not only can you produce whisper,
speech and outcries, but you can even modulate the tone, pace and intensity.
Doing so, you can talk in a soft and loving voice, to your kids. Or, you can
shout out warnings across the street. Good modulation in your speech, will even
produce a certain amount of authority. Correctly trained and controlled, your
voice even can be made into a wonderful instrument. This is why, we can enjoy
good singers. Really, a master-piece of creation. And, definitely a hard one to
copy.
But not only your voice, lungs, tongue, mouth and throat muscles are into play
here. Being a good narrator, would take a good portion of coordination between
all of these body-members, and even include your eyes. The coordination will
all be cared for by your brain. As if it did not already have enough to
coordinate, there is yet one more thing to be included here: Your emotions.
Feelings.
Narrating a text, where you will find quotations, punctuations, parenthesized
phrases, emotional expressions and simple 'plain' text; it all calls for some
kind of work from your whole body. Trying to reproduce ALL of this
electronically, is an enormous task. A task, that even gets more complicated,
due to the fact that many texts do need a bit of interpretation. A certain
phrase when given as part of an article or story will need one kind of
modulation. Were you to include the exact same phrase, as part of a joke, you
would all the sudden need a fairly different modulation. And if the phrase
appears on your computer screen, as part of a message box popping up, you might
want it read out in a way, that totally differs from any of the others. All of
this context-depending interpretation', will NEVER be possible electronically.
Why? Simply due to the computer's lack of thinking-process, and emotions. Based
on this, we can establish the fact, that the PERFECT synthetic voice - will
simply never be released.
Still, a handful of the above challenges, are taken into consideration, when
building a synthetic voice. Depending on how many of the issues of intonation,
modulation, pace and basic interpretation a manufacturer has incoperated in his
production, you will get a more or less realistic sounding voice for your
computer. Yet, even when the manufacturer has endeavored to deal with most of
these challenges, you could end up with a rather poor product. Many a voice on
the market, is so eagerly stressing one or more of the issues, that it becomes
somehow wearisome to listen to the voice over time. If a voice has a certain
modulation, it might sound wonderful for one sentence. But when the exact same
modulation is repeated over and over on each and every sentence in your text
no matter the length or characteristics of the text, it simply does not flow
well in our ears. Other voices tend to give wrong pausing at punctuations
making long breaks at commas, and hardly any breaks between the sentences.
Still others, lengthen the vowels in the text, to the extend that many words
will sound rather strange to the human ear.
Now, that we know the many challenges in reproducing the human voice, maybe it
is time we take a look at how the manufacturers deal with the whole matter. As
you will see in the coming paragraphs, producing a basic synthetic voice might
not be all that hard; first you have the knowledge and equipment to do so. Yet,
to have your product realistic sounding, would take a crew of professionals. A
speech therapist natively speaking the language your voice is meant to work
under, might be in place for hiring. Some kind of crew holding linguistic
knowledge, would likely be another qualified group in your team. These
professional members, would be able to give good counsel on how to implement
the right set of rules for good speech-reproduction. Only, such a crew does
cost a small fortune. If you are a private person, with limitted resources, you
might soon enough find yourself stuck in the wilderness of challenges.
Electronic Voices
- A Peek Behind The Scene
Breaking an electronic voice all down to its very basic parts, we could tell it
to consist of numerous tones. If you ever have tried playing the guitar, you
will know, pecking a string on the instrument, then quickly sliding your finger
up and down on the the string, you can vary the tone quite a lot. Putting a
number of such tones together according to certain rules, would produce a sound
that would fool our human ears to hear a spoken vowel or consonant. This kind
of sound, can be produced electronically, by your computer. Stringing numerous
pieces of 'music' of this kind together would produce our first spoken word, by
an electronic voice. Further stringing several words together, would make the
first spoken phrase; and so on. It would all be done synthetically, and
electronically. And, that is exactly the way it would sound as well. You
typically now would hear the metallic robot-voice, that is so famously known,
from science-fiction movies. If you happen to have such a robotic voice
installed on your computer, and you let it read a phrase to you at a very low
speed, you will hear the different small pieces of tone-combinations being
played. You can even hear, how it 'slides' its way through the whole range of
tone-combinations.
This basic voice, can hardly be used for anything but sound-effects, and
illustration purposes. It is way too robotic, and would have no realistic touch
to it. It might even pronounce numerous words wrongly. Before we can release
the voice as a worthy stand-alone product, we need a good portion more work
done on the project.
Each character in our alphabet, does have its own sound. Reproducing each of
these sounds electronically, can be done with a certain amount of efforts - as
we just discussed. But try to sit down, and speak out the word "WATER" to your
self. Try to really slow down the pronounciation to the extreme. You will hear,
that we start out with some kind of a "thick" Wsound. But do we immediately
jump to the A-sound? No, we smoothly 'slide' from the W, to a W-A, and into the
A-sound. Same goes between any of the rest of the letters in our word. And this
is one of the big issues, in creating a well-sounding electronic voice. The
example we just went through, gave the very basics. We here would need to
reproduce the sound of each letter, and one sound for each two-letter
combination. Still, doing such a basic job, would make our voice sound rather
choppy. Especially so, in cases where we let it read in a slow pace. A
professionally designed voice, would have several 'snap-shots' of the
transition between two-letters. Let's break it into a five-step process,
instead of the thre-step one illustrated above. You now would have one sound
reproducing the pure W. The next sound would reproduce a clear W with a tiny A
attached. Third sound, would be a balanced W-A sound. Fourth comes a sound
reproducing a short W, followed by a longer A. And finally, we have a clear
A-sound. The W-A combination, will now be spoken out a bit more smoothly
sounding by your electronic voice. And this 'break-down' of the different
letter-combinations, could go on indefinitely. Depending on how many steps each
letter-combination is broken into, the voice would perform the pronounciation
more or less realistically. The more steps for each combination, the better
sounding the voice will be.
Such letter-combination work, could be quite excessive. In the English
alphabet, we have 26 letters. If you were to make a list of all possible
two-letter combinations from that alphabet, you would end up with a list of
several hundred possibilities. Breaking each of these many possible
combinations into three-step processes, would litterally bring your sound
library into thousands of entries. And, if your voice is going to have a
pleasant sound, you will need far more than three steps in each process. I dare
say, your library of small tones, would easily run into hundreds of thousands -
If not millions - of entries.
Depending on the technique you decide on - at this point in the development -
you could choose a couple of approaches. Either you would save the tones
produced by your computer for each step, into small electronic sound clips;
which could be played back at a later time. Alternatively, you can let the
parameters for the tones to be played (frequency, duration and volume), be
saved in a file. When the voice was to speak out a given letter or combination,
the actual tones would be produced on the fly, based on these parameters. The
first approach, would ensure an equally sounding voice on all computers; but
does take up a bit more space on the hard disk. The second approach, would be
far less space-demanding on the disk, but also carries the risk of the tones
produced, to vary slightly from what you expected. Such tiny variations, could
even affect the whole sound of your electronic voice.
Now that you have got your thousands or millions of tones in place, we can
start to put it into action. We then need a bit of programming - a small piece
of software that will take care of all the playback. Your software will 'read'
the word to be spoken off from your screen. It then will break the word into
two-letter combinations, and play back the predefined tones for each
combination. So the word "water", would be broken into the combinations of:
w-a
a-t
t-e
e-r
. Each combination taking numerous tones to be played back. But finally, you
have a product, that can be used for the very basic reading of any text, the
user might want to listen to. Yet, there is more to be dealt with.
Your software now should care for all kind of punctuations. It would need to be
programmed to make a pause of a certain duration, every time it comes across a
space character in the text. Otherwise, you will have no real break between the
words in your reading. Further, the duration of pauses for things like commas,
colons, semi-colons and period-signs will have to be preset in your software.
You further will need your software to turn things like numbers, into spelled
words. The numeric character 1, has no real basic pronounciation. We pronounce
it as a word, made up of the letters O, N, and E. So, we have to tell our
voice-software, that the character 1, should be replaced with the word "one".
This kind of pronounciation rules, would be collected in a 'pronounciation
dictionary'. This dictionary will hold entries for many characters that need
some kind of modification, before they can be pronounced correctly. Oh, but
wait! If we strictly follow the rules of two-letter combinations outlined
above, sending the three letters O, N, E, to the voice for playback - would not
make a correct pronounciation. Try for yourself, pronouncing each of the three
characters quickly in a row, and you will hear how wrongly it would sound. So
our dictionary actually should hold the three letters of W, O, and N; to have
the numeric character 1, pronounced correctly.
Further on, your pronounciation-dictionary will need rules for handling two-,
three-, four-digits, grouping them into tens, hundreds and thousands. It needs
rules for handling dates, times and commonly used abbriviations. As if all of
this isn't enough, we also need rules - or entries - in the pronounciation
dictionary, as to how to spell out each letter itself. You know, the letter B,
when spelling, is pronounced like a B with a following E-sound: BE. And worse
comes, when we reach the W, as it actually has to be pronounced as two words,
when spelling: Double, U.
Our trials are not yet done away with. The letter-combination of B and E,
should be pronounced with a somehow long E-sound, when we come across the word
"BE". Yet, it should be pronounced with a slightly shorter E-sound, when we
have words like "BECAUSE". And if we read about the insect BEE, we don't want
to hear it pronounced as B-E-E. So, we need entries in our dictionary, caring
for all of these many 'exceptional' pronounciation rules. Do you start to
realize, why I told you that a crew of speech therapists and linguistics would
be in place for your developing team?
Finally, we are ready to take care of the real challenges, when comes to have a
smooth-sounding narration of any text. Oh, come on, haven't we already taken
care of everything? Sorry, but no! So far, we only have dealt with the basic
pronounciation features of your new electronic voice. In the outset of the
article, we learned that there is a need for great care to be taken, when comes
to things like modulation and pace. These things our software needs to care
for, by raising the volume, lengthen the duration, or increasing the speed
during the playback of any of our entries in the sound library of your
electronic voice. It sounds easy enough. But there is an excessive amount of
analysis to be performed, so as to have the right modulation.
Many voices on the market, are rather poorly designed, when comes to this kind
of analyzation. They tend to think, that all phrases start with higher pitch,
and end with a low one. No matter the length of the phrase, the voice will
start and end on predefined pitch-levels. And this results in a rather monotone
voice. It might be good for projects where you only need short and concise
messages spoken. Such could be the case, when you want your voice to do the
reading of the weather reports and forecasting. But it will hardly be what
anyone wants for reading the daily newspaper.
Yet other voices, tend to fall in the opposite ditch. They overdo the
modulation, to the extend that it sounds like the narrator is doing the reading
onboard in a roller-coaster. Each phrase is read out, with the pitch jumping
abusively up and down several times, throughout the whole length of the phrase.
Such voices, could easily give a listening experience, where you feel the
narrator is stressing every second or third word, no matter if it is meant to
be stressed or not.
A third pitfall of modulation, is when the voice treats big amount of text, as
one extremely long phrase. In such cases, the voice would 'happily' start out
narrating the text to you, in a normal speed and pitch. But as the narration
moves on, the pitch and speed gradually falls; till it ends up sounding like
the narrator has been over-working for three days, with no cup of coffee. This
could specially be the case, when reading long lists - like your grocery
shopping-list - with no punctuations. The result is a narration, that is hard
to follow, and a frustrated listener.
Modulation analysis will not only depend on punctuations in the text to be
spoken. It will have to consider the length of each phrase, maybe even hold the
different phrases up against each other. Further, it will be dealing with
line-breaks. It might even look at the length of each word, making short words
be read out a bit faster, than the longer words. And, if it is really
professionally done, the analysis will be looking out for given
word-combinations, which should affect the modulation in any way.
There is a great many ways of performing all of the above. Your software should
break the text into letter-combinations, playback the corresponding set of
tones, keeping track of a huge amount of pronounciation rules, adjust pitch and
speed so as get good modulation. All of this, in real-time. That is, right at
the moment. You don't want to have the speech telling you what was on your
screen ten minutes ago. J
Depending on how well-designed the software of your voice is, it will handle
all of these tasks more or less quickly. If the manufacturer has had a creative
crew - who did come up with a few good and general pronounciation rules - your
voice might be far more responsive. If, on the other hand, the software,
dictionary and sound library of your voice are all poorly designed, your voice
will tend to be rather slow in response. When talking about responsiveness, and
particularly when the voice is going to be used as feedback in your computer
working, we are talking fractions of a second. A visually impaired person, who
uses the speech synthesizer for keeping track of his typing, doesn't want the
voice to give him the info half a second after a key has been pressed, or a
word has been typed. Such a lack, adding up throughout the working-day, would
really mean several minutes of 'waiting for the speech' to keep up with a good
and fast writer. Such would only lead to frustration. Unfortunately, quite a
number of voices on the market, will have too low responsiveness, for real
daily usage in environments like this. Still, in cases where you need only
narration to take place on a pre-written text, or text produced by a given
process on the computer, even slow-responding voices might have good usage.
Again here, we could mention things like weather reports, which are
pre-processed, prior to be sent to the speech synthesizer for narrating.
---This article has been split into three parts. Please read part1 and 3 as
well.---
If you reply to this message it will be delivered to the original sender only.
If your reply would benefit others on the list and your message is related to
GW Micro, then please consider sending your message to [email protected] so
the entire list will receive it.
GW-Info messages are archived at http://www.gwmicro.com/gwinfo. You can manage
your list subscription at http://www.gwmicro.com/listserv.