Part 2 of 3, Getting To Know Yor Computer - Speech Synthesizer

David Sun, 17 Jun 2012 11:44:06 -0700

(C) Copyright, David (No) - June 2012

---This is the second part of the article. Please read part 1 and 3 as well, to 
get the full info.---



Reproducing The Human Voice

- The Crew Meet Their First Challenges

The human voice is a master-piece of creation. As air passes the vocal cords in 
your throat, they start to vibrate. Controlling the air-flow in your lungs and 
throat, and actively using the tongue, as well as the whole of your mouth, you 
can produce a quite wide range of sounds. Not only can you produce whisper, 
speech and outcries, but you can even modulate the tone, pace and intensity. 
Doing so, you can talk in a soft and loving voice, to your kids. Or, you can 
shout out warnings across the street. Good modulation in your speech, will even 
produce a certain amount of authority. Correctly trained and controlled, your 
voice even can be made into a wonderful instrument. This is why, we can enjoy 
good singers. Really, a master-piece of creation. And, definitely a hard one to 
copy.

 

But not only your voice, lungs, tongue, mouth and throat muscles are into play 
here. Being a good narrator, would take a good portion of coordination between 
all of these body-members, and even include your eyes. The coordination will 
all be cared for by your brain. As if it did not already have enough to 
coordinate, there is yet one more thing to be included here: Your emotions. 
Feelings. 

 

Narrating a text, where you will find quotations, punctuations, parenthesized 
phrases, emotional expressions and simple 'plain' text; it all calls for some 
kind of work from your whole body. Trying to reproduce ALL of this 
electronically, is an enormous task. A task, that even gets more complicated, 
due to the fact that many texts do need a bit of interpretation. A certain 
phrase  when given as part of an article or story  will need one kind of 
modulation. Were you to include the exact same phrase, as part of a joke, you 
would all the sudden need a fairly different modulation. And if the phrase 
appears on your computer screen, as part of a message box popping up, you might 
want it read out in a way, that totally differs from any of the others. All of 
this context-depending interpretation', will NEVER be possible electronically. 
Why? Simply due to the computer's lack of thinking-process, and emotions. Based 
on this, we can establish the fact, that the PERFECT synthetic voice - will 
simply never be released.

 

Still, a handful of the above challenges, are taken into consideration, when 
building a synthetic voice. Depending on how many of the issues of intonation, 
modulation, pace and basic interpretation a manufacturer has incoperated in his 
production, you will get a more or less realistic sounding voice for your 
computer. Yet, even when the manufacturer has endeavored to deal with most of 
these challenges, you could end up with a rather poor product. Many a voice on 
the market, is so eagerly stressing one or more of the issues, that it becomes 
somehow wearisome to listen to the voice over time. If a voice has a certain 
modulation, it might sound wonderful for one sentence. But when the exact same 
modulation is repeated over and over on each and every sentence in your text  
no matter the length or characteristics of the text, it simply does not flow 
well in our ears. Other voices tend to give wrong pausing at punctuations  
making long breaks at commas, and hardly any breaks between the sentences. 
Still others, lengthen the vowels in the text, to the extend that many words 
will sound rather strange to the human ear. 

 

Now, that we know the many challenges in reproducing the human voice, maybe it 
is time we take a look at how the manufacturers deal with the whole matter. As 
you will see in the coming paragraphs, producing a basic synthetic voice might 
not be all that hard; first you have the knowledge and equipment to do so. Yet, 
to have your product realistic sounding, would take a crew of professionals. A 
speech therapist natively speaking the language your voice is meant to work 
under, might be in place for hiring. Some kind of crew holding linguistic 
knowledge, would likely be another qualified group in your team. These 
professional members, would be able to give good counsel on how to implement 
the right set of rules for good speech-reproduction. Only, such a crew does 
cost a small fortune. If you are a private person, with limitted resources, you 
might soon enough find yourself stuck in the wilderness of challenges. 

 

 

Electronic Voices

- A Peek Behind The Scene

Breaking an electronic voice all down to its very basic parts, we could tell it 
to consist of numerous tones. If you ever have tried playing the guitar, you 
will know, pecking a string on the instrument, then quickly sliding your finger 
up and down on the the string, you can vary the tone quite a lot. Putting a 
number of such tones together according to certain rules, would produce a sound 
that would fool our human ears to hear a spoken vowel or consonant. This kind 
of sound, can be produced electronically, by your computer. Stringing numerous 
pieces of 'music' of this kind together would produce our first spoken word, by 
an electronic voice. Further stringing several words together, would make the 
first spoken phrase; and so on. It would all be done synthetically, and 
electronically. And, that is exactly the way it would sound as well. You 
typically now would hear the metallic robot-voice, that is so famously known, 
from science-fiction movies. If you happen to have such a robotic voice 
installed on your computer, and you let it read a phrase to you at a very low 
speed, you will hear the different small pieces of tone-combinations being 
played. You can even hear, how it 'slides' its way through the whole range of 
tone-combinations.

 

This basic voice, can hardly be used for anything but sound-effects, and 
illustration purposes. It is way too robotic, and would have no realistic touch 
to it. It might even pronounce numerous words wrongly. Before we can release 
the voice as a worthy stand-alone product, we need a good portion more work 
done on the project.

 

Each character in our alphabet, does have its own sound. Reproducing each of 
these sounds electronically, can be done with a certain amount of efforts - as 
we just discussed. But try to sit down, and speak out the word "WATER" to your 
self. Try to really slow down the pronounciation to the extreme. You will hear, 
that we start out with some kind of a "thick" Wsound. But do we immediately 
jump to the A-sound? No, we smoothly 'slide' from the W, to a W-A, and into the 
A-sound. Same goes between any of the rest of the letters in our word. And this 
is one of the big issues, in creating a well-sounding electronic voice. The 
example we just went through, gave the very basics. We here would need to 
reproduce the sound of each letter, and one sound for each two-letter 
combination. Still, doing such a basic job, would make our voice sound rather 
choppy. Especially so, in cases where we let it read in a slow pace. A 
professionally designed voice, would have several 'snap-shots' of the 
transition between two-letters. Let's break it into a five-step process, 
instead of the thre-step one illustrated above. You now would have one sound 
reproducing the pure W. The next sound would reproduce a clear W with a tiny A 
attached. Third sound, would be a balanced W-A sound. Fourth comes a sound 
reproducing a short W, followed by a longer A. And finally, we have a clear 
A-sound. The W-A combination, will now be spoken out a bit more smoothly 
sounding by your electronic voice. And this 'break-down' of the different 
letter-combinations, could go on indefinitely. Depending on how many steps each 
letter-combination is broken into, the voice would perform the pronounciation 
more or less realistically. The more steps for each combination, the better 
sounding the voice will be.

 

Such letter-combination work, could be quite excessive. In the English 
alphabet, we have 26 letters. If you were to make a list of all possible 
two-letter combinations from that alphabet, you would end up with a list of 
several hundred possibilities. Breaking each of these many possible 
combinations into three-step processes, would litterally bring your sound 
library into thousands of entries. And, if your voice is going to have a 
pleasant sound, you will need far more than three steps in each process. I dare 
say, your library of small tones, would easily run into hundreds of thousands - 
If not millions - of entries.

 

Depending on the technique you decide on - at this point in the development - 
you could choose a couple of approaches. Either you would save the tones 
produced by your computer for each step, into small electronic sound clips; 
which could be played back at a later time. Alternatively, you can let the 
parameters for the tones to be played (frequency, duration and volume), be 
saved in a file. When the voice was to speak out a given letter or combination, 
the actual tones would be produced on the fly, based on these parameters. The 
first approach, would ensure an equally sounding voice on all computers; but 
does take up a bit more space on the hard disk. The second approach, would be 
far less space-demanding on the disk, but also carries the risk of the tones 
produced, to vary slightly from what you expected. Such tiny variations, could 
even affect the whole sound of your electronic voice.

 

Now that you have got your thousands or millions of tones in place, we can 
start to put it into action. We then need a bit of programming - a small piece 
of software that will take care of all the playback. Your software will 'read' 
the word to be spoken off from your screen. It then will break the word into 
two-letter combinations, and play back the predefined tones for each 
combination. So the word "water", would be broken into the combinations of:

            w-a

            a-t

            t-e

            e-r

. Each combination taking numerous tones to be played back. But finally, you 
have a product, that can be used for the very basic reading of any text, the 
user might want to listen to. Yet, there is more to be dealt with.

 

Your software now should care for all kind of punctuations. It would need to be 
programmed to make a pause of a certain duration, every time it comes across a 
space character in the text. Otherwise, you will have no real break between the 
words in your reading. Further, the duration of pauses for things like commas, 
colons, semi-colons and period-signs will have to be preset in your software. 
You further will need your software to turn things like numbers, into spelled 
words. The numeric character 1, has no real basic pronounciation. We pronounce 
it as a word, made up of the letters O, N, and E. So, we have to tell our 
voice-software, that the character 1, should be replaced with the word "one". 
This kind of pronounciation rules, would be collected in a 'pronounciation 
dictionary'. This dictionary will hold entries for many characters that need 
some kind of modification, before they can be pronounced correctly. Oh, but 
wait! If we strictly follow the rules of two-letter combinations outlined 
above, sending the three letters O, N, E, to the voice for playback - would not 
make a correct pronounciation. Try for yourself, pronouncing each of the three 
characters quickly in a row, and you will hear how wrongly it would sound. So 
our dictionary actually should hold the three letters of W, O, and N; to have 
the numeric character 1, pronounced correctly.

 

Further on, your pronounciation-dictionary will need rules for handling two-, 
three-, four-digits, grouping them into tens, hundreds and thousands. It needs 
rules for handling dates, times and commonly used abbriviations. As if all of 
this isn't enough, we also need rules - or entries - in the pronounciation 
dictionary, as to how to spell out each letter itself. You know, the letter B, 
when spelling, is pronounced like a B with a following E-sound: BE. And worse 
comes, when we reach the W, as it actually has to be pronounced as two words, 
when spelling: Double, U.

 

Our trials are not yet done away with. The letter-combination of B and E, 
should be pronounced with a somehow long E-sound, when we come across the word 
"BE". Yet, it should be pronounced with a slightly shorter E-sound, when we 
have words like "BECAUSE". And if we read about the insect BEE, we don't want 
to hear it pronounced as B-E-E. So, we need entries in our dictionary, caring 
for all of these many 'exceptional' pronounciation rules. Do you start to 
realize, why I told you that a crew of speech therapists and linguistics would 
be in place for your developing team? 

 

Finally, we are ready to take care of the real challenges, when comes to have a 
smooth-sounding narration of any text. Oh, come on, haven't we already taken 
care of everything? Sorry, but no! So far, we only have dealt with the basic 
pronounciation features of your new electronic voice. In the outset of the 
article, we learned that there is a need for great care to be taken, when comes 
to things like modulation and pace. These things our software needs to care 
for, by raising the volume, lengthen the duration, or increasing the speed 
during the playback of any of our entries in the sound library of your 
electronic voice. It sounds easy enough. But there is an excessive amount of 
analysis to be performed, so as to have the right modulation. 

 

Many voices on the market, are rather poorly designed, when comes to this kind 
of analyzation. They tend to think, that all phrases start with higher pitch, 
and end with a low one. No matter the length of the phrase, the voice will 
start and end on predefined pitch-levels. And this results in a rather monotone 
voice. It might be good for projects where you only need short and concise 
messages spoken. Such could be the case, when you want your voice to do the 
reading of the weather reports and forecasting. But it will hardly be what 
anyone wants for reading the daily newspaper. 

 

Yet other voices, tend to fall in the opposite ditch. They overdo the 
modulation, to the extend that it sounds like the narrator is doing the reading 
onboard in a roller-coaster. Each phrase is read out, with the pitch jumping 
abusively up and down several times, throughout the whole length of the phrase. 
Such voices, could easily give a listening experience, where you feel the 
narrator is stressing every second or third word, no matter if it is meant to 
be stressed or not. 

 

A third pitfall of modulation, is when the voice treats big amount of text, as 
one extremely long phrase. In such cases, the voice would 'happily' start out 
narrating the text to you, in a normal speed and pitch. But as the narration 
moves on, the pitch and speed gradually falls; till it ends up sounding like 
the narrator has been over-working for three days, with no cup of coffee. This 
could specially be the case, when reading long lists - like your grocery 
shopping-list - with no punctuations. The result is a narration, that is hard 
to follow, and a frustrated listener.

 

Modulation analysis will not only depend on punctuations in the text to be 
spoken. It will have to consider the length of each phrase, maybe even hold the 
different phrases up against each other. Further, it will be dealing with 
line-breaks. It might even look at the length of each word, making short words 
be read out a bit faster, than the longer words. And, if it is really 
professionally done, the analysis will be looking out for given 
word-combinations, which should affect the modulation in any way. 

 

There is a great many ways of performing all of the above. Your software should 
break the text into letter-combinations, playback the corresponding set of 
tones, keeping track of a huge amount of pronounciation rules, adjust pitch and 
speed so as get good modulation. All of this, in real-time. That is, right at 
the moment. You don't want to have the speech telling you what was on your 
screen ten minutes ago. J

 

Depending on how well-designed the software of your voice is, it will handle 
all of these tasks more or less quickly. If the manufacturer has had a creative 
crew - who did come up with a few good and general pronounciation rules - your 
voice might be far more responsive. If, on the other hand, the software, 
dictionary and sound library of your voice are all poorly designed, your voice 
will tend to be rather slow in response. When talking about responsiveness, and 
particularly when the voice is going to be used as feedback in your computer 
working, we are talking fractions of a second. A visually impaired person, who 
uses the speech synthesizer for keeping track of his typing, doesn't want the 
voice to give him the info half a second after a key has been pressed, or a 
word has been typed. Such a lack, adding up throughout the working-day, would 
really mean several minutes of 'waiting for the speech' to keep up with a good 
and fast writer. Such would only lead to frustration. Unfortunately, quite a 
number of voices on the market, will have too low responsiveness, for real 
daily usage in environments like this. Still, in cases where you need only 
narration to take place on a pre-written text, or text produced by a given 
process on the computer, even slow-responding voices might have good usage. 
Again here, we could mention things like weather reports, which are 
pre-processed, prior to be sent to the speech synthesizer for narrating. 

 

 ---This article has been split into three parts. Please read part1 and 3 as 
well.---

If you reply to this message it will be delivered to the original sender only. 
If your reply would benefit others on the list and your message is related to 
GW Micro, then please consider sending your message to [email protected] so 
the entire list will receive it.

GW-Info messages are archived at http://www.gwmicro.com/gwinfo. You can manage 
your list subscription at http://www.gwmicro.com/listserv.

Part 2 of 3, Getting To Know Yor Computer - Speech Synthesizer

Reply via email to