(This is available in HTML at <http://pobox.com/~kragen/science-espeak.html>.)
So I've been playing around with speech synthesis software tonight. [eSpeak](http://espeak.sourceforge.net/) looks a lot nicer than [Festival](http://www.cstr.ed.ac.uk/projects/festival/), just in that it's much easier to adjust its speed, correct its pronunciation, and play with variations: whisper, different accents, pitch, word spacing, creaky voice. I got to thinking, what would a logical policy for updating its lexicon look like? I thought the results I came up with were interesting. Maybe some other people will be interested too. The problem ----------- [eSpeak](http://espeak.sourceforge.net) gets "neuroscience" and "pseudoscience" wrong, pronouncing them with a `[[s,[EMAIL PROTECTED] rather than a `[[s'[EMAIL PROTECTED] It also gets "omniscience" and "prescience" wrong, or at least pronounces them rather differently than I would: $ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "The science of neuroscience is not a scientific or quasiscientific pseudoscience. Conscientiously pursue omniscience and prescience." [EMAIL PROTECTED] s'[EMAIL PROTECTED] Vv n'3:[EMAIL PROTECTED],[EMAIL PROTECTED] I2z n,0t#@ [EMAIL PROTECTED]'IfIk _:_:O@ kw,[EMAIL PROTECTED]'IfIk sj'u:[EMAIL PROTECTED],[EMAIL PROTECTED] k,0nsI2;'[EMAIL PROTECTED] p3sj'u: '0mnIs,[EMAIL PROTECTED] _:_:and pr'i:[EMAIL PROTECTED] I would pronounce the "science" in "omniscience" and "prescience" as [EMAIL PROTECTED] and put the accent on another syllable. There's a special rule for "scien" beginning a word, and for "conscience": en_list:conscience [EMAIL PROTECTED] en_rules: _sc) ie (n aI@ en_rules:?8 _sc) ie (n aIa2 However, Jonathan Duddington has said he wants to keep the eSpeak distribution small, so he "wouldn't want to include too many unusual or specialist words". (See <http://sourceforge.net/forum/forum.php?thread_id=1700280&forum_id=538920> where he talks about why he doesn't want to import the Festival lexicon.) Already, `espeak-data/en_dict` is 80KB, which is half the size of the `speak` binary. Replacement strategies ---------------------- There are several possible strategies that a maintainer could adopt in order to improve the coverage of their special-case word files without letting them get large. Suppose that there is a scalar metric of "goodness" that can be applied independently to each special case. Here are three plausible strategies, ordered from least to most stringent. - C-: They could never remove items from the file, adding new items as long as they were better than the worst item in the file. This will probably cause the average quality of the entries in the file to gradually decline, because many of the most important entries were probably added early on. It will eventually result in a very large file with very low average quality per entry, but very comprehensive coverage. - C+: They could keep the number of items in the file fixed, adding new items as long as they were better than the worst item in the file. This will cause the program to gradually work better, but each new version will introduce regressions --- words that the previous version pronounced correctly, but the new one does not. - A: They could never remove items, but add new items as long as they improved the median item quality of the file --- that is, as long as the new item improved the program's performance more than most of the items in the file. This will gradually slow down and eventually stop the addition of new items, because that median quality will gradually increase. I am going to approximate "quality" with "frequency", on the theory that mispronouncing a rare word is always better than mispronouncing a common one. Note the analogy to Google's famous hiring policy: only hiring candidates who raised their average ability. Evaluating word frequencies --------------------------- Are these "science" words significant enough to include? `en_list` only contains 2869 lines, maybe 2400 of which are words. So maybe only the top 2400 or so exceptions to the normal rules of pronunciation are currently considered for inclusion. Some time ago, I tabulated the frequencies of words in the British National Corpus and put the results online at <http://pobox.com/~kragen/sw/wordlist>. It has 109557 lines, ordered from the most common words ("the", "of", and "and", each occurring millions of times) to the least common (with a cutoff of 5 occurrences, because most of the words with fewer were actually misspellings). I selected 20 lines at random from `en_list` with the following results: [EMAIL PROTECTED]:~/pkgs/espeak-1.37-source/dictsource$ ~/bin/unsort < en_list | head -20 this %DIs $nounf $strend $verbsf barbeque [EMAIL PROTECTED]@kju: con k0n ?5 thu TIR // Thursday _: [EMAIL PROTECTED] Ukraine ju:kr'eIn peculiar pI2kju:lI3 unread Vnr'Ed $only inference [EMAIL PROTECTED]@ns José hoUs'eI unsure VnS'U@ survey $verb ë $accent epistle [EMAIL PROTECTED] Munich mju:nIk scenic si:nIk synthesise [EMAIL PROTECTED] corps kO@ $only rajah rA:dZA: transports [EMAIL PROTECTED]|s $nounf Where do these special cases appear in the British National Corpus tabulation? Here are some results, edited for readability: [EMAIL PROTECTED]:~/pkgs/espeak-1.37-source/dictsource$ grep -niE ' (this|barbeque |con|thu|ukraine|peculiar|unread|inference|José|unsure|survey|epistle|munich |scenic|synthesise|corps|rajah|transports)$' /home/kragen/devel/wordlist 22:463240 this 1178:7999 survey 5102:1441 peculiar 5831:1200 corps 7165:888 ukraine 8977:634 munich 9045:627 unsure 10552:494 inference 11134:455 con 15127:275 scenic 29899:82 epistle 31386:74 transports 34270:62 synthesise 37255:52 unread 73679:11 thu 74154:11 rajah 87737:8 barbeque The 50th-percentile among the sample of 20 (of which two weren't words, and a third wasn't found) seems to be line 11 134 with the word "con". That is, the exceptions in `en_list` are mostly drawn from the most frequently used eleven thousand words in the language. (Maybe words like "barbeque", "rajah", and "unread" should be dropped.) So under the policies "C+" and "C-", any word that is more common than "barbeque", at position 87737 in the British National Corpus tabulation, (or maybe some word even a bit rarer than that) should be added to the file. (Under policy "C+", some word would be removed to compensate, raising the threshold.) Under the policy "A", the threshold would be "con", at position 11 134. Unfortunately, José is missing. I think I excluded accented characters when I tabulated the frequencies initially. Anyway, that gives us a way to compare the "science" words: [EMAIL PROTECTED]:~/pkgs/espeak-1.37-source/dictsource$ grep -n scien[tc] /home/kragen/devel/wordlist 870:10597 science 1614:5922 scientific 2584:3547 scientists 3865:2088 sciences 3977:2005 scientist 5342:1355 conscience 13365:338 conscientious 16976:227 scientifically 25757:109 consciences 26015:107 conscientiously 27861:93 unscientific 37040:53 omniscient 44349:36 prescient 49031:29 neuroscience 49706:28 prescience 50457:27 scientificity 50587:27 omniscience 53155:24 scientism 62346:17 geoscience 66943:14 scientia 67285:14 neuroscientists 68176:14 conscientiousness 82060:9 geoscientists 84433:8 scientology 84434:8 scienter 86513:8 geosciences 90235:7 neurosciences 93073:7 biosciences 93074:7 bioscience 95039:6 scientifique 95591:6 pseudoscience 103190:5 presciently 103191:5 prescientific Of these, only those more common than "conscience" seem to deserve a place in `en_list`. How does eSpeak do now? $ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "Science is scientific and done by scientists, who work in the sciences. A scientist with a conscience may be conscientious. Those with scientifically-minded consciences will conscientiously avoid unscientific claims of omniscient beings or prescient prophets." s'[EMAIL PROTECTED] I2z [EMAIL PROTECTED]'IfIk _:_:and d'Vn baI s'[EMAIL PROTECTED] _:_:h,u: w'3:k [EMAIL PROTECTED] s'[EMAIL PROTECTED] a2 s'[EMAIL PROTECTED] wI2D a2 k'[EMAIL PROTECTED] m'eI bi: k,0nsI2;'[EMAIL PROTECTED] DoUz wI2D [EMAIL PROTECTED]'IfIkli m'aIndI2d k'[EMAIL PROTECTED] wIl k,0nsI2;'[EMAIL PROTECTED]; a2v'OId [EMAIL PROTECTED]'IfIk kl'eImz Vv '0mnIs,[EMAIL PROTECTED] b'i:;INz _:_:O@ pr'i:[EMAIL PROTECTED] pr'0fIts It pronounces everything correctly until it gets to "omniscient" and "prescient", and maybe its pronunciations for those are correct, but at least they're not the pronunciations I would use. Under policy "A", those words are not common enough to add to `en_list`, because they would lower the average frequency of words in `en_list` unless you removed a less common word to compensate. Under policies "C+" and "C-", not only "omniscient" and "prescient" qualify, but so do "neuroscience", "geoscience", "neuroscientists", and "geoscience", which eSpeak currently mispronounces. (Including all the exceptions that as rare as "prescient" might quadruple the size of `en_list`, and perhaps `en_dict` as a result, if arbitrary spellings were as common among rare words as they are among common words. Think of that as an upper bound. Including all the exceptions as rare as "neuroscientists" might multiply its size by seven. This is the downside of policy "C-", but it does not happen with policy "C+". On the other hand, under policy "C+", even "prescient" might not survive long after being added.) Recommendation -------------- There is a better solution than adding a bunch of one-word special cases to `en_list`. Probably in this case the solution is to change the special case for "conscience" to a special case for "conscien..." and change the "scien..." rule to a "...scien..." rule; that covers all the words except for "omniscien..." and "prescien...". Covering those two takes only two more rules in `en_rules`, if it's considered worthwhile; but "conscience" is ten times as common as both of those together, "con" three times as common, but "barbeque" 18 times less common. Alternatives ------------ I think there is a need for a larger `en_list` and `en_rules` to be available, even if they aren't part of the standard distribution. eSpeak's current footprint for a single language is about 160KB for the executable and 80KB for the dictionary. But it would be useful in many cases even if its dictionary were 800KB (as perhaps it would be with the Festival lexicon) or 8MB. And for a better user interface for making changes to the dictionary, and especially `en_rules`, since currently it's hard to know what words you're changing the pronunciation of when you change `en_rules`, and you have to master a phonological orthography system to make any contribution at all. And then there's no `git`-like infrastructure for sharing your changes, and even learning `git` is a pretty big barrier to contributions. If, instead, you could twist a knob to jog back to the last mispronounced word, then hold down a button and say its correct pronunciation, the barrier to contributions would be much lower. You would need a reasonable phonological analysis system (like in a speech-to-text system) to turn the spoken word into the string of phonemes. Then, if you could share your accumulated corrections with all other users of the software with the push of a button, the process of coming up with the tens of thousands of special cases would be a lot quicker.

