On Wed, 28 Jun 2017 10:48:47 -0500 Richard Owlett <[email protected]> dijo:
>On 06/28/2017 09:54 AM, Larry Brigman wrote: >> Human voice frequency range tops out at 8khz. Normal speech is >> around 2-3khz. ><chuckle> That's the theory that's been around "forever". >I'm in possession of a factoid that prompts me to do some research >needing high resolution at high sample rates. There is a lot to say about acoustic phonetics, and I'm not sure where to start. It is correct that most human speech tops out at about 3 KHz, but the lower limit mentioned above is too high. In fact, an adult male with a large head and vocal tract can produce speech sounds as low as 60 Hz (e.g., a basso profundo). Most males start at about 85 Hz. For vowels, sounds above 3 KHz are actually just harmonics, in decreasing volume the higher the harmonic. The harmonics normally contribute little to perception of vowels. Some consonants, however, use much higher frequencies, notably the stridents, of which English has an embarras de richesse. Human languages all have a unique phonetic inventory, that is, all the individual sounds of the language. English has 41 - 43 phonemes (depending on your dialect), and many of them have two or more allophones. Of these 10 - 11 (again, depending on dialect) are vowels, plus there are a handful of diphthongs. Each sound has a specific set of frequencies, plus there are other issues that hearers pick up, e.g., length of the sound. Now I address the issue of frequency, starting with the vowels. When you utter a vowel you actually produce three frequencies (called formants) simultaneously. The lower two are the critical ones, and the upper one could be considered as a kind of checksum. The formants for the vowel [i] (as in 'beet') average around 280, 2250, and 2900 Hz, whereas for the vowel [ɪ] (as in 'bit') the formants are around 400, 1900 and 2550 Hz. Now here is the crucial point: It is the distance between the two lower formants that makes our brains think 'oh I just heard an [i],' or I just heard an [ɪ]. Why is this important? Because every human has a different 'fundamental frequency,' determined mostly by the size of the vocal tract. Just as your 6th grade science teacher demonstrated by pinging the sides of glasses filled with different levels of water, the larger the volume of air the lower the frequency that will be produced. Men tend to have larger vocal tracts than women, so males tend to have a lower fundamental frequency than women. If our perception of vowels was determined just by the absolute frequencies we wouldn't be able to understand anything. But the system works because the distance between the lower two formants is identical whether the speaker has a high or a low fundamental frequency. The numbers I gave above for [i] are actually an average; for a man they might be 120, 2090, 2730, whereas for a woman they might be 380, 2350, and 2990. Note for both speakers the difference between the lower two formants is still 1970 Hz for [i]. 'Speaker normalization' is a term used by phoneticians to describe an amazing feature of the human brain - the instantaneous unconscious ability to perceive the fundamental frequency of a speaker the moment they open their mouth and utter the first couple of sounds, even if you have never heard the speaker before. Now I turn to consonants. Consonants also have formants, but the upper formants are the most important, and they can be much higher, even higher than 3 KHz. For example, the upper formant for [s] (as in 'hiss') ranges from around 4900 to 6000 Hz, depending on the speaker's fundamental frequency and the vowel(s) that precede or follow it - which leads me to problems with telephony. A long time ago when telephone systems were first being developed the telephone companies decided, for purely economic reasons, to limit the bandwidth that their equipment could perceive and reproduce to 300 - 3400 Hz. (Those figures are present-day standards; in the beginning they weren't even that generous.) Equipment that could do a wider range would have increased expense massively. Unfortunately, this produces the famous expressions 's as in Sam' or 'f as in Frank,' because the equipment doesn't go high enough to reproduce the upper formant of [s], making it impossible to distinguish it from [f] on a telephone. Having said all of that, there is a lot more to human speech recognition than having equipment capable of adequate bandwidth. Our human brains juggle so much input so rapidly that we have to use shortcuts. Let me give you just one example: If you hear an article (a, an, the) your brain knows that it always introduces a noun phrase so the next word absolutely must be a noun, a nominal modifier, or an intensifier. If you speak a language then every word in your lexicon is flagged as to which categories it can be used for. This means that as you try to decipher the next word that you are hearing you can discard a vast amount of your lexicon as impossible. _______________________________________________ PLUG mailing list [email protected] http://lists.pdxlinux.org/mailman/listinfo/plug
