David Rowe (VK5DGR) has been doing some absolutely awesome work on open-source speech codec development (for lossy compression of speech, for example for certain radio frequency bands where available bandwidth is very limited):
<http://www.rowetel.com/ucasterisk/codec2.html> <http://www.rowetel.com/blog/?p=132> He's shooting for good speech quality at 2400 bits per second, which is apparently close to the best proprietary codecs. And he's using an approach fairly similar in many ways to what the rest of this post describes. But today I ran into this truly astonishing research project at Haskins Laboratories at Yale, by Robert Remez and Philip Rubin: <http://www.haskins.yale.edu/featured/sws/sws.html> There is some further exploration, which unfortunately I couldn't listen to successfully, at Remez's site: <http://www.columbia.edu/~remez/Site/Musical%20Sinewave%20Speech.html> And there's a Wikipedia page: <http://en.wikipedia.org/wiki/Sinewave_synthesis> They're synthesizing comprehensible --- if slow --- English speech out of nothing more than three or four formants, realized purely as sine waves. The sound recordings are truly astonishing to listen to. They don't sound like human speech; they don't sound like synthesized speech; they sound like the whistling and squealing sounds when you're trying to tune an AM radio. But you can understand them. (And apparently they started this research in the 1970s, their Fortran code is from 1980, they put up a web page about it in 1996, and the current Matlab version of the code is from 2003. The publications on their publications page are from 1980 to 1994.) This made me wonder how low you can really push the bandwidth of a usable speech codec. Presumably you could do k-nearest-neighbors averaging on a database of recorded speech sounds in order to get back from the sine waves to something that more closely resembles human speech. But how much bandwidth would it take to transmit the sine-wave information? They've put online the parameters they used to synthesize their sample utterances; one of them is at <http://www.haskins.yale.edu/featured/sws/swssentences/S7pars.html>. It encodes the sentence "Please say what this word is," about 19 phones, in 1.68 seconds. The text file is 15045 bytes, which isn't particularly good; that's almost 80kbps. But even `gzip` can compress it to 3333 bytes, which brings it down below 16 kilobits per second. That is, even without intending to, their research comprises a speech codec that can produce comprehensible speech at 16 kilobits per second, slightly more than GSM uses (although GSM produces very realistic speech). Beyond that, though, we'd need to discard more information. The parameters file in its current form divides the utterance up into 10ms frames, with pitch and amplitude information for each formant in each frame; between frames, the pitch and amplitude are linearly interpolated. The pitch information in that file ranges from 136Hz to 4597Hz, and is already quantized to, apparently, 1Hz. The amplitude information is represented to six significant figures. Suppose that, instead, we reduced the frames to a few "keyframes" and interpolated between them with cubic splines. We probably need at least one keyframe per phone, and probably 1.5 or so. That would give us about 17 keyframes per second, which is an improvement of a factor of 6. If that didn't affect its gzip-compressibility, that alone would get us down to 2700 bits per second. But there's no need to spend 13 or 16 bits per formant per keyframe on the frequency; we can almost certainly quantize the frequency logarithmically to within one semitone. The range in question is almost 61 semitones, so you only need six bits. Similarly, the amplitude probably doesn't need ten bits of precision. Probably four bits (logarithmic; maybe 2dB each, for a total dynamic range of 32dB) would do fine. And the interval between keyframes can probably be quantized to 10ms, and range up to, say, 160ms, which would require four bits of keyframe duration. So a keyframe consisting of four bits of timing, and four formants each with ten bits of pitch and amplitude information, would occupy a total of 44 bits, for a total of about 750 bits per second, or 210 bits per word in this case (since you'd need fewer keyframes when the speech was slower). That's about five times worse than ASCII text. Also, in many of the frames, not all of the formants were present; out of the 169 frames in that file, there were an average of 2.5 formants present. If the average number of formants were the same for keyframes, and you used two bits per frame to indicate the number of formants, the average frame size would fall from 44 bits to 4+2+25 = 31 bits, and the total bit rate would fall to 530 bits per second. But in practice, you would probably tend to choose fewer keyframes in segments with fewer formants. (This would also reduce the bit rate for silence to 6 bits per keyframe, a little over 6 times per second: almost 38 bits per second.) If you had some kind of entropy or linear-predictive-plus-quantized- residuals coding for the keyframes, you might be able to do still better; in essence, you could take advantage of the kinds of redundancies that phonotactics enforces --- consonants tend to alternate with vowels, for example. How to choose keyframes? A simple greedy approach would be to start with 100 frames per second, and then iteratively remove the frame that would produce the smallest error, until removing further frames would generate unacceptable levels of error. Another simple greedy approach would be to start with no frames, and then iteratively add keyframes at the point where the interpolated spectrograms were the farthest from the real signal, until the spectrogram was close enough. Neither of those approaches to keyframe selection can work well in real-time. A simple approach that would probably work well in real-time would be to maintain two "possible keyframes" at the present moment and just before, and whenever the spectrogram interpolated from the last emitted keyframes to the "possible keyframes" becomes too far from the real signal, emit a keyframe at the point in the recent past where the error is greatest. All of these approaches, of course, have to be adjusted to not exceed the maximum representable keyframe interval. Other tweaks to try: - Add a bit per keyframe to indicate that the spectrogram has a discontinuous break at that point, rather than interpolating. (This could avoid transmitting as many as three more closely spaced keyframes.) - Add a bit per keyframe to indicate the presence or absence of voicing, as almost all vocoder algorithms do. - More generally, add two or three bits per keyframe to indicate the average bandwidth of the formants. - Transmit parameters for a voice model toward the beginning of the connection or periodically throughout the recording, in parallel with the formant frequency data, so that the synthesized voice can sound at least vaguely like the speaker instead of like someone else. If you had a perfect model of the range of variation of human voices, 36 bits would be enough to uniquely specify the voice of any person who's ever lived, and another 26 bits would be enough to specify a minute of their life. How close to that can you get with some kind of parametric model? Can you come up with a model that describes the unique timbre of a person's vocal tract in a small number of ruthlessly quantized coefficients, say, 25 dimensions of four bits each? - Since the formants can be constrained to be transmitted in sorted order, transmit formant frequencies as intervals (ratios) from the previous formant's frequency rather than independently. This could reduce the size of each frequency transmitted from 6 bits to 5 or 4. - Nonuniform encoding for the per-frame formant parameters. This would probably require some pretty heavy-duty psychoacoustic research to validate (and someone has probably already done it), but perhaps, say, there is less tolerance for error in the interval between two formants when they are close together, because the difference between a perfect fifth and a perfect fourth is more audible than the difference between a 16:3 and an 18:3 interval --- which are a perfect fourth and fifth plus two octaves. Or perhaps amplitude variation is more important at high frequencies. - Update only the higher-frequency formants in some keyframes. The frequency and amplitude of a 200Hz formant can't change very rapidly. In a 10ms frame you only get two full cycles! So if you're looking at 10ms, unless I'm confused about the math, your first few discrete Fourier transform coefficients are DC, 100Hz, 200Hz, and 300Hz. So you can't detect even fairly large shifts in its frequency --- if it were to drop or rise by a whole fifth, seven semitones, you wouldn't even notice until you're looking at a longer period of time. On the other hand, if a 4000Hz formant drops to 3900Hz --- less than half a semitone --- you could detect that in the DFT of those 10ms. Presumably similar constraints apply to your ear: you can't detect if a 200Hz signal jumps to 216Hz over a 10ms period; you need a longer period of time. So you could emit updates for the high-frequency formants more frequently. This would add a couple of bits per keyframe (to indicate which formants were being updated), but most keyframes would only contain one formant. Klatt 1987 reports that the Speak 'n' Spell had stored about 1000 bits per second of speech, using linear predictive coding: <http://americanhistory.si.edu/archives/speechsynthesis/dk_749.htm> Dan Ellis, who wrote the current Matlab version on the Haskins Lab site, talks about the connection with LPC vocoders: <http://labrosa.ee.columbia.edu/matlab/sws/> -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

