David Rowe (VK5DGR) has been doing some absolutely awesome work on
open-source speech codec development (for lossy compression of speech,
for example for certain radio frequency bands where available bandwidth
is very limited):

<http://www.rowetel.com/ucasterisk/codec2.html>
<http://www.rowetel.com/blog/?p=132>

He's shooting for good speech quality at 2400 bits per second, which is
apparently close to the best proprietary codecs. And he's using an
approach fairly similar in many ways to what the rest of this post
describes.

But today I ran into this truly astonishing research project at Haskins
Laboratories at Yale, by Robert Remez and Philip Rubin:

<http://www.haskins.yale.edu/featured/sws/sws.html>

There is some further exploration, which unfortunately I couldn't listen
to successfully, at Remez's site:

<http://www.columbia.edu/~remez/Site/Musical%20Sinewave%20Speech.html>

And there's a Wikipedia page:

<http://en.wikipedia.org/wiki/Sinewave_synthesis>

They're synthesizing comprehensible --- if slow --- English speech out
of nothing more than three or four formants, realized purely as sine
waves. The sound recordings are truly astonishing to listen to. They
don't sound like human speech; they don't sound like synthesized speech;
they sound like the whistling and squealing sounds when you're trying to
tune an AM radio. But you can understand them.

(And apparently they started this research in the 1970s, their Fortran
code is from 1980, they put up a web page about it in 1996, and the
current Matlab version of the code is from 2003. The publications on
their publications page are from 1980 to 1994.)

This made me wonder how low you can really push the bandwidth of a
usable speech codec. Presumably you could do k-nearest-neighbors
averaging on a database of recorded speech sounds in order to get back
from the sine waves to something that more closely resembles human
speech. But how much bandwidth would it take to transmit the sine-wave
information?

They've put online the parameters they used to synthesize their sample
utterances; one of them is at
<http://www.haskins.yale.edu/featured/sws/swssentences/S7pars.html>. It
encodes the sentence "Please say what this word is," about 19 phones, in
1.68 seconds.  The text file is 15045 bytes, which isn't particularly
good; that's almost 80kbps. But even `gzip` can compress it to 3333
bytes, which brings it down below 16 kilobits per second.

That is, even without intending to, their research comprises a speech
codec that can produce comprehensible speech at 16 kilobits per second,
slightly more than GSM uses (although GSM produces very realistic
speech).

Beyond that, though, we'd need to discard more information. The
parameters file in its current form divides the utterance up into 10ms
frames, with pitch and amplitude information for each formant in each
frame; between frames, the pitch and amplitude are linearly
interpolated.  The pitch information in that file ranges from 136Hz to
4597Hz, and is already quantized to, apparently, 1Hz.  The amplitude
information is represented to six significant figures.

Suppose that, instead, we reduced the frames to a few "keyframes" and
interpolated between them with cubic splines. We probably need at least
one keyframe per phone, and probably 1.5 or so. That would give us about
17 keyframes per second, which is an improvement of a factor of 6. If
that didn't affect its gzip-compressibility, that alone would get us
down to 2700 bits per second.

But there's no need to spend 13 or 16 bits per formant per keyframe on
the frequency; we can almost certainly quantize the frequency
logarithmically to within one semitone. The range in question is almost
61 semitones, so you only need six bits.

Similarly, the amplitude probably doesn't need ten bits of precision.
Probably four bits (logarithmic; maybe 2dB each, for a total dynamic
range of 32dB) would do fine.

And the interval between keyframes can probably be quantized to 10ms,
and range up to, say, 160ms, which would require four bits of keyframe
duration.

So a keyframe consisting of four bits of timing, and four formants each
with ten bits of pitch and amplitude information, would occupy a total
of 44 bits, for a total of about 750 bits per second, or 210 bits per
word in this case (since you'd need fewer keyframes when the speech was
slower).  That's about five times worse than ASCII text.

Also, in many of the frames, not all of the formants were present; out
of the 169 frames in that file, there were an average of 2.5 formants
present. If the average number of formants were the same for keyframes,
and you used two bits per frame to indicate the number of formants, the
average frame size would fall from 44 bits to 4+2+25 = 31 bits, and the
total bit rate would fall to 530 bits per second. But in practice, you
would probably tend to choose fewer keyframes in segments with fewer
formants. (This would also reduce the bit rate for silence to 6 bits per
keyframe, a little over 6 times per second: almost 38 bits per second.)

If you had some kind of entropy or linear-predictive-plus-quantized-
residuals coding for the keyframes, you might be able to do still
better; in essence, you could take advantage of the kinds of
redundancies that phonotactics enforces --- consonants tend to alternate
with vowels, for example.

How to choose keyframes? A simple greedy approach would be to start with
100 frames per second, and then iteratively remove the frame that would
produce the smallest error, until removing further frames would generate
unacceptable levels of error. Another simple greedy approach would be to
start with no frames, and then iteratively add keyframes at the point
where the interpolated spectrograms were the farthest from the real
signal, until the spectrogram was close enough.

Neither of those approaches to keyframe selection can work well in
real-time. A simple approach that would probably work well in real-time
would be to maintain two "possible keyframes" at the present moment and
just before, and whenever the spectrogram interpolated from the last
emitted keyframes to the "possible keyframes" becomes too far from the
real signal, emit a keyframe at the point in the recent past where the
error is greatest.

All of these approaches, of course, have to be adjusted to not exceed
the maximum representable keyframe interval.

Other tweaks to try:

- Add a bit per keyframe to indicate that the spectrogram has a
  discontinuous break at that point, rather than interpolating. (This
  could avoid transmitting as many as three more closely spaced
  keyframes.)
- Add a bit per keyframe to indicate the presence or absence of voicing,
  as almost all vocoder algorithms do.
- More generally, add two or three bits per keyframe to indicate the
  average bandwidth of the formants.
- Transmit parameters for a voice model toward the beginning of the
  connection or periodically throughout the recording, in parallel with
  the formant frequency data, so that the synthesized voice can sound at
  least vaguely like the speaker instead of like someone else. If you
  had a perfect model of the range of variation of human voices, 36 bits
  would be enough to uniquely specify the voice of any person who's ever
  lived, and another 26 bits would be enough to specify a minute of
  their life. How close to that can you get with some kind of
  parametric model? Can you come up with a model that describes the
  unique timbre of a person's vocal tract in a small number of
  ruthlessly quantized coefficients, say, 25 dimensions of four bits
  each?
- Since the formants can be constrained to be transmitted in sorted
  order, transmit formant frequencies as intervals (ratios) from the
  previous formant's frequency rather than independently. This could
  reduce the size of each frequency transmitted from 6 bits to 5 or 4.
- Nonuniform encoding for the per-frame formant parameters. This would
  probably require some pretty heavy-duty psychoacoustic research to
  validate (and someone has probably already done it), but perhaps, say,
  there is less tolerance for error in the interval between two formants
  when they are close together, because the difference between a perfect
  fifth and a perfect fourth is more audible than the difference between
  a 16:3 and an 18:3 interval --- which are a perfect fourth and fifth
  plus two octaves. Or perhaps amplitude variation is more important at
  high frequencies.
- Update only the higher-frequency formants in some keyframes. The
  frequency and amplitude of a 200Hz formant can't change very rapidly.
  In a 10ms frame you only get two full cycles! So if you're looking at
  10ms, unless I'm confused about the math, your first few discrete
  Fourier transform coefficients are DC, 100Hz, 200Hz, and 300Hz. So you
  can't detect even fairly large shifts in its frequency --- if it were
  to drop or rise by a whole fifth, seven semitones, you wouldn't even
  notice until you're looking at a longer period of time.  On the other
  hand, if a 4000Hz formant drops to 3900Hz --- less than half a
  semitone --- you could detect that in the DFT of those 10ms.
  Presumably similar constraints apply to your ear: you can't detect if
  a 200Hz signal jumps to 216Hz over a 10ms period; you need a longer
  period of time. So you could emit updates for the high-frequency
  formants more frequently.  This would add a couple of bits per
  keyframe (to indicate which formants were being updated), but most
  keyframes would only contain one formant.

Klatt 1987 reports that the Speak 'n' Spell had stored about 1000 bits
per second of speech, using linear predictive coding:

<http://americanhistory.si.edu/archives/speechsynthesis/dk_749.htm>

Dan Ellis, who wrote the current Matlab version on the Haskins Lab site,
talks about the connection with LPC vocoders:

<http://labrosa.ee.columbia.edu/matlab/sws/>
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

Reply via email to