Wow! Thanks for all of the well-reasoned responses. I completely agree
with Chris Long that there is much about human communication that is lost
once it is stripped down to a text transcription or robophone equivalent
(just as there is when you can't see the speaker and have to rely on only
their voice). Nonetheless, there are times when the message content itself
is valuable; for example a SMS message often tells me all I need to know
(and might be all I can get at the time).
In the application I am looking at, doing speech-to-text at one end and
text-to-speech at the other would be fine in priciple. A practical
limitation is that continuous, large library, free form speech recognition
seems to require a lot of memory and processor time, more than most
embedded systems can handle. The most promising option I have found is
PocketSphinx, which reportedly works very well with a small library (20 to
40 words) and reasonably well with a 1000 word library (using maybe 20MB of
memory). I can think of a lot of applications where it would be useful,
but a 1000 word libary is far too limiting for general communication.
Using phonetics as Rick van Rein suggested is a really appealing idea. Not
only does it avoid the "pain" of dealing with English spelling, but it also
avoids the resource-hungry dictionary search to find matching English words
and should also work for some other (non-tonal) languages without extra
effort. I think the general idea is to find a compact representation for
the essential sounds of speech. On the other hand, I don't think that is
fundamentally any different than what codec2 does using LSPs. It boils
down to deciding what the "essential sounds of speech" are and how they
should be represented (phonemes, LSPs or something else). I think we could
all agree that it is essential that the speech be intelligible. Then there
is a tradeoff between making it sound more natural and using more
bandwidth, reducing latency, etc. Many codecs support a range of quality
options (as with the 1400 and 2400bps options for codec2), but different
codecs work well over different quality ranges. Codec2 and mp3 both do
their intended jobs well, but do not overlap much at all in application.
My thought is to investigate how far codec2 can be stretched into the
extremely-low bandwidth realm. We may find that phonetics or something
else works better there, just as codec2 works better than mp3 for
low-bitrate speech.
My goal is to encode as much speech as possible into a single "packet"
(think SMS) in such a way that it can be understood by a listener at the
other end. Latency is not a primary concern, so delta coding (as suggested
by Bruce Perens) is certainly an option. FEC is not relevant for me; the
transport will take care of that. I suppose my next step might be to
experiment with making the codec2 playback routines ignore some of the
encoded data to determine how important it is to intelligibility. Then the
data encoding could be modified to omit that data, providing another rate
option in addition to the existing 2400 and 1400bps options. Then we could
consider the potential benefits of delta coding, VBR, div/mod, etc. to
compact the data even more.
Steve
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2