Re: [Asterisk-Users] Poll - Would you pay $30-$50 for high qualityspeech synthesis?

Steve Underwood Tue, 15 Jul 2003 22:12:39 -0700

Jeff Noxon wrote:

Many of you are familiar with how lousy Festival sounds.

AT&T has a product, NaturalVoices, that sounds much better.  There are
male & female voice fonts for US/UK/Indian English, French, Spanish,
and German.

I am considering offering a linux-based text-to-speech engine based on
the NaturalVoices runtime.  An asterisk module would also be provided,
making it easy to add natural sounding synthesis to Asterisk applications.
You could also use it for other purposes, such as home automation.

After discussing royalties with AT&T, I have concluded that I can probably
offer such a product at the following prices:

Runtime - $30 intro price with one voice font & one processor
Extra voices/languages - $15 each
Extra processors - $15 each

Depending on demand, the price may rise to $50 at some point.  The lower
the demand, the higher the price, due to AT&T's royalty structure.

Maybe you are right, but take great care with this.

You can get packaged versions of Natural Voices cheaply for desktop applications. However, when you want to use it for telephony systems it usually costs more like $600-$700 per port. There are also big differences in the way ports are counted by different vendors. For example, the per port pricing for RealSpeak (which is not realated to Naturally Speaking) and Speechify (Speechworks derivative of Naturally Speaking) is not too different, but the final bill may be. With Realspeak, if you have 1000 ports, and only use TTS a little you still pay for 1000 ports. With Speechify you pay for the maximum current channels you will have speaking at any instant. Unless your system is very TTS heavy, this makes a huge difference.

I last worked heavily with these TTS engines about two years ago. They have improved, but I don't think by that much. Speechify was a lot more functional then Naturally Speaking, as its front end language processing was more complete. Naturally Speaking read too many things in the wrong way (a lot of other TTSs did too). The various Naturally Speaking derivatives are not all equal. Naturally Speaking is itself a derivative of Festival. Look in the directories, and you still see lots of Festival files. Cepstral and Rhetorical Systems both have impressive sounding TTS based on Festival. Festival seems to be the root of most things other than RealSpeak and Eloquence. Eloquence seems pretty much the only mainstream package which does things differently, and actually synthesizes voice from basic principals.

Two years ago we deployed systems using RealSpeak, Speechify and Eloquence. People hated the robotic quality of Eloquence, but could understand it clearly (at least the English one - the Mandarin version sounded terrible). People liked the natural sound of RealSpeak, but couldn't understand it very well - they could follow paragraphs of text OK, but ask them about a specific thing that was said, like a street name, and their accuarcy was very poor. Speechify was somewhere in between, but tending towards RealSpeak. In the end, adverse user reaction made us rip out all the TTS and abandon attempts to use it.

Some pointers from working with this stuff:

- First impressions are a bad indicator of true quality, due to the next point. You need to play with these things for a while, and see how they behave in real world use, before you can really evaluate their usefulness.

- In current TTS systems (all of them), natural sounding tned to equate with hard to understand. Most TTS systems basically use a database of recorded snippets, and blend them to form speech. The longer the snippets, the smoother and more natural the sound, but the worse is its accuracy. Short snippets allow more flexibility in sculpting the result. giving better intelligibility, but making the sound more robotic.

- If your application is reading long tracts of text, the natural sounding TTSs do fairly well. The words you don't hear clearly are naturally filled in by your brain from the context. If your application is reading out addresses, the more robotic systems do better - I found Eloquence does the best for this.

- Don't underestimate the importance of the front end language processor. Most offerings deal with this part poorly. They all have demos that show how well the sysem will read things like currency and dates. Try feeding those texts to other vendor's TTS engines. The results can be quite interestings. The demos only contain examples of things the particular engine does well, and they have all focussed on getting different things right.

- You put together a system you think is really neat. Users initially think it pretty neat too. Then those same users gradually abandon the system as they find its limitations.

- Watch our for resource usage. You might expect these things to hog the CPU. They don't. However, they take hundreds of megs of disk (OK), and some (like Naturally Speak and Speechify) needed it all in RAM at once to work well (not so OK). So, you had to allow more than 200MB of RAM per voice. This may have been improved in newer versions of Speechify, but I don' t think Naturally Speaking has changed much in that time.

Regards,
Steve


_______________________________________________
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users

Re: [Asterisk-Users] Poll - Would you pay $30-$50 for high qualityspeech synthesis?

Reply via email to