Re: [Asterisk-Users] Text to Speech - Someone needs to do this
People working on this have found that context influences the pronounciation of words. I think the root cause of this is that the human vocal tract cannot re-shape itself for different sounds instantly and must move from the previous sound to the next sound, we hear the movement. If it does instantly change then we hear it as un-natural robot-like speach. Your proposed system would sound just like what it is, a sequence of words. Good systems not only look at phonetic context but also inflection like tone, volume and pitch range and speed. Cursive hand writting is this way too. Cursive fonts don't look like real hand writting because each letter is always the same --- Matthew John Darnell [EMAIL PROTECTED] wrote: Why hasn't someone found 50 people who sound alike, put them in sound studios and record the 10,000 most commonly used words. You would all differnent forms of the 1,000 most words, i.e. leading, trailing, question etc. You can synthesize the other 0.05% when you run into them. With hard drives so big, processors so fast and EXT3 that can handle 30,000+ files in a single directory that seems like the way to do it. You could sell it for BIG bucks. -Matt ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users = Chris Albertson Home: 310-376-1029 [EMAIL PROTECTED] Cell: 310-990-7550 Office: 310-336-5189 [EMAIL PROTECTED] KG6OMK __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [Asterisk-Users] Text to Speech - Someone needs to do this
I must say this is basically correct BUT Remember that festival is actually based phonetically. remember that and modify your text accordingly and you might be surprised at the results. yes the standard voices do suck ! On Tue, 15 Jul 2003 23:04:24 -0700 (PDT), Chris Albertson wrote: People working on this have found that context influences the pronounciation of words. I think the root cause of this is that the human vocal tract cannot re-shape itself for different sounds instantly and must move from the previous sound to the next sound, we hear the movement. If it does instantly change then we hear it as un-natural robot-like speach. Your proposed system would sound just like what it is, a sequence of words. Good systems not only look at phonetic context but also inflection like tone, volume and pitch range and speed. Cursive hand writting is this way too. Cursive fonts don't look like real hand writting because each letter is always the same --- Matthew John Darnell [EMAIL PROTECTED] wrote: Why hasn't someone found 50 people who sound alike, put them in sound studios and record the 10,000 most commonly used words. You would all differnent forms of the 1,000 most words, i.e. leading, trailing, question etc. You can synthesize the other 0.05% when you run into them. With hard drives so big, processors so fast and EXT3 that can handle 30,000+ files in a single directory that seems like the way to do it. You could sell it for BIG bucks. -Matt ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users = Chris Albertson Home: 310-376-1029 [EMAIL PROTECTED] Cell: 310-990-7550 Office: 310-336-5189 [EMAIL PROTECTED] KG6OMK __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users . ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [Asterisk-Users] Text to Speech - Someone needs to do this
At 15:41 2003-07-15 -1000, Matthew John Darnell wrote: Why hasn't someone found 50 people who sound alike, put them in sound studios and record the 10,000 most commonly used words. You would all differnent forms of the 1,000 most words, i.e. leading, trailing, question etc. You can synthesize the other 0.05% when you run into them. With hard drives so big, processors so fast and EXT3 that can handle 30,000+ files in a single directory that seems like the way to do it. You could sell it for BIG bucks. Text-to-Speech (TTS) is usually either formative, created by synthesis of sounds; or concatenative, created by concatenating sounds of actual speech samples. However, concatenative TTS usually works by using small fragments of speech, not entire words. The storage requirements are much smaller, and it gives the system an opportunity to pick units of speech that match the units of speech that precede and follow them. The real trick is to get the correct posidy. Here's three sentences with the same words but each with different prosidy: I said 'yes.' I said yes? _I_ said '_yes_'???!! Both formative and concatenative systems add prosidy. Adding prosidy to whole-word concatentative systems is difficult. If you're in a buying mood, there are some excellent TTS systems available. For example, Rhetorical (http://www.rhetorical.com) has some excellent voices. And they have the funniest TTS current available is the Southern California female voice; I use it for non-serious demos (That's so totally awesome.) Commercial TTS is actually very intelligble and perfectly adequate for many tasks. -- Moshe Yudkowsky Disaggregate 2952 W Fargo Chicago, IL 60645 USA www.Disaggregate.com [EMAIL PROTECTED] +1 773 764 8727 ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [Asterisk-Users] Text to Speech - Someone needs to do this
--- Moshe Yudkowsky [EMAIL PROTECTED] wrote: SNIP The real trick is to get the correct posidy. Here's three sentences with the same words but each with different prosidy: I said 'yes.' I said yes? _I_ said '_yes_'???!! Both formative and concatenative systems add prosidy. Adding prosidy to whole-word concatentative systems is difficult. The thing is that _people_ don't do text to speech. If you were to simply read one word at a time you'd sound bad too. Try it: if, ... you. ...were, ... to, ... simply, ...read, ... You sound like a robot. No, we people know what it is we are trying to comunicate if you want a synthetic voice to sound natural you will have to tell the software the _intent_ of the words not just the words. You would need a markup language for that emph I /emph said quotequestionword yes /quote/questionword now the system can apply some transformations to the pitch, speed and loudness. For interactive systems markup works because the software generating the text knows _why_ it is generating the text Reading a book for the blind is a much harder problem. The TTS system has to do the same job as a voice actor which even includes understands the emotions of characters in a novel. Very hard to do for a computer. But interactive systems can use markup to get the expresson right. And don't put down festival. Many (most?) of the comercial systems _are_ festival. you, = Chris Albertson Home: 310-376-1029 [EMAIL PROTECTED] Cell: 310-990-7550 Office: 310-336-5189 [EMAIL PROTECTED] KG6OMK __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [Asterisk-Users] Text to Speech - Someone needs to do this
At 10:11 2003-07-16 -0700, Chris Albertson wrote: SNIP if you want a synthetic voice to sound natural you will have to tell the software the _intent_ of the words not just the words. You would need a markup language for that emph I /emph said quotequestionword yes /quote/questionword The W3C has a TTS markup language, SSML, http://www.w3.org/TR/speech-synthesis/. However, SSML is not a _semantic_ markup language. SSML gives directives about prosidy and pronunciation. And don't put down festival. Many (most?) of the comercial systems _are_ festival. I am not putting down Festival. However, I don't believe that many or most commercial systems are based on Festival. I think we should take any further discussion off-list. Regards, Moshe -- Moshe Yudkowsky Disaggregate 2952 W Fargo Chicago, IL 60645 USA http://www.Disaggregate.com ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [Asterisk-Users] Text to Speech - Someone needs to do this
Moshe Yudkowsky wrote: At 10:11 2003-07-16 -0700, Chris Albertson wrote: SNIP if you want a synthetic voice to sound natural you will have to tell the software the _intent_ of the words not just the words. You would need a markup language for that emph I /emph said quotequestionword yes /quote/questionword The W3C has a TTS markup language, SSML, http://www.w3.org/TR/speech-synthesis/. However, SSML is not a _semantic_ markup language. SSML gives directives about prosidy and pronunciation. Two interesting things about SSML (which used to be called Sable). One - there is almost no support for it amongst the commercial TTS packages. Two - even the people who wrote the SSML spec don't seem to have fully implemented it. The markup in most commercial TTS software is both proprietary and cranky. And don't put down festival. Many (most?) of the comercial systems _are_ festival. I am not putting down Festival. However, I don't believe that many or most commercial systems are based on Festival. You are wrong. All the packages I know, except Eloquence and maybe RealSpeak, are based at some level on Festival. The ones derived from Naturally Speaking have most of the Festival directories still in place. Strange, but true. Regards, Steve ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
[Asterisk-Users] Text to Speech - Someone needs to do this
Why hasn't someone found 50 people who sound alike, put them in sound studios and record the 10,000 most commonly used words. You would all differnent forms of the 1,000 most words, i.e. leading, trailing, question etc. You can synthesize the other 0.05% when you run into them. With hard drives so big, processors so fast and EXT3 that can handle 30,000+ files in a single directory that seems like the way to do it. You could sell it for BIG bucks. -Matt ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [Asterisk-Users] Text to Speech - Someone needs to do this
Matthew John Darnell wrote: Why hasn't someone found 50 people who sound alike, put them in sound studios and record the 10,000 most commonly used words. You would all differnent forms of the 1,000 most words, i.e. leading, trailing, question etc. You can synthesize the other 0.05% when you run into them. With hard drives so big, processors so fast and EXT3 that can handle 30,000+ files in a single directory that seems like the way to do it. You could sell it for BIG bucks. People have done this. The results are terrible. You couldn't charge big bucks. You'd have trouble giving it away. Regards, Steve ___ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users