Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-16 Thread Chris Albertson

People working on this have found that context influences the
pronounciation of words.  I think the root cause of this is
that the human vocal tract cannot re-shape itself for different
sounds instantly and must move from the previous sound to the next
sound, we hear the movement.  If it does instantly change then
we hear it as un-natural robot-like speach.  Your proposed system
would sound just like what it is, a sequence of words.
Good systems not only look at phonetic context but also
inflection like tone, volume and pitch range and speed.

Cursive hand writting is this way too.  Cursive fonts don't
look like real hand writting because each letter is always
the same

--- Matthew John Darnell [EMAIL PROTECTED] wrote:
 Why hasn't someone found 50 people who sound alike, put them in sound
 studios and record the 10,000 most commonly used words.  You would
 all
 differnent forms of the 1,000 most words, i.e. leading, trailing,
 question
 etc.
 
 You can synthesize the other 0.05% when you run into them.  With hard
 drives
 so big, processors so fast and EXT3 that can handle 30,000+ files in
 a
 single directory that seems like the way to do it.
 
 You could sell it for BIG bucks.
 
 -Matt
 
 ___
 Asterisk-Users mailing list
 [EMAIL PROTECTED]
 http://lists.digium.com/mailman/listinfo/asterisk-users


=
Chris Albertson
  Home:   310-376-1029  [EMAIL PROTECTED]
  Cell:   310-990-7550
  Office: 310-336-5189  [EMAIL PROTECTED]
  KG6OMK

__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-16 Thread Gary

I must say this is basically correct

BUT

Remember that festival is actually based phonetically. remember
that and modify your text accordingly and you might be surprised at the
results.

yes the standard voices do suck !

On Tue, 15 Jul 2003 23:04:24 -0700 (PDT), Chris Albertson wrote:


People working on this have found that context influences the
pronounciation of words.  I think the root cause of this is
that the human vocal tract cannot re-shape itself for different
sounds instantly and must move from the previous sound to the next
sound, we hear the movement.  If it does instantly change then
we hear it as un-natural robot-like speach.  Your proposed system
would sound just like what it is, a sequence of words.
Good systems not only look at phonetic context but also
inflection like tone, volume and pitch range and speed.

Cursive hand writting is this way too.  Cursive fonts don't
look like real hand writting because each letter is always
the same

--- Matthew John Darnell [EMAIL PROTECTED] wrote:
 Why hasn't someone found 50 people who sound alike, put them in sound
 studios and record the 10,000 most commonly used words.  You would
 all
 differnent forms of the 1,000 most words, i.e. leading, trailing,
 question
 etc.
 
 You can synthesize the other 0.05% when you run into them.  With hard
 drives
 so big, processors so fast and EXT3 that can handle 30,000+ files in
 a
 single directory that seems like the way to do it.
 
 You could sell it for BIG bucks.
 
 -Matt
 
 ___
 Asterisk-Users mailing list
 [EMAIL PROTECTED]
 http://lists.digium.com/mailman/listinfo/asterisk-users


=
Chris Albertson
  Home:   310-376-1029  [EMAIL PROTECTED]
  Cell:   310-990-7550
  Office: 310-336-5189  [EMAIL PROTECTED]
  KG6OMK

__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users

.



___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-16 Thread Moshe Yudkowsky
At 15:41 2003-07-15 -1000, Matthew John Darnell wrote:
Why hasn't someone found 50 people who sound alike, put them in sound
studios and record the 10,000 most commonly used words.  You would all
differnent forms of the 1,000 most words, i.e. leading, trailing, question
etc.
You can synthesize the other 0.05% when you run into them.  With hard drives
so big, processors so fast and EXT3 that can handle 30,000+ files in a
single directory that seems like the way to do it.
You could sell it for BIG bucks.
Text-to-Speech (TTS) is usually either formative, created by synthesis of 
sounds; or concatenative, created by concatenating sounds of actual speech 
samples.

However, concatenative TTS usually works by using small fragments of 
speech, not entire words. The storage requirements are much smaller, and it 
gives the system an opportunity to pick units of speech that match the 
units of speech that precede and follow them.

The real trick is to get the correct posidy. Here's three sentences with 
the same words but each with different prosidy:

I said 'yes.'

I said yes?

_I_ said '_yes_'???!!

Both formative and concatenative systems add prosidy. Adding prosidy to 
whole-word concatentative systems is difficult.

If you're in a buying mood, there are some excellent TTS systems available. 
For example, Rhetorical (http://www.rhetorical.com) has some excellent 
voices. And they have the funniest TTS current available is the Southern 
California female voice; I use it for non-serious demos (That's so 
totally awesome.)

Commercial TTS is actually very intelligble and perfectly adequate for many 
tasks.



--
 Moshe Yudkowsky
 Disaggregate
 2952 W Fargo
 Chicago, IL 60645 USA
 www.Disaggregate.com
 [EMAIL PROTECTED]
 +1 773 764 8727
___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-16 Thread Chris Albertson

--- Moshe Yudkowsky [EMAIL PROTECTED] wrote:
SNIP
 
 The real trick is to get the correct posidy. Here's three sentences
 with 
 the same words but each with different prosidy:
 
 I said 'yes.'
 
 I said yes?
 
 _I_ said '_yes_'???!!
 
 Both formative and concatenative systems add prosidy. Adding prosidy
 to 
 whole-word concatentative systems is difficult.

The thing is that _people_ don't do text to speech.  If you were to 
simply read one word at a time you'd sound bad too.

Try it:  if, ... you. ...were, ... to, ... simply, ...read, ...
You sound like a robot.  No, we people know what it is we are
trying to comunicate if you want a synthetic voice to sound
natural you will have to tell the software the _intent_ of the words
not just the words.  You would need a markup language for that

emph I /emph said quotequestionword yes /quote/questionword

now the system can apply some transformations to the pitch, speed
and loudness.  For interactive systems markup works because the
software generating the text knows _why_ it is generating the text

Reading a book for the blind is a much harder problem.  The
TTS system has to do the same job as a voice actor which even
includes understands the emotions of characters in a novel.  Very
hard to do for a computer.

But interactive systems can use markup to get the expresson
right.

And don't put down festival.  Many (most?) of the comercial systems
_are_ festival.



you,

=
Chris Albertson
  Home:   310-376-1029  [EMAIL PROTECTED]
  Cell:   310-990-7550
  Office: 310-336-5189  [EMAIL PROTECTED]
  KG6OMK

__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-16 Thread Moshe Yudkowsky
At 10:11 2003-07-16 -0700, Chris Albertson wrote:

SNIP
if you want a synthetic voice to sound
natural you will have to tell the software the _intent_ of the words
not just the words.  You would need a markup language for that
emph I /emph said quotequestionword yes /quote/questionword
The W3C has a TTS markup language, SSML, 
http://www.w3.org/TR/speech-synthesis/. However, SSML is not a _semantic_ 
markup language. SSML gives directives about prosidy and pronunciation.

 And don't put down festival.  Many (most?) of the comercial systems
_are_ festival.
I am not putting down Festival. However, I don't believe that many or most 
commercial systems are based on Festival.

I think we should take any further discussion off-list.

Regards,
 Moshe
--
 Moshe Yudkowsky
 Disaggregate
 2952 W Fargo
 Chicago, IL 60645 USA
 http://www.Disaggregate.com

___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-16 Thread Steve Underwood
Moshe Yudkowsky wrote:

At 10:11 2003-07-16 -0700, Chris Albertson wrote:

SNIP

if you want a synthetic voice to sound
natural you will have to tell the software the _intent_ of the words
not just the words.  You would need a markup language for that
emph I /emph said quotequestionword yes /quote/questionword


The W3C has a TTS markup language, SSML, 
http://www.w3.org/TR/speech-synthesis/. However, SSML is not a 
_semantic_ markup language. SSML gives directives about prosidy and 
pronunciation. 
Two interesting things about SSML (which used to be called Sable). One - 
there is almost no support for it amongst the commercial TTS packages. 
Two - even the people who wrote the SSML spec don't seem to have fully 
implemented it. The markup in most commercial TTS software is both 
proprietary and cranky.

 And don't put down festival.  Many (most?) of the comercial systems

_are_ festival.
I am not putting down Festival. However, I don't believe that many or 
most commercial systems are based on Festival.
You are wrong. All the packages I know, except Eloquence and maybe 
RealSpeak, are based at some level on Festival. The ones derived from 
Naturally Speaking have most of the Festival directories still in place. 
Strange, but true.

Regards,
Steve
___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


[Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-15 Thread Matthew John Darnell
Why hasn't someone found 50 people who sound alike, put them in sound
studios and record the 10,000 most commonly used words.  You would all
differnent forms of the 1,000 most words, i.e. leading, trailing, question
etc.

You can synthesize the other 0.05% when you run into them.  With hard drives
so big, processors so fast and EXT3 that can handle 30,000+ files in a
single directory that seems like the way to do it.

You could sell it for BIG bucks.

-Matt

___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [Asterisk-Users] Text to Speech - Someone needs to do this

2003-07-15 Thread Steve Underwood
Matthew John Darnell wrote:

Why hasn't someone found 50 people who sound alike, put them in sound
studios and record the 10,000 most commonly used words.  You would all
differnent forms of the 1,000 most words, i.e. leading, trailing, question
etc.
You can synthesize the other 0.05% when you run into them.  With hard drives
so big, processors so fast and EXT3 that can handle 30,000+ files in a
single directory that seems like the way to do it.
You could sell it for BIG bucks.
 

People have done this. The results are terrible. You couldn't charge big 
bucks. You'd have trouble giving it away.

Regards,
Steve
___
Asterisk-Users mailing list
[EMAIL PROTECTED]
http://lists.digium.com/mailman/listinfo/asterisk-users