[JAWS-Users] Fwd: We may see some really great Sounding TTS in the near feature

Josh Kennedy Mon, 12 Sep 2016 04:59:44 -0700



-------- Forwarded Message --------
Subject:        We may see some really great Sounding TTS in the near feature
Date:   Sun, 11 Sep 2016 22:01:48 -0700
From:   Warren Carr <[email protected]>
Reply-To:       [email protected]
To:     [email protected]

I was reading a blog post from WaveNet and I was blown away by some ofthe stuff that they are doing.


I can’t wait to have those voices on our devices!

Here’s is the extract, followed by the URL to the page, and be sure tohead over to that page, and take a listen to some of those voices.

If you don’t want to read while you are on the page, you can simply hitletter B, to take you to the “play button.”

The first ones are demonstrating how the current Google TTS sound, andthen the latter ones, demonstrate the more modern sounding ones.

There are a couple other languages in there besides U.S. English andChinese.


Quote:

This post presents WaveNet

, a deep generative model of raw audio waveforms. We show that WaveNetsare able to generate speech which mimics any human voice and whichsounds more

natural than the best existing Text-to-Speech systems, reducing the gapwith human performance by over 50%.

We also demonstrate that the same network can be used to synthesizeother audio signals such as music, and present some striking samples ofautomatically


generated piano pieces.

Talking Machines

Allowing people to converse with machines is a long-standing dream ofhuman-computer interaction. The ability of computers to understandnatural speech

has been revolutionised in the last few years by the application of deepneural networks (e.g., Google Voice Search ). However, generating speechwith computers


 — a process usually referred to as

speech synthesis

or text-to-speech (TTS) — is still largely based on so-calledconcatenative TTS , where a very large database of short speechfragments are recorded from

a single speaker and then recombined to form complete utterances. Thismakes it difficult to modify the voice (for example switching to adifferent speaker,

or altering the emphasis or emotion of their speech) without recording awhole new database.


This has led to a great demand for parametric TTS

, where all the information required to generate the data is stored inthe parameters of the model, and the contents and characteristics of thespeech

can be controlled via the inputs to the model. So far, however,parametric TTS has tended to sound less natural than concatenative, atleast for syllabic

languages such as English. Existing parametric models typically generateaudio signals by passing their outputs through signal processingalgorithms known


as

vocoders .

WaveNet changes this paradigm by directly modelling the raw waveform ofthe audio signal, one sample at a time. As well as yielding morenatural-sounding

speech, using raw waveforms means that WaveNet can model any kind ofaudio, including music.


WaveNets

Wave animation

Researchers usually avoid modelling raw audio because it ticks soquickly: typically 16,000 samples per second or more, with importantstructure at many

time-scales. Building a completely autoregressive model, in which theprediction for every one of those samples is influenced by all previousones (in

statistics-speak, each predictive distribution is conditioned on allprevious observations), is clearly a challenging task.


However, our PixelRNN and PixelCNN

models, published earlier this year, showed that it was possible togenerate complex natural images not only one pixel at a time, but onecolour-channel

at a time, requiring thousands of predictions per image. This inspiredus to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.


Architecture animation

The above animation shows how a WaveNet is structured. It is a fullyconvolutional neural network, where the convolutional layers havevarious dilation

factors that allow its receptive field to grow exponentially with depthand cover thousands of timesteps.

At training time, the input sequences are real waveforms recorded fromhuman speakers. After training, we can sample the network to generatesynthetic

utterances. At each step during sampling a value is drawn from theprobability distribution computed by the network. This value is then fedback into the

input and a new prediction for the next step is made. Building upsamples one step at a time like this is computationally expensive, butwe have found


it essential for generating complex, realistic-sounding audio.

Improving the State of the Art

We trained WaveNet using some of Google’s TTS datasets so we couldevaluate its performance. The following figure shows the quality ofWaveNets on a scale


from 1 to 5, compared with Google’s current best TTS systems (

parametric and concatenative

), and with human speech using

Mean Opinion Scores (MOS)

. MOS are a standard measure for subjective sound quality tests, andwere obtained in blind tests with human subjects (from over 500 ratingson 100 test

sentences). As we can see, WaveNets reduce the gap between the state ofthe art and human-level performance by over 50% for both US English andMandarin


Chinese.

For both Chinese and English, Google’s current TTS systems areconsidered among the best worldwide, so improving on both with a singlemodel is a major


achievement.

Here are some samples from all three systems so you can listen andcompare yourself:


End of quote from:

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

What do you think?

Warren

--

To report violations of our ground rules or content guidelines, contact[email protected]. -- https://goo.gl/rDveM8

---

You received this message because you are subscribed to the GoogleGroups "eyes-free" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.To post to this group, send email to [email protected]<mailto:[email protected]>.

For more options, visit https://groups.google.com/d/optout.
For answers to frequently asked questions about this list visit:
http://www.jaws-users.com/help/

[JAWS-Users] Fwd: We may see some really great Sounding TTS in the near feature

Reply via email to