Re: [JAWS-Users] Fwd: We may see some really great Sounding TTS in the near feature

Maria Campbell Mon, 12 Sep 2016 08:21:12 -0700

I was unable to hear any of the samples.  Everything said unavailable.



Maria Campbell
[email protected]

Love and compassion are necessities, not luxuries.
Without them, humanity cannot survive.
--Dalai Lama

On 9/12/2016 7:57 AM, Josh Kennedy wrote:

-------- Forwarded Message --------
Subject: We may see some really great Sounding TTS in the nearfeature
Date:     Sun, 11 Sep 2016 22:01:48 -0700
From:     Warren Carr <[email protected]>
Reply-To:     [email protected]
To:     [email protected]
I was reading a blog post from WaveNet and I was blown away by some ofthe stuff that they are doing.
I can’t wait to have those voices on our devices!
Here’s is the extract, followed by the URL to the page, and be sure tohead over to that page, and take a listen to some of those voices.
If you don’t want to read while you are on the page, you can simplyhit letter B, to take you to the “play button.”
The first ones are demonstrating how the current Google TTS sound, andthen the latter ones, demonstrate the more modern sounding ones.
There are a couple other languages in there besides U.S. English andChinese.
Quote:

This post presents WaveNet
, a deep generative model of raw audio waveforms. We show thatWaveNets are able to generate speech which mimics any human voice andwhich sounds more
natural than the best existing Text-to-Speech systems, reducing thegap with human performance by over 50%.
We also demonstrate that the same network can be used to synthesizeother audio signals such as music, and present some striking samplesof automatically
generated piano pieces.

Talking Machines
Allowing people to converse with machines is a long-standing dream ofhuman-computer interaction. The ability of computers to understandnatural speech
has been revolutionised in the last few years by the application ofdeep neural networks (e.g., Google Voice Search ). However, generatingspeech with computers
 — a process usually referred to as

speech synthesis
or text-to-speech (TTS) — is still largely based on so-calledconcatenative TTS , where a very large database of short speechfragments are recorded from
a single speaker and then recombined to form complete utterances. Thismakes it difficult to modify the voice (for example switching to adifferent speaker,
or altering the emphasis or emotion of their speech) without recordinga whole new database.
This has led to a great demand for parametric TTS
, where all the information required to generate the data is stored inthe parameters of the model, and the contents and characteristics ofthe speech
can be controlled via the inputs to the model. So far, however,parametric TTS has tended to sound less natural than concatenative, atleast for syllabic
languages such as English. Existing parametric models typicallygenerate audio signals by passing their outputs through signalprocessing algorithms known
as

vocoders .
WaveNet changes this paradigm by directly modelling the raw waveformof the audio signal, one sample at a time. As well as yielding morenatural-sounding
speech, using raw waveforms means that WaveNet can model any kind ofaudio, including music.
WaveNets

Wave animation
Researchers usually avoid modelling raw audio because it ticks soquickly: typically 16,000 samples per second or more, with importantstructure at many
time-scales. Building a completely autoregressive model, in which theprediction for every one of those samples is influenced by allprevious ones (in
statistics-speak, each predictive distribution is conditioned on allprevious observations), is clearly a challenging task.
However, our PixelRNN and PixelCNN
models, published earlier this year, showed that it was possible togenerate complex natural images not only one pixel at a time, but onecolour-channel
at a time, requiring thousands of predictions per image. This inspiredus to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.
Architecture animation
The above animation shows how a WaveNet is structured. It is a fullyconvolutional neural network, where the convolutional layers havevarious dilation
factors that allow its receptive field to grow exponentially withdepth and cover thousands of timesteps.
At training time, the input sequences are real waveforms recorded fromhuman speakers. After training, we can sample the network to generatesynthetic
utterances. At each step during sampling a value is drawn from theprobability distribution computed by the network. This value is thenfed back into the
input and a new prediction for the next step is made. Building upsamples one step at a time like this is computationally expensive, butwe have found
it essential for generating complex, realistic-sounding audio.

Improving the State of the Art
We trained WaveNet using some of Google’s TTS datasets so we couldevaluate its performance. The following figure shows the quality ofWaveNets on a scale
from 1 to 5, compared with Google’s current best TTS systems (

parametric and concatenative

), and with human speech using

Mean Opinion Scores (MOS)
. MOS are a standard measure for subjective sound quality tests, andwere obtained in blind tests with human subjects (from over 500ratings on 100 test
sentences). As we can see, WaveNets reduce the gap between the stateof the art and human-level performance by over 50% for both US Englishand Mandarin
Chinese.
For both Chinese and English, Google’s current TTS systems areconsidered among the best worldwide, so improving on both with asingle model is a major
achievement.
Here are some samples from all three systems so you can listen andcompare yourself:
End of quote from:

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

What do you think?

Warren



For answers to frequently asked questions about this list visit:
http://www.jaws-users.com/help/

Re: [JAWS-Users] Fwd: We may see some really great Sounding TTS in the near feature

Reply via email to