David Luff wrote:

This all sounds very exciting, especially the encouraging results from the
voice recognition stage, and the fact that Jon thinks that Festival 2 is
sounding pretty good.

As I mentioned earlier you can make up your own mind: there are numerous examples available at:


- including the possibility to "remote-control" festival via a
simple CGI:


... so that the synthesized speech is returned as wav/mp3 file.

Some of these examples sound pretty unspectacular and actually exactly
like what you'd normally expect from a TTS, unlike others that are
really impressive - these are then created using an approach that uses
mathematically enhanced fragments of real audio data - this is
specifically meant to be used in areas where there's only a certain
subset of vocabulary expected (sounds suitable for our task).

I think they call it "Language (specific) Domain Modelling" or something
like that, will have to check their homepage for that, though - here:


Could you send me the code you've got so far for
sending strings across to FG?

if I didn't get him wrong, he was only just about to consider making a stab at it, not really writing anything specific for FG yet - also, my impression was that he was firstly going to send strings *from* FG to the TTS ?

I'm a bit unclear which parts you are actually working on.  Are you working
on the decoding of the speech to text-strings only so far, or have you
actually started on logically decoding the text strings for ATC-AI?  This
is the part I'm currently in deep thought about.

Would you mind sharing more of your 'deep thoughts' ? :-) I would really love to hear about other suggestions for approaches to deal with the AI part.

A few random thoughts.  Speech recognition for ATC ought to be easier than
the general case, since the smaller vocabulary ought to mean that better
guesses can be made, if this sort of thing can be specified to the ASR

That's entirely correct, the latter is indeed possible - with festival, too: This is what I referred to as 'LDM', this is also where the results are really REMARKABLE (there are examples about that, too) - this is all based on building a domain specific phonetic database - so the overall quality will depend whether you have a suitable database or not:


For the LDM part you need to have REAL AUDIO data of the relevant
domain available and have another application process it, that
creates a mathematical model of the most frequently encountered
phonetical patters, these are then used to create a
"language domain specific" database for speech synthesis.

So this is the drawback - you need to have plenty of audio
data available in order to get good results, but on the other hand
it shouldn't be all that difficult to get our hands on a couple of
hours of ATC recordings: there are numerous free ATC streaming
servics on the web, one would only need to record these and
feed them into a LDM creation application.

But on the other hand, quality is a pretty determining factor,
so you cannot simply use any ATC stream - like all those that
were recorded using a simple scanner, one would rather need
to use high quality recordings of ATC <-> Pilot conversations
in order to create a usable domain model, so one would prefer
those streams that are being made available by airports/centers

What I found particularly interesting when I browsed the festival
pages a couple of days ago, though was that festival seems to have
already components available that would take care of what I
referred to as the "dialecting part", take a look at:


So, in that regard there would be no need to make up anything :-)


Flightgear-devel mailing list

Reply via email to