Re: keynote gold festival sapi5 voice

AudioGames . net Forum — Off-topic room : raygrote via Audiogames-reflector Wed, 31 Aug 2016 17:17:22 -0700

Hi all,
Haven't officially tried the Keynote Gold recreation yet but from what I've heard it is interesting. Not quite sure why it's so nostalgic to have this synth, but it definitely is sounding a lot like Keynote Gold from the samples I'm hearing, and I applaud the effort.
Now, there are people who are asking about the program being used to create the voices. Being an audio and language geek of sorts I decided to try to create my own voice without aid of tutorial. And I will describe below what I've found for those interested.
This program, called MNLP, seems to allow for not only tts voice creation but lip sync animation as well. I'm not sure what it's aiming to do but it seems to be for a sort of niche audience, which I am not criticizing. I like that sort of stuff.
If you want to create a tts voice, you must firstly know that you can't just take random recordings and plop them in and press a create button. This is be cause the program comes with a basic voice creator that will instruct you on sentences that you must speak. You then can optionally check the sentences to make sure the word boundaries are identified properly.
As Jake pointed out, the interface does not use standard windows GUI controls that screen readers will recognise. I ended up using Jaws since I am familiar with its Jaws cursor more than NVDA's object nav, and to both MNLP and Jaws credit, it is quite usable this way. Nevertheless, after about 4 hours of trying to use it and explore it, I found myself becoming frustrated and overwhelmed.
When recording a sentence, the program analyzes the recording to identify word boundaries which you can verify by just clicking on the words. This is important, as if you end up with word boundaries which are too far off, the resulting voice will start to jabble and not make a lot of sense with certain words and phrases. You can turn this word review off, and by doing so you eli minate the long analyses for each recording and the subsequent word review process. However the trade-off is that you can't find mistakes and try to redo the recording in a more clear way to eliminate the mistake. It's a good thing I checked every single recording too, as some word boundaries were extremely far off, and I never would've caught them without review. One was so bad that the program couldn't even identify all the words in the sentence. I'm not sure why it had so much difficulty, my speech is not at all hard to understand and I speak a natural American English dialect which the program is supposedly most trained for. Needless to say I have lost faith in its automatic detection and I always check it, which is not the quickest of processes.
I spent 2 hours analyzing and then verifying each word boundary on 35 sentences. Then I wanted to see how the voice was sounding. So I compiled it. And that's where frustration set in. The voice took o ver half an hour to compile on my machine. And that's nowhere near what you'd need to produce a good voice... it's recommended you have a minimum of 1000 sentences recorded to really make a good voice! A voice with that many recordings, as Jake stated above in this topic, would take hours to compile! And that's not even the limit, as the included sentence list has over 3500 sentences you could record.
I also am not a very large fan of how the program pre-processes the audio before it is recorded. I went through the trouble of cleaning my mic up with an fx chain I set up in Reaper with compression, noise gates, and other denoisers I like. This produces a sound which I feel is not bad considering my cheap setup. However MNLP still seems to want to apply noise reduction, since there are warbly artifacts in its recorded speech which do not exist anywhere else. I even took off all the effects and left in my background noise, and I could tell the program was reduc ing it, and the warbly watery artifacts had become worse.
After compilation, I was elated to hear speech produced using a synthetic copy of my voice. However I immediately am noticing problems which could potentially add another long, laborious layer to this. The synthetic copy of my voice was having problems, even with the sentences I recorded. Syllable boundaries were slurred, and the ends of words were trailed or slurred off. Two sentences I had recorded but still had problems were:
"Not at this particular case, Tom, apologized Whittemore."
In the synthetic voice, the word apologized didn't have a proper D sound in it, and the W sounded strangled in the name whittemore. I had tried to articulate them well during recording, but had noticed that when I was checking word boundaries, the D in apologized had been overlooked for some reason and the W in whittmore had been largely cut as well from the word boundary. I overlooked it as I had been wanting to move on, but it seems to have become a problem. One more example:
"He was a head shorter than his companion, of almost delicate physique."
In this case, the synthetic voice almost completely mocks my strange prosody I had used while recording, which I neglected to take much notice to until after the fact, and I found this amusing. However, the K sound at the end of physique was for some reason cut. So when the synth says the word physique, it almost sounds like fizzy. During the recording process, I again noticed that the cut was made in the word boundary but did not want to draw too much attention to a little thing like that. In all of these cases if I could've gone in and manually edited the recorded data boundaries, I think I would already have a better sounding voice than I do now, even though I've only finished mear dozens of what could become over a thousand recordings. Some other consonants were cut both from my recordings and from the final spee ch itself. Even in the Keynote Gold samples, you can here this, as the H from hello was cut by the program. The synthesizer does try to sort of interpolate data across phoneme boundaries where needed, but it can only do so much, and it can't really create things which aren't there without a load of artifacts.
Before you ask, I have deleted my early remnants of a TTS voice as I want to start over when I know a little more about the advanced features of the program. Thus, I can't show you what a voice in this early stage would sound like.
Speaking of advanced features of the program, I thought I would try my hand at using them. The developer recommends that for optimum results when creating a voice, you should perform its speech recognition training for the program to determine your way of speaking and how you articulate phonemes. Unfortunately, satisfying the program that you have a good word list is tricky, partly because of its inaccessible gui. It also ask s for fairly large text files of passages and words which I suspect should be specially designed to capture all phoneme combinations for the English lexicon. I've not delved into that, but I am getting the impression that if you are going that deep into the process, you should already have a list of appropriate phrases to record. I started off by giving it the default sentences that ocme with it, and letting it pick appropriate phrases from that. According to the program, you can also even create your own lexicons if you are so inclined! So yeah this program is definitely meant for a tweak head!
Despite my primitive understanding of linguistics and me being ill prepared to tackle a challenge like this, I decided to attempt the speech recognition training anyway and make the best voice I possibly could with this thing. I recorded a few phrases which it recommended. However to feed the training with potentially useful information, you have to verify phoneme boundaries and de tection, and this part of the process induces a massive headache. For a start, once you've recorded your voice, you then have to open your recordings one by one in its built-in audio editor, which is completely inaccessible. The editor is full of unlabeled graphics, and while clicking randomly throughout the interface does bring up dialogs which you can navigate through a little more easily with the mouse cursor, you still have no clue what you're doing in its main window. Without even having labels it's impossible to verify, split, and edit phonetic representations to your heart's content. The program also mentions alternatively importing wav files with phoneme markers, but I've yet to find out how that works. If there's some laborious way of specifying information in a text file or similar instead of fighting with the inaccessible interface, I'm willing to undertake that process, but for now I have given up on the training. I will maybe look up s ome other options later for how I can verify phoneme detection, as that is obviously pretty important to producing a voice that can speak fluidly and with as little awkwardness as possible.
There is also an advanced tts editor that strays from the basic one I described above. This allows you to use your speech recognition training data to record a voice, and then later to edit its phonetic splits, prosody information, and pronunciation in a much more fine way than you could with the simple creator. Because I am unable to create a training file, I've not yet played with this, but if I could get the training stage down, I would attempt to create a voice using the advanced method. Unfortunately my doubts that this is practical with a screen reader are quickly mounting.
What I would probably do is attempt to make recordings with another audio editor and import them into the program, as this apparently is supported. I could then use an editor of choice to edit any data the program needs, and I would also get rid of its noise reduction and other pre-processing it likes to do. As an audiophile and perfectionist the differences the pre-processing makes really bug me. But at this point I think I will have to stick to basics. The best I can do is record up to 3539 sentences which is what the program comes with, and hope for the best. I would be relying on the program's default algorithm setup to do all the work, without being able to provide my own corrections to the occasional error it makes. This would be fine for someone who wants the process to be as painless as possible, but for someone like me who would be willing to do many hours of tedious but beneficial labor to improve the voice, the fact that I cannot do so because of a annoying thing such as blindness is a bit discouraging. Not unexpected, just discouraging.
I'm not trying to knock this program or anything, just warning those who want to get really geeky with it that you will ha ve hurdles to cross. The demo voices which come with the program are quite impressive I think, and I strive to match them in any voices I try to create. I am still looking into way to access advanced features of the program for those interested, though my hopes for a full sollution are not exceedingly high.
I hope this rant of sorts has brought a little helpful information to the table. If you seek to create your own voice, no matter how geeky or casually you want to go about it, I wish you the best of luck. If you need advice or have a question, let me know and I will try to answer it.

_______________________________________________
Audiogames-reflector mailing list
Audiogames-reflector@sabahattin-gucukoglu.com
https://sabahattin-gucukoglu.com/cgi-bin/mailman/listinfo/audiogames-reflector

Re: keynote gold festival sapi5 voice

Re: keynote gold festival sapi5 voice

Reply via email to