Hi All, I'm doing a hobby project to try building a voice banking system for people who are losing their voices due to diseases like ALS or age-related voice loss. Existing voice banks require extensive voice samples for a person, and sadly, many people do not have the ability to record enough samples to build a voice bank the traditional way. This is especially common in the elderly, who lose their voices gradually, then suddenly. Traditional voice banking requires a month of reading text with the hope of finding every phonetic transition along with every intonation of every phone. So, while there are only 44 english phones, there are thousands of variations of the phone and transitions between phones needed for voice banking.
My hope is to be able to synthesis a "perfect" version of a person's voice with very few parameters for text to speech systems. Enough that I might be able to figure out the parameters from a short audio clip a person may have stored in the past. For example, I might take a base phone, like "aaaa", and modify it to match a person's voice by altering very few parameters. So, I did some research into sound and synthesis -- and I discovered codec2 by accident. It seems to meet the needs of being small enough to understand, and perhaps generate frames from. I'm a C# programmer, so I'm using a .NET port done by Mikhail Nasyrov from the 2010 codec2 codebase, and built a "sound munger" to try and understand the synthesis process better. The munger takes pre-recorded frames of me saying "aaaaaa", and randomly distorts them in some way. For example, one distortion that worked surprisingly well was bitshifting the first 8 bits of the 36-bit LSPs field. This changed the voice to sound higher, without introducing distortion. Single bit changes to the other LSPs rarely did anything other than insert noise. The munger showed that energy didn't seem to contribute much, if anything, to the sound quality. I could set any/all of the bits to 0 or 1 for that field, and it seemed to have no effect. Wo was a different matter. I could set all bits to 1 of the Wo and the original voice would be preserved, but changing any 1 to 0 lowered the volume significantly. Too many 1-->0 bit flips, and the sound would be inaudible. It became, "volume". I had expected this to be some sort of frequency, but it did not work out that way. Removing the voicing bits had significant effect, but not that I understood. The voice sounded somewhat like a whisper, but not quite. I've learned enough from this munger to possibly create "autotune" -- which is a step in the right direction and may help some people who have distorted voices -- but the grail of voice synthesis from few parameters escapes me. I don't understand the fields of the codec frame well enough. I see the docs, and know it's 51 bits. I can translate the frames to a C# structure and bit-bang the bytes, but I can't find a predictable pattern. Questions: 1. How do I better understand what the LSPs, energy, Wo, etc fields are doing? 2. Does anyone out there have any thoughts on how I can achieve the goal of few-parameter voice construction? Thanks, Imran ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ Freetel-codec2 mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/freetel-codec2
