Hi All,

I'm doing a hobby project to try building a voice banking system for
people who are losing their voices due to diseases like ALS or
age-related voice loss.  Existing voice banks require extensive voice
samples for a person, and sadly, many people do not have the ability to
record enough samples to build a voice bank the traditional way.  This
is especially common in the elderly, who lose their voices gradually,
then suddenly.  Traditional voice banking requires a month of reading
text with the hope of finding every phonetic transition along with every
intonation of every phone.  So, while there are only 44 english phones,
there are thousands of variations of the phone and transitions between
phones needed for voice banking.

My hope is to be able to synthesis a "perfect" version of a person's
voice with very few parameters for text to speech systems.  Enough that
I might be able to figure out the parameters from a short audio clip a
person may have stored in the past.  For example, I might take a base
phone, like "aaaa", and modify it to match a person's voice by altering
very few parameters.

So, I did some research into sound and synthesis -- and I discovered
codec2 by accident.  It seems to meet the needs of being small enough to
understand, and perhaps generate frames from.  I'm a C# programmer, so
I'm using a .NET port done by Mikhail Nasyrov from the 2010 codec2
codebase, and built a "sound munger" to try and understand the synthesis
process better.

The munger takes pre-recorded frames of me saying "aaaaaa", and randomly
distorts them in some way.  For example, one distortion that worked
surprisingly well was bitshifting the first 8 bits of the 36-bit LSPs
field.  This changed the voice to sound higher, without introducing
distortion.  Single bit changes to the other LSPs rarely did anything
other than insert noise.

The munger showed that energy didn't seem to contribute much, if
anything, to the sound quality.  I could set any/all of the bits to 0 or
1 for that field, and it seemed to have no effect.  Wo was a different
matter.  I could set all bits to 1 of the Wo and the original voice
would be preserved, but changing any 1 to 0 lowered the volume
significantly.  Too many 1-->0 bit flips, and the sound would be
inaudible.  It became, "volume".  I had expected this to be some sort of
frequency, but it did not work out that way.  Removing the voicing bits
had significant effect, but not that I understood.  The voice sounded
somewhat like a whisper, but not quite.

I've learned enough from this munger to possibly create "autotune" --
which is a step in the right direction and may help some people who have
distorted voices -- but the grail of voice synthesis from few parameters
escapes me.

I don't understand the fields of the codec frame well enough.  I see the
docs, and know it's 51 bits.  I can translate the frames to a C#
structure and bit-bang the bytes, but I can't find a predictable
pattern.

Questions:
1. How do I better understand what the LSPs, energy, Wo, etc fields are
doing?
2. Does anyone out there have any thoughts on how I can achieve the goal
of few-parameter voice construction?

Thanks,
Imran


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Freetel-codec2 mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/freetel-codec2

Reply via email to