I did research for a year on how to do this. I came to write externals for PD 
because of that project, but I never quite got to the point where I could do 
it. It's on my long to-do list, which means it probably never will be finished. 
Here are some ideas:

1. Calculate a Chebyshev polynomial from a Linear Predictive Coding filter 
response. Track the peaks of the response (the formant peaks) and (maybe) find 
approximate matches in a database of material. A model can be built on-the-fly 
of formant patterns in a training mode, so you can make a database of formant 
peak line-sections, and this can be used to check subsequent analyses. For 
example, a training session can be used to build a model of a particular 
speaker's formant patterns, then the live input can be compared to each model.

I was trying to port the formant modelling tools from the Speech Filing System 
from UCL: http://www.phon.ucl.ac.uk/resource/sfs/ to PD in 2005-06, but didn't 
get much support from my superiors who were running this project. I never got 
it to work, but i'd only just begun proper C programming then. I'm sure I 
wasn't far off... I'd love to try again if I get time in my schedule (I now 
have 2 kids and 5 jobs). The advantages to this method are that with careful 
measurement of the residual spectrum, it is possible to re-create the sound of 
a voice from a good formant/residual model. Thus, we can make a person's voice 
"speak" the words we want them to, or the get a hundred people to sing in tune! 
It is a reversible algorithm, so the original sound can be re-created from the 
analysis.

2. The Mel-Frequency Cepstral Coefficient (MFCC) of the FFT (Fast Fourier 
Transform) of a waveform is a good timbral identifier. William Brent's TimbreID 
objects are good instantaneous timbre identifiers using this principle, but to 
build up a sophisticated model of a human voice (robust  enough for speaker ID) 
you need to work out how to build a database. For an instantaneous MFCC 
identifier using an internal database, check out Michael Casey's "soundspotter" 
PD external. This is even more efficient, since each frame of MFCC analysis is 
simplified as a string of 40 ASCII characters. This means that standard MySQL 
search techniques can be used to search the database, and hence it is a lot 
faster than comparing two numbers. The MFCC algorithm is non-reversible, 
meaning that the original waveform cannot be constructed from the analysis data.


The biggest problem with all of this is that speech is identified not just by 
its instantaneous timbre, but also by the way the timbre and pitch changes over 
time. So speech recognitpion technology uses a thing called a Markov Model to 
map the likelihood of one timbre changing to another. For example, the 
likelihood of a "k" sound followed by a "r" is quite high, since there are many 
words like "cracker, croak" that have this morphology. Whereas "k" followed by 
"s" is much rare in (English) language, so its likelihood is much less.

I...well there it is,
Ed

>> The task would be to identify from a live-talk the voice of the current

>> speaker amongst several. Training before is also possible .. i guess this
>> could be done for sure by utilizing a simple neural network trained on a
>> FFT docemposition of the voices..  so there must be some software out for
>> sure...
> 
> Something tells me a fft+neural network would be really bad at this.
> Seriously, that sounds like a doomed project if you tried.  These
> things would be huge:
> 1.  fft size (for resolution)
> 2.  network size (based on the fft size)
> 3.  training set (lots of variance in the speaker is possible)
> 
> How about autocovariance and dot-product?
> 
> Ahead of time, create an array containing normalized autocovariance
> (an autocorrelation) of the speaker's voice.
> 
> Compute a running autocovariance of the sound.  Decompose it into the
> portion of the sound matching the autocovariance of the speaker and
> compare it with the part not matching the speaker (via dot-product, or
> projection operators)
> 
> That would be ~less~ expensive and time consuming than neural
> networks, but I'd give it not much chance of success either.  Probably
> it would match quite a few different people all the same.


I think that getting some kind of basic recognition of who is speaking would 
not be super difficult, if you have a clean recording of the voices. You need 
to get the formant of the voice, then use that as the base comparison.  You 
could start with something like William Brent's timbreID library to isolate the 
different vowel sounds, then get a format for each of the vowels, then use that 
data for the pattern matching.  It'll definitely take some research and a solid 
chunk of work to get it going.

.hc

----------------------------------------------------------------------------

Access to computers should be unlimited and total.  - the hacker ethic



_______________________________________________
[email protected] mailing list
UNSUBSCRIBE and account-management -> 
http://lists.puredata.info/listinfo/pd-list
_______________________________________________
[email protected] mailing list
UNSUBSCRIBE and account-management -> 
http://lists.puredata.info/listinfo/pd-list

Reply via email to