If you are really interested in Hidden Markov Models, I recommend some of Marie Roch's courses at SDSU when she comes back from sabbatical. She's quite good. Her work on interpreting dolphin sounds is fascinating.

Tracy R Reed wrote:

Now, we have to analyze that data. An FFT can be done in O(n log n). So we need 10^9 * log 10^9 flops. Or, roughly 10^10 flops continuously producing frequency bins.

I read that the latest FPU in a 3G x86 machine can do 24 GFLOPS. So nearly half a machine would be required, right?

Somewhere in that range. Remember, we are just getting orders of magnitude. For that purpose I would consider anything from .5 to 50 GFLOPS to be 10^9 FLOPS.

Now, we have to analyze the DFT's and convert them to something useful.

What is a DFT? Discrete fourrier transform?

Yes.

But the problem you describe may not really be the problem they need to solve. They don't need to do general case speech recognition. If someone buys the adword "hamburger" for my geographic area then all they have to do is search for when I say hamburger. It seems like most big companies I call now have basic keyword recognition systems built into their PBX's which presumably have relatively modest hardware. So it seems like the problem may not be nearly as large as you describe.

It depends. If all I need to discriminate between is "one", "two", "three", ... "nine", "yes", "no", and "operator", when I am specifically accessing a voicemail/directory then that requires quite a bit less power.

Of course, if I say "hamburger", that's pretty close to "operator" and I'm likely to get that choice. Or, if I say "fine", I'm likely to enter "nine" rather than "yes".

Discriminating between 20 specific words in a specific context is doable.

Discriminating between thousands of words in casual conversation is not.

-a


--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to