Hosting New Speech Modeling Project@GoogleCode

robertburrelldonkin Thu, 26 Jan 2012 04:43:07 -0800

Hi

This is a long post but I think that context may be important. Anyone
who wants to cut to the chase should skip right to the bottom...


<context>
After just over a decade contributing to FOSS[1] in my copious spare
time, I took a break from commercial coding to study for my second
Masters[2]. I focused on Machine Learning, Logic and Concurrency
(scoring a shade over 88% for Semester One[3]) but at an increasing
cost to my health. I dropped out when my computer access time reached
zero towards the end of Semester Two. With physiotherapy and
discipline over the last couple of years, I now have enough typing
time to start thinking about working again. I promised myself that -
once this happened - I'd start looking at the open source aural
interface problem.

I spent that first summer reading academic machine learning papers
trying to work out why open source solutions don't exist, and
recording my voice. Like most developers I've talked to, I originally
assumed that data volume was the primary issue - and that given the
fantastic quantity of stuff out there on the internet now, it should
be possible to reverse engineer most of the stuff that's needed. But
no: my reading led me to believe that the key issue preventing this
promising approach is that the quality of basic nuts-and-bolts parts-
of-speech recognition isn't high enough to allow modern machine
learning techniques to be applied to this data, leaving the state of
the art aiming to perfect techniques know in the 70's (now know to be
flawed in theory).

I think the root of the problem lies in the almost-universal initial
use of Fast Fourier. Given that much greater human expertise is needed
to recognize parts of speech after transformation than from simple
wave-form, my guess is that dropping the transform and applying modern
machine learning approaches the way humans recognize speech is the way
to go. A secondary consideration is that Fast Fourier scales poorly
across processors, so any user interface that performs Fast Fourier
will be unresponsive.

So, I'd like to start working on the open source aural interface
problem from the ground up, applying modern machine learning
techniques to the basics. But this means storing large quantities of
high quality speech data. For an open source project, finding a host
for this data is a fundamental step in establishing the provenance of
any future integrated solution. The FOSS projects known to me are not
good matches for my aims:

1. My interest focuses on aural interfaces. So promotion of Free
Software is less important than ensuring that the license adopted is
compatible with a wide range of downstream FOSS projects. The existing
projects I know about use the GPL and are focused on promoting Free
Software, rather than engineering.

2. I propose the use of alternative feature extraction and machine
learning techniques.  Existing projects concern themselves with
compatibility with existing tools and aim for wide participation. This
means throwing away high frequencies audible to the human ear by heavy
down-sampling. This means that recognizing some parts of speech
depends more heavily on context, which makes reverse-engineering of
good dictionaries and oral speech models much more difficult. Good
sound cards ship with many modern motherboards, and good speech
recognition microphones are available for around £100. My primary use
case is people dependent on a serious aural interface, (rather than
occasional users). So, preserving the original fidelity makes sense.

3. In humankind, the wetware used to create sounds varies along
several dimensions but shares a common engineering design. This
suggests that - using modern statistical learning algorithms - a
suitably parameterised vocal model could be efficiently tuned to a
particular voice. This is my preferred approach. Existing projects aim
to crowdsource an average voice. This has statistical disadvantages
for my preferred approach. Instead, the first step for me is to prove
a high quality statistical model of one voice.
</context>

In short: in order to take the first step towards open source aural
interfaces, I need to find a host for large quantities of my speech
(already recorded), at high fidelity (to preserve high frequencies)
under an MIT license. Would this be something that Google Code might
be interested in supporting?

Robert
[1]
(linked in profile) http://robertburrelldonkin.name describes most of
my ASF stuff
(Member, ASF)http://apache.org/foundation/members.html
(email rdonkin-at-apache.org for confirmation)
http://people.apache.org/committer-index.html#rdonkin
http://www.ohloh.net/accounts/robertburrelldonkin lists most of my
other FOSS stuff
or just use the world favourite engine ;-) 
www.google.co.uk/search?q="robert+burrell+donkin"
[2] Advanced Computer Science@Manchester
http://www.cs.manchester.ac.uk/postgraduate/taught/programmes/acs/ my
previous degrees were in Mathematics http://www2.warwick.ac.uk/fac/sci/maths/
[3] scroll down http://robertburrelldonkin.name for more details.

-- 
You received this message because you are subscribed to the Google Groups 
"Project Hosting on Google Code" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-code-hosting?hl=en.

Hosting New Speech Modeling Project@GoogleCode

Reply via email to