|
I am part of a group of students that is working on
a project designed to allow voice control of a set-top box. This is being
done for as a senior design project for a computer
engineering degree. We are using Freevo as our set-top box
platform in this project. I'd like to address some ideas and suggestions I
have, but first let me give a little of the background behind the
project.
There are three major components to our
system. The primary component is our voice analysis server. We have
defined a lightweight protocol which encapsulates data into transactions.
These transactions contain all the necessary data to perform voice
recognition. Among the things contained in a transaction are grammars,
phrases, and the audio data to be recognized. Grammars include words that
need to be recognized for all specified occurrances. For example, when
playing a movie, a person would always want words such as "play", "pause", etc.
to be included. These are defined in a main grammar. However,
occasions occur when words need to be dynamically added as recognition
probabilities, these words are called phrases. Phrases would be useful
when displaying a menu of movie titles, for instance. The movie titles
could be loaded dynamically into the grammar, and then could be recognized by
the server. The server returns a control code indicating which command was
recognized. This is a basic overview of the server, other capabilities
exist but are not relevant to this discussion.
The second major component of our system is the
set-top box. We are using Freevo running on a custom Linux distribution we
have created, which is intended to be minimal, as we boot directly into Freevo,
in an attempt to mimic a "commercial" set-top box as much as possible. If
you are reading this, you are probably familiar with Freevo and its code, so I
won't discuss it : ).
The third component of our system is a voice remote
control. For this we are using a Palm Tungsten T, which has a built in
microphone and integrated Bluetooth. We simply transfer and audio commands
pulled off the microphone over Bluetooth to the set-top box. The set-top
box then relays this to the voice analysis server, and waits for the control
code come back. You can also use a microphone hooked up to the sound card,
but this isn't as cool : )
Using a server to provide voice recognition
capabilities is done for the purpose of decoupling the two systems, as set-tops
are generally embedded devices with just enough processing power to get
by. The set-top and the remote are tied together and the user interacts
directly with these two items. Ideally the voice server is entirely
transparent.
Now, with the background out of the way, I'd like
to discuss Freevo related stuff...
First of all, one code feature I'd like to see in
Freevo is the ability to directly select any item that is showed on the current
menu. Obviously, Freevo is currently built with the assumtion of a
"normal" remote control, where navigation is done with a series of up, down,
etc. commands. This makes sense for a normal remote control, but not much
sense for something like a voice remote control.
Right now, our current plan for selecting a
specific item is to simply create a mapping from the currenttly selected item to
the one the user requested and issue a series of "fake-IR" commands to get
to that item and select it. These commands would be the same thing LIRC
sends if the user had pressed that sequence on the remote. This is kind of
an ugly hack. In the ideal scenario, if the user said "Watch American Pie"
it should simply start playing American Pie. As far as I can tell, it is
simply not possible with the current architecture to select an arbitrary menu
position (but I may be overlooking something).
What I would like to know is if there is enough
interest in designing a more advanced command system to allow for future control
options such as voice. Obviously simpler control systems, such as your
average remote could be implemented using the more advanced architecture.
It just seems a better idea to design for greater capabilities, rather than more
restrictive ones. I am currently brainstorming the requriements for such
an architecture as I wrap up the other components, and would like input from
Freevo's developers, if you see this as a direction you want to
take.
Other situations could be interesting as
well. For instance, if I am in the TV Guide menu and I say "The Simpson's"
and there are multiple showings on different channels, a filtered list could be
displayed with all airings of The Simpson's. Also, voice commands should
not necesarily be restricted to just what is on the screen. For instance,
in my music directory I hundreds of artist folders. When I first
enter it, it starts displaying at artists that begin with "A". In a
perfect world, I could say "Radiohead" and then the menu would go to that
folder, even though Radiohead was currently not displayed on the screen.
These are just usage scenarious, I am sure other people have wishes that they
could voice too.
We'd like to contribute as much of these components
back to the open source community as possible. Obviously any code changes
to Freevo would be. The server currently uses an API into the Nuance (www.nuance.com) speech engine, as there simply
isn't a good open source alternative to speech recognition that has all these
capabilities. They used to have a developer program where you could obtain
there software for free along with free development license keys. This is
where I got the software. They have since discontinued the program
though. The server code isn't all that usefull without the Nuance
libraries to link against. I'm currently evaluating possibilities in this
area. The Palm Voice Remote is fairly simple. This will likely get
open sourced so that other people can use it as they see fit. Could be
useful for other projects that want to wirelessly capture voice.
Hopefully a good discussion can get started on
this. Also, if anyone has any questions, feel free to ask. Oh, and
thanks for the great software that you guys are developing.
Anyway, it is kinda late, so sorry for any huge
grammar errors. Also pretty longwinded, but hopefully it's interesting
enough to justify that.
Goodnight,
Jared Hanson
|
