I am part of a group of students that is working on a project designed to allow voice control of a set-top box.  This is being done for as a senior design project for a computer engineering degree.  We are using Freevo as our set-top box platform in this project.  I'd like to address some ideas and suggestions I have, but first let me give a little of the background behind the project.
 
There are three major components to our system.  The primary component is our voice analysis server.  We have defined a lightweight protocol which encapsulates data into transactions.  These transactions contain all the necessary data to perform voice recognition.  Among the things contained in a transaction are grammars, phrases, and the audio data to be recognized.  Grammars include words that need to be recognized for all specified occurrances.  For example, when playing a movie, a person would always want words such as "play", "pause", etc. to be included.  These are defined in a main grammar.  However, occasions occur when words need to be dynamically added as recognition probabilities, these words are called phrases.  Phrases would be useful when displaying a menu of movie titles, for instance.  The movie titles could be loaded dynamically into the grammar, and then could be recognized by the server.  The server returns a control code indicating which command was recognized.  This is a basic overview of the server, other capabilities exist but are not relevant to this discussion.
 
The second major component of our system is the set-top box.  We are using Freevo running on a custom Linux distribution we have created, which is intended to be minimal, as we boot directly into Freevo, in an attempt to mimic a "commercial" set-top box as much as possible.  If you are reading this, you are probably familiar with Freevo and its code, so I won't discuss it : ).
 
The third component of our system is a voice remote control.  For this we are using a Palm Tungsten T, which has a built in microphone and integrated Bluetooth.  We simply transfer and audio commands pulled off the microphone over Bluetooth to the set-top box.  The set-top box then relays this to the voice analysis server, and waits for the control code come back.  You can also use a microphone hooked up to the sound card, but this isn't as cool : )
 
Using a server to provide voice recognition capabilities is done for the purpose of decoupling the two systems, as set-tops are generally embedded devices with just enough processing power to get by.  The set-top and the remote are tied together and the user interacts directly with these two items.  Ideally the voice server is entirely transparent.
 
Now, with the background out of the way, I'd like to discuss Freevo related stuff...
 
First of all, one code feature I'd like to see in Freevo is the ability to directly select any item that is showed on the current menu.  Obviously, Freevo is currently built with the assumtion of a "normal" remote control, where navigation is done with a series of up, down, etc. commands.  This makes sense for a normal remote control, but not much sense for something like a voice remote control.
 
Right now, our current plan for selecting a specific item is to simply create a mapping from the currenttly selected item to the one the user requested and issue a series of "fake-IR" commands to get to that item and select it.  These commands would be the same thing LIRC sends if the user had pressed that sequence on the remote.  This is kind of an ugly hack.  In the ideal scenario, if the user said "Watch American Pie" it should simply start playing American Pie.  As far as I can tell, it is simply not possible with the current architecture to select an arbitrary menu position (but I may be overlooking something).
 
What I would like to know is if there is enough interest in designing a more advanced command system to allow for future control options such as voice.  Obviously simpler control systems, such as your average remote could be implemented using the more advanced architecture.  It just seems a better idea to design for greater capabilities, rather than more restrictive ones.  I am currently brainstorming the requriements for such an architecture as I wrap up the other components, and would like input from Freevo's developers, if you see this as a direction you want to take.
 
Other situations could be interesting as well.  For instance, if I am in the TV Guide menu and I say "The Simpson's" and there are multiple showings on different channels, a filtered list could be displayed with all airings of The Simpson's.  Also, voice commands should not necesarily be restricted to just what is on the screen.  For instance, in my music directory I hundreds of artist folders.  When I first enter it, it starts displaying at artists that begin with "A".  In a perfect world, I could say "Radiohead" and then the menu would go to that folder, even though Radiohead was currently not displayed on the screen.  These are just usage scenarious, I am sure other people have wishes that they could voice too.
 
We'd like to contribute as much of these components back to the open source community as possible.  Obviously any code changes to Freevo would be.  The server currently uses an API into the Nuance (www.nuance.com) speech engine, as there simply isn't a good open source alternative to speech recognition that has all these capabilities.  They used to have a developer program where you could obtain there software for free along with free development license keys.  This is where I got the software.  They have since discontinued the program though.  The server code isn't all that usefull without the Nuance libraries to link against.  I'm currently evaluating possibilities in this area.  The Palm Voice Remote is fairly simple.  This will likely get open sourced so that other people can use it as they see fit.  Could be useful for other projects that want to wirelessly capture voice.
 
Hopefully a good discussion can get started on this.  Also, if anyone has any questions, feel free to ask.  Oh, and thanks for the great software that you guys are developing.
 
Anyway, it is kinda late, so sorry for any huge grammar errors.  Also pretty longwinded, but hopefully it's interesting enough to justify that.
 
Goodnight,
Jared Hanson

Reply via email to