We here at Randa had a little session about voice recognition and control of applications. We tried to roughly define what we mean by that - a way of talking to the computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate, conversational interfaces. We agreed that want this and people expect it more and more. Striking a balance between privacy and getting some data to enable this is a big concern, see later. While there is general interest (almost everyone here went out of their way to join the disussion), it didn't seem like anyone here at the moment wanted to drive this forward themselves, so it may just not go anywhere due to lack of people willing to put in time. Otherwise it may be something worth considering as a community goal.
The term "intent" seems to be OK for the event that arrives at the application. More on that later. We tried to break down the problem and arrived at two possible scenarios: 1) voice recognition -> string representation in user's language 1.1) translation to English -> string representation in English 2) English sentence -> English string to intent or alternatively: 1) voice recognition -> string representation in user's language 2) user language sentence -> user language string to intent 3) appliations get "intents" and react to them. So basically one open question is if we need a translation step or if we can directly translate from a string in any language to an intent. We do not think it feasible nor desirable to let every app do its own magic. Thus a central "daemon" processes does step 1, listenting to audio and translating to a string representation. Then, assuming we want to do a translation step 1.1 we need to find a way to do the translation. For step 1 mozilla deep voice seems like a candidate, it seems to be quickly progressing. We assume that mid-term we need machine learning for step 2 - gather sample sentences (somewhere between thousands and millions) to enable the step of going from sentence to intent. We might get away with a set of simple heuristics to get this kick-started, but over time we would want to use machine learning to do this step. Here it's important to gather enough sample sentences to be able to train a model. We basically assume we need to encourage people to participate and send us the recognized sentences to get enough raw material to work with. On interesting point is that ideally we can keep context, so that the users can do follow up queries/commands. Some of the context may be expressed with state machines (talk to Emanuelle about that). Clearly the whole topic needs research, we want to build on other people's stuff and cooperate as much as possible. Hopefully we can find some centralized daemon thing to run on Linux and do a lot of the work in step 1 and 2 for us. Step 3 requires work on our side (in Qt?) for sure. What should intents look like? lists of property bags? Should apps have a way of saying which intents they support? A starting point could be to use the common media player interface to control the media player using voice. Should exposing intents be a dbus thing to start with? For querying data, we may want to interface with wikipedia, music brainz, etc, but is that more part of the central daemon or should there be an app? We probably want to be able to start applications when the appropriate command arrives "write a new email to Volker" launches Kube with the composer open and ideally the receiver filled out, or it may ask the user "I don't know who that is, please help me...". So how do applications define what intents they process? How can applications ask for details? after receiving an intent they may need to ask for more data. There is also the kpurpose framework, I have no idea what it does, should read up on it. This is likely to be completely new input, while app is in some state, may have an open modal dialog, new crashes because we're not prepared? Are there patterns/building blocks to make it easier when an app is in a certain state? Maybe we should look at transactional computing and finite state machines? We could look at network protocols as example, they have error recovery etc. How would integration for online services look like? A lot of this is about querying information. Should it be by default offline, delegate stuff to online when the user asks for it? We need to build for example public transport app integration. For centralized AI join other projects. Maybe Qt will provide the connection to 3rd party engines on Windows and macOS, good testing ground. And to end with a less serious idea, we need a big bike-shed discussion about wake up words. We already came up with: OK KDE (try saying that out loud), OK Konqui or Oh Kate! I hope some of this makes sense, I'd love to see more people stepping up and start figuring out what is needed and move it forward :) Cheers, Frederik
