Hey! Interesting discussion. Did you guys factor in the work done by Mycroft on that front? I think there's a great deal of overlap, and already some really interesting results shown for example in the Mycroft Plasmoid:
https://www.youtube.com/watch?v=sUhvKTq6c40 (somewhat dated, but gives a decent impression) Cheers, -- sebas On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote: > We here at Randa had a little session about voice recognition and > control of applications. > We tried to roughly define what we mean by that - a way of talking to > the computer as Siri/Cortana/Alexa/Google Now and other projects > demonstrate, conversational interfaces. We agreed that want this and > people expect it more and more. > Striking a balance between privacy and getting some data to enable > this is a big concern, see later. > While there is general interest (almost everyone here went out of > their way to join the disussion), it didn't seem like anyone here at > the moment wanted to drive this forward themselves, so it may just > not go anywhere due to lack of people willing to put in time. > Otherwise it may be something worth considering as a community goal. > > > The term "intent" seems to be OK for the event that arrives at the > application. More on that later. > > We tried to break down the problem and arrived at two possible > scenarios: 1) voice recognition -> string representation in user's > language 1.1) translation to English -> string representation in > English 2) English sentence -> English string to intent > > or alternatively: > 1) voice recognition -> string representation in user's language > 2) user language sentence -> user language string to intent > > 3) appliations get "intents" and react to them. > > So basically one open question is if we need a translation step or if > we can directly translate from a string in any language to an intent. > > We do not think it feasible nor desirable to let every app do its own > magic. Thus a central "daemon" processes does step 1, listenting to > audio and translating to a string representation. > Then, assuming we want to do a translation step 1.1 we need to find a > way to do the translation. > > For step 1 mozilla deep voice seems like a candidate, it seems to be > quickly progressing. > > We assume that mid-term we need machine learning for step 2 - gather > sample sentences (somewhere between thousands and millions) to enable > the step of going from sentence to intent. > We might get away with a set of simple heuristics to get this > kick-started, but over time we would want to use machine learning to > do this step. Here it's important to gather enough sample sentences > to be able to train a model. We basically assume we need to encourage > people to participate and send us the recognized sentences to get > enough raw material to work with. > > On interesting point is that ideally we can keep context, so that the > users can do follow up queries/commands. > Some of the context may be expressed with state machines (talk to > Emanuelle about that). > Clearly the whole topic needs research, we want to build on other > people's stuff and cooperate as much as possible. > > Hopefully we can find some centralized daemon thing to run on Linux > and do a lot of the work in step 1 and 2 for us. > Step 3 requires work on our side (in Qt?) for sure. > What should intents look like? lists of property bags? > Should apps have a way of saying which intents they support? > > A starting point could be to use the common media player interface to > control the media player using voice. > Should exposing intents be a dbus thing to start with? > > For querying data, we may want to interface with wikipedia, music > brainz, etc, but is that more part of the central daemon or should > there be an app? > > We probably want to be able to start applications when the appropriate > command arrives "write a new email to Volker" launches Kube with the > composer open and ideally the receiver filled out, or it may ask the > user "I don't know who that is, please help me...". > So how do applications define what intents they process? > How can applications ask for details? after receiving an intent they > may need to ask for more data. > > There is also the kpurpose framework, I have no idea what it does, > should read up on it. > > This is likely to be completely new input, while app is in some > state, may have an open modal dialog, new crashes because we're not > prepared? Are there patterns/building blocks to make it easier when > an app is in a certain state? > Maybe we should look at transactional computing and finite state > machines? We could look at network protocols as example, they have > error recovery etc. > > How would integration for online services look like? A lot of this is > about querying information. > Should it be by default offline, delegate stuff to online when the > user asks for it? > > We need to build for example public transport app integration. > For centralized AI join other projects. > Maybe Qt will provide the connection to 3rd party engines on Windows > and macOS, good testing ground. > > And to end with a less serious idea, we need a big bike-shed > discussion about wake up words. > We already came up with: OK KDE (try saying that out loud), OK Konqui > or Oh Kate! > > I hope some of this makes sense, I'd love to see more people stepping > up and start figuring out what is needed and move it forward :) > > Cheers, > Frederik -- sebas http://www.kde.org | http://vizZzion.org
