On 3/18/2019 2:46 PM, Samuel Thibault wrote: > >> Is there any capabilities in at-spi that allow a speech recognition >> environment to query the application and find out enough context to be >> able to generate appropriate grammars? for example, using a >> multipurpose editor. I may want to have different grammars for different >> types of data. I need to know what tab has focus and something about >> that tab (TBD) so that I can generate the right grammar to operate on >> data within that tab. > At-spi provides information of which widget has focus, and then with > at-spi you can inspect the list of actions.
A speech driven environment only partially cares about focus. You care when you're dictating text into a specific target. Because dictating text at random especially if there are shortcut key commands. However, speech command context is not limited to the GUI context. For example, let's say focus is on a dialogue box in a word processor doing something with a table. But you remember you need to send an email message so you say "take a message". This command has nothing to do with the current focus and would not be revealed as part of the widget actions. It's a global command because the environment knows it has an email program and the top level commands for email are visible globally. once you are in the email client like Thunderbird, your focus could be in the list of mailboxes or list of messages or within the message itself. It doesn't matter because saying the command "next message" always moves to the next unread message no matter where you have focus within the email client. I guess one way to think of it is if you look at the screen and you see something you want to do you should be all the say it and not have to worry about where the focus is because your eyes see the focus, your brain determines the context and understands the commands for the operation, and screw the mouse, I'm not not touch that damn thing because it hurts One way to understand how the user interface changes is to sit on your hands and tell somebody else what to click on. We need to instruct the somebody else to be really stupid and only do exactly what you say. That will give you feeling for the state of speech interfaces today. Then if you and the somebody agree on the names of things, do the same exercise and you will find how much easier it is to work > >> Each cell of the database has a type and a name. For text fields, >> saying "Change <name>" Should put me in the cell of that name. But if >> the cell is a multi-selection list, the grammar should be all of the >> options for the multi-selection list. If if the cell is a number, I >> want to limit the grammar to just numbers. > AFAICT, the type of content is not exposed in at-spi, but that could be > added. Exactly. every content field in a GUI should be exposed so the speech environment can read that field and edit it programmatically. > >> There are a bunch of other types such as email address phone number URL >> people and other limited board non-English language elements that could >> be spoken. Each of which need their own grammar. > The exact allowed grammer could be passed through at-spi. The grammar wouldn't be passed through at-spi. The type of information and any other context information would be passed through to speech interface layer which would in turn build the grammar that the user can speak. Side note: people do not understand the speech interface until they try to use a computer without hands. And even then, some people can't think beyond the keyboard and mouse. If I get frustrated when explaining some of these concepts, I apologize in advance because I've spent over 20 years with this adult acquired disability and all the societal shit that comes along with it. My frustration comes from seeing people reinventing the same failed techniques for the past 20+ years. Specifically a solution that tries to leverage a keyboard interface or trying to deal with complex operations through vocalizations and not speech. The latter is more common when people try to figure out how to program using speech recognition. One thing I've learned in and building speech interfaces is that the grammar is incredibly personal. Is based on how we speak, how we use language, what idioms we grew up with. Because speech interfaces give you no clues about what you can say it's very hard to train people to a standard language. To me the answer is giving the end user the ability to modify the grammar to fit how they speak. I'm twitching now because I see the start of the conversation that heads down that path of yet again reinventing or advocating for stuff that has failed repeatedly. I apologize for my twitching and for my anticipating what you haven't said yet. >> One of the problems though with the notion database is that there are no >> row names except by convention. Therefore whenever you use a name to >> sell, somebody needs to keep track of the role you are on and no >> command should take you off of that row. > I'm not sure to understand. Picture a spreadsheet. The spreadsheet has numbers across the top one through infinity but on the side instead of the usual A-Z, there's nothing. So when you operate on that row, you can only operate on horizontally adjacent cells and not referred anything above or below. > >> The last issue is a method of bypassing JavaScript backed editors. I >> cannot dictate into Google Docs, have difficulty dictating into >> LinkedIn. In the browser context, only naked text areas seem to work >> well with NaturallySpeaking. > That is the kind of example where plugging only through at-spi can fail, > when the application is not actually exposing an at-spi interface, and > thus plugging at the OS level can be a useful option. But we still have the same problem of the speech environment needing to reach into the application to understand the contents of buffers, what can commands are accessible in the application context etc. It may not be a problem we can solve in a way that's acceptable to others but basically an API that roots around in the bowels of the application is where we need to go. Think of it as a software endoscopy or colonoscopy. One possibility for dealing with these kinds of crappy interfaces is to have the ability to tell the application to export what's visible in a text area plus a little bit more into a plaintext form which could then be edited or added to using speech in a speech-enabled editor. maybe something like converting rich text to mark down+ so it's all speech editable. _______________________________________________ gnome-accessibility-list mailing list gnome-accessibility-list@gnome.org https://mail.gnome.org/mailman/listinfo/gnome-accessibility-list