[whatwg] Web API for speech recognition and synthesis
We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson I am interested in a JavaScript API for text-to-speech synthesis at least. It would be a great help in creating more usable web applications for people with visual impairments (i.e., self-voicing web applications instead of screen reading). It could also enable a slew of new web apps in mobile, eyes-busy situations (e.g., my smart phone reads me my web mail, twitter feed, what-have-you, while I'm driving). Some folks working on enabling technologies at the Univ. of North Carolina built Outfox (http://code.google.com/p/outfox/) as proof of concept JS interface to text-to-speech engines on Mac, Windows, and Linux. It's Firefox-only, but might be worth a look. Pete
[whatwg] Web API for speech recognition and synthesis
(resending to include the whatwg list, sorry for multiple postings) Hi Olli, Thank you for bringing this interesting thread to the Multimodal Interaction Working Group's attention. The working group is in fact very active. Although it is chartered as W3C Member-only, we do have a public mailing list, www-multimo...@w3.org, available for public discussions. In general, we would be very interested in hearing about the kinds of use cases for speech recognition and TTS in a browser context that you would like to handle. The Multimodal Architecture is our primary draft spec that addresses using speech in web pages (although it also addresses other modes of input, such as handwriting). A new Working Draft has just been published and we would be very interested in getting feedback on it. In my opinion, it's probably focused more on distributed architectures than on the use cases you might be interested in, but we would like our specs to be comprehensive enough to be able to address both server-based and client-based speech processing. We would also be interested in general discussions of questions about multimodality. Here are some pointers that may be useful. MMI page: http://www.w3.org/2002/mmi/ MMI Architecture spec: http://www.w3.org/TR/2009/WD-mmi-arch-20091201/ best regards, Debbie Dahl, MMI Working Group Chair -Original Message- From: Olli Pettay [mailto:olli.pet...@helsinki.fi] Sent: Friday, December 11, 2009 4:14 PM To: Bjorn Bringert Cc: o...@pettay.fi; Dave Burke; João Eiras; whatwg; David Singleton; Gudmundur Hafsteinsson; westonru...@gmail.com; www-multimo...@w3.org; Deborah Dahl Subject: Re: [whatwg] Web API for speech recognition and synthesis On 12/11/09 6:05 AM, Bjorn Bringert wrote: Thanks for the discussion - cool to see more interest today also (http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.ht ml) I've hacked up a proof-of-concept JavaScript API for speech recognition and synthesis. It adds a navigator.speech object with these functions: void listen(ListenCallback callback, ListenOptions options); void speak(DOMString text, SpeakCallback callback, SpeakOptions options); So if I read the examples correctly you're not using grammars anywhere. I wonder how well does that work in real world cases. Of course if the speech recognizer can handle everything well without grammars, the result validation could be done in JS after the result is got from the recognizer. But I think having support for grammars simplifies coding and can make speech dialogs somewhat more manageable. W3C has already standardized things like http://www.w3.org/TR/speech-grammar/ and http://www.w3.org/TR/semantic-interpretation/ and the latter one works quite nicely with JS. Again, I think this kind of discussion should happen in W3C multimodal WG. Though, I'm not sure how actively or how openly that working group works atm. -Olli The implementation uses an NPAPI plugin for the Android browser that wraps the existing Android speech APIs. The code is available at http://code.google.com/p/speech-api-browser-plugin/ There are some simple demo apps in http://code.google.com/p/speech-api-browser-plugin/source/brow se/trunk/android-plugin/demos/ including: - English to Spanish speech-to-speech translation - Google search by speaking a query - The obligatory pizza ordering system - A phone number dialer Comments appreciated! /Bjorn On Fri, Dec 4, 2009 at 2:51 PM, Olli Pettayolli.pet...@helsinki.fi wrote: Indeed the API should be something significantly simpler than X+V. Microsoft has (had?) support for SALT. That API is pretty simple and provides speech recognition and TTS. The API could be probably even simpler than SALT. IIRC, there was an extension for Firefox to support SALT (well, there was also an extension to support X+V). If the platform/OS provides ASR and TTS, adding a JS API for it should be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but I think it would be more web-like to give pure JS API (similar to SALT). Integrating visual and voice input could be done in scripts. I'd assume there would be some script libraries to handle multimodal input integration - especially if there will be touch and gestures events too etc. (Classic multimodal map applications will become possible in web.) But this all is something which should be possibly designed in or with W3C multimodal working group. I know their current architecture is way more complex, but X+X, SALT and even Multimodal-CSS has been discussed in that working group. -Olli On 12/3/09 2:50 AM, Dave Burke wrote: We're envisaging a simpler programmatic API that looks familiar to the modern Web developer but one which avoids the legacy of dialog system languages. Dave On Wed, Dec 2, 2009 at 7:25
Re: [whatwg] Web API for speech recognition and synthesis
It seems like there is enough interest in speech to start developing experimental implementations. There appear to be two general directions that we could take: - A general microphone API + streaming API + audio tag - Pro: Useful for non-speech recognition / synthesis applications. E.g. audio chat, sound recording. - Pro: Allows JavaScript libraries for third-party network speech services. E.g. an AJAX API for Google's speech services. Web app developers that don't have their own speech servers could use that. - Pro: Consistent recognition / synthesis user experience across user agents in the same web app. - Con: No support for on-device recognition / synthesis, only network services. - Con: Varying recognition / synthesis user experience across different web apps in a single user agent. - Con: Possibly higher overhead because the audio data needs to pass through JavaScript. - Con: Requires dealing with audio encodings, endpointing, buffer sizes etc in the microphone API. - A speech-specific back-end neutral API - Pro: Simple API, basically just two methods: listen() and speak(). - Pro: Can use local recognition / synthesis. - Pro: Consistent recognition / synthesis user experience across different web apps in a single user agent. - Con: Varying recognition / synthesis user experience across user agents in the same web app. - Con: Only works for speech, not general audio. /Bjorn On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw imcg...@mit.edu wrote: I'm new to this list, but as a speech-scientist and web developer, I wanted to add my 2 cents. Personally, I believe the future of speech recognition is in the cloud. Here are two services which provide Javascript APIs for speech recognition (and TTS) today: http://wami.csail.mit.edu/ http://www.research.att.com/projects/SpeechMashup/index.html Both of these are research systems, and as such they are really just proof-of-concepts. That said, Wami's JSONP-like implementation allows Quizlet.com to use speech recognition today on a relatively large scale, with just a few lines of Javascript code: http://quizlet.com/voicetest/415/?scatter Since there are a lot of Google folks on this list, I recommend you talk to Alex Gruenstein (in your speech group) who was one of the lead developers of WAMI while at MIT. The major limitation we found when building the system was that we had to develop a new audio controller for every client (Java for the desktop, custom browsers for iPhone and Android). It would have been much simpler if browsers came with standard microphone capture and audio streaming capabilities. -Ian On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter westonru...@gmail.com wrote: I blogged yesterday about this topic (including a text-to-speech demo using HTML5 Audio and Google Translate's TTS service); the more relevant part for this thread: I am really excited at the prospect of text-to-speech being made available on the Web! It's just too bad that fetching MP3s on an remote web service is the only standard way of doing so currently; modern operating systems all have TTS capabilities, so it's a shame that web apps and can't utilize them via client-side scripting. I posted to the WHATWG mailing list about such a Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a recent thread about a Web API for speech recognition and synthesis. Perhaps there is some momentum building here? Having TTS available in the browser would boost accessibility for the seeing-impaired and improve usability for people on-the-go. TTS is just another technology that has traditionally been relegated to desktop applications, but as the open Web advances as the preferred platform for application development, it is an essential service to make available (as with Geolocation API, Device API, etc.). And besides, I want to build TTS applications and my motto is: If it can't be done on the open web, it's not worth doing at all! http://weston.ruter.net/projects/google-tts/ Weston On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter westonru...@gmail.com wrote: I was just alerted about this thread from my post Text-To-Speech (TTS) Web API for JavaScript at http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html. Amazing how shared ideas like these seem to arise independently at the same time. I have a use-case and an additional requirement, that the time indices be made available for when each word is spoken in the TTS-generated audio: I've been working on a web app which reads text in a web page, highlighting each word as it is read. For this to be possible, a Text-To-Speech API is needed which is able to: (1) generate the speech audio from some text, and (2) include the time indicies for when each of the words in the text is spoken. I foresee that
Re: [whatwg] Web API for speech recognition and synthesis
On Tue, 15 Dec 2009, Bjorn Bringert wrote: - A general microphone API + streaming API + audio tag - Pro: Useful for non-speech recognition / synthesis applications. E.g. audio chat, sound recording. - Pro: Allows JavaScript libraries for third-party network speech services. E.g. an AJAX API for Google's speech services. Web app developers that don't have their own speech servers could use that. - Pro: Consistent recognition / synthesis user experience across user agents in the same web app. - Con: No support for on-device recognition / synthesis, only network services. - Con: Varying recognition / synthesis user experience across different web apps in a single user agent. - Con: Possibly higher overhead because the audio data needs to pass through JavaScript. - Con: Requires dealing with audio encodings, endpointing, buffer sizes etc in the microphone API. FWIW I've started looking at this kind of thing in general (for audio and video -- see device in the spec for the first draft ideas), since it'll be required for other things as well. However, that shouldn't be taken as a sign that the other approach shouldn't also be examined. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Web API for speech recognition and synthesis
Great! As I've said, I'm definitely bias towards this approach. As Bjorn hinted AJAX APIs could be developed with all sorts of interesting features that will never make it down into the browser, e.g. pronunciation assessment, speech therapy, all those lie-detector apps for your phone :-). Still, I think that we're missing the biggest pro: - Pro: Speech recognition technology is data-driven. Improvements in the underlying technology are far more likely to occur with a network driven approach. To be fair, with that, you have to add a con: - Con: Less privacy. -Ian On Tue, Dec 15, 2009 at 3:37 PM, Ian Hickson i...@hixie.ch wrote: On Tue, 15 Dec 2009, Bjorn Bringert wrote: - A general microphone API + streaming API + audio tag - Pro: Useful for non-speech recognition / synthesis applications. E.g. audio chat, sound recording. - Pro: Allows JavaScript libraries for third-party network speech services. E.g. an AJAX API for Google's speech services. Web app developers that don't have their own speech servers could use that. - Pro: Consistent recognition / synthesis user experience across user agents in the same web app. - Con: No support for on-device recognition / synthesis, only network services. - Con: Varying recognition / synthesis user experience across different web apps in a single user agent. - Con: Possibly higher overhead because the audio data needs to pass through JavaScript. - Con: Requires dealing with audio encodings, endpointing, buffer sizes etc in the microphone API. FWIW I've started looking at this kind of thing in general (for audio and video -- see device in the spec for the first draft ideas), since it'll be required for other things as well. However, that shouldn't be taken as a sign that the other approach shouldn't also be examined. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Web API for speech recognition and synthesis
Currently the W3C Device API WG is working on a Capture API which will include microphone capture and audio streaming capabilities. The current draft is at: http://dev.w3.org/2009/dap/camera/ It is pretty rough and still in working progress, so for instance streaming is not there. Thanks Dzung Tran On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw imcg...@mit.edumailto:imcg...@mit.edu wrote: I'm new to this list, but as a speech-scientist and web developer, I wanted to add my 2 cents. ?Personally, I believe the future of speech recognition is in the cloud. Here are two services which provide Javascript APIs for speech recognition (and TTS) today: http://wami.csail.mit.edu/ http://www.research.att.com/projects/SpeechMashup/index.html Both of these are research systems, and as such they are really just proof-of-concepts. That said, Wami's JSONP-like implementation allows Quizlet.com to use speech recognition today on a relatively large scale, with just a few lines of Javascript code: http://quizlet.com/voicetest/415/?scatter Since there are a lot of Google folks on this list, I recommend you talk to Alex Gruenstein (in your speech group) who was one of the lead developers of WAMI while at MIT. The major limitation we found when building the system was that we had to develop a new audio controller for every client (Java for the desktop, custom browsers for iPhone and Android). ?It would have been much simpler if browsers came with standard microphone capture and audio streaming capabilities. -Ian
Re: [whatwg] Web API for speech recognition and synthesis
I'm new to this list, but as a speech-scientist and web developer, I wanted to add my 2 cents. Personally, I believe the future of speech recognition is in the cloud. Here are two services which provide Javascript APIs for speech recognition (and TTS) today: http://wami.csail.mit.edu/ http://www.research.att.com/projects/SpeechMashup/index.html Both of these are research systems, and as such they are really just proof-of-concepts. That said, Wami's JSONP-like implementation allows Quizlet.com to use speech recognition today on a relatively large scale, with just a few lines of Javascript code: http://quizlet.com/voicetest/415/?scatter Since there are a lot of Google folks on this list, I recommend you talk to Alex Gruenstein (in your speech group) who was one of the lead developers of WAMI while at MIT. The major limitation we found when building the system was that we had to develop a new audio controller for every client (Java for the desktop, custom browsers for iPhone and Android). It would have been much simpler if browsers came with standard microphone capture and audio streaming capabilities. -Ian On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter westonru...@gmail.comwrote: I blogged yesterday about this topic (including a text-to-speech demo using HTML5 Audio and Google Translate's TTS service); the more relevant part for this thread: http://weston.ruter.net/projects/google-tts/ I am really excited at the prospect of text-to-speech being made available on the Web! It's just too bad that fetching MP3s on an remote web service is the only standard way of doing so currently; modern operating systems all have TTS capabilities, so it's a shame that web apps and can't utilize them via client-side scripting. I posted to the WHATWG mailing list about such a Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a recent thread about a Web API for speech recognition and synthesis. Perhaps there is some momentum building here? Having TTS available in the browser would boost accessibility for the seeing-impaired and improve usability for people on-the-go. TTS is just another technology that has traditionally been relegated to desktop applications, but as the open Web advances as the preferred platform for application development, it is an essential service to make available (as with Geolocation API, Device API, etc.). And besides, I want to build TTS applications and my motto is: If it can't be done on the open web, it's not worth doing at all! http://weston.ruter.net/projects/google-tts/ Weston On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter westonru...@gmail.comwrote: I was just alerted about this thread from my post Text-To-Speech (TTS) Web API for JavaScript at http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html. Amazing how shared ideas like these seem to arise independently at the same time. I have a use-case and an additional requirement, that the time indices be made available for when each word is spoken in the TTS-generated audio: I've been working on a web app which reads text in a web page, highlighting each word as it is read. For this to be possible, a Text-To-Speech API is needed which is able to: (1) generate the speech audio from some text, and (2) include the time indicies for when each of the words in the text is spoken. I foresee that a TTS API should integrate closely with the HTML5 Audio API. For example, invoking a call to the API could return a TTS object which has an instance of Audio, whose interface could be used to navigate through the TTS output. For example: var tts = new TextToSpeech(Hello, World!); tts.audio.addEventListener(canplaythrough, function(e){ //tts.indices == [{startTime:0, endTime:500, text:Hello}, {startTime:500, endTime:1000, text:World}] }, false); tts.read(); //invokes tts.audio.play What would be even cooler, is if the parameter passed to the TextToSpeech constructor could be an Element or TextNode, and the indices would then include a DOM Range in addition to the text property. A flag could also be set which would result in each of these DOM ranges to be selected when it is read. For example: var tts = new TextToSpeech(document.querySelector(article)); tts.selectRangesOnRead = true; tts.audio.addEventListener(canplaythrough, function(e){ /* tts.indices == [ {startTime:0, endTime:500, text:Hello, range:Range}, {startTime:500, endTime:1000, text:World, range:Range} ] */ }, false); tts.read(); In addition to the events fired by the Audio API, more events could be fired when reading TTS, such as a readrange event whose event object would include the index (startTime, endTime, text, range) for the range currently being spoken. Such functionality would make the ability to read along with the text trivial. What do you think? Weston On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert
Re: [whatwg] Web API for speech recognition and synthesis
Thanks for the discussion - cool to see more interest today also (http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html) I've hacked up a proof-of-concept JavaScript API for speech recognition and synthesis. It adds a navigator.speech object with these functions: void listen(ListenCallback callback, ListenOptions options); void speak(DOMString text, SpeakCallback callback, SpeakOptions options); The implementation uses an NPAPI plugin for the Android browser that wraps the existing Android speech APIs. The code is available at http://code.google.com/p/speech-api-browser-plugin/ There are some simple demo apps in http://code.google.com/p/speech-api-browser-plugin/source/browse/trunk/android-plugin/demos/ including: - English to Spanish speech-to-speech translation - Google search by speaking a query - The obligatory pizza ordering system - A phone number dialer Comments appreciated! /Bjorn On Fri, Dec 4, 2009 at 2:51 PM, Olli Pettay olli.pet...@helsinki.fi wrote: Indeed the API should be something significantly simpler than X+V. Microsoft has (had?) support for SALT. That API is pretty simple and provides speech recognition and TTS. The API could be probably even simpler than SALT. IIRC, there was an extension for Firefox to support SALT (well, there was also an extension to support X+V). If the platform/OS provides ASR and TTS, adding a JS API for it should be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but I think it would be more web-like to give pure JS API (similar to SALT). Integrating visual and voice input could be done in scripts. I'd assume there would be some script libraries to handle multimodal input integration - especially if there will be touch and gestures events too etc. (Classic multimodal map applications will become possible in web.) But this all is something which should be possibly designed in or with W3C multimodal working group. I know their current architecture is way more complex, but X+X, SALT and even Multimodal-CSS has been discussed in that working group. -Olli On 12/3/09 2:50 AM, Dave Burke wrote: We're envisaging a simpler programmatic API that looks familiar to the modern Web developer but one which avoids the legacy of dialog system languages. Dave On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com mailto:jo...@opera.com wrote: On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com mailto:bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson This exists already, but only Opera supports it, although there are problems with the library we use for speech recognition. http://www.w3.org/TR/xhtml+voice/ http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/ Would be nice to revive that specification and get vendor buy-in. -- João Eiras Core Developer, Opera Software ASA, http://www.opera.com/ -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
I was just alerted about this thread from my post Text-To-Speech (TTS) Web API for JavaScript at http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html. Amazing how shared ideas like these seem to arise independently at the same time. I have a use-case and an additional requirement, that the time indices be made available for when each word is spoken in the TTS-generated audio: I've been working on a web app which reads text in a web page, highlighting each word as it is read. For this to be possible, a Text-To-Speech API is needed which is able to: (1) generate the speech audio from some text, and (2) include the time indicies for when each of the words in the text is spoken. I foresee that a TTS API should integrate closely with the HTML5 Audio API. For example, invoking a call to the API could return a TTS object which has an instance of Audio, whose interface could be used to navigate through the TTS output. For example: var tts = new TextToSpeech(Hello, World!); tts.audio.addEventListener(canplaythrough, function(e){ //tts.indices == [{startTime:0, endTime:500, text:Hello}, {startTime:500, endTime:1000, text:World}] }, false); tts.read(); //invokes tts.audio.play What would be even cooler, is if the parameter passed to the TextToSpeech constructor could be an Element or TextNode, and the indices would then include a DOM Range in addition to the text property. A flag could also be set which would result in each of these DOM ranges to be selected when it is read. For example: var tts = new TextToSpeech(document.querySelector(article)); tts.selectRangesOnRead = true; tts.audio.addEventListener(canplaythrough, function(e){ /* tts.indices == [ {startTime:0, endTime:500, text:Hello, range:Range}, {startTime:500, endTime:1000, text:World, range:Range} ] */ }, false); tts.read(); In addition to the events fired by the Audio API, more events could be fired when reading TTS, such as a readrange event whose event object would include the index (startTime, endTime, text, range) for the range currently being spoken. Such functionality would make the ability to read along with the text trivial. What do you think? Weston On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert bring...@google.com wrote: On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas I'm mostly thinking about speech web apps for mobile devices. I think that's where speech makes most sense as an input and output method, because of the poor keyboards, small screens, and frequent hands/eyes busy situations (e.g. while driving). Accessibility is the other big reason for using speech. Some ideas for use cases: - Search by speaking a query - Speech-to-speech translation - Voice Dialing (could open a tel: URI to actually make the call) - Dialog systems (e.g. the canonical pizza ordering system) - Lightweight JavaScript browser extensions (e.g. Greasemonkey / Chrome extensions) for using speech with any web site, e.g, for accessibility. Requirements: - Web app developer side: - Allows both speech recognition and synthesis. - Easy to use API. Makes simple things easy and advanced things possible. - Doesn't require web app developer to develop / run his own speech recognition / synthesis servers. - (Natural) language-neutral API. - Allows developer-defined application specific grammars / language models. - Allows multilingual applications. - Allows easy localization of speech apps. - Implementor side: - Easy enough to implement that it can get wide adoption in browsers. - Allows implementor to use either client-side or server-side recognition and synthesis. -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
(Sending this 2nd time. Hopefully whatwg list doesn't bounce it back.) On 12/11/09 6:05 AM, Bjorn Bringert wrote: Thanks for the discussion - cool to see more interest today also (http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html) I've hacked up a proof-of-concept JavaScript API for speech recognition and synthesis. It adds a navigator.speech object with these functions: void listen(ListenCallback callback, ListenOptions options); void speak(DOMString text, SpeakCallback callback, SpeakOptions options); So if I read the examples correctly you're not using grammars anywhere. I wonder how well does that work in real world cases. Of course if the speech recognizer can handle everything well without grammars, the result validation could be done in JS after the result is got from the recognizer. But I think having support for grammars simplifies coding and can make speech dialogs somewhat more manageable. W3C has already standardized things like http://www.w3.org/TR/speech-grammar/ and http://www.w3.org/TR/semantic-interpretation/ and the latter one works quite nicely with JS. Again, I think this kind of discussion should happen in W3C multimodal WG. Though, I'm not sure how actively or how openly that working group works atm. -Olli The implementation uses an NPAPI plugin for the Android browser that wraps the existing Android speech APIs. The code is available at http://code.google.com/p/speech-api-browser-plugin/ There are some simple demo apps in http://code.google.com/p/speech-api-browser-plugin/source/browse/trunk/android-plugin/demos/ including: - English to Spanish speech-to-speech translation - Google search by speaking a query - The obligatory pizza ordering system - A phone number dialer Comments appreciated! /Bjorn On Fri, Dec 4, 2009 at 2:51 PM, Olli Pettayolli.pet...@helsinki.fi wrote: Indeed the API should be something significantly simpler than X+V. Microsoft has (had?) support for SALT. That API is pretty simple and provides speech recognition and TTS. The API could be probably even simpler than SALT. IIRC, there was an extension for Firefox to support SALT (well, there was also an extension to support X+V). If the platform/OS provides ASR and TTS, adding a JS API for it should be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but I think it would be more web-like to give pure JS API (similar to SALT). Integrating visual and voice input could be done in scripts. I'd assume there would be some script libraries to handle multimodal input integration - especially if there will be touch and gestures events too etc. (Classic multimodal map applications will become possible in web.) But this all is something which should be possibly designed in or with W3C multimodal working group. I know their current architecture is way more complex, but X+X, SALT and even Multimodal-CSS has been discussed in that working group. -Olli On 12/3/09 2:50 AM, Dave Burke wrote: We're envisaging a simpler programmatic API that looks familiar to the modern Web developer but one which avoids the legacy of dialog system languages. Dave On Wed, Dec 2, 2009 at 7:25 PM, João Eirasjo...@opera.com mailto:jo...@opera.com wrote: On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.commailto:bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson This exists already, but only Opera supports it, although there are problems with the library we use for speech recognition. http://www.w3.org/TR/xhtml+voice/ http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/ Would be nice to revive that specification and get vendor buy-in. -- João Eiras Core Developer, Opera Software ASA, http://www.opera.com/
Re: [whatwg] Web API for speech recognition and synthesis
Indeed the API should be something significantly simpler than X+V. Microsoft has (had?) support for SALT. That API is pretty simple and provides speech recognition and TTS. The API could be probably even simpler than SALT. IIRC, there was an extension for Firefox to support SALT (well, there was also an extension to support X+V). If the platform/OS provides ASR and TTS, adding a JS API for it should be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but I think it would be more web-like to give pure JS API (similar to SALT). Integrating visual and voice input could be done in scripts. I'd assume there would be some script libraries to handle multimodal input integration - especially if there will be touch and gestures events too etc. (Classic multimodal map applications will become possible in web.) But this all is something which should be possibly designed in or with W3C multimodal working group. I know their current architecture is way more complex, but X+X, SALT and even Multimodal-CSS has been discussed in that working group. -Olli On 12/3/09 2:50 AM, Dave Burke wrote: We're envisaging a simpler programmatic API that looks familiar to the modern Web developer but one which avoids the legacy of dialog system languages. Dave On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com mailto:jo...@opera.com wrote: On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com mailto:bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson This exists already, but only Opera supports it, although there are problems with the library we use for speech recognition. http://www.w3.org/TR/xhtml+voice/ http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/ Would be nice to revive that specification and get vendor buy-in. -- João Eiras Core Developer, Opera Software ASA, http://www.opera.com/
Re: [whatwg] Web API for speech recognition and synthesis
On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas I'm mostly thinking about speech web apps for mobile devices. I think that's where speech makes most sense as an input and output method, because of the poor keyboards, small screens, and frequent hands/eyes busy situations (e.g. while driving). Accessibility is the other big reason for using speech. Some ideas for use cases: - Search by speaking a query - Speech-to-speech translation - Voice Dialing (could open a tel: URI to actually make the call) - Dialog systems (e.g. the canonical pizza ordering system) - Lightweight JavaScript browser extensions (e.g. Greasemonkey / Chrome extensions) for using speech with any web site, e.g, for accessibility. Requirements: - Web app developer side: - Allows both speech recognition and synthesis. - Easy to use API. Makes simple things easy and advanced things possible. - Doesn't require web app developer to develop / run his own speech recognition / synthesis servers. - (Natural) language-neutral API. - Allows developer-defined application specific grammars / language models. - Allows multilingual applications. - Allows easy localization of speech apps. - Implementor side: - Easy enough to implement that it can get wide adoption in browsers. - Allows implementor to use either client-side or server-side recognition and synthesis. -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
I agree 100%. Still, I think the access to the mic and the speech recognition could be separated. -- Diogo Resende drese...@thinkdigital.pt ThinkDigital On Thu, 2009-12-03 at 12:06 +, Bjorn Bringert wrote: On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas I'm mostly thinking about speech web apps for mobile devices. I think that's where speech makes most sense as an input and output method, because of the poor keyboards, small screens, and frequent hands/eyes busy situations (e.g. while driving). Accessibility is the other big reason for using speech. Some ideas for use cases: - Search by speaking a query - Speech-to-speech translation - Voice Dialing (could open a tel: URI to actually make the call) - Dialog systems (e.g. the canonical pizza ordering system) - Lightweight JavaScript browser extensions (e.g. Greasemonkey / Chrome extensions) for using speech with any web site, e.g, for accessibility. Requirements: - Web app developer side: - Allows both speech recognition and synthesis. - Easy to use API. Makes simple things easy and advanced things possible. - Doesn't require web app developer to develop / run his own speech recognition / synthesis servers. - (Natural) language-neutral API. - Allows developer-defined application specific grammars / language models. - Allows multilingual applications. - Allows easy localization of speech apps. - Implementor side: - Easy enough to implement that it can get wide adoption in browsers. - Allows implementor to use either client-side or server-side recognition and synthesis. signature.asc Description: This is a digitally signed message part
Re: [whatwg] Web API for speech recognition and synthesis
I agree. The application should be able to choose a source for speech commands, or give the user a choice of options for a speech source. It also provides a much better separation of APIs, allowing the development of a speech API that doesn't depend on or interfere in any way with the development of a microphone/audio input device API. 2009/12/3 Diogo Resende drese...@thinkdigital.pt I agree 100%. Still, I think the access to the mic and the speech recognition could be separated. -- Diogo Resende drese...@thinkdigital.pt ThinkDigital On Thu, 2009-12-03 at 12:06 +, Bjorn Bringert wrote: On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas I'm mostly thinking about speech web apps for mobile devices. I think that's where speech makes most sense as an input and output method, because of the poor keyboards, small screens, and frequent hands/eyes busy situations (e.g. while driving). Accessibility is the other big reason for using speech. Some ideas for use cases: - Search by speaking a query - Speech-to-speech translation - Voice Dialing (could open a tel: URI to actually make the call) - Dialog systems (e.g. the canonical pizza ordering system) - Lightweight JavaScript browser extensions (e.g. Greasemonkey / Chrome extensions) for using speech with any web site, e.g, for accessibility. Requirements: - Web app developer side: - Allows both speech recognition and synthesis. - Easy to use API. Makes simple things easy and advanced things possible. - Doesn't require web app developer to develop / run his own speech recognition / synthesis servers. - (Natural) language-neutral API. - Allows developer-defined application specific grammars / language models. - Allows multilingual applications. - Allows easy localization of speech apps. - Implementor side: - Easy enough to implement that it can get wide adoption in browsers. - Allows implementor to use either client-side or server-side recognition and synthesis.
Re: [whatwg] Web API for speech recognition and synthesis
On Dec 3, 2009, at 4:06 AM, Bjorn Bringert wrote: On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas I'm mostly thinking about speech web apps for mobile devices. I think that's where speech makes most sense as an input and output method, because of the poor keyboards, small screens, and frequent hands/eyes busy situations (e.g. while driving). Accessibility is the other big reason for using speech. Accessibility is already handle through ARIA and the host platforms accessibility features. Some ideas for use cases: - Search by speaking a query - Speech-to-speech translation - Voice Dialing (could open a tel: URI to actually make the call) - Dialog systems (e.g. the canonical pizza ordering system) - Lightweight JavaScript browser extensions (e.g. Greasemonkey / Chrome extensions) for using speech with any web site, e.g, for accessibility. I am unsure why the site should be directly responsible for things like audio based accessibility. What do you believe a site should be doing itself manually vs. the accessibility services provided by the host OS? Requirements: - Web app developer side: - Allows both speech recognition and synthesis. ARIA (in conjunction with the OS accessibility services) already provides the accessibility focused text to speech (unsure about recognition side) - Doesn't require web app developer to develop / run his own speech recognition / synthesis servers. This would seem to be use the OS services - Implementor side: - Easy enough to implement that it can get wide adoption in browsers. These services are not simple -- any implementation would seem to be a significant amount of work, especially if you want to a) actually be good at it and b) interact with the host OS's native accessibility features. - Allows implementor to use either client-side or server-side recognition and synthesis. I honestly have no idea what you mean by this. --Oliver
Re: [whatwg] Web API for speech recognition and synthesis
On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert bring...@google.com wrote: On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas I'm mostly thinking about speech web apps for mobile devices. I think that's where speech makes most sense as an input and output method, because of the poor keyboards, small screens, and frequent hands/eyes busy situations (e.g. while driving). Accessibility is the other big reason for using speech. Some ideas for use cases: - Search by speaking a query - Speech-to-speech translation - Voice Dialing (could open a tel: URI to actually make the call) input type=search, input type=text and input type=tel seems like the correct solution for these. Nothing prevents UAs for allowing speech rather than keyboard input into these (and I believe that most do if you have AT tools installed). - Dialog systems (e.g. the canonical pizza ordering system) I saw some pretty cool XHTML+Voice demos a few years ago that did this. They didn't use speech-to-text scripting APIs though. - Lightweight JavaScript browser extensions (e.g. Greasemonkey / Chrome extensions) for using speech with any web site, e.g, for accessibility. These would seem like APIs not exposed to webpages, but rather to extensions. So WHATWG would be the wrong place to standardize them. And I'm not convinced that this needs speech-to-text scripting APIs either, but rather simply support for speech rather than keyboard as text input. / Jonas
Re: [whatwg] Web API for speech recognition and synthesis
On Thu, Dec 3, 2009 at 7:32 AM, Diogo Resende drese...@thinkdigital.ptwrote: I agree 100%. Still, I think the access to the mic and the speech recognition could be separated. While it would be possible to separate access to the microphone and speech recognition, combining them allows the API to abstract away details of the implementation that would otherwise have to be exposed, in particular the audio encoding(s) used, and whether the audio is streamed to the recognizer or sent in a single chunk. If we don't provide general access to the microphone, the speech recognition API can be simpler, implementors will have more flexibility, and implementations can be simpler and smaller because they won't have to deal with conversions between different audio encodings. So I'm in favour of not separating out access to the microphone, at least in v1 of the API. -- Fergus Henderson fer...@google.com
Re: [whatwg] Web API for speech recognition and synthesis
I was not thinking of raw access to the mic. I was just thinking of a 2 step method to do it so you could just do 1 step :) I was thinking of something like: 1. Call Sound API and ask to record (maybe something like the geolocation on Firefox [1]). 2. Pass it to text2speech or save or stream or whatever.. This way one could record audio and do something else like save/stream. If other want to translate into text, just do the next step. [1]: http://www.mozilla.com/en-US/firefox/geolocation/ -- Diogo Resende drese...@thinkdigital.pt ThinkDigital On Thu, 2009-12-03 at 12:30 -0500, Fergus Henderson wrote: On Thu, Dec 3, 2009 at 7:32 AM, Diogo Resende drese...@thinkdigital.pt wrote: I agree 100%. Still, I think the access to the mic and the speech recognition could be separated. While it would be possible to separate access to the microphone and speech recognition, combining them allows the API to abstract away details of the implementation that would otherwise have to be exposed, in particular the audio encoding(s) used, and whether the audio is streamed to the recognizer or sent in a single chunk. If we don't provide general access to the microphone, the speech recognition API can be simpler, implementors will have more flexibility, and implementations can be simpler and smaller because they won't have to deal with conversions between different audio encodings. So I'm in favour of not separating out access to the microphone, at least in v1 of the API. -- Fergus Henderson fer...@google.com signature.asc Description: This is a digitally signed message part
[whatwg] Web API for speech recognition and synthesis
We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
On Wed, Dec 2, 2009 at 3:32 AM, Bjorn Bringert bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson Short answer: Yes, very :) Longer answer: APIs for accessing microphone and camera is something that I think is very needed. There's several aspects to this, ranging from simply uploading video/audio clips using an input type=file element, to streaming APIs that allow video/audio conferancing using a browser, to being able to do video/audio processing/playback inside the browser. There's a ton of work here to be done, anywhere you are willing to help would be hugely appreciated. / Jonas
Re: [whatwg] Web API for speech recognition and synthesis
Is speech support a feature of the web page, or the web browser? On Wed, Dec 2, 2009 at 12:32 PM, Bjorn Bringert bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
I think that it would be best to extend the browser with a JavaScript speech API intended for use by web apps. That is, only web apps that use the speech API would have speech support. But it should be possible to use such an API to write browser extensions (using Greasemonkey, Chrome extensions etc) that allow speech control of the browser and speech synthesis of web page contents. Doing it the other way around seems like it would reduce the flexibility for web app developers. /Bjorn On Wed, Dec 2, 2009 at 4:55 PM, Mike Hearn m...@plan99.net wrote: Is speech support a feature of the web page, or the web browser? On Wed, Dec 2, 2009 at 12:32 PM, Bjorn Bringert bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902 -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote: I think that it would be best to extend the browser with a JavaScript speech API intended for use by web apps. That is, only web apps that use the speech API would have speech support. But it should be possible to use such an API to write browser extensions (using Greasemonkey, Chrome extensions etc) that allow speech control of the browser and speech synthesis of web page contents. Doing it the other way around seems like it would reduce the flexibility for web app developers. Hmm.. I guess I misunderstood your original proposal. Do you want the browser to expose an API that converts speech to text? Or do you want the browser to expose access to the microphone so that you can do speech to text convertion in javascript? If the former, could you describe your use cases in more detail? / Jonas
Re: [whatwg] Web API for speech recognition and synthesis
I missunderstood too. It would be great to have the ability to access the microphone and record+upload or stream sound to the web server. -- D. On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote: On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote: I think that it would be best to extend the browser with a JavaScript speech API intended for use by web apps. That is, only web apps that use the speech API would have speech support. But it should be possible to use such an API to write browser extensions (using Greasemonkey, Chrome extensions etc) that allow speech control of the browser and speech synthesis of web page contents. Doing it the other way around seems like it would reduce the flexibility for web app developers. Hmm.. I guess I misunderstood your original proposal. Do you want the browser to expose an API that converts speech to text? Or do you want the browser to expose access to the microphone so that you can do speech to text convertion in javascript? If the former, could you describe your use cases in more detail? / Jonas signature.asc Description: This is a digitally signed message part
Re: [whatwg] Web API for speech recognition and synthesis
I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. /Bjorn On Wed, Dec 2, 2009 at 6:23 PM, Diogo Resende drese...@thinkdigital.pt wrote: I missunderstood too. It would be great to have the ability to access the microphone and record+upload or stream sound to the web server. -- D. On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote: On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote: I think that it would be best to extend the browser with a JavaScript speech API intended for use by web apps. That is, only web apps that use the speech API would have speech support. But it should be possible to use such an API to write browser extensions (using Greasemonkey, Chrome extensions etc) that allow speech control of the browser and speech synthesis of web page contents. Doing it the other way around seems like it would reduce the flexibility for web app developers. Hmm.. I guess I misunderstood your original proposal. Do you want the browser to expose an API that converts speech to text? Or do you want the browser to expose access to the microphone so that you can do speech to text convertion in javascript? If the former, could you describe your use cases in more detail? / Jonas -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Web API for speech recognition and synthesis
On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson This exists already, but only Opera supports it, although there are problems with the library we use for speech recognition. http://www.w3.org/TR/xhtml+voice/ http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/ Would be nice to revive that specification and get vendor buy-in. -- João Eiras Core Developer, Opera Software ASA, http://www.opera.com/
Re: [whatwg] Web API for speech recognition and synthesis
If you're able to read from the mic, you don't need to upload. You could save it locally (for example for voice memos). The read+upload was just 2 steps I sugested instead of direct streaming. Speech recognition could be done separatly. One could use the mic to capture a voice note. Other could use the speech recognition without the mic (saved file?). Divide and conquer :) -- Diogo Resende drese...@thinkdigital.pt ThinkDigital On Wed, 2009-12-02 at 19:17 +, Bjorn Bringert wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. /Bjorn On Wed, Dec 2, 2009 at 6:23 PM, Diogo Resende drese...@thinkdigital.pt wrote: I missunderstood too. It would be great to have the ability to access the microphone and record+upload or stream sound to the web server. -- D. On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote: On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote: I think that it would be best to extend the browser with a JavaScript speech API intended for use by web apps. That is, only web apps that use the speech API would have speech support. But it should be possible to use such an API to write browser extensions (using Greasemonkey, Chrome extensions etc) that allow speech control of the browser and speech synthesis of web page contents. Doing it the other way around seems like it would reduce the flexibility for web app developers. Hmm.. I guess I misunderstood your original proposal. Do you want the browser to expose an API that converts speech to text? Or do you want the browser to expose access to the microphone so that you can do speech to text convertion in javascript? If the former, could you describe your use cases in more detail? / Jonas signature.asc Description: This is a digitally signed message part
Re: [whatwg] Web API for speech recognition and synthesis
On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote: I agree that being able to capture and upload audio to a server would be useful for a lot of applications, and it could be used to do speech recognition. However, for a web app developer who just wants to develop an application that uses speech input and/or output, it doesn't seem very convenient, since it requires server-side infrastructure that is very costly to develop and run. A speech-specific API in the browser gives browser implementors the option to use on-device speech services provided by the OS, or server-side speech synthesis/recognition. Again, it would help a lot of you could provide use cases and requirements. This helps both with designing an API, as well as evaluating if the use cases are common enough that a dedicated API is the best solution. / Jonas
Re: [whatwg] Web API for speech recognition and synthesis
We're envisaging a simpler programmatic API that looks familiar to the modern Web developer but one which avoids the legacy of dialog system languages. Dave On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com wrote: On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson This exists already, but only Opera supports it, although there are problems with the library we use for speech recognition. http://www.w3.org/TR/xhtml+voice/ http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/ Would be nice to revive that specification and get vendor buy-in. -- João Eiras Core Developer, Opera Software ASA, http://www.opera.com/
Re: [whatwg] Web API for speech recognition and synthesis
On Thu, 03 Dec 2009 01:50:20 +0100, Dave Burke davebu...@google.com wrote: We're envisaging a simpler programmatic API that looks familiar to the modern Web developer but one which avoids the legacy of dialog system languages. Ok. I referenced that XHTML+Voice because there is already a specification with markup, css 2 aural stylesheets and javascript APIs, and one implementation. I quite sure someone can revisit this whole issue, and refactor the xhtml+voice specification into something more acceptable and implementable. I don't think anyone would implement it the way it is. Dave On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com wrote: On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com wrote: We've been watching our colleagues build native apps that use speech recognition and speech synthesis, and would like to have JavaScript APIs that let us do the same in web apps. We are thinking about creating a lightweight and implementation-independent API that lets web apps use speech services. Is anyone else interested in that? Bjorn Bringert, David Singleton, Gummi Hafsteinsson This exists already, but only Opera supports it, although there are problems with the library we use for speech recognition. http://www.w3.org/TR/xhtml+voice/ http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/ Would be nice to revive that specification and get vendor buy-in. -- João Eiras Core Developer, Opera Software ASA, http://www.opera.com/