[whatwg] Web API for speech recognition and synthesis

2009-12-26 Thread Peter Parente
 We've been watching our colleagues build native apps that use speech
 recognition and speech synthesis, and would like to have JavaScript
 APIs that let us do the same in web apps. We are thinking about
 creating a lightweight and implementation-independent API that lets
 web apps use speech services. Is anyone else interested in that?

 Bjorn Bringert, David Singleton, Gummi Hafsteinsson

I am interested in a JavaScript API for text-to-speech synthesis at
least. It would be a great help in creating more usable web
applications for people with visual impairments (i.e., self-voicing
web applications instead of screen reading). It could also enable a
slew of new web apps in mobile, eyes-busy situations (e.g., my smart
phone reads me my web mail, twitter feed, what-have-you, while I'm
driving).

Some folks working on enabling technologies at the Univ. of North
Carolina built Outfox (http://code.google.com/p/outfox/) as proof of
concept JS interface to text-to-speech engines on Mac, Windows, and
Linux. It's Firefox-only, but might be worth a look.

Pete


[whatwg] Web API for speech recognition and synthesis

2009-12-16 Thread Deborah Dahl
(resending to include the whatwg list, sorry for multiple postings)
Hi Olli,
Thank you for bringing this interesting thread to the Multimodal
Interaction Working Group's attention.
The working group is in fact very active. Although it is chartered as 
W3C Member-only, we do have a public mailing list, www-multimo...@w3.org, 
available for public discussions. 

In general, we would be very interested in hearing about the kinds of use 
cases for speech recognition and TTS in a browser context that you would 
like to handle. The Multimodal Architecture is our primary draft spec 
that addresses using speech in web pages (although it also addresses 
other modes of input, such as handwriting). A new Working Draft has just 
been published and we would be very interested 
in getting feedback on it. In my opinion, it's probably focused more on 
distributed architectures than on the use cases you might be interested 
in, but we would like our specs to be comprehensive enough to be able to 
address both server-based and client-based speech processing. 

We would also be interested in general discussions of questions about
multimodality. 

Here are some pointers that may be useful.
MMI page: http://www.w3.org/2002/mmi/
MMI Architecture spec: http://www.w3.org/TR/2009/WD-mmi-arch-20091201/

best regards,

Debbie Dahl, MMI Working Group Chair


 -Original Message-
 From: Olli Pettay [mailto:olli.pet...@helsinki.fi] 
 Sent: Friday, December 11, 2009 4:14 PM
 To: Bjorn Bringert
 Cc: o...@pettay.fi; Dave Burke; João Eiras; whatwg; David 
 Singleton; Gudmundur Hafsteinsson; westonru...@gmail.com; 
 www-multimo...@w3.org; Deborah Dahl
 Subject: Re: [whatwg] Web API for speech recognition and synthesis
 
 On 12/11/09 6:05 AM, Bjorn Bringert wrote:
  Thanks for the discussion - cool to see more interest today also
  

(http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.ht
ml)
 
  I've hacked up a proof-of-concept JavaScript API for speech
  recognition and synthesis. It adds a navigator.speech object with
  these functions:
 
  void listen(ListenCallback callback, ListenOptions options);
  void speak(DOMString text, SpeakCallback callback, 
 SpeakOptions options);
 
 
 So if I read the examples correctly you're not using grammars 
 anywhere.
 I wonder how well does that work in real world cases. Of course if
 the speech recognizer can handle everything well without grammars, the
 result validation could be done in JS after the result is got from the
 recognizer. But I think having support for grammars simplifies coding
 and can make speech dialogs somewhat more manageable.
 
 W3C has already standardized things like
 http://www.w3.org/TR/speech-grammar/ and
 http://www.w3.org/TR/semantic-interpretation/
 and the latter one works quite nicely with JS.
 
 Again, I think this kind of discussion should happen in W3C 
 multimodal 
 WG. Though, I'm not sure how actively or how openly that 
 working group 
 works atm.
 
 -Olli
 
 
 
  The implementation uses an NPAPI plugin for the Android browser that
  wraps the existing Android speech APIs. The code is available at
  http://code.google.com/p/speech-api-browser-plugin/
 
  There are some simple demo apps in
  
 http://code.google.com/p/speech-api-browser-plugin/source/brow
 se/trunk/android-plugin/demos/
  including:
 
  - English to Spanish speech-to-speech translation
  - Google search by speaking a query
  - The obligatory pizza ordering system
  - A phone number dialer
 
  Comments appreciated!
 
  /Bjorn
 
  On Fri, Dec 4, 2009 at 2:51 PM, Olli 
 Pettayolli.pet...@helsinki.fi  wrote:
  Indeed the API should be something significantly simpler than X+V.
  Microsoft has (had?) support for SALT. That API is pretty 
 simple and
  provides speech recognition and TTS.
  The API could be probably even simpler than SALT.
  IIRC, there was an extension for Firefox to support SALT 
 (well, there was
  also an extension to support X+V).
 
  If the platform/OS provides ASR and TTS, adding a JS API 
 for it should
  be pretty simple. X+V tries to handle some logic using 
 VoiceXML FIA, but
  I think it would be more web-like to give pure JS API 
 (similar to SALT).
  Integrating visual and voice input could be done in 
 scripts. I'd assume
  there would be some script libraries to handle multimodal 
 input integration
  - especially if there will be touch and gestures events 
 too etc. (Classic
  multimodal map applications will become possible in web.)
 
  But this all is something which should be possibly 
 designed in or with W3C
  multimodal working group. I know their current 
 architecture is way more
  complex, but X+X, SALT and even Multimodal-CSS has been 
 discussed in that
  working group.
 
 
  -Olli
 
 
 
  On 12/3/09 2:50 AM, Dave Burke wrote:
 
  We're envisaging a simpler programmatic API that looks 
 familiar to the
  modern Web developer but one which avoids the legacy of 
 dialog system
  languages.
 
  Dave
 
  On Wed, Dec 2, 2009 at 7:25

Re: [whatwg] Web API for speech recognition and synthesis

2009-12-15 Thread Bjorn Bringert
It seems like there is enough interest in speech to start developing
experimental implementations. There appear to be two general
directions that we could take:

- A general microphone API + streaming API + audio tag
  - Pro: Useful for non-speech recognition / synthesis applications.
   E.g. audio chat, sound recording.
  - Pro: Allows JavaScript libraries for third-party network speech services.
   E.g. an AJAX API for Google's speech services. Web app developers
   that don't have their own speech servers could use that.
  - Pro: Consistent recognition / synthesis user experience across
user agents in the same web app.
  - Con: No support for on-device recognition / synthesis, only
network services.
  - Con: Varying recognition / synthesis user experience across
different web apps in a single user agent.
  - Con: Possibly higher overhead because the audio data needs to
pass through JavaScript.
  - Con: Requires dealing with audio encodings, endpointing, buffer
sizes etc in the microphone API.

- A speech-specific back-end neutral API
  - Pro: Simple API, basically just two methods: listen() and speak().
  - Pro: Can use local recognition / synthesis.
  - Pro: Consistent recognition / synthesis user experience across
   different web apps in a single user agent.
  - Con: Varying recognition / synthesis user experience across user
agents in the same web app.
  - Con: Only works for speech, not general audio.

/Bjorn

On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw imcg...@mit.edu wrote:
 I'm new to this list, but as a speech-scientist and web developer, I wanted
 to add my 2 cents.  Personally, I believe the future of speech recognition
 is in the cloud.
 Here are two services which provide Javascript APIs for speech recognition
 (and TTS) today:
 http://wami.csail.mit.edu/
 http://www.research.att.com/projects/SpeechMashup/index.html
 Both of these are research systems, and as such they are really just
 proof-of-concepts.
 That said, Wami's JSONP-like implementation allows Quizlet.com to use speech
 recognition today on a relatively large scale, with just a few lines of
 Javascript code:
 http://quizlet.com/voicetest/415/?scatter
 Since there are a lot of Google folks on this list, I recommend you talk to
 Alex Gruenstein (in your speech group) who was one of the lead developers of
 WAMI while at MIT.
 The major limitation we found when building the system was that we had to
 develop a new audio controller for every client (Java for the desktop,
 custom browsers for iPhone and Android).  It would have been much simpler if
 browsers came with standard microphone capture and audio streaming
 capabilities.
 -Ian


 On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter westonru...@gmail.com
 wrote:

 I blogged yesterday about this topic (including a text-to-speech demo
 using HTML5 Audio and Google Translate's TTS service); the more relevant
 part for this thread:

 I am really excited at the prospect of text-to-speech being made
 available on
 the Web! It's just too bad that fetching MP3s on an remote web service is
 the
 only standard way of doing so currently; modern operating systems all
 have TTS
 capabilities, so it's a shame that web apps and can't utilize them via
 client-side scripting. I posted to the WHATWG mailing list about such a
 Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a
 recent
 thread about a Web API for speech recognition and synthesis.

 Perhaps there is some momentum building here? Having TTS available in the
 browser would boost accessibility for the seeing-impaired and improve
 usability
 for people on-the-go. TTS is just another technology that has
 traditionally been
 relegated to desktop applications, but as the open Web advances as the
 preferred
 platform for application development, it is an essential service to make
 available (as with Geolocation API, Device API, etc.). And besides, I
 want to
 build TTS applications and my motto is: If it can't be done on the open
 web,
 it's not worth doing at all!

 http://weston.ruter.net/projects/google-tts/

 Weston

 On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter westonru...@gmail.com
 wrote:

 I was just alerted about this thread from my post Text-To-Speech (TTS)
 Web API for JavaScript at
 http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html.
 Amazing how shared ideas like these seem to arise independently at the same
 time.

 I have a use-case and an additional requirement, that the time indices be
 made available for when each word is spoken in the TTS-generated audio:

 I've been working on a web app which reads text in a web page,
 highlighting each word as it is read. For this to be possible, a
 Text-To-Speech API is needed which is able to:
 (1) generate the speech audio from some text, and
 (2) include the time indicies for when each of the words in the text is
 spoken.

 I foresee that 

Re: [whatwg] Web API for speech recognition and synthesis

2009-12-15 Thread Ian Hickson
On Tue, 15 Dec 2009, Bjorn Bringert wrote:
 
 - A general microphone API + streaming API + audio tag
   - Pro: Useful for non-speech recognition / synthesis applications.
E.g. audio chat, sound recording.
   - Pro: Allows JavaScript libraries for third-party network speech services.
E.g. an AJAX API for Google's speech services. Web app developers
that don't have their own speech servers could use that.
   - Pro: Consistent recognition / synthesis user experience across
 user agents in the same web app.
   - Con: No support for on-device recognition / synthesis, only
 network services.
   - Con: Varying recognition / synthesis user experience across
 different web apps in a single user agent.
   - Con: Possibly higher overhead because the audio data needs to
 pass through JavaScript.
   - Con: Requires dealing with audio encodings, endpointing, buffer
 sizes etc in the microphone API.

FWIW I've started looking at this kind of thing in general (for audio and 
video -- see device in the spec for the first draft ideas), since it'll 
be required for other things as well. However, that shouldn't be taken as 
a sign that the other approach shouldn't also be examined.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-15 Thread Ian McGraw
Great!  As I've said, I'm definitely bias towards this approach.  As Bjorn
hinted AJAX APIs could be developed with all sorts of interesting features
that will never make it down into the browser, e.g. pronunciation
assessment, speech therapy, all those lie-detector apps for your phone :-).
Still, I think that we're missing the biggest pro:

- Pro:  Speech recognition technology is data-driven.  Improvements in the
underlying technology are far more likely to occur with a network driven
approach.

To be fair, with that, you have to add a con:

- Con:  Less privacy.

-Ian

On Tue, Dec 15, 2009 at 3:37 PM, Ian Hickson i...@hixie.ch wrote:

 On Tue, 15 Dec 2009, Bjorn Bringert wrote:
 
  - A general microphone API + streaming API + audio tag
- Pro: Useful for non-speech recognition / synthesis applications.
 E.g. audio chat, sound recording.
- Pro: Allows JavaScript libraries for third-party network speech
 services.
 E.g. an AJAX API for Google's speech services. Web app
 developers
 that don't have their own speech servers could use that.
- Pro: Consistent recognition / synthesis user experience across
  user agents in the same web app.
- Con: No support for on-device recognition / synthesis, only
  network services.
- Con: Varying recognition / synthesis user experience across
  different web apps in a single user agent.
- Con: Possibly higher overhead because the audio data needs to
  pass through JavaScript.
- Con: Requires dealing with audio encodings, endpointing, buffer
  sizes etc in the microphone API.

 FWIW I've started looking at this kind of thing in general (for audio and
 video -- see device in the spec for the first draft ideas), since it'll
 be required for other things as well. However, that shouldn't be taken as
 a sign that the other approach shouldn't also be examined.

 --
 Ian Hickson   U+1047E)\._.,--,'``.fL
 http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
 Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'



Re: [whatwg] Web API for speech recognition and synthesis

2009-12-15 Thread Tran, Dzung D
Currently the W3C Device API WG is working on a Capture API which will include 
microphone capture and audio streaming capabilities. The current draft is at: 
http://dev.w3.org/2009/dap/camera/

It is pretty rough and still in working progress, so for instance streaming is 
not there.

Thanks
Dzung Tran

On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw 
imcg...@mit.edumailto:imcg...@mit.edu wrote:
 I'm new to this list, but as a speech-scientist and web developer, I wanted
 to add my 2 cents. ?Personally, I believe the future of speech recognition
 is in the cloud.
 Here are two services which provide Javascript APIs for speech recognition
 (and TTS) today:
 http://wami.csail.mit.edu/
 http://www.research.att.com/projects/SpeechMashup/index.html
 Both of these are research systems, and as such they are really just
 proof-of-concepts.
 That said, Wami's JSONP-like implementation allows Quizlet.com to use speech
 recognition today on a relatively large scale, with just a few lines of
 Javascript code:
 http://quizlet.com/voicetest/415/?scatter
 Since there are a lot of Google folks on this list, I recommend you talk to
 Alex Gruenstein (in your speech group) who was one of the lead developers of
 WAMI while at MIT.
 The major limitation we found when building the system was that we had to
 develop a new audio controller for every client (Java for the desktop,
 custom browsers for iPhone and Android). ?It would have been much simpler if
 browsers came with standard microphone capture and audio streaming
 capabilities.
 -Ian





Re: [whatwg] Web API for speech recognition and synthesis

2009-12-13 Thread Ian McGraw
I'm new to this list, but as a speech-scientist and web developer, I wanted
to add my 2 cents.  Personally, I believe the future of speech recognition
is in the cloud.

Here are two services which provide Javascript APIs for speech recognition
(and TTS) today:

http://wami.csail.mit.edu/
http://www.research.att.com/projects/SpeechMashup/index.html

Both of these are research systems, and as such they are really just
proof-of-concepts.
That said, Wami's JSONP-like implementation allows Quizlet.com to use speech
recognition today on a relatively large scale, with just a few lines of
Javascript code:

http://quizlet.com/voicetest/415/?scatter

Since there are a lot of Google folks on this list, I recommend you talk to
Alex Gruenstein (in your speech group) who was one of the lead developers of
WAMI while at MIT.

The major limitation we found when building the system was that we had to
develop a new audio controller for every client (Java for the desktop,
custom browsers for iPhone and Android).  It would have been much simpler if
browsers came with standard microphone capture and audio streaming
capabilities.

-Ian


On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter westonru...@gmail.comwrote:

 I blogged yesterday about this topic (including a text-to-speech demo using
 HTML5 Audio and Google Translate's TTS service); the more relevant part for
 this thread: http://weston.ruter.net/projects/google-tts/

 I am really excited at the prospect of text-to-speech being made available
 on
 the Web! It's just too bad that fetching MP3s on an remote web service is
 the
 only standard way of doing so currently; modern operating systems all have
 TTS
 capabilities, so it's a shame that web apps and can't utilize them via
 client-side scripting. I posted to the WHATWG mailing list about such a
 Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a
 recent
 thread about a Web API for speech recognition and synthesis.

 Perhaps there is some momentum building here? Having TTS available in the
 browser would boost accessibility for the seeing-impaired and improve
 usability
 for people on-the-go. TTS is just another technology that has
 traditionally been
 relegated to desktop applications, but as the open Web advances as the
 preferred
 platform for application development, it is an essential service to make
 available (as with Geolocation API, Device API, etc.). And besides, I want
 to
 build TTS applications and my motto is: If it can't be done on the open
 web,
 it's not worth doing at all!


 http://weston.ruter.net/projects/google-tts/

 Weston

 On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter westonru...@gmail.comwrote:

 I was just alerted about this thread from my post Text-To-Speech (TTS)
 Web API for JavaScript at 
 http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html.
 Amazing how shared ideas like these seem to arise independently at the same
 time.

 I have a use-case and an additional requirement, that the time indices be
 made available for when each word is spoken in the TTS-generated audio:

 I've been working on a web app which reads text in a web page,
 highlighting each word as it is read. For this to be possible, a
 Text-To-Speech API is needed which is able to:
 (1) generate the speech audio from some text, and
 (2) include the time indicies for when each of the words in the text is
 spoken.


 I foresee that a TTS API should integrate closely with the HTML5 Audio
 API. For example, invoking a call to the API could return a TTS object
 which has an instance of Audio, whose interface could be used to navigate
 through the TTS output. For example:

 var tts = new TextToSpeech(Hello, World!);
 tts.audio.addEventListener(canplaythrough, function(e){
 //tts.indices == [{startTime:0, endTime:500, text:Hello},
 {startTime:500, endTime:1000, text:World}]
 }, false);
 tts.read(); //invokes tts.audio.play

 What would be even cooler, is if the parameter passed to the TextToSpeech
 constructor could be an Element or TextNode, and the indices would then
 include a DOM Range in addition to the text property. A flag could also be
 set which would result in each of these DOM ranges to be selected when it is
 read. For example:

 var tts = new TextToSpeech(document.querySelector(article));
 tts.selectRangesOnRead = true;
 tts.audio.addEventListener(canplaythrough, function(e){
 /*
 tts.indices == [
 {startTime:0, endTime:500, text:Hello, range:Range},
 {startTime:500, endTime:1000, text:World, range:Range}
 ]
 */
 }, false);
 tts.read();

 In addition to the events fired by the Audio API, more events could be
 fired when reading TTS, such as a readrange event whose event object would
 include the index (startTime, endTime, text, range) for the range currently
 being spoken. Such functionality would make the ability to read along with
 the text trivial.

 What do you think?
 Weston


 On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert 

Re: [whatwg] Web API for speech recognition and synthesis

2009-12-11 Thread Bjorn Bringert
Thanks for the discussion - cool to see more interest today also
(http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html)

I've hacked up a proof-of-concept JavaScript API for speech
recognition and synthesis. It adds a navigator.speech object with
these functions:

void listen(ListenCallback callback, ListenOptions options);
void speak(DOMString text, SpeakCallback callback, SpeakOptions options);

The implementation uses an NPAPI plugin for the Android browser that
wraps the existing Android speech APIs. The code is available at
http://code.google.com/p/speech-api-browser-plugin/

There are some simple demo apps in
http://code.google.com/p/speech-api-browser-plugin/source/browse/trunk/android-plugin/demos/
including:

- English to Spanish speech-to-speech translation
- Google search by speaking a query
- The obligatory pizza ordering system
- A phone number dialer

Comments appreciated!

/Bjorn

On Fri, Dec 4, 2009 at 2:51 PM, Olli Pettay olli.pet...@helsinki.fi wrote:
 Indeed the API should be something significantly simpler than X+V.
 Microsoft has (had?) support for SALT. That API is pretty simple and
 provides speech recognition and TTS.
 The API could be probably even simpler than SALT.
 IIRC, there was an extension for Firefox to support SALT (well, there was
 also an extension to support X+V).

 If the platform/OS provides ASR and TTS, adding a JS API for it should
 be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but
 I think it would be more web-like to give pure JS API (similar to SALT).
 Integrating visual and voice input could be done in scripts. I'd assume
 there would be some script libraries to handle multimodal input integration
 - especially if there will be touch and gestures events too etc. (Classic
 multimodal map applications will become possible in web.)

 But this all is something which should be possibly designed in or with W3C
 multimodal working group. I know their current architecture is way more
 complex, but X+X, SALT and even Multimodal-CSS has been discussed in that
 working group.


 -Olli



 On 12/3/09 2:50 AM, Dave Burke wrote:

 We're envisaging a simpler programmatic API that looks familiar to the
 modern Web developer but one which avoids the legacy of dialog system
 languages.

 Dave

 On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com
 mailto:jo...@opera.com wrote:

    On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert
    bring...@google.com mailto:bring...@google.com wrote:

        We've been watching our colleagues build native apps that use
 speech
        recognition and speech synthesis, and would like to have JavaScript
        APIs that let us do the same in web apps. We are thinking about
        creating a lightweight and implementation-independent API that lets
        web apps use speech services. Is anyone else interested in that?

        Bjorn Bringert, David Singleton, Gummi Hafsteinsson


    This exists already, but only Opera supports it, although there are
    problems with the library we use for speech recognition.

    http://www.w3.org/TR/xhtml+voice/

  http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/

    Would be nice to revive that specification and get vendor buy-in.



    --

    João Eiras
    Core Developer, Opera Software ASA, http://www.opera.com/







-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-11 Thread Weston Ruter
I was just alerted about this thread from my post Text-To-Speech (TTS) Web
API for JavaScript at 
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html.
Amazing how shared ideas like these seem to arise independently at the same
time.

I have a use-case and an additional requirement, that the time indices be
made available for when each word is spoken in the TTS-generated audio:

I've been working on a web app which reads text in a web page, highlighting
 each word as it is read. For this to be possible, a Text-To-Speech API is
 needed which is able to:
 (1) generate the speech audio from some text, and
 (2) include the time indicies for when each of the words in the text is
 spoken.


I foresee that a TTS API should integrate closely with the HTML5 Audio API.
For example, invoking a call to the API could return a TTS object which
has an instance of Audio, whose interface could be used to navigate through
the TTS output. For example:

var tts = new TextToSpeech(Hello, World!);
tts.audio.addEventListener(canplaythrough, function(e){
//tts.indices == [{startTime:0, endTime:500, text:Hello},
{startTime:500, endTime:1000, text:World}]
}, false);
tts.read(); //invokes tts.audio.play

What would be even cooler, is if the parameter passed to the TextToSpeech
constructor could be an Element or TextNode, and the indices would then
include a DOM Range in addition to the text property. A flag could also be
set which would result in each of these DOM ranges to be selected when it is
read. For example:

var tts = new TextToSpeech(document.querySelector(article));
tts.selectRangesOnRead = true;
tts.audio.addEventListener(canplaythrough, function(e){
/*
tts.indices == [
{startTime:0, endTime:500, text:Hello, range:Range},
{startTime:500, endTime:1000, text:World, range:Range}
]
*/
}, false);
tts.read();

In addition to the events fired by the Audio API, more events could be fired
when reading TTS, such as a readrange event whose event object would
include the index (startTime, endTime, text, range) for the range currently
being spoken. Such functionality would make the ability to read along with
the text trivial.

What do you think?
Weston


On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert bring...@google.com wrote:

 On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote:
  On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com
 wrote:
  I agree that being able to capture and upload audio to a server would
  be useful for a lot of applications, and it could be used to do speech
  recognition. However, for a web app developer who just wants to
  develop an application that uses speech input and/or output, it
  doesn't seem very convenient, since it requires server-side
  infrastructure that is very costly to develop and run. A
  speech-specific API in the browser gives browser implementors the
  option to use on-device speech services provided by the OS, or
  server-side speech synthesis/recognition.
 
  Again, it would help a lot of you could provide use cases and
  requirements. This helps both with designing an API, as well as
  evaluating if the use cases are common enough that a dedicated API is
  the best solution.
 
  / Jonas

 I'm mostly thinking about speech web apps for mobile devices. I think
 that's where speech makes most sense as an input and output method,
 because of the poor keyboards, small screens, and frequent hands/eyes
 busy situations (e.g. while driving). Accessibility is the other big
 reason for using speech.

 Some ideas for use cases:

 - Search by speaking a query
 - Speech-to-speech translation
 - Voice Dialing (could open a tel: URI to actually make the call)
 - Dialog systems (e.g. the canonical pizza ordering system)
 - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
 Chrome extensions) for using speech with any web site, e.g, for
 accessibility.

 Requirements:

 - Web app developer side:
   - Allows both speech recognition and synthesis.
   - Easy to use API. Makes simple things easy and advanced things possible.
   - Doesn't require web app developer to develop / run his own speech
 recognition / synthesis servers.
   - (Natural) language-neutral API.
   - Allows developer-defined application specific grammars / language
 models.
   - Allows multilingual applications.
   - Allows easy localization of speech apps.

 - Implementor side:
   - Easy enough to implement that it can get wide adoption in browsers.
   - Allows implementor to use either client-side or server-side
 recognition and synthesis.

 --
 Bjorn Bringert
 Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
 Palace Road, London, SW1W 9TQ
 Registered in England Number: 3977902



Re: [whatwg] Web API for speech recognition and synthesis

2009-12-11 Thread Olli Pettay

(Sending this 2nd time. Hopefully whatwg list doesn't bounce it back.)

On 12/11/09 6:05 AM, Bjorn Bringert wrote:

Thanks for the discussion - cool to see more interest today also
(http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html)

I've hacked up a proof-of-concept JavaScript API for speech
recognition and synthesis. It adds a navigator.speech object with
these functions:

void listen(ListenCallback callback, ListenOptions options);
void speak(DOMString text, SpeakCallback callback, SpeakOptions options);



So if I read the examples correctly you're not using grammars anywhere.
I wonder how well does that work in real world cases. Of course if
the speech recognizer can handle everything well without grammars, the
result validation could be done in JS after the result is got from the
recognizer. But I think having support for grammars simplifies coding
and can make speech dialogs somewhat more manageable.

W3C has already standardized things like
http://www.w3.org/TR/speech-grammar/ and
http://www.w3.org/TR/semantic-interpretation/
and the latter one works quite nicely with JS.

Again, I think this kind of discussion should happen in W3C multimodal 
WG. Though, I'm not sure how actively or how openly that working group 
works atm.


-Olli




The implementation uses an NPAPI plugin for the Android browser that
wraps the existing Android speech APIs. The code is available at
http://code.google.com/p/speech-api-browser-plugin/

There are some simple demo apps in
http://code.google.com/p/speech-api-browser-plugin/source/browse/trunk/android-plugin/demos/
including:

- English to Spanish speech-to-speech translation
- Google search by speaking a query
- The obligatory pizza ordering system
- A phone number dialer

Comments appreciated!

/Bjorn

On Fri, Dec 4, 2009 at 2:51 PM, Olli Pettayolli.pet...@helsinki.fi  wrote:

Indeed the API should be something significantly simpler than X+V.
Microsoft has (had?) support for SALT. That API is pretty simple and
provides speech recognition and TTS.
The API could be probably even simpler than SALT.
IIRC, there was an extension for Firefox to support SALT (well, there was
also an extension to support X+V).

If the platform/OS provides ASR and TTS, adding a JS API for it should
be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but
I think it would be more web-like to give pure JS API (similar to SALT).
Integrating visual and voice input could be done in scripts. I'd assume
there would be some script libraries to handle multimodal input integration
- especially if there will be touch and gestures events too etc. (Classic
multimodal map applications will become possible in web.)

But this all is something which should be possibly designed in or with W3C
multimodal working group. I know their current architecture is way more
complex, but X+X, SALT and even Multimodal-CSS has been discussed in that
working group.


-Olli



On 12/3/09 2:50 AM, Dave Burke wrote:


We're envisaging a simpler programmatic API that looks familiar to the
modern Web developer but one which avoids the legacy of dialog system
languages.

Dave

On Wed, Dec 2, 2009 at 7:25 PM, João Eirasjo...@opera.com
mailto:jo...@opera.com  wrote:

On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert
bring...@google.commailto:bring...@google.com  wrote:

We've been watching our colleagues build native apps that use
speech
recognition and speech synthesis, and would like to have JavaScript
APIs that let us do the same in web apps. We are thinking about
creating a lightweight and implementation-independent API that lets
web apps use speech services. Is anyone else interested in that?

Bjorn Bringert, David Singleton, Gummi Hafsteinsson


This exists already, but only Opera supports it, although there are
problems with the library we use for speech recognition.

http://www.w3.org/TR/xhtml+voice/

  http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/

Would be nice to revive that specification and get vendor buy-in.



--

João Eiras
Core Developer, Opera Software ASA, http://www.opera.com/













Re: [whatwg] Web API for speech recognition and synthesis

2009-12-04 Thread Olli Pettay

Indeed the API should be something significantly simpler than X+V.
Microsoft has (had?) support for SALT. That API is pretty simple and
provides speech recognition and TTS.
The API could be probably even simpler than SALT.
IIRC, there was an extension for Firefox to support SALT (well, there 
was also an extension to support X+V).


If the platform/OS provides ASR and TTS, adding a JS API for it should
be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but
I think it would be more web-like to give pure JS API (similar to SALT).
Integrating visual and voice input could be done in scripts. I'd assume
there would be some script libraries to handle multimodal input 
integration - especially if there will be touch and gestures events too 
etc. (Classic multimodal map applications will become possible in web.)


But this all is something which should be possibly designed in or with 
W3C multimodal working group. I know their current architecture is way 
more complex, but X+X, SALT and even Multimodal-CSS has been discussed 
in that working group.



-Olli



On 12/3/09 2:50 AM, Dave Burke wrote:

We're envisaging a simpler programmatic API that looks familiar to the
modern Web developer but one which avoids the legacy of dialog system
languages.

Dave

On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com
mailto:jo...@opera.com wrote:

On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert
bring...@google.com mailto:bring...@google.com wrote:

We've been watching our colleagues build native apps that use speech
recognition and speech synthesis, and would like to have JavaScript
APIs that let us do the same in web apps. We are thinking about
creating a lightweight and implementation-independent API that lets
web apps use speech services. Is anyone else interested in that?

Bjorn Bringert, David Singleton, Gummi Hafsteinsson


This exists already, but only Opera supports it, although there are
problems with the library we use for speech recognition.

http://www.w3.org/TR/xhtml+voice/
http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/

Would be nice to revive that specification and get vendor buy-in.



--

João Eiras
Core Developer, Opera Software ASA, http://www.opera.com/






Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Bjorn Bringert
On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote:
 I agree that being able to capture and upload audio to a server would
 be useful for a lot of applications, and it could be used to do speech
 recognition. However, for a web app developer who just wants to
 develop an application that uses speech input and/or output, it
 doesn't seem very convenient, since it requires server-side
 infrastructure that is very costly to develop and run. A
 speech-specific API in the browser gives browser implementors the
 option to use on-device speech services provided by the OS, or
 server-side speech synthesis/recognition.

 Again, it would help a lot of you could provide use cases and
 requirements. This helps both with designing an API, as well as
 evaluating if the use cases are common enough that a dedicated API is
 the best solution.

 / Jonas

I'm mostly thinking about speech web apps for mobile devices. I think
that's where speech makes most sense as an input and output method,
because of the poor keyboards, small screens, and frequent hands/eyes
busy situations (e.g. while driving). Accessibility is the other big
reason for using speech.

Some ideas for use cases:

- Search by speaking a query
- Speech-to-speech translation
- Voice Dialing (could open a tel: URI to actually make the call)
- Dialog systems (e.g. the canonical pizza ordering system)
- Lightweight JavaScript browser extensions (e.g. Greasemonkey /
Chrome extensions) for using speech with any web site, e.g, for
accessibility.

Requirements:

- Web app developer side:
   - Allows both speech recognition and synthesis.
   - Easy to use API. Makes simple things easy and advanced things possible.
   - Doesn't require web app developer to develop / run his own speech
recognition / synthesis servers.
   - (Natural) language-neutral API.
   - Allows developer-defined application specific grammars / language models.
   - Allows multilingual applications.
   - Allows easy localization of speech apps.

- Implementor side:
   - Easy enough to implement that it can get wide adoption in browsers.
   - Allows implementor to use either client-side or server-side
recognition and synthesis.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Diogo Resende
I agree 100%. Still, I think the access to the mic and the speech
recognition could be separated.

-- 
Diogo Resende drese...@thinkdigital.pt
ThinkDigital

On Thu, 2009-12-03 at 12:06 +, Bjorn Bringert wrote:
 On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote:
  On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote:
  I agree that being able to capture and upload audio to a server would
  be useful for a lot of applications, and it could be used to do speech
  recognition. However, for a web app developer who just wants to
  develop an application that uses speech input and/or output, it
  doesn't seem very convenient, since it requires server-side
  infrastructure that is very costly to develop and run. A
  speech-specific API in the browser gives browser implementors the
  option to use on-device speech services provided by the OS, or
  server-side speech synthesis/recognition.
 
  Again, it would help a lot of you could provide use cases and
  requirements. This helps both with designing an API, as well as
  evaluating if the use cases are common enough that a dedicated API is
  the best solution.
 
  / Jonas
 
 I'm mostly thinking about speech web apps for mobile devices. I think
 that's where speech makes most sense as an input and output method,
 because of the poor keyboards, small screens, and frequent hands/eyes
 busy situations (e.g. while driving). Accessibility is the other big
 reason for using speech.
 
 Some ideas for use cases:
 
 - Search by speaking a query
 - Speech-to-speech translation
 - Voice Dialing (could open a tel: URI to actually make the call)
 - Dialog systems (e.g. the canonical pizza ordering system)
 - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
 Chrome extensions) for using speech with any web site, e.g, for
 accessibility.
 
 Requirements:
 
 - Web app developer side:
- Allows both speech recognition and synthesis.
- Easy to use API. Makes simple things easy and advanced things possible.
- Doesn't require web app developer to develop / run his own speech
 recognition / synthesis servers.
- (Natural) language-neutral API.
- Allows developer-defined application specific grammars / language models.
- Allows multilingual applications.
- Allows easy localization of speech apps.
 
 - Implementor side:
- Easy enough to implement that it can get wide adoption in browsers.
- Allows implementor to use either client-side or server-side
 recognition and synthesis.
 


signature.asc
Description: This is a digitally signed message part


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread David Workman
I agree. The application should be able to choose a source for speech
commands, or give the user a choice of options for a speech source. It also
provides a much better separation of APIs, allowing the development of a
speech API that doesn't depend on or interfere in any way with the
development of a microphone/audio input device API.

2009/12/3 Diogo Resende drese...@thinkdigital.pt

 I agree 100%. Still, I think the access to the mic and the speech
 recognition could be separated.

 --
 Diogo Resende drese...@thinkdigital.pt
 ThinkDigital

 On Thu, 2009-12-03 at 12:06 +, Bjorn Bringert wrote:
  On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote:
   On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com
 wrote:
   I agree that being able to capture and upload audio to a server would
   be useful for a lot of applications, and it could be used to do speech
   recognition. However, for a web app developer who just wants to
   develop an application that uses speech input and/or output, it
   doesn't seem very convenient, since it requires server-side
   infrastructure that is very costly to develop and run. A
   speech-specific API in the browser gives browser implementors the
   option to use on-device speech services provided by the OS, or
   server-side speech synthesis/recognition.
  
   Again, it would help a lot of you could provide use cases and
   requirements. This helps both with designing an API, as well as
   evaluating if the use cases are common enough that a dedicated API is
   the best solution.
  
   / Jonas
 
  I'm mostly thinking about speech web apps for mobile devices. I think
  that's where speech makes most sense as an input and output method,
  because of the poor keyboards, small screens, and frequent hands/eyes
  busy situations (e.g. while driving). Accessibility is the other big
  reason for using speech.
 
  Some ideas for use cases:
 
  - Search by speaking a query
  - Speech-to-speech translation
  - Voice Dialing (could open a tel: URI to actually make the call)
  - Dialog systems (e.g. the canonical pizza ordering system)
  - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
  Chrome extensions) for using speech with any web site, e.g, for
  accessibility.
 
  Requirements:
 
  - Web app developer side:
 - Allows both speech recognition and synthesis.
 - Easy to use API. Makes simple things easy and advanced things
 possible.
 - Doesn't require web app developer to develop / run his own speech
  recognition / synthesis servers.
 - (Natural) language-neutral API.
 - Allows developer-defined application specific grammars / language
 models.
 - Allows multilingual applications.
 - Allows easy localization of speech apps.
 
  - Implementor side:
 - Easy enough to implement that it can get wide adoption in browsers.
 - Allows implementor to use either client-side or server-side
  recognition and synthesis.
 



Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Oliver Hunt

On Dec 3, 2009, at 4:06 AM, Bjorn Bringert wrote:

 On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote:
 I agree that being able to capture and upload audio to a server would
 be useful for a lot of applications, and it could be used to do speech
 recognition. However, for a web app developer who just wants to
 develop an application that uses speech input and/or output, it
 doesn't seem very convenient, since it requires server-side
 infrastructure that is very costly to develop and run. A
 speech-specific API in the browser gives browser implementors the
 option to use on-device speech services provided by the OS, or
 server-side speech synthesis/recognition.
 
 Again, it would help a lot of you could provide use cases and
 requirements. This helps both with designing an API, as well as
 evaluating if the use cases are common enough that a dedicated API is
 the best solution.
 
 / Jonas
 
 I'm mostly thinking about speech web apps for mobile devices. I think
 that's where speech makes most sense as an input and output method,
 because of the poor keyboards, small screens, and frequent hands/eyes
 busy situations (e.g. while driving). Accessibility is the other big
 reason for using speech.
Accessibility is already handle through ARIA and the host platforms 
accessibility features.

 
 Some ideas for use cases:
 
 - Search by speaking a query
 - Speech-to-speech translation
 - Voice Dialing (could open a tel: URI to actually make the call)
 - Dialog systems (e.g. the canonical pizza ordering system)
 - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
 Chrome extensions) for using speech with any web site, e.g, for
 accessibility.

I am unsure why the site should be directly responsible for things like audio 
based accessibility.  What do you believe a site should be doing itself 
manually vs. the accessibility services provided by the host OS?

 
 Requirements:
 
 - Web app developer side:
   - Allows both speech recognition and synthesis.
ARIA (in conjunction with the OS accessibility services) already provides the 
accessibility focused text to speech (unsure about recognition side)
 
   - Doesn't require web app developer to develop / run his own speech
 recognition / synthesis servers.
This would seem to be use the OS services
 
 - Implementor side:
   - Easy enough to implement that it can get wide adoption in browsers.
These services are not simple -- any implementation would seem to be a 
significant amount of work, especially if you want to a) actually be good at it 
and b) interact with the host OS's native accessibility features.

   - Allows implementor to use either client-side or server-side
 recognition and synthesis.
I honestly have no idea what you mean by this.

--Oliver



Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Jonas Sicking
On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert bring...@google.com wrote:
 On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote:
 I agree that being able to capture and upload audio to a server would
 be useful for a lot of applications, and it could be used to do speech
 recognition. However, for a web app developer who just wants to
 develop an application that uses speech input and/or output, it
 doesn't seem very convenient, since it requires server-side
 infrastructure that is very costly to develop and run. A
 speech-specific API in the browser gives browser implementors the
 option to use on-device speech services provided by the OS, or
 server-side speech synthesis/recognition.

 Again, it would help a lot of you could provide use cases and
 requirements. This helps both with designing an API, as well as
 evaluating if the use cases are common enough that a dedicated API is
 the best solution.

 / Jonas

 I'm mostly thinking about speech web apps for mobile devices. I think
 that's where speech makes most sense as an input and output method,
 because of the poor keyboards, small screens, and frequent hands/eyes
 busy situations (e.g. while driving). Accessibility is the other big
 reason for using speech.

 Some ideas for use cases:

 - Search by speaking a query
 - Speech-to-speech translation
 - Voice Dialing (could open a tel: URI to actually make the call)

input type=search, input type=text and input type=tel seems like
the correct solution for these. Nothing prevents UAs for allowing
speech rather than keyboard input into these (and I believe that most
do if you have AT tools installed).

 - Dialog systems (e.g. the canonical pizza ordering system)

I saw some pretty cool XHTML+Voice demos a few years ago that did
this. They didn't use speech-to-text scripting APIs though.

 - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
 Chrome extensions) for using speech with any web site, e.g, for
 accessibility.

These would seem like APIs not exposed to webpages, but rather to
extensions. So WHATWG would be the wrong place to standardize them.
And I'm not convinced that this needs speech-to-text scripting APIs
either, but rather simply support for speech rather than keyboard as
text input.

/ Jonas


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Fergus Henderson
On Thu, Dec 3, 2009 at 7:32 AM, Diogo Resende drese...@thinkdigital.ptwrote:

 I agree 100%. Still, I think the access to the mic and the speech
 recognition could be separated.


While it would be possible to separate access to the microphone and speech
recognition, combining them allows the API to abstract away details of the
implementation that would otherwise have to be exposed, in particular the
audio encoding(s) used, and whether the audio is streamed to the recognizer
or sent in a single chunk.  If we don't provide general access to the
microphone, the speech recognition API can be simpler, implementors will
have more flexibility, and implementations can be simpler and smaller
because they won't have to deal with conversions between different audio
encodings.

So I'm in favour of not separating out access to the microphone, at least in
v1 of the API.

-- 
Fergus Henderson fer...@google.com


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Diogo Resende
I was not thinking of raw access to the mic. I was just thinking of a 2
step method to do it so you could just do 1 step :)

I was thinking of something like:

1. Call Sound API and ask to record (maybe something like the
geolocation on Firefox [1]).

2. Pass it to text2speech or save or stream or whatever..

This way one could record audio and do something else like save/stream.
If other want to translate into text, just do the next step.

[1]: http://www.mozilla.com/en-US/firefox/geolocation/

-- 
Diogo Resende drese...@thinkdigital.pt
ThinkDigital

On Thu, 2009-12-03 at 12:30 -0500, Fergus Henderson wrote:
 On Thu, Dec 3, 2009 at 7:32 AM, Diogo Resende
 drese...@thinkdigital.pt wrote:
 I agree 100%. Still, I think the access to the mic and the
 speech
 recognition could be separated.
 
 While it would be possible to separate access to the microphone and
 speech recognition, combining them allows the API to abstract away
 details of the implementation that would otherwise have to be exposed,
 in particular the audio encoding(s) used, and whether the audio is
 streamed to the recognizer or sent in a single chunk.  If we don't
 provide general access to the microphone, the speech recognition API
 can be simpler, implementors will have more flexibility, and
 implementations can be simpler and smaller because they won't have to
 deal with conversions between different audio encodings.
 
 So I'm in favour of not separating out access to the microphone, at
 least in v1 of the API.
 
 -- 
 Fergus Henderson fer...@google.com


signature.asc
Description: This is a digitally signed message part


[whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Bjorn Bringert
We've been watching our colleagues build native apps that use speech
recognition and speech synthesis, and would like to have JavaScript
APIs that let us do the same in web apps. We are thinking about
creating a lightweight and implementation-independent API that lets
web apps use speech services. Is anyone else interested in that?

Bjorn Bringert, David Singleton, Gummi Hafsteinsson

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Jonas Sicking
On Wed, Dec 2, 2009 at 3:32 AM, Bjorn Bringert bring...@google.com wrote:
 We've been watching our colleagues build native apps that use speech
 recognition and speech synthesis, and would like to have JavaScript
 APIs that let us do the same in web apps. We are thinking about
 creating a lightweight and implementation-independent API that lets
 web apps use speech services. Is anyone else interested in that?

 Bjorn Bringert, David Singleton, Gummi Hafsteinsson

Short answer: Yes, very :)

Longer answer: APIs for accessing microphone and camera is something
that I think is very needed. There's several aspects to this, ranging
from simply uploading video/audio clips using an input type=file
element, to streaming APIs that allow video/audio conferancing using a
browser, to being able to do video/audio processing/playback inside
the browser.

There's a ton of work here to be done, anywhere you are willing to
help would be hugely appreciated.

/ Jonas


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Mike Hearn
Is speech support a feature of the web page, or the web browser?

On Wed, Dec 2, 2009 at 12:32 PM, Bjorn Bringert bring...@google.com wrote:
 We've been watching our colleagues build native apps that use speech
 recognition and speech synthesis, and would like to have JavaScript
 APIs that let us do the same in web apps. We are thinking about
 creating a lightweight and implementation-independent API that lets
 web apps use speech services. Is anyone else interested in that?

 Bjorn Bringert, David Singleton, Gummi Hafsteinsson

 --
 Bjorn Bringert
 Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
 Palace Road, London, SW1W 9TQ
 Registered in England Number: 3977902



Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Bjorn Bringert
I think that it would be best to extend the browser with a JavaScript
speech API intended for use by web apps. That is, only web apps that
use the speech API would have speech support. But it should be
possible to use such an API to write browser extensions (using
Greasemonkey, Chrome extensions etc) that allow speech control of the
browser and speech synthesis of web page contents. Doing it the other
way around seems like it would reduce the flexibility for web app
developers.

/Bjorn

On Wed, Dec 2, 2009 at 4:55 PM, Mike Hearn m...@plan99.net wrote:
 Is speech support a feature of the web page, or the web browser?

 On Wed, Dec 2, 2009 at 12:32 PM, Bjorn Bringert bring...@google.com wrote:
 We've been watching our colleagues build native apps that use speech
 recognition and speech synthesis, and would like to have JavaScript
 APIs that let us do the same in web apps. We are thinking about
 creating a lightweight and implementation-independent API that lets
 web apps use speech services. Is anyone else interested in that?

 Bjorn Bringert, David Singleton, Gummi Hafsteinsson

 --
 Bjorn Bringert
 Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
 Palace Road, London, SW1W 9TQ
 Registered in England Number: 3977902





-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Jonas Sicking
On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote:
 I think that it would be best to extend the browser with a JavaScript
 speech API intended for use by web apps. That is, only web apps that
 use the speech API would have speech support. But it should be
 possible to use such an API to write browser extensions (using
 Greasemonkey, Chrome extensions etc) that allow speech control of the
 browser and speech synthesis of web page contents. Doing it the other
 way around seems like it would reduce the flexibility for web app
 developers.

Hmm.. I guess I misunderstood your original proposal.

Do you want the browser to expose an API that converts speech to text?
Or do you want the browser to expose access to the microphone so that
you can do speech to text convertion in javascript?

If the former, could you describe your use cases in more detail?

/ Jonas


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Diogo Resende
I missunderstood too. It would be great to have the ability to access
the microphone and record+upload or stream sound to the web server.

-- 
D.


On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote:
 On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote:
  I think that it would be best to extend the browser with a JavaScript
  speech API intended for use by web apps. That is, only web apps that
  use the speech API would have speech support. But it should be
  possible to use such an API to write browser extensions (using
  Greasemonkey, Chrome extensions etc) that allow speech control of the
  browser and speech synthesis of web page contents. Doing it the other
  way around seems like it would reduce the flexibility for web app
  developers.
 
 Hmm.. I guess I misunderstood your original proposal.
 
 Do you want the browser to expose an API that converts speech to text?
 Or do you want the browser to expose access to the microphone so that
 you can do speech to text convertion in javascript?
 
 If the former, could you describe your use cases in more detail?
 
 / Jonas


signature.asc
Description: This is a digitally signed message part


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Bjorn Bringert
I agree that being able to capture and upload audio to a server would
be useful for a lot of applications, and it could be used to do speech
recognition. However, for a web app developer who just wants to
develop an application that uses speech input and/or output, it
doesn't seem very convenient, since it requires server-side
infrastructure that is very costly to develop and run. A
speech-specific API in the browser gives browser implementors the
option to use on-device speech services provided by the OS, or
server-side speech synthesis/recognition.

/Bjorn

On Wed, Dec 2, 2009 at 6:23 PM, Diogo Resende drese...@thinkdigital.pt wrote:
 I missunderstood too. It would be great to have the ability to access
 the microphone and record+upload or stream sound to the web server.

 --
 D.


 On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote:
 On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote:
  I think that it would be best to extend the browser with a JavaScript
  speech API intended for use by web apps. That is, only web apps that
  use the speech API would have speech support. But it should be
  possible to use such an API to write browser extensions (using
  Greasemonkey, Chrome extensions etc) that allow speech control of the
  browser and speech synthesis of web page contents. Doing it the other
  way around seems like it would reduce the flexibility for web app
  developers.

 Hmm.. I guess I misunderstood your original proposal.

 Do you want the browser to expose an API that converts speech to text?
 Or do you want the browser to expose access to the microphone so that
 you can do speech to text convertion in javascript?

 If the former, could you describe your use cases in more detail?

 / Jonas




-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread João Eiras
On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com  
wrote:



We've been watching our colleagues build native apps that use speech
recognition and speech synthesis, and would like to have JavaScript
APIs that let us do the same in web apps. We are thinking about
creating a lightweight and implementation-independent API that lets
web apps use speech services. Is anyone else interested in that?

Bjorn Bringert, David Singleton, Gummi Hafsteinsson



This exists already, but only Opera supports it, although there are  
problems with the library we use for speech recognition.


http://www.w3.org/TR/xhtml+voice/
http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/

Would be nice to revive that specification and get vendor buy-in.



--

João Eiras
Core Developer, Opera Software ASA, http://www.opera.com/


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Diogo Resende
If you're able to read from the mic, you don't need to upload. You could
save it locally (for example for voice memos). The read+upload was just
2 steps I sugested instead of direct streaming. Speech recognition could
be done separatly. One could use the mic to capture a voice note. Other
could use the speech recognition without the mic (saved file?). Divide
and conquer :)

-- 
Diogo Resende drese...@thinkdigital.pt
ThinkDigital

On Wed, 2009-12-02 at 19:17 +, Bjorn Bringert wrote:
 I agree that being able to capture and upload audio to a server would
 be useful for a lot of applications, and it could be used to do speech
 recognition. However, for a web app developer who just wants to
 develop an application that uses speech input and/or output, it
 doesn't seem very convenient, since it requires server-side
 infrastructure that is very costly to develop and run. A
 speech-specific API in the browser gives browser implementors the
 option to use on-device speech services provided by the OS, or
 server-side speech synthesis/recognition.
 
 /Bjorn
 
 On Wed, Dec 2, 2009 at 6:23 PM, Diogo Resende drese...@thinkdigital.pt 
 wrote:
  I missunderstood too. It would be great to have the ability to access
  the microphone and record+upload or stream sound to the web server.
 
  --
  D.
 
 
  On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote:
  On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert bring...@google.com wrote:
   I think that it would be best to extend the browser with a JavaScript
   speech API intended for use by web apps. That is, only web apps that
   use the speech API would have speech support. But it should be
   possible to use such an API to write browser extensions (using
   Greasemonkey, Chrome extensions etc) that allow speech control of the
   browser and speech synthesis of web page contents. Doing it the other
   way around seems like it would reduce the flexibility for web app
   developers.
 
  Hmm.. I guess I misunderstood your original proposal.
 
  Do you want the browser to expose an API that converts speech to text?
  Or do you want the browser to expose access to the microphone so that
  you can do speech to text convertion in javascript?
 
  If the former, could you describe your use cases in more detail?
 
  / Jonas
 
 
 
 


signature.asc
Description: This is a digitally signed message part


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Jonas Sicking
On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert bring...@google.com wrote:
 I agree that being able to capture and upload audio to a server would
 be useful for a lot of applications, and it could be used to do speech
 recognition. However, for a web app developer who just wants to
 develop an application that uses speech input and/or output, it
 doesn't seem very convenient, since it requires server-side
 infrastructure that is very costly to develop and run. A
 speech-specific API in the browser gives browser implementors the
 option to use on-device speech services provided by the OS, or
 server-side speech synthesis/recognition.

Again, it would help a lot of you could provide use cases and
requirements. This helps both with designing an API, as well as
evaluating if the use cases are common enough that a dedicated API is
the best solution.

/ Jonas


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Dave Burke
We're envisaging a simpler programmatic API that looks familiar to the
modern Web developer but one which avoids the legacy of dialog system
languages.

Dave

On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com wrote:

 On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com
 wrote:

  We've been watching our colleagues build native apps that use speech
 recognition and speech synthesis, and would like to have JavaScript
 APIs that let us do the same in web apps. We are thinking about
 creating a lightweight and implementation-independent API that lets
 web apps use speech services. Is anyone else interested in that?

 Bjorn Bringert, David Singleton, Gummi Hafsteinsson


 This exists already, but only Opera supports it, although there are
 problems with the library we use for speech recognition.

 http://www.w3.org/TR/xhtml+voice/
 http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/

 Would be nice to revive that specification and get vendor buy-in.



 --

 João Eiras
 Core Developer, Opera Software ASA, http://www.opera.com/



Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread João Eiras
On Thu, 03 Dec 2009 01:50:20 +0100, Dave Burke davebu...@google.com  
wrote:



We're envisaging a simpler programmatic API that looks familiar to the
modern Web developer but one which avoids the legacy of dialog system
languages.



Ok. I referenced that XHTML+Voice because there is already a specification  
with markup, css 2 aural stylesheets and javascript APIs, and one  
implementation.
I quite sure someone can revisit this whole issue, and refactor the  
xhtml+voice specification into something more acceptable and implementable.

I don't think anyone would implement it the way it is.



Dave

On Wed, Dec 2, 2009 at 7:25 PM, João Eiras jo...@opera.com wrote:


On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert bring...@google.com
wrote:

 We've been watching our colleagues build native apps that use speech

recognition and speech synthesis, and would like to have JavaScript
APIs that let us do the same in web apps. We are thinking about
creating a lightweight and implementation-independent API that lets
web apps use speech services. Is anyone else interested in that?

Bjorn Bringert, David Singleton, Gummi Hafsteinsson



This exists already, but only Opera supports it, although there are
problems with the library we use for speech recognition.

http://www.w3.org/TR/xhtml+voice/
http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/

Would be nice to revive that specification and get vendor buy-in.



--

João Eiras
Core Developer, Opera Software ASA, http://www.opera.com/