On 3/18/2013 8:29 PM, Eric Rescorla wrote:
On Mon, Mar 18, 2013 at 4:54 PM, Robert O'Callahan <[email protected]>wrote:

As far as I know there are two major problems with the way MSG video works
right now:

1) In WebRTC we don't want to hold up playing audio for a time interval
[T1, T2] until all video frames up to and including T2 have been decoded
(MSG currently requires this). We'd rather just go ahead and play the audio
and if video decoding has fallen behind audio, render the latest video
frame as soon as it's available (preferably without waiting for an MSG
iteration). Of course if video decoding is far enough ahead we should sync
video frames to the audio instead (and then MSG needs to be involved since
it has the audio track(s).

It's probably worth mentioning at this point that the current WebRTC video
implementation (as does the gUM one) just returns the latest video frame
upon request. So if (say) two video frames come in during the time period
between NotifyPull()s, we just deliver the most recent one. Obviously,
we could buffer them and deliver as two segments, but if we went to
a model where we pushed video onto the MSG (which is what GIPS
expects), then we wouldn't bother.

In theory, GetUserMedia *should* be pushing in video frames as they occur and letting the sink decide what to do with them. In my mind, playout sinks of a MediaStream should pull on Audio and Video (typically based on the hardware playout clock), while intermediate sinks (like a PeerConnection) would like to get data from the stream as soon as it's available - and preferably would accept data as it's available - with the capture timestamp being fed in to let the far-end handle sync.

If a source is blocked, the sink should decide whether to block itself or not (either dynamically or statically when it becomes a sync). For realtime use, typically you want to use the latest video frame that matches (or is slightly newer than) the audio, but never block audio output on video frames being late or non-available.

If I had my druthers, I'd want an interface where on Pull you can specify if you want to block on any missing data or not, and when pushing video in you could specify a start time and duration, or specify a start time an no/infinite duration (which would work out to "until another frame is pushed"). Non-pulling sinks (such as PeerConnection) generally want any track data as soon as possible, even if other tracks don't have data yet, and they want to get the data with a timestamp. GetUserMedia would push video frames with infinite/no durations, as would PeerConnection sources. Streaming/recorded media may well push video with defined durations. The last question is about audio. Pulling sources should generally block on missing audio (though PeerConnection sources should adapt off the pulls and never cause blocking, while realtime sampled sources should probably resample or timebase correct to adapt to the Puller's frequency. (I.e. the Graph frequency is driven by the output sink, and inputs need to adapt to it, at least if they're hooked together).

There's an assumption here that media elements Pull the data and themselves are driven by the output clocking - or equivalent (it doesn't actually have to be a Pull from the media; it could be clocked out (using the main output device clock) by the MediaStream into the media element).

One side note: a MediaStream that goes from audio capture -> MediaStream -> PeerConnection doesn't have to be resampled to match the output clock, though it may be simpler (talbeit more expensive) to do so. Of course, if it's later cloned (the track or stream) and does go to output, it may need to start resampling/etc (though it could do so at the cloning point I think). This may be a common usage with a self-image (a muted <video> element playing the same MediaStream attached to the PeerConnection).

Those are my off-the-cuff thoughts (nothing seriously new here); I may have missed a point somewhere - please feel free to critique. (derf, jmspeex especially)

Side note:
In theory you could switch on resampling when the <video> element un-muted (and only pull video frames until then), but that's getting pretty complex. If you want to get really accurate, elements should know if they're visible and only be synced to output if they're somehow routed to audio outputs. (for example, a hidden video element being used to capture a MediaStream for playback to a PeerConnection wouldn't have to be synced to audio output - but one that's connected to a visible <video> element would have to be. It can get complex if you want to do the maximal version of this, which I wouldn't advise - certainly not now.)

Note that in GIPS, video frames have times of arrival but no duration,
so there is a difficult match there as well.

Correct.


  -Ekr


2) Various devices implement stream capture using ring buffers and
therefore don't really want to give away references to image buffers that
can live indefinitely ... so these image buffers aren't a good fit for the
Image object, which allows Gecko code to keep an Image alive indefinitely
... unless we make copies of images, which of course we want to avoid. So
we'd really like a SourceMediaStream to be able to manage the lifetimes of
its frames, most of the time, and make frame copies (if necessary) only in
exceptional cases.

Let me know if there are important issues I've overlooked. And share your
ideas if you have a solution. I'm still thinking :-).

--
   Randell Jesup

_______________________________________________
dev-media mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-media

Reply via email to