On 3/18/2013 8:29 PM, Eric Rescorla wrote:
On Mon, Mar 18, 2013 at 4:54 PM, Robert O'Callahan <[email protected]>wrote:
As far as I know there are two major problems with the way MSG video works
right now:
1) In WebRTC we don't want to hold up playing audio for a time interval
[T1, T2] until all video frames up to and including T2 have been decoded
(MSG currently requires this). We'd rather just go ahead and play the audio
and if video decoding has fallen behind audio, render the latest video
frame as soon as it's available (preferably without waiting for an MSG
iteration). Of course if video decoding is far enough ahead we should sync
video frames to the audio instead (and then MSG needs to be involved since
it has the audio track(s).
It's probably worth mentioning at this point that the current WebRTC video
implementation (as does the gUM one) just returns the latest video frame
upon request. So if (say) two video frames come in during the time period
between NotifyPull()s, we just deliver the most recent one. Obviously,
we could buffer them and deliver as two segments, but if we went to
a model where we pushed video onto the MSG (which is what GIPS
expects), then we wouldn't bother.
In theory, GetUserMedia *should* be pushing in video frames as they
occur and letting the sink decide what to do with them. In my mind,
playout sinks of a MediaStream should pull on Audio and Video (typically
based on the hardware playout clock), while intermediate sinks (like a
PeerConnection) would like to get data from the stream as soon as it's
available - and preferably would accept data as it's available - with
the capture timestamp being fed in to let the far-end handle sync.
If a source is blocked, the sink should decide whether to block itself
or not (either dynamically or statically when it becomes a sync). For
realtime use, typically you want to use the latest video frame that
matches (or is slightly newer than) the audio, but never block audio
output on video frames being late or non-available.
If I had my druthers, I'd want an interface where on Pull you can
specify if you want to block on any missing data or not, and when
pushing video in you could specify a start time and duration, or specify
a start time an no/infinite duration (which would work out to "until
another frame is pushed"). Non-pulling sinks (such as PeerConnection)
generally want any track data as soon as possible, even if other tracks
don't have data yet, and they want to get the data with a timestamp.
GetUserMedia would push video frames with infinite/no durations, as
would PeerConnection sources. Streaming/recorded media may well push
video with defined durations. The last question is about audio.
Pulling sources should generally block on missing audio (though
PeerConnection sources should adapt off the pulls and never cause
blocking, while realtime sampled sources should probably resample or
timebase correct to adapt to the Puller's frequency. (I.e. the Graph
frequency is driven by the output sink, and inputs need to adapt to it,
at least if they're hooked together).
There's an assumption here that media elements Pull the data and
themselves are driven by the output clocking - or equivalent (it doesn't
actually have to be a Pull from the media; it could be clocked out
(using the main output device clock) by the MediaStream into the media
element).
One side note: a MediaStream that goes from audio capture -> MediaStream
-> PeerConnection doesn't have to be resampled to match the output
clock, though it may be simpler (talbeit more expensive) to do so. Of
course, if it's later cloned (the track or stream) and does go to
output, it may need to start resampling/etc (though it could do so at
the cloning point I think). This may be a common usage with a
self-image (a muted <video> element playing the same MediaStream
attached to the PeerConnection).
Those are my off-the-cuff thoughts (nothing seriously new here); I may
have missed a point somewhere - please feel free to critique. (derf,
jmspeex especially)
Side note:
In theory you could switch on resampling when the <video> element
un-muted (and only pull video frames until then), but that's getting
pretty complex. If you want to get really accurate, elements should
know if they're visible and only be synced to output if they're somehow
routed to audio outputs. (for example, a hidden video element being
used to capture a MediaStream for playback to a PeerConnection wouldn't
have to be synced to audio output - but one that's connected to a
visible <video> element would have to be. It can get complex if you
want to do the maximal version of this, which I wouldn't advise -
certainly not now.)
Note that in GIPS, video frames have times of arrival but no duration,
so there is a difficult match there as well.
Correct.
-Ekr
2) Various devices implement stream capture using ring buffers and
therefore don't really want to give away references to image buffers that
can live indefinitely ... so these image buffers aren't a good fit for the
Image object, which allows Gecko code to keep an Image alive indefinitely
... unless we make copies of images, which of course we want to avoid. So
we'd really like a SourceMediaStream to be able to manage the lifetimes of
its frames, most of the time, and make frame copies (if necessary) only in
exceptional cases.
Let me know if there are important issues I've overlooked. And share your
ideas if you have a solution. I'm still thinking :-).
--
Randell Jesup
_______________________________________________
dev-media mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-media