> Looking at this plot it seems like the latency times are varying according
> to a pattern. My guess is that the screen runs asynchronously with the GPU.

That it certainly does. Modern LCDs are not tied to the VSYNC/HSYNC and
pixel clock signals anymore as the old CRTs were, because there is always
an image processing logic inbetween. If for nothing else then for scaling
the image to the full resolution of the LCD panel. With packet oriented
digital connections like DisplayPort it gets decoupled even further
(DVI/HDMI is still essentially emulating the old VGA which in turns
emulates the old TV conventions for video, even though the way the signals
are encoded electrically is very different). Oh and it gets better - in
fact, some monitors completely ignore VSYNC/HSYNC signals when connected by
a digital connection (HDMI/DisplayPort), because they know how many pixels
they need to draw, so they simply count the incoming bytes, something that
was not possible with the purely analog VGA signals and analog monitors.

There are basically 4 completely asynchronous systems here:

- your code
- the GPU
- the monitor processor
- the LCD panel itself

The first two can be synchronized using VSYNC and/or fences, the second two
using the HDMI/DVI/VGA electrical signals, the monitor processor is
synchronized with the LCD panel, but there is no mutual synchronisation
between these pairs, there are, in fact, buffers between them to account
for the differing processing speeds. And you can typically only observe the
first and the last one ...

So the screen polls the GPU at 60Hz and sometimes the GPU just happens to
> have a frame ready when the screen starts a scanout.

The lower limit is just the pure scanout time of the display. But this is
> just my theory right now. I do not have the detailed knowledge of the inner
> workings of a LCD display.

The way it works is the opposite - the GPU generates the data at whatever
resolution and refresh rate the screen declares that it supports
(determined via EDID, unless manually overriden in the driver settings) and
sends these to the processor in the display. The display then does any
processing it needs and only actually flips the pixels on the panel when
ready, independently from the GPU. Which is always delayed a bit -
depending on how much processing the display does. There is no polling, the
video connection is essentially one way only (not counting service data
like EDID).

In comparison, the old CRTs had low latency, because the analog signal from
the GPU was driving the deflection coils steering the electron beam
practically directly, with no buffering or processing. When the GPU started
a new frame by sending VSYNC, the monitor really made the beam jump to the
upper left corner at that moment. That also explains why you could generate
"weird", unsupported resolutions out of an analog CRT screen and why you
could potentially fry it with resolution or refresh too high for it to
handle - the deflection coild electronics would typically overheat, drawing
too much current.

I think the problem you are seeing with the "jitter" that sometimes you
have very low latency and sometimes it is over a frame comes from a
different source - namely your program and the VSYNC handling by the GPU.
The GPU will always generate the output signal the same way, VSYNC or no
VSYNC, otherwise the monitor may not be able to handle it and sync to it.
What happens is that sometimes your program "gets lucky" and tells the GPU
to swap buffers "just in time" before the start of the next frame - then
you have very little latency, because the change gets visible almost
immediately (modulo the input latency of the monitor above). On the other
hand, sometimes you get unlucky, you swap buffers right after the scanout
of the framebuffer has started and then the GPU will hold your image until
the next frame cycle - poof, one frame of latency extra ... And you can
have everything in between these two extreme cases.

When VSYNC is on, it gets even more complicated, because then you are
telling the GPU to synchronize the userspace code with the frame scanout
start (not the start of the physical frame on the monitor when your sensor
reacts - remember, the GPU has no control at all over the image processor
in the monitor!). This is typically an extremely inefficient thing to do
from the driver's point of view because you are stalling the GPU until the
new frame is due, so the drivers often "play games" here - like not really
blocking your program waiting for the frame start but return right away and
buffer your frame internally. The frame then gets sent out later when
convenient (i.e. on the next scanout cycle). They can even hold several
frames back like that and block only when this frame queue runs out of
space (you were really rendering too fast). Especially Nvidia is known for
these VSYNC "shenanigans" in their driver. This is what Robert was talking
about with the fences - that is a relatively recent feature that allows you
to force the driver to wait until a certain event occurs - e.g. a new frame
start or sync event from an external source (e.g. genlock). The regular
VSYNC ON/OFF will not guarantee this anymore on today's hardware, it only
guarantees that you will not change the framebuffer in the middle of it
being drawn (tearing).

This has a lot of consequences for "pro" applications requiring active
stereo or synchronization to external sources (CAVE, TV studios, etc.).
However, an average desktop application (e.g. a 3D game) benefits, because
it is typically going to be able to run faster and smoother when not having
to block on VSYNC.

