The gem_exec_nop test generally works by submitting batches to an engine as fast as possible for a fixed time, then finally calling gem_sync() to wait for the last submitted batch to complete. The time-per-batch is then calculated as the total elapsed time, divided by the total number of batches submitted.

The problem with this approach as a measurement of driver overhead, or latency (or anything else) is that the amount of work involved in submitting a batch is not a simple constant; in particular, it depends on the state of the various queues in the execution path. And it has the rather strange characteristic that if the GPU runs slightly faster, the driver may go much slower!

The main reason here is the lite-restore mechanism, although it interacts with dual-submission and the details of handling the completion interrupt. In particular, lite-restore means that it can be much cheaper to add a request to an engine that's already (or still) busy with a previous request than to send a new request to an idle engine.

For example, imagine that it takes the (test/CPU/driver) 2us to prepare a request up to the point of submission, but another 4us to push it into the submission port. Also assume that once started, this batch takes 3us to execute on the GPU, and handling the completion takes the driver another 2us of CPU time. Then the stream of requests will produce a pattern like this:

t0:      batch 1: 6us from user to h/w (idle->busy)
t0+6us:  GPU now running batch 1
t0+8us:  batch 2: 2us from user to queue (not submitted)
t0+9us:  GPU finished; IRQ handler samples queue (batch 2)
t0+10us: batch 3: 2us from user to queue (not submitted)
t0+11us: IRQ handler submits tail of batch 2
t0+12us: batch 4: 2us from user to queue (not submitted)
t0+14us: batch 5: 2us from user to queue (not submitted)
t0+15us: GPU now running batch 2
t0+16us: batch 6: 2us from user to queue (not submitted)
t0+18us: GPU finished; IRQ handler samples queue (batch 6)
t0+18us: batch 7: 2us from user to queue (not submitted)
t0+20us: batch 8: 2us from user to queue (not submitted)
t0+20us: IRQ handler coalesces requests, submits tail of batch 6
t0+20us: batch 9: 2us from user to queue (not submitted)
t0+22us: batch 10: 2us from user to queue (not submitted)
t0+24us: GPU now running batches 3-6
t0+24us: batch 11: 2us from user to queue (not submitted)
t0+26us: batch 12: 2us from user to queue (not submitted)
t0+28us: batch 13: 2us from user to queue (not submitted)
t0+30us: batch 14: 2us from user to queue (not submitted)
t0+32us: batch 15: 2us from user to queue (not submitted)
t0+34us: batch 16: 2us from user to queue (not submitted)
t0+36us: GPU finished; IRQ handler samples queue (batch 16)
t0+36us: batch 17: 2us from user to queue (not submitted)
t0+38us: batch 18: 2us from user to queue (not submitted)
t0+38us: IRQ handler coalesces requests, submits tail of batch 16
t0+40us: batch 19: 2us from user to queue (not submitted)
t0+42us: batch 20: 2us from user to queue (not submitted)
t0+42us: GPU now running batches 7-16

Thus, after the first few, *all* requests will be coalesced, and only a few of them will incur the overhead of writing to the ELSP or handling a context-complete interrupt. With the CPU generating a new batch every 2us and the GPU taking 3us/batch to execute them, the queue of outstanding requests will get longer and longer until the ringbuffer is nearly full, but the write to the ELSP will happen ever more rarely.

When we measure the overall time for the process, we will find the result is 3us/batch, i.e. the GPU batch execution time. The coalescing means that all the driver *and hardware* overheads are *completely* hidden.

Now consider what happens if the batches are generated and submitted slightly slower, only one every 4us:

t1:      batch 1: 6us from user to h/w (idle->busy)
t1+6us:  GPU now running batch 1
t1+9us:  GPU finished; IRQ handler samples queue (empty)
t1+10us: batch 2: 6us from user to h/w (idle->busy)
t1+16us: GPU now running batch 2
t1+19us: GPU finished; IRQ handler samples queue (empty)
t1+20us: batch 3: 6us from user to h/w (idle->busy)
etc

This hits the worst case, where *every* batch submission needs to go through the most expensive path (and in doing so, delays the creation of the next workload, so we will never get out of this pattern). Our measurement will therefore show 10us/batch.

*IF* we didn't have a BKL, it would be reasonable to expect that a suitable multi-threaded program on a CPU with more h/w threads than GPU engines could submit batches on any set of engines in parallel, and for each thread and engine, the execution time would be essentially independent of which engines were running concurrently.

Unfortunately, though, that lock-free scenario is not what we have today. The BKL means that only one thread can submit at a time (and in any case, the test program isn't multi-threaded). Therefore, if the test can generate and submit batches at a rate of one every 2us (as in the first "GOOD" scenario above), but those batches are being split across two different engines, it results in an effective submission rate of one per 4us, and flips into the second "BAD" scenario as a result.

The conclusion, then, is that the parallel execution part of this test as written today isn't really measuring a meaningful quantity, and the pass-fail criterion in particular isn't telling us anything useful about the overhead (or latency) of various parts of the submission path.

I've written another test variant, which explores the NO-OP execution time as a function of both batch buffer size and the number of consecutive submissions to the same engine before switching to the next (burst size). Typical results look something like this:

IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 4.8.0-rc4-dsg-00786-g9a8bc43-dsg-test-32 x86_64)
Time to exec 8-byte batch:        3.136µs (ring=render)
Time to exec 8-byte batch:        1.294µs (ring=bsd)
Time to exec 8-byte batch:        1.263µs (ring=blt)
Time to exec 8-byte batch:        1.276µs (ring=vebox)
Time to exec 8-byte batch:        1.745µs (ring=all, sequential)
Time to exec 8-byte batch:        5.605µs (ring=all, parallel/1)
Time to exec 8-byte batch:        5.583µs (ring=all, parallel/2)
Time to exec 8-byte batch:        4.780µs (ring=all, parallel/4)
Time to exec 8-byte batch:        3.870µs (ring=all, parallel/8)
Time to exec 8-byte batch:        2.883µs (ring=all, parallel/16)
Time to exec 8-byte batch:        2.155µs (ring=all, parallel/32)
Time to exec 8-byte batch:        1.560µs (ring=all, parallel/64)
Time to exec 8-byte batch:        1.221µs (ring=all, parallel/128)
Time to exec 8-byte batch:        1.302µs (ring=all, parallel/256)
Time to exec 8-byte batch:        1.417µs (ring=all, parallel/512)
Time to exec 8-byte batch:        1.624µs (ring=all, parallel/1024)
Time to exec 8-byte batch:        1.680µs (ring=all, parallel/2048)

Time to exec 4Kbyte batch:       12.588µs (ring=render)
Time to exec 4Kbyte batch:       11.291µs (ring=bsd)
Time to exec 4Kbyte batch:       11.837µs (ring=blt)
Time to exec 4Kbyte batch:       11.355µs (ring=vebox)
Time to exec 4Kbyte batch:       11.770µs (ring=all, sequential)
Time to exec 4Kbyte batch:       11.109µs (ring=all, parallel/1)
Time to exec 4Kbyte batch:       11.094µs (ring=all, parallel/2)
Time to exec 4Kbyte batch:       11.087µs (ring=all, parallel/4)
Time to exec 4Kbyte batch:       11.046µs (ring=all, parallel/8)
Time to exec 4Kbyte batch:       10.984µs (ring=all, parallel/16)
Time to exec 4Kbyte batch:       10.957µs (ring=all, parallel/32)
Time to exec 4Kbyte batch:       10.942µs (ring=all, parallel/64)
Time to exec 4Kbyte batch:       10.928µs (ring=all, parallel/128)
Time to exec 4Kbyte batch:       11.118µs (ring=all, parallel/256)
Time to exec 4Kbyte batch:       11.359µs (ring=all, parallel/512)
Time to exec 4Kbyte batch:       11.562µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch:       11.663µs (ring=all, parallel/2048)

which clearly shows the effect of failing to coalesce (small) requests. But even this doesn't really reveal the numbers that would be of most interest i.e. minimum/typical/maximum values for
1. overhead from execbuf call to submission queue
2. latency from execbuf to h/w execution start (if queue empty)
3. latency from h/w completion to ELSP update
4. overhead of completion processing
5. etc

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to