[Intel-gfx] igt/gem_exec_nop parallel test: why it isn't useful

Dave Gordon Thu, 01 Sep 2016 09:58:26 -0700

The gem_exec_nop test generally works by submitting batches to an engineas fast as possible for a fixed time, then finally calling gem_sync() towait for the last submitted batch to complete. The time-per-batch isthen calculated as the total elapsed time, divided by the total numberof batches submitted.

The problem with this approach as a measurement of driver overhead, orlatency (or anything else) is that the amount of work involved insubmitting a batch is not a simple constant; in particular, it dependson the state of the various queues in the execution path. And it has therather strange characteristic that if the GPU runs slightly faster, thedriver may go much slower!

The main reason here is the lite-restore mechanism, although itinteracts with dual-submission and the details of handling thecompletion interrupt. In particular, lite-restore means that it can bemuch cheaper to add a request to an engine that's already (or still)busy with a previous request than to send a new request to an idle engine.

For example, imagine that it takes the (test/CPU/driver) 2us to preparea request up to the point of submission, but another 4us to push it intothe submission port. Also assume that once started, this batch takes 3usto execute on the GPU, and handling the completion takes the driveranother 2us of CPU time. Then the stream of requests will produce apattern like this:


t0:      batch 1: 6us from user to h/w (idle->busy)
t0+6us:  GPU now running batch 1
t0+8us:  batch 2: 2us from user to queue (not submitted)
t0+9us:  GPU finished; IRQ handler samples queue (batch 2)
t0+10us: batch 3: 2us from user to queue (not submitted)
t0+11us: IRQ handler submits tail of batch 2
t0+12us: batch 4: 2us from user to queue (not submitted)
t0+14us: batch 5: 2us from user to queue (not submitted)
t0+15us: GPU now running batch 2
t0+16us: batch 6: 2us from user to queue (not submitted)
t0+18us: GPU finished; IRQ handler samples queue (batch 6)
t0+18us: batch 7: 2us from user to queue (not submitted)
t0+20us: batch 8: 2us from user to queue (not submitted)
t0+20us: IRQ handler coalesces requests, submits tail of batch 6
t0+20us: batch 9: 2us from user to queue (not submitted)
t0+22us: batch 10: 2us from user to queue (not submitted)
t0+24us: GPU now running batches 3-6
t0+24us: batch 11: 2us from user to queue (not submitted)
t0+26us: batch 12: 2us from user to queue (not submitted)
t0+28us: batch 13: 2us from user to queue (not submitted)
t0+30us: batch 14: 2us from user to queue (not submitted)
t0+32us: batch 15: 2us from user to queue (not submitted)
t0+34us: batch 16: 2us from user to queue (not submitted)
t0+36us: GPU finished; IRQ handler samples queue (batch 16)
t0+36us: batch 17: 2us from user to queue (not submitted)
t0+38us: batch 18: 2us from user to queue (not submitted)
t0+38us: IRQ handler coalesces requests, submits tail of batch 16
t0+40us: batch 19: 2us from user to queue (not submitted)
t0+42us: batch 20: 2us from user to queue (not submitted)
t0+42us: GPU now running batches 7-16

Thus, after the first few, *all* requests will be coalesced, and only afew of them will incur the overhead of writing to the ELSP or handling acontext-complete interrupt. With the CPU generating a new batch every2us and the GPU taking 3us/batch to execute them, the queue ofoutstanding requests will get longer and longer until the ringbuffer isnearly full, but the write to the ELSP will happen ever more rarely.

When we measure the overall time for the process, we will find theresult is 3us/batch, i.e. the GPU batch execution time. The coalescingmeans that all the driver *and hardware* overheads are *completely* hidden.

Now consider what happens if the batches are generated and submittedslightly slower, only one every 4us:


t1:      batch 1: 6us from user to h/w (idle->busy)
t1+6us:  GPU now running batch 1
t1+9us:  GPU finished; IRQ handler samples queue (empty)
t1+10us: batch 2: 6us from user to h/w (idle->busy)
t1+16us: GPU now running batch 2
t1+19us: GPU finished; IRQ handler samples queue (empty)
t1+20us: batch 3: 6us from user to h/w (idle->busy)
etc

This hits the worst case, where *every* batch submission needs to gothrough the most expensive path (and in doing so, delays the creation ofthe next workload, so we will never get out of this pattern). Ourmeasurement will therefore show 10us/batch.

*IF* we didn't have a BKL, it would be reasonable to expect that asuitable multi-threaded program on a CPU with more h/w threads than GPUengines could submit batches on any set of engines in parallel, and foreach thread and engine, the execution time would be essentiallyindependent of which engines were running concurrently.

Unfortunately, though, that lock-free scenario is not what we havetoday. The BKL means that only one thread can submit at a time (and inany case, the test program isn't multi-threaded). Therefore, if the testcan generate and submit batches at a rate of one every 2us (as in thefirst "GOOD" scenario above), but those batches are being split acrosstwo different engines, it results in an effective submission rate of oneper 4us, and flips into the second "BAD" scenario as a result.

The conclusion, then, is that the parallel execution part of this testas written today isn't really measuring a meaningful quantity, and thepass-fail criterion in particular isn't telling us anything useful aboutthe overhead (or latency) of various parts of the submission path.

I've written another test variant, which explores the NO-OP executiontime as a function of both batch buffer size and the number ofconsecutive submissions to the same engine before switching to the next(burst size). Typical results look something like this:

IGT-Version: 1.15-gd09ad86 (x86_64) (Linux:4.8.0-rc4-dsg-00786-g9a8bc43-dsg-test-32 x86_64)

Time to exec 8-byte batch:        3.136µs (ring=render)
Time to exec 8-byte batch:        1.294µs (ring=bsd)
Time to exec 8-byte batch:        1.263µs (ring=blt)
Time to exec 8-byte batch:        1.276µs (ring=vebox)
Time to exec 8-byte batch:        1.745µs (ring=all, sequential)
Time to exec 8-byte batch:        5.605µs (ring=all, parallel/1)
Time to exec 8-byte batch:        5.583µs (ring=all, parallel/2)
Time to exec 8-byte batch:        4.780µs (ring=all, parallel/4)
Time to exec 8-byte batch:        3.870µs (ring=all, parallel/8)
Time to exec 8-byte batch:        2.883µs (ring=all, parallel/16)
Time to exec 8-byte batch:        2.155µs (ring=all, parallel/32)
Time to exec 8-byte batch:        1.560µs (ring=all, parallel/64)
Time to exec 8-byte batch:        1.221µs (ring=all, parallel/128)
Time to exec 8-byte batch:        1.302µs (ring=all, parallel/256)
Time to exec 8-byte batch:        1.417µs (ring=all, parallel/512)
Time to exec 8-byte batch:        1.624µs (ring=all, parallel/1024)
Time to exec 8-byte batch:        1.680µs (ring=all, parallel/2048)

Time to exec 4Kbyte batch:       12.588µs (ring=render)
Time to exec 4Kbyte batch:       11.291µs (ring=bsd)
Time to exec 4Kbyte batch:       11.837µs (ring=blt)
Time to exec 4Kbyte batch:       11.355µs (ring=vebox)
Time to exec 4Kbyte batch:       11.770µs (ring=all, sequential)
Time to exec 4Kbyte batch:       11.109µs (ring=all, parallel/1)
Time to exec 4Kbyte batch:       11.094µs (ring=all, parallel/2)
Time to exec 4Kbyte batch:       11.087µs (ring=all, parallel/4)
Time to exec 4Kbyte batch:       11.046µs (ring=all, parallel/8)
Time to exec 4Kbyte batch:       10.984µs (ring=all, parallel/16)
Time to exec 4Kbyte batch:       10.957µs (ring=all, parallel/32)
Time to exec 4Kbyte batch:       10.942µs (ring=all, parallel/64)
Time to exec 4Kbyte batch:       10.928µs (ring=all, parallel/128)
Time to exec 4Kbyte batch:       11.118µs (ring=all, parallel/256)
Time to exec 4Kbyte batch:       11.359µs (ring=all, parallel/512)
Time to exec 4Kbyte batch:       11.562µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch:       11.663µs (ring=all, parallel/2048)

which clearly shows the effect of failing to coalesce (small) requests.But even this doesn't really reveal the numbers that would be of mostinterest i.e. minimum/typical/maximum values for

1. overhead from execbuf call to submission queue
2. latency from execbuf to h/w execution start (if queue empty)
3. latency from h/w completion to ELSP update
4. overhead of completion processing
5. etc

.Dave.
_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[Intel-gfx] igt/gem_exec_nop parallel test: why it isn't useful

Reply via email to