On 18/08/16 16:27, Dave Gordon wrote:

[snip]

Note that SKL GuC firmware 6.1 didn't support dual submission or lite
restore, whereas the next version (8.11) does. Therefore, with that
firmware we don't see the same slowdown when going to 1-at-a-time
round-robin. I have a different (new) test that shows this more clearly.

This is with GuC version 6.1:

skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS

Time to exec 8-byte batch:        3.428µs (ring=render)
Time to exec 8-byte batch:        2.444µs (ring=bsd)
Time to exec 8-byte batch:        2.394µs (ring=blt)
Time to exec 8-byte batch:        2.615µs (ring=vebox)
Time to exec 8-byte batch:        2.625µs (ring=all, sequential)
Time to exec 8-byte batch:       12.701µs (ring=all, parallel/1) ***
Time to exec 8-byte batch:        7.259µs (ring=all, parallel/2)
Time to exec 8-byte batch:        4.336µs (ring=all, parallel/4)
Time to exec 8-byte batch:        2.937µs (ring=all, parallel/8)
Time to exec 8-byte batch:        2.661µs (ring=all, parallel/16)
Time to exec 8-byte batch:        2.245µs (ring=all, parallel/32)
Time to exec 8-byte batch:        1.626µs (ring=all, parallel/64)
Time to exec 8-byte batch:        2.170µs (ring=all, parallel/128)
Time to exec 8-byte batch:        1.804µs (ring=all, parallel/256)
Time to exec 8-byte batch:        2.602µs (ring=all, parallel/512)
Time to exec 8-byte batch:        2.602µs (ring=all, parallel/1024)
Time to exec 8-byte batch:        2.607µs (ring=all, parallel/2048)

Time to exec 4Kbyte batch:       14.835µs (ring=render)
Time to exec 4Kbyte batch:       11.787µs (ring=bsd)
Time to exec 4Kbyte batch:       11.533µs (ring=blt)
Time to exec 4Kbyte batch:       11.991µs (ring=vebox)
Time to exec 4Kbyte batch:       12.444µs (ring=all, sequential)
Time to exec 4Kbyte batch:       16.211µs (ring=all, parallel/1)
Time to exec 4Kbyte batch:       13.943µs (ring=all, parallel/2)
Time to exec 4Kbyte batch:       13.878µs (ring=all, parallel/4)
Time to exec 4Kbyte batch:       13.841µs (ring=all, parallel/8)
Time to exec 4Kbyte batch:       14.188µs (ring=all, parallel/16)
Time to exec 4Kbyte batch:       13.747µs (ring=all, parallel/32)
Time to exec 4Kbyte batch:       13.734µs (ring=all, parallel/64)
Time to exec 4Kbyte batch:       13.727µs (ring=all, parallel/128)
Time to exec 4Kbyte batch:       13.947µs (ring=all, parallel/256)
Time to exec 4Kbyte batch:       12.230µs (ring=all, parallel/512)
Time to exec 4Kbyte batch:       12.147µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch:       12.617µs (ring=all, parallel/2048)

What this shows is that the submission overhead is ~3us which is comparable with the execution time of a trivial (8-byte) batch, but insignificant compared with the time to execute the 4Kbyte batch. The burst size therefore makes very little difference to the larger batches.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to