Module: Mesa Branch: master Commit: c4290a52ddbe11a5e78179392ca47467b17a46ce URL: http://cgit.freedesktop.org/mesa/mesa/commit/?id=c4290a52ddbe11a5e78179392ca47467b17a46ce
Author: Eric Anholt <[email protected]> Date: Thu Oct 15 13:54:38 2020 -0700 docs/vc4: Move my old vc4 wiki's documentation into docs.mesa3d.org. Reviewed-by: Erik Faye-Lund <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7174> --- docs/contents.rst | 1 + docs/drivers/vc4.rst | 287 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 288 insertions(+) diff --git a/docs/contents.rst b/docs/contents.rst index 96d40ba1b29..2873bbf43b7 100644 --- a/docs/contents.rst +++ b/docs/contents.rst @@ -54,6 +54,7 @@ drivers/freedreno drivers/llvmpipe drivers/openswr + drivers/vc4 drivers/vmware-guest drivers/zink diff --git a/docs/drivers/vc4.rst b/docs/drivers/vc4.rst new file mode 100644 index 00000000000..d80b568cd20 --- /dev/null +++ b/docs/drivers/vc4.rst @@ -0,0 +1,287 @@ +VC4 +=== + +Mesa's ``vc4`` graphics driver supports multiple implementations of +Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0 +through Raspberry Pi 3 hardware, and the driver is included as an +option as of the 2016-02-09 Rasbpian release using ``raspi-config``. +On most other distributions such as Debian or Fedora, you need no +configuration to enable the driver. + +This Mesa driver talks directly to the `vc4 +<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM +driver for scheduling graphics commands, and that module also provides +KMS display support. The driver makes no use of the closed source VPU +firmware on the VideoCore IV block, instead talking directly to the +GPU block from Linux. + +GLES2 support +------------- + +The vc4 driver is a nearly conformant GLES2 driver, and the hardware +has achieved GLES2 conformance with other driver stacks. + +OpenGL support +-------------- + +Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is +mostly correct but with a few caveats. + +* 4-byte index buffers. + +GLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support +them in vc4, we create a shadow copy of your index buffer with the +indices truncated to 2 bytes. This is incorrect (and will assertion +fail in debug builds of Mesa) if any of the indices were >65535. To +fix that, we would need to detect this case and rewrite the index +buffer and vertex buffers to do a series of draws each with small +indices and new vertex attrib bindings. + +To avoid this problem, ensure that all index buffers are written using +``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls +with updated vertex attrib bindings. + +* Occlusion queries + +The VC4 hardware has no support for occlusion queries. GL 2.0 +requires that you support the occlusion queries extension, but you can +report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED, +GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles +"we want the functions to be present everywhere, but we want it to be +optional for hardware to support it. Sadly, gallium doesn't yet allow +the driver to report 0 query bits. + +* Primitive mode + +VC4 doesn't support reducing triangles/quads/polygons to lines and +points like desktop GL. If front/back mode matched, we could rewrite +the index buffer to the new primitive type, but we don't. If +front/back mode don't match, we would need to run the vertex shader in +software, classify the prims, write new index buffers, and emit +(possibly many) new draw calls to rasterize the new prims in the same +order. + +Bug Reporting +------------- + +VC4 rendering bugs should go to Mesa's gitlab `issues +<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page. + +By far the easiest way to communicate bug reports for rendering +problems is to take an apitrace. This passes exactly the drawing you +saw to the developer, without the developer needing to download and +build the application and replicate whatever steps you took to produce +the problem. Traces attached to bug reports should ideally be small. + +For GPU hangs, if you can get a short apitrace that produces the +problem, that's still the best. If the problem takes a long time to +reproduce or you can't capture it in a trace, describing how to +reproduce and including a gpu hang dump would be the most +useful. Install `vc4-gpu-tools +<https://github.com/anholt/vc4-gpu-tools/>` and use +``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will +provide useful information. + +Tiled Rendering +--------------- + +VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or +32x32 (MSAA) tiles and rendering the scene per tile. Rasterization +looks like:: + + (CPU) Allocate space to store a list of draw commands per tile + (CPU) Set up a command list per tile that does: + Either load the current tile's color buffer from memory, or clear it. + Either load the current tile's depth buffer from memory, or clear it. + Branch into the draw list for the tile + Store the depth buffer if anybody might read it. + Store the color buffer if anybody might read it. + (GPU) Initialize the per-tile draw call lists to empty. + (GPU) Run all draw calls collecting vertex data + (GPU) For each tile covered by a draw call's primitive. + Emit state packets to the list to update it to the current draw call's state. + Emit a primitive description into the tile's draw call list. + +Tiled rendering avoids the need for large render target caches, at the +expense of increasing the cost of vertex processing. Unlike some tiled +renderers, VC4 has no non-tiled rendering mode. + +Performance Tricks +------------------ + +* Reducing memory bandwidth by clearing. + +Even if your drawing is going to cover the entire render target, it's +more efficient for VC4 if you emit a ``glClear()`` of the color and +depth buffers. This means we can skip the load of the previous state +from memory, in favor of a cheap GPU-side ``memset()`` of the tile +buffer before we start running the draw calls. + +* Reducing memory bandwidth with scissoring. + +If all draw calls for the frame are with a ``glScissor()`` to only +part of the screen, then we can skip setting up the tiles for that +area, which means a little less memory used setting up the empty bins, +and a lot less memory used loading/storing the unchanged tiles. + +* Reducing memory bandwidth with ``glInvalidateFramebuffer()``. + +If we don't know who might use the contents of the framebuffer's depth +or color in the future, then we have to store it for later. If you use +glInvalidateFramebuffer() before accessing the results of your +rendering, then we can skip the store of the depth or color +buffer. Note that this is unimplemented. + +* Avoid non-constant GLSL array indexing + +In VC4 the only non-constant-index array access supported in hardware +is uniforms. For everything else (inputs, outputs, temporaries), we +have to lower them to an IF ladder like:: + + if (index == 0) + return array[0] + else if (index == 1) + return array[1] + ... + +This is very expensive as we probably have to execute every branch of +every IF statement due to it being a SIMD machine. So, it is +recommended (if you can) to avoid non-uniform non-constant array +indexing. + +Note that if you do variable indexing within a bounded loop that Mesa +can unroll, that can actually count as constant indexing. + +* Increasing GPU memory Increase CMA pool size + +The memory for the VC4 driver is allocated from the standard Linux cma +pool. The size of this pool defaults to 64 MB. To increase this, pass +an additional parameter on the kernel command line. Edit the boot +partition's ``cmdline.txt`` to add:: + + cma=256M@256M + +``cmdline.txt`` is a single line with whitespace separated parameters. + +The first value is the size of the pool and the second parameter is +the start address of the pool. The pool size can be increased further, +but it must fit into the memory, so size + start address must be below +1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this +reduces the memory available to Linux. + +* Decrease firmware memory + +The firmware allocates a fixed chunk of memory before booting +Linux. If firmware functions are not required, this amount can be +reduced. + +In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding, +edit gpu_mem to 64 if you need video decoding. + +Performance debugging +--------------------- + +* Step 1: Known issues + +The first tool to look at is running your application with the +environment variable ``VC4_DEBUG=perf`` set. This will report debug +information for many known causes of performance problems on the +console. Not all of them will cause visible performance improvements +when fixed, but it's a good first step to see what might going wrong. + +* Step 2: CPU vs GPU + +The primary question is figuring out whether the CPU is busy in your +application, the CPU is busy in the GL driver, the GPU is waiting for +the CPU, or the CPU is waiting for the GPU. Ideally, you get to the +point where the CPU is waiting for the GPU infrequently but for a +significant amount of time (however long it takes the GPU to draw a +frame). + +Start with top while your application is running. Is the CPU usage +around 90%+? If so, then our performance analysis will be with +sysprof. If it's not very high, is the GPU staying busy? We don't have +a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be +useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that +means that the GPU is currently busy processing some rendering job. + +* sysprof for CPU usage + +If the CPU is totally busy and the GPU isn't terribly busy, there is +an excellent tool for debugging: sysprof. Install, run as root (so you +can get system-wide profiling), hit play and later stop. The top-left +area shows the flat profile sorted by total time of that symbol plus +its descendants. The top few are generally uninteresting (main() and +its descendants consuming a lot), but eventually you can get down to +something interesting. Click it, and to the right you get the +callchains to descendants -- where all that time actually went. On the +other hand, the lower left shows callers -- double-clicking those +selects that as the symbol to view, instead. + +Note that you need debug symbols for the callgraphs in sysprof to +work, which is where most of its value is. Most distributions offer +debug symbol packages from their builds which can be installed +separately, and sysprof will find them. I've found that on arm, the +debug packages are not enough, and if someone could determine what is +necessary for callgraphs in debugging, that would be really helpful. + +* perf for CPU waits on GPU + +If the CPU is not very busy and the GPU is not very busy, then we're +probably ping-ponging between the two. Most cases of this would be +noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where +this happens, use the perf tool from the Linux kernel (note: unrelated +to ``VC4_DEBUG=perf``):: + + sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena + +If you want to see the whole system's stalls for a period of time +(very useful!), use the -a flag instead of a particular command +name. Just ``^C`` when you're done capturing data. + +At exit, you'll have ``perf.data`` in the current directory. You can print +out the results with:: + + perf report | less + +* Debugging for GPU fully busy + +As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's +performance counters in OpenGL. Install apitrace, and trace your +application with:: + + apitrace trace <application> # for GLX applications + apitrace trace -a egl <application> # for EGL applications + +Once you've captured a trace, you can see what counters are available +and replay it while looking while looking at some of those counters:: + + apitrace replay <application>.trace --list-metrics + + apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading + +Multiple counters can be captured at once with commas separating them. + +Once you've found what draw calls are surprisingly expensive in one of +the counters, you can work out which ones they were at the GL level by +opening the trace up in qapitrace and using ``^-G`` to jump to that call +number and ``^-L`` to look up the GL state at that call. + +shader-db +--------- + +shader-db is often used as a proxy for real-world app performance when +working on the compiler in Mesa. On vc4, there is a lot of +state-dependent code in the shaders (like blending or vertex attribute +format handling), so the typical `shader-db +<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important +areas for optimization. Instead, anholt wrote a `new one +<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on +apitraces. Once you have a collection of traces, starting from +`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__, +you can test a compiler change in this shader-db with:: + + ./run.py > before + (cd ../mesa && make install) + ./run.py > after + ./report.py before after _______________________________________________ mesa-commit mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/mesa-commit
