http://keithp.com/blogs/Sharpening_the_Intel_Driver_Focus/
Keithp.com/ blogs/
Sharpening the Intel Driver Focus
This week, we finished up our 2009 Q1 release of the Intel driver. Most of the effort for this quarter has been to stabilize the recent work, focusing on serious bugs and testing as many combinations as we could manage. For the last year or so, we’ve been busy rebuilding the driver, adding new ways of managing memory, setting modes and communicating between user space and the kernel. Because all of these changes cross multiple projects (X/Mesa/Linux), we’ve tried to make sure we supported all of the possible combinations. Let’s see what options we’ve got: Mode Setting
Direct Rendering
Memory Management
2D acceleration
Pick One From Each ColumnNow, many of the above choices can be made independently — you can use User mode setting with DRI1, classic memory management and XAA. Or you can select Kernel mode setting with DRI1, GEM and EXA. With 2 × 3 × 2 × 4 = 48 combinations, you can imagine that:
Two years ago, you had a lot fewer choices, only user mode setting, none or DRI1 direct rendering, only X server memory management and only none, XAA or EXA acceleration = 12 choices). Even then, choosing between XAA and EXA was quite contentious — EXA would thrash memory badly, while XAA would effectively disable acceleration for pixmaps as soon as it ran out of its (tiny) off-screen space. In moving towards our eventual goal of a KMS/GEM/DRI2 world, we’ve felt obligated to avoid removing options until that goal worked best for as many people as possible. So, instead of forcing people to switch to brand new code that hasn’t been entirely stable or fast, we’ve tried to make sure that each release of the driver has at least continued to work with the older options. However, some of the changes we’ve made have caused performance regressions in these older options, which doesn’t exactly make people happy — the old code runs slow, and the new code isn’t quite ready for prime time in all situations. One option here would be to stop shipping code and sit around working on the ‘perfect’ driver, to be released soon after the heat-death of the universe. Instead, we decided (without much discussion, I’ll have to admit) to keep shipping stuff, make it work as well as we knew how, and engage the community in helping us make this fairly significant transition to our new world order. We did, however, make a very conscious choice to push out new code quickly — getting exposure to real users is often the best way to make sure you’re not making terrible mistakes in the design. The thinking was that users could always switch back to the ‘old’ code if the new code caused problems. Of course, sometimes that ‘old’ code saw fairly significant changes while the new code was integrated… You can imagine that our internal testing people haven’t been entirely happy with this plan either — our count of bugs has been far too high for far too long, and while we spent the last three months doing nothing but fixing things, it’s still a lot higher than I’d like to see. Performance DifferencesOnly a few things in the above lists have obvious performance implications — choose XAA and your performance for modern applications will suffer as it offers no acceleration for the Render extension. So, why does switching from EXA to UXA change the performance characteristics of the X server so much? The simple answer is that UXA, GEM and KMS haven’t been tweaked on every platform yet. For example, hardware rendering performance is affected by how memory is accessed by the drawing engine. There are two ways of mapping pixels, “linear” and “tiled”. In linear mode, pixels are stored in sequential addresses all the way across each scanline, subsequent scanlines are at ever higher addresses. A simple plan, and all of the software rendering code in the X server assumes this model. In tiled mode, rectangular chunks of the screen are stored in adjacent areas in memory, a block of 128x8 pixels forms an ‘X tile’ in the Intel hardware. Drawing to vertically adjacent pixels in this mode means touching the same page, reducing PTE thrashing compared with linear mode. For systems with a limited number of PTEs and limited caches inside the graphics hardware, tiled mode offers tremendous performance improvements. However, getting everything lined up to hit tiled mode is a pain, and on some hardware, in some configurations it doesn’t happen, so you see a huge drop in performance. Similarly, mapping pages in and out of the GTT sometimes requires that the contents be flushed from CPU or GPU caches. Now, GPU cache flushing isn’t cheap, but we end up doing it all the time as that’s how rendering contents are guaranteed to become visible on the screen. CPU cache flushing, on the other hand, is something you’re never “supposed” to do, as all I/O operations over PCI and communication between CPU cores is cache-coherent. Except for the GPU. So, we end up using some fairly dire slow-paths in the CPU whenever we end up doing this. UXA isn’t supposed to hit cache flushing paths while drawing, but sometimes it still happens. So, you get UXA performance loss sometimes. On the other hand, failing to dynamically map objects into the GTT means that some objects don’t fit, and so EXA spends a huge amount of time copying data around, in which case EXA suffers. The difference between DRI1 and DRI2 is due in part to the context switch necessary to get buffer swap commands from the DRI2 application to the X server which owns the ‘real’ front buffer. For an application like glxgears which draws almost nothing, and spends most of its time clearing and swapping, the impact can be significant (note, glxgears is not a benchmark, this is just one of many reasons). On the other hand, having private back buffers means that partially obscured applications will draw faster, not having to loop over clip rectangles in the main rendering loop. The obvious result here is that we’re at a point where application performance goes all over the map, depending on the hardware platform and particular set of configuration options selected. Light at Tunnel’s EndThe good news is that our redesign is now complete, and we have the architecture we want in place throughout the system — global graphics memory management, kernel mode setting and per-window 3D buffers. This means that the rate of new code additions to the driver has dropped dramatically; almost to zero. Going forward, users should expect this ‘perfect’ combination to work more reliably, faster and better as time goes by. Right now, we continue to spend all of our time stabilizing the code and fixing bugs. A minor but important piece of this work is to get UXA running without GEM so that we have EXA-like performance on older kernels. That should be fairly straightforward as UXA shares all of the same basic EXA acceleration code, and the EXA pixmap migration stuff works best when it works in the most simplistic fashion possible (move to GPU when drawing, move out only under memory pressure), something which we can provide in the GEM emulation layer already present under UXA. Our overall plan is to focus our efforts on the ‘one true configuration’. The best way to do that is to work on reducing the number of supported configurations until we get to just that one. First on the block are XAA and EXA. XAA because no-one should have to use that anymore, and EXA because it’s just UXA with some pixmap management stuff we don’t need. There’s no reason UXA should be slower than EXA, once the various hidden performance bugs are fixed. At the same time, DRI1 support will be removed. We cannot support compositing managers under DRI1, nor can we support frame buffer resize and a host of other new features. You’ll still get a desktop without DRI1, you just won’t get accelerated OpenGL. With the necessary infrastructure in the kernel and X server already released, this seems like the right time to switch off a huge pile of code. Initial measurements from this work show that we’ll be shrinking our codebase by about 10%. Moving beyond this next quarterly release, the remaining ‘legacy’ piece is the user mode setting code. Something like 50% of the code in the 2D driver relates this this, so removing it will rather significantly reduce our code base. You can only imagine how excited we are about this prospect. The goal is to take the driver we’ve got and produce a leaner, faster more stable driver in the next few releases to come. |