Just reposting this with a new subject line and less preamble.
----- Original Message ---- > > Well the thing is I can't believe we don't know enough to do this in some > way generically, but maybe the TTM vs GEM thing proves its not possible. I don't think there's anything particularly wrong with the GEM interface -- I just need to know that the implementation can be fixed so that performance doesn't suck as hard as it does in the current one, and that people's political views on basic operations like mapping buffers don't get in the way of writing a decent driver. We've run a few benchmarks against i915 drivers in all their permutations, and to summarize the results look like: - for GPU-bound apps, there are small differences, perhaps up to 10%. I'm really not concerned about these (yet). - for CPU-bound apps, the overheads introduced by Intel's approach to buffer handling impose a significant penalty in the region of 50-100%. I think the latter is the significant result -- none of these experiments in memory management significantly change the command stream the hardware has to operate on, so what we're varying essentially is the CPU behaviour to acheive that command stream. And it is in CPU usage where GEM (and Keith/Eric's now-abandoned TTM driver) do significantly dissapoint. Or to put it another way, GEM & master/TTM seem to burn huge amounts of CPU just running the memory manager. This isn't true for master/no-ttm or for i915tex using userspace sub-allocation, where the CPU penalty for getting decent memory management seems to be minimal relative to the non-ttm baseline. If there's a political desire to not use userspace sub-allocation, then whatever kernel-based approach you want to investigate should nonetheless make some effort to hit reasonable performance goals -- and neither of the current two kernel-allocation-based approaches currently are at all impressive. Keith ============================================================== And on an i945G, dual core Pentium D 3Ghz 2MB cache, FSB 800 Mhz, single-channel ram: Openarena timedemo at 640x480: -------------------------------------------- master w/o TTM: 840 frames, 17.1 seconds: 49.0 fps, 12.24s user 1.02s system 63% cpu 20.880 total master with TTM: 840 frames, 15.8 seconds: 53.1 fps, 13.51s user 5.15s system 95% cpu 19.571 total i915tex_branch: 840 frames, 13.8 seconds: 61.0 fps, 12.54s user 2.34s system 85% cpu 17.506 total gem: 840 frames, 15.9 seconds: 52.8 fps, 11.96s user 4.44s system 83% cpu 19.695 total KW: It's less obvious here than some of the tests below, but the pattern is still clear -- compared to master/no-ttm i915tex is getting about the same ratio of fps to CPU usage, whereas both master/ttm and gem are significantly worse, burning much more CPU per fps, with a large chunk of the extra CPU being spent in the kernel. The particularly worrying thing about GEM is that it isn't hitting *either* 100% cpu *or* maximum framerates from the hardware -- that's really not very good, as it implies hardware is being left idle unecessarily. glxgears: A: ~1029 fps, 20.63user 2.88system 1:00.00elapsed 39%CPU (master, no ttm) B: ~1072 fps, 23.97user 18.06system 1:00.00elapsed 70%CPU (master, ttm) C: ~1128 fps, 22.38user 5.21system 1:00.00elapsed 45%CPU (i915tex, new) D: ~1167 fps, 23.14user 9.07system 1:00.00elapsed 53%CPU (i915tex, old) F: ~1112 fps, 24.70user 21.95system 1:00.00elapsed 77%CPU (gem) KW: The high CPU overhead imposed by GEM and (non-suballocating) master/TTM should be pretty clear here. master/TTM burns 30% of CPU just running the memory manager!! GEM gets slightly higher framerates but uses even more CPU than master/TTM. fgl_glxgears -fbo: A: n/a B: ~244 fps, 7.03user 5.30system 1:00.01elapsed 20%CPU (master, ttm) C: ~255 fps, 6.24user 1.71system 1:00.00elapsed 13%CPU (i915tex, new) D: ~260 fps, 6.60user 2.44system 1:00.00elapsed 15%CPU (i915tex, old) F: ~258 fps, 7.56user 6.44system 1:00.00elapsed 23%CPU (gem) KW: GEM & master/ttm burn more cpu to build/submit the same command streams. openarena 1280x1024: A: 840 frames, 44.5 seconds: 18.9 fps (master, no ttm) B: 840 frames, 40.8 seconds: 20.6 fps (master, ttm) C: 840 frames, 40.4 seconds: 20.8 fps (i915tex, new) D: 840 frames, 37.9 seconds: 22.2 fps (i915tex, old) F: 840 frames, 40.3 seconds: 20.8 fps (gem) KW: no cpu measurements taken here, but almost certainly GPU bound. A lot of similar numbers, I don't believe the deltas have anything in particular to do with memory management interface choices... ipers: A: ~285000 Poly/sec (master, no ttm) B: ~217000 Poly/sec (master, ttm) C: ~298000 Poly/sec (i915tex, new) D: ~227000 Poly/sec (i915tex, old) F: ~125000 Poly/sec (gen, GPU lockup on first attempt) KW: no cpu measurements in this run, but all are almost certainly 100% pinned on CPU. - i915tex (in particular i915tex, new) show similar performance to classic - ie low cpu overhead for this memory manager. - GEM is significantly worse even than master/ttm -- hopefully this is a bug rather than a necessary characteristic of the interface. texdown: A: total texels=393216000.000000 time=3.004000 (master, no ttm) B: total texels=434110464.000000 time=3.000000 (master, ttm) C: (i915tex new --- woops, crashes) D: total texels=1111490560.000000 time=3.002000 (i915tex old) F: total texels=279969792.000000 time=3.004000 (gem) Note the huge (3x-4x) performance lead of i915tex, despite the embarassing crash in the newer version. I suspect this is unrelated to command handling and probably somebody has disabled or regressed some aspect of the texture upload path... NOTE: The reason that i915tex does so well relative to master/no-ttm is because we can upload directly to "VRAM"... master/no-ttm treats vram as a cache & always keeps a second copy of the texture safe in main memory... Hence performance isn't great for texture uploads on master/no-ttm. Here's what we're seeing on a i915 3GHz Celeron 256kB cache. Dual channel. Reportdamage disabled. DRM master: ======================================================================= *Test* *i915tex_branch* *i915 master, TTM* *i915 master, classic* ( no gem results on this machine ... ) gears 1033fps, 70.1% CPU. (i915tex) 726fps, 100% CPU. (master, ttm) 955fps, 56%CPU. (master, no-ttm) openarena 47,1fps, 17.9u, 2.7s time (i915tex) 31.5fps, 21.1u, 8.7s time (master, ttm) 39fps, 17.9u, 1.3s time (master, no-ttm) Texdown 1327MB/s (i915tex) 551MB/s (master, ttm) 572MB/s (master, no-ttm) Texdown, subimage 1014MB/s (i915tex) 134MB/s (master, ttm) 148MB/s (master, no-ttm) Ipers, no help screen 255 000 tri/s, 100% cpu (i915tex) 139 000 tri/s, 100% cpu (master, ttm) 241 000 tri/s, 100% cpu (master, no-ttm) I would summarize the results like this: - master/no-ttm has a basically "free" memory manager in terms of CPU overhead - master/ttm and GEM gain a proper memory manager but introduce a huge CPU overhead & consequent performance regression - i915tex makes use of userspace sub-allocation to resolve that regression & achieve comparable efficiency to master/no-ttm. - a separate regression seems to have killed texture upload performance on master/ttm relative to it's ancestor i915tex. Keith ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ -- _______________________________________________ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel