-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 After listening to a couple fairly vocal people squawk about the x86-64 dispatch stubs, I spent some time investigating the raised issues. The primary issue is that the TLS versions of the stubs contains an unnecessary function call to get the dispatch pointer.
_x86_64_get_dispatch: movq [EMAIL PROTECTED](%rip), %rax movq %fs:(%rax), %rax ret ... glVertex3f: call [EMAIL PROTECTED] movq 1088(%rax), %r11 jmp *%r11 I used the attached patch to regenerate src/mesa/x86-64/glapi_x86-64.S. The modified dispatch stubs have the code of _x86_64_get_dispatch in-line. glVertex3f: movq [EMAIL PROTECTED](%rip), %rax movq %fs:(%rax), %rax movq 1088(%rax), %r11 jmp *%r11 I then used progs/tests/api_speed.py [1] to calculate the difference in API overhead between the two. Both versions were built with the following make command: make CC='ccache distcc gcc' CXX='ccache distcc gcc' \ ARCH_FLAGS='-DGLX_USE_TLS -m64 -fPIC -fvisibility=hidden' \ linux-dri-x86-64 I had to rebuild in src/glut/glx *without* -fvisibility=hidden. GLUT (and probably GLU) needs its public interfaces marked with the proper visibility (as is done in the rest of Mesa). Either that or the GLUT Makefile needs to filter -fvisibility=[a-z]* out of its CFLAGS. The results are not impressive. The libGL.so with the modified dispatch routines is 13KiB larger. The measured API overhead was, at best, 1 clock cycle faster. In most cases the measured overhead was much, much less than the resolution of the measurement apparatus (e.g., glFogCoordfEXT scored 71.284420 for the original vs. 71.280840 for the modified). Given these results, I'm inclined to leave the x86-64 assembly dispatch stubs as they are. Evidence showing either a benchmark where the modified dispatch stubs are faster or showing some flaw in my testing methodology would, naturally, give me reason to revisit this issue. In the mean time, I am considering it closed. If someone is really excited about improving the state of things on x86-64, they might choose to investigate adding code to dynamically generate dispatch functions for newly registered (by a DRI driver at run-time) extension functions. This is currently done for x86, SPARC, and Alpha, but not for x86-64, PowerPC, or IA-64. Happy hacking. [1] On x86-64 systems you may need to build api_speed with '-DCONFIG_X86_TSC' explicitly set on the command line. Even though I have this set in my kernel config, linux/config.h doesn't properly set it. The result is that get_cycles always returns 0. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFEA4pPX1gOwKyEAw8RAtY+AJ9/LKXLGCnzTQsNlrw/7moViHDFmACfdvNA SO8sv8rZsRspJpVJ9enXt9E= =prKj -----END PGP SIGNATURE-----
Index: src/mesa/glapi/gl_x86-64_asm.py =================================================================== RCS file: /cvs/mesa/Mesa/src/mesa/glapi/gl_x86-64_asm.py,v retrieving revision 1.6 diff -u -d -r1.6 gl_x86-64_asm.py --- src/mesa/glapi/gl_x86-64_asm.py 2 Dec 2005 00:25:06 -0000 1.6 +++ src/mesa/glapi/gl_x86-64_asm.py 27 Feb 2006 23:17:43 -0000 @@ -237,7 +237,8 @@ print '\t.type\tGL_PREFIX(%s), @function' % (f.name) print 'GL_PREFIX(%s):' % (f.name) print '#if defined(GLX_USE_TLS)' - print '[EMAIL PROTECTED]' + print '[EMAIL PROTECTED](%rip), %rax' + print '\tmovq\t%fs:(%rax), %rax' print '\tmovq\t%u(%%rax), %%r11' % (f.offset * 8) print '\tjmp\t*%r11' print '#elif defined(PTHREADS)'