-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

After listening to a couple fairly vocal people squawk about the x86-64
dispatch stubs, I spent some time investigating the raised issues.  The
primary issue is that the TLS versions of the stubs contains an
unnecessary function call to get the dispatch pointer.

_x86_64_get_dispatch:
        movq    [EMAIL PROTECTED](%rip), %rax
        movq    %fs:(%rax), %rax
        ret

        ...

glVertex3f:
        call    [EMAIL PROTECTED]
        movq    1088(%rax), %r11
        jmp     *%r11

I used the attached patch to regenerate src/mesa/x86-64/glapi_x86-64.S.
 The modified dispatch stubs have the code of _x86_64_get_dispatch in-line.

glVertex3f:
        movq    [EMAIL PROTECTED](%rip), %rax
        movq    %fs:(%rax), %rax
        movq    1088(%rax), %r11
        jmp     *%r11

I then used progs/tests/api_speed.py [1] to calculate the difference in
API overhead between the two.  Both versions were built with the
following make command:

    make CC='ccache distcc gcc' CXX='ccache distcc gcc' \
        ARCH_FLAGS='-DGLX_USE_TLS -m64 -fPIC -fvisibility=hidden' \
        linux-dri-x86-64

I had to rebuild in src/glut/glx *without* -fvisibility=hidden.  GLUT
(and probably GLU) needs its public interfaces marked with the proper
visibility (as is done in the rest of Mesa).  Either that or the GLUT
Makefile needs to filter -fvisibility=[a-z]* out of its CFLAGS.

The results are not impressive.  The libGL.so with the modified dispatch
routines is 13KiB larger.  The measured API overhead was, at best, 1
clock cycle faster.  In most cases the measured overhead was much, much
less than the resolution of the measurement apparatus (e.g.,
glFogCoordfEXT scored 71.284420 for the original vs. 71.280840 for the
modified).

Given these results, I'm inclined to leave the x86-64 assembly dispatch
stubs as they are.  Evidence showing either a benchmark where the
modified dispatch stubs are faster or showing some flaw in my testing
methodology would, naturally, give me reason to revisit this issue.  In
the mean time, I am considering it closed.

If someone is really excited about improving the state of things on
x86-64, they might choose to investigate adding code to dynamically
generate dispatch functions for newly registered (by a DRI driver at
run-time) extension functions.  This is currently done for x86, SPARC,
and Alpha, but not for x86-64, PowerPC, or IA-64.

Happy hacking.

[1] On x86-64 systems you may need to build api_speed with
'-DCONFIG_X86_TSC' explicitly set on the command line.  Even though I
have this set in my kernel config, linux/config.h doesn't properly set
it.  The result is that get_cycles always returns 0.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFEA4pPX1gOwKyEAw8RAtY+AJ9/LKXLGCnzTQsNlrw/7moViHDFmACfdvNA
SO8sv8rZsRspJpVJ9enXt9E=
=prKj
-----END PGP SIGNATURE-----
Index: src/mesa/glapi/gl_x86-64_asm.py
===================================================================
RCS file: /cvs/mesa/Mesa/src/mesa/glapi/gl_x86-64_asm.py,v
retrieving revision 1.6
diff -u -d -r1.6 gl_x86-64_asm.py
--- src/mesa/glapi/gl_x86-64_asm.py     2 Dec 2005 00:25:06 -0000       1.6
+++ src/mesa/glapi/gl_x86-64_asm.py     27 Feb 2006 23:17:43 -0000
@@ -237,7 +237,8 @@
                print '\t.type\tGL_PREFIX(%s), @function' % (f.name)
                print 'GL_PREFIX(%s):' % (f.name)
                print '#if defined(GLX_USE_TLS)'
-               print '[EMAIL PROTECTED]'
+               print '[EMAIL PROTECTED](%rip), %rax'
+               print '\tmovq\t%fs:(%rax), %rax'
                print '\tmovq\t%u(%%rax), %%r11' % (f.offset * 8)
                print '\tjmp\t*%r11'
                print '#elif defined(PTHREADS)'

Reply via email to