I have made a new release of my Church-State system

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-2.tar.gz

which incorporates performance work described in these blog posts:

http://subvert-the-dominant-paradigm.net/blog/?p=42
http://subvert-the-dominant-paradigm.net/blog/?p=43

* Inline slot caches

As part of my performance work I have implemented inline slot caches for
slot reads and writes in Church. These are implemented by patching the
program code at runtime with assembly code that checks the type of the
target object and loads the correct offset for slot access.

In this example we see the code that prepares a call to
‘church-fixup-initial-slot-access’. The first two arguments on the stack
(%esp) and 0×4(%esp) are the argument-count and closure pointer used in
the State calling convention. The next two arguments 0×8(%esp) and 0xc(%
esp) are the object being accessed and the symbol representing the name
of the slot to be accessed.


0x080ce733 mov    %ebx,0xc(%esp)
0x080ce737 mov    %edx,0x8(%esp)
0x080ce73b mov    %ecx,0x4(%esp)
0x080ce73f mov    %eax,(%esp)
0x080ce742 call   0x80ab3b4 
0×080ce747 mov    %eax,%eax
0×080ce749 nop
0×080ce74a nop
0×080ce74b nop
0×080ce74c nop    



The fixup routine gets the type of the object and examines all the
parent classes to determine the correct offset for accessing the slot in
this object. It then generates x86 machine code and patches the calling
function. At the moment I do this by directly emitting a byte sequence
for each instruction, this is quite crude and error-prone but manageable
when such a small amount of code is being generated.


              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)

              (write-word! patch-start #x08244c8b)
              (write-byte! patch-start #x81)
              (write-byte! patch-start #x39)
              (write-word! patch-start obj-type)
                                        ;je
              (write-byte! patch-start #x74)
              (write-byte! patch-start #x7)
...


First the old call is overwritten with nops and then we emit some
comparison and jump instructions. The final output looks like this:


0x080ce733 mov    %ebx,0xc(%esp)
0x080ce737 mov    %edx,0x8(%esp)
0x080ce73b mov    %ecx,0x4(%esp)
0x080ce73f mov    %eax,(%esp)
0x080ce742 nop
0x080ce743 nop
0x080ce744 nop
0x080ce745 nop
0x080ce746 nop
0x080ce747 mov    0x8(%esp),%ecx
0x080ce74b cmpl   $0x8302993,(%ecx)
0x080ce751 je     0x80ce75a 
0×080ce753 call   0×80ab8ba 
0×080ce758 jmp    0×80ce760 
0×080ce75a mov    0×4(%ecx),%eax
0×080ce760 mov    %eax,%eax
0×080ce762 nop
0×080ce763 nop
0×080ce764 nop
0×080ce765 nop


The untagged object pointer is moved into %ecx and the first word (which
points to the class wrapper for this object) is compared with the
literal address of the class wrapper seen the first time. If it is the
same, we simply load the slot at the precomputed offset (0×4) and store
it in %eax. If not we jump to a runtime function which does a
conventional (but much slower) lookup.


* Compiling dispatch matchers at runtime

After implementing inline slot caches I tried various approaches for
caching method lookup. At first I thought I could store a global hash
table which was keyed by combining the hash values of the types of the
arguments passed to a function. Then by comparing the arguments types
with the types stored in the table entries I could find the correct
method to dispatch to. This turned out to be slower than my current
implementation which simply walks the cons lists that describe each
method’s type pattern and calling “type?” to test whether the argument
is a subtype of the target type.

My eventual solution was to turn this lookup process into straight-line
code by dynamically creating machine code to execute these tests.

Consider the following Church code:


length s:string
   ......

length n:nil
   0

length l:cons
   ......                                                                       
                                                                          

length a:array
   ......



Here we check the type of the argument to distinguish between nil,
strings, cons-lists and arrays. For nil and cons types we can generate
an efficient check against the object tag, for strings and arrays we
call out to “type?”. The following state code is generated for these
tests:


 (DEFINE DISPATCH-MATCHER31
  (LAMBDA (ARG1 ARG2 ARG3 ARG4 ARG5 ARG6 ARG7)
    (LET ((ARG-COUNT (LOAD-ARGUMENT-COUNT)))
      (IF (= ARG-COUNT 1)
          (BEGIN
           (IF
            (AND (= (BAND ARG1 LOWTAG_BITS) TAG_REF)
                 (CHURCH-IF (TYPE? ARG1 137075987) 1 0))
            (RETURN (CALL-C 134993347 1 (LOAD-CLOSURE-POINTER) ARG1)))
           (IF
            (AND (= (BAND ARG1 LOWTAG_BITS) TAG_REF)
                 (CHURCH-IF (TYPE? ARG1 137075507) 1 0))
            (RETURN (CALL-C 134968518 1 (LOAD-CLOSURE-POINTER) ARG1)))
           (IF (= (BAND ARG1 LOWTAG_BITS) TAG_CONS)
               (RETURN (CALL-C 134968094 1 (LOAD-CLOSURE-POINTER) ARG1)))
           (IF (= ARG1 TAG_NIL)
               (RETURN (CALL-C 134968046 1 (LOAD-CLOSURE-POINTER) ARG1))))))
    (CHURCH-DISPATCH-MATCHER-FAILED 1))))


The first two tests check for the TAG_REF tag which is used for normal
objects. If the tag matches we call “type?” with the literal address
representing the class of the target type. If the test matches we can
call the associated method directly.

The other two tests only compare the tag, making it quite efficient for
primitive types like cons, fixnums, true and nil.

After generating this code the calling function is patched to jump to
this matcher method directly.

Together with the inline slot caches these modifications speed the
system up by about a factor of two.

Future work involves determining more precisely when to compile these
dispatchers (at the moment I do it after a symbol has been dispatched
through 5000 times) and optimizing the generated tests. (Even in this
example there is redundant check for TAG_REF). It is probably also
possible to be more efficient when overlapping types are involved.





_______________________________________________
fonc mailing list
[email protected]
http://vpri.org/mailman/listinfo/fonc

Reply via email to