guile 3 update: instruction explosion for pairs, vectors

Andy Wingo Mon, 08 Jan 2018 07:02:20 -0800

Hey all!

This is an update along the road to Guile 3.  Check
https://lists.gnu.org/archive/html/guile-devel/2017-11/msg00016.html for
the previous entry.


Since 25 November there have been around 100 commits or so.  Firstly I
merged in patches from stable-2.0, including patches corresponding to
the improvements in the Guile 2.2.3 stable series release.

Then, I started to look at "instruction explosion" for vector-ref et
al.  Basically the idea is to transform the various subcomponents of
e.g. vector-ref into their constituent parts.  In the concrete case of
vector-ref, we have to check that the vector is a heap object, that its
heap object tag is "vector", we have to extract the length from the heap
object, then we have to check that the index is an integer between 0 and
length-1, and finally we dereference the field in the vector.
Instruction explosion turns all of these into different primcalls and
branches.

One thing that became apparent was that with instruction explosion, we'd
have a lot more control flow.  Information that the optimizer would
learn in a specific way (e.g. via specialzied type inference / effects
analysis handlers for vector-ref) would instead be learned by generic
control flow.

Concretely --

    scheme@(guile-user)> ,x (lambda (v i) (vector-ref v i))
    Disassembly of #<procedure 29f5e50 at <unknown port>:1:3 (v i)> at 
#x29f5b4c:

       0    (assert-nargs-ee/locals 3 1)    ;; 4 slots (2 args)   at (unknown 
file):1:3
       1    (immediate-tag=? 2 7 0)         ;; heap-object?       at (unknown 
file):1:17
       3    (jne 23)                        ;; -> L3
       4    (heap-tag=? 2 127 13)           ;; vector?
       6    (jne 20)                        ;; -> L3
       7    (word-ref/immediate 3 2 0)      
       8    (ursh/immediate 3 3 8)          
       9    (immediate-tag=? 1 3 2)         ;; fixnum?
      11    (jne 13)                        ;; -> L2
      12    (untag-fixnum 0 1)              
      13    (s64-imm<? 0 0)                 
      14    (jl 8)                          ;; -> L1
      15    (s64<? 0 3)                     
      16    (jnl 6)                         ;; -> L1
      17    (mov 3 0)                       
      18    (uadd/immediate 3 3 1)          
      19    (scm-ref 2 2 3)                 
      20    (handle-interrupts)             
      21    (return-values 2)               ;; 1 value
    L1:
      22    (throw/value+data 1 201)        ;; #(out-of-range "vector-ref" 
"Argument 2 out of range: ~S")
    L2:
      24    (throw/value+data 1 225)        ;; #(wrong-type-arg "vector-ref" 
"Wrong type argument in position 2 (expecting small integer): ~S")
    L3:
      26    (throw/value+data 2 239)        ;; #(wrong-type-arg "vector-ref" 
"Wrong type argument in position 1 (expecting vector): ~S")

So this is a bit horrible and I need to make the disassembler do a
better job, but anyway.  Instructions 1 through 6 check that V is a
vector; instructions 7 and 8 extract the length; 9 and 11 check that the
index is a fixnum, 12 extracts the fixnum value as an untagged 64-bit
integer, and 13 through 16 check that the index is in range.

L1, L2, and L3 are bailouts.  The idea here is that if this vector-ref
is followed by some other operation on the vector, we'll at least get to
elide the vector? checks, and maybe we can reuse the length extraction
too.  Et cetera.

The optimizer makes decisions like when to elide redundant checks based
on flow information.  However for historical reasons unfortunately the
"throw" terms actually did "continue" from the optimizer's perspective;
whereas the information flowing to e.g. L3 shouldn't flow at all to
instruction 7, the IR didn't have a way to denote terms that didn't
continue at all.

To fix this I had to make some changes to the IR.  On the plus side,
$throw is its own term kind now that doesn't have a continuation.  Also,
$branch is a term instead of an expression shoehorned into $continue;
$prompt is a term too.  Maybe one day we'll get a $select term for
proper case compilation.

Finally, the instruction explosion.  I refactored the Tree-IL->CPS
compiler to allow individual primcalls to have custom lowering routines,
and tightened up $primcall so that it now always continues to $kargs.
Then I added custom lowering routines for vector-ref et al, and cons and
other things.  The CPS IR refactors allowed me to remove some useless
passes (elide-values, prune-bailouts).  There were some optimizer bugs
but generally things were already in a good state; e.g. here's a vector
sum routine:

  scheme@(guile-user)> (define (v-sum v)
                         (let lp ((n 0) (sum 0))
                           (if (< n (vector-length v))
                               (lp (+ n 1) (+ sum (vector-ref v n)))
                               sum)))
  scheme@(guile-user)> ,x v-sum
  Disassembly of #<procedure v-sum (v)> at #x1fa2750:

     0    (assert-nargs-ee/locals 2 3)    ;; 5 slots (1 arg)    at (unknown 
file):1:0
     1    (make-short-immediate 4 2)      ;; 0
     2    (immediate-tag=? 3 7 0)         ;; heap-object?       at (unknown 
file):1:51
     4    (jne 39)                        ;; -> L5
     5    (heap-tag=? 3 127 13)           ;; vector?
     7    (jne 36)                        ;; -> L5
     8    (word-ref/immediate 2 3 0)      
     9    (ursh/immediate 2 2 8)          
    10    (imm-s64<? 2 0)                                       at (unknown 
file):1:46
    11    (jnl 29)                        ;; -> L4
    12    (load-s64 1 0 0)                                      at (unknown 
file):1:89
    15    (s64<? 1 2)                     
    16    (jnl 22)                        ;; -> L3
    17    (mov 4 1)                       
    18    (uadd/immediate 4 4 1)          
    19    (scm-ref 4 3 4)                 
    20    (add/immediate 4 4 0)                                 at (unknown 
file):1:82
    21    (load-s64 1 0 1)                
    24    (s64<? 1 2)                                           at (unknown 
file):1:46
    25    (jnl 11)                        ;; -> L2
  L1:
    26    (handle-interrupts)             
    27    (mov 0 1)                                             at (unknown 
file):1:74
    28    (uadd/immediate 0 0 1)          
    29    (uadd/immediate 1 1 1)                                at (unknown 
file):1:89
    30    (scm-ref 1 3 1)                 
    31    (add 4 4 1)                                           at (unknown 
file):1:82
    32    (s64<? 0 2)                                           at (unknown 
file):1:46
    33    (jnl 7)                         ;; -> L4
    34    (mov 1 0)                                             at (unknown 
file):1:70
    35    (j -9)                          ;; -> L1
  L2:
    36    (mov 0 1)                                             at (unknown 
file):1:46
    37    (j 3)                           ;; -> L4
  L3:
    38    (throw/value+data 4 204)        ;; #(out-of-range "vector-ref" 
"Argument 2 out of range: ~S") at (unknown file):1:89
  L4:
    40    (handle-interrupts)             
    41    (mov 3 4)                       
    42    (return-values 2)               ;; 1 value
  L5:
    43    (throw/value+data 3 233)        ;; #(wrong-type-arg "vector-length" 
"Wrong type argument in position 1 (expect…") at (unknown file):1:51

The bit between L1 and L2 is the loop.  Note that the loop does no
dynamic checks besides checking that the unboxed index is within bounds
-- notably the vector? check, the length computation, and the index
untagging were hoisted out.  The "scm-ref" instruction is the raw
unchecked memory accessor instruction that will correspond to a load
when compiled to machine code.

This code runs a little slower than it did a month ago because
instruction explosion is generally more instructions than what we had
before.  But this is expected.  I expect the corresponding amount of
machine code to be lower, when we emit machine code, sometimes
significantly so.  Doing things this way takes away the temptation for
an eventual JIT to do optimization-like tasks -- we keep it all in
generic CPS, and the JIT will be easy to write and generate good code.
Until then, hacking on Guile is a bit slow though, in terms of compile
time.

Next up is instruction explosion for other data structures (boxes,
structs, strings, etc); then I have to figure out what to do for
arithmetic that I can't specialize (like that "add" at IP 31).  I guess
polymorphic inline caches are the real solution there; we'll see.

OK that's the update.  Happy new year and happy hacking with Guile :)

Cheers,

Andy

guile 3 update: instruction explosion for pairs, vectors

Reply via email to