[racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Tony Garnock-Jones

On 2011-05-04 12:04 PM, Matthias Felleisen wrote:

I still believe that the Java implementation (just under 1s without
their 'Google' contract) benefits from typed dispatches.


Maybe it does, but it's almost certain that it is benefiting from inline 
caching at send sites (i.e. dynamic type information) much more than it 
will be benefiting from static type information.


A quick-and-dirty comparison of raw send performance on my Mac:

  Language Virtual machine  Nanoseconds per send
 
  Java Hotspot 64-bit 1.6.0_24  1.4
  SmalltalkCog r2382   21
  SmalltalkSqueakVM 4.2.4beta1U   122
  Racket   Racket v5.1   ~350

Note that Cog is a JITting VM and SqueakVM is a plain (but very well 
optimised) interpreter. Both Cog and SqueakVM use a couple of levels of 
method lookup cache.


A simple experiment I just performed suggests that a monomorphic inline 
cache hit can reduce the time needed for a send in Racket from 350ns to 
around 60ns, which is a massive win. I've attached the program I used to 
measure this, FWIW. (Run it using command-line Racket, not DrRacket: I 
got some *very* weird timing artifacts out of DrRacket during this 
experiment!)


The question, then, is: how do we implement MICs or PICs using Racket's 
macro system? Each send site needs to expand into


 - a piece of global state
 - a test involving that global state
 - a possible update to that global state

Hypothesising some kind of (let-static) form that introduces a 
lexically-scoped piece of global state, this kind of thing might Just 
Work to provide a speedup of almost six-fold on almost-monomorphic 
send-heavy code:


(define-syntax cached-send
  (syntax-rules ()
((_ obj msg arg ...)
 (let-static ((bc (box #f))
  (bm (box #f)))
   (let* ((tmp obj)
  (cls (object-ref tmp)))
 (if (eq? (unbox bc) cls)
 ((unbox bm) tmp arg ...)
 (let-values (((method _)
   (find-method/who 'send tmp 'msg)))
   (set-box! bc cls)
   (set-box! bm method)
   (method tmp arg ...

Regards,
  Tony
#lang racket
(require racket/private/class-internal)

;; An ordinary Racket class.
(define a%
  (class* object% ()
(super-new)
(define/public (op x) (+ x 1

;; Representation of a trivial vtable.
(struct ob (vt state) #:transparent)

;; A simple vtable providing a single method named op.
(define (b%-vt selector)
  (case selector
((op) (lambda (self x) (+ x 2)))
(else (error 'dnu

;; A simple class, using b%-vt as its behaviour.
(define (b%)
  (ob b%-vt 'no-state))

;; An uncached send to a struct ob.
(define-syntax unmemo-send
  (syntax-rules ()
((_ obj msg arg ...)
 (let ((tmp obj))
   (((ob-vt tmp) 'msg) tmp arg ...)

;; A quasi-cached send to a struct ob.
;;
;; A real cache would have per-send-site state rather than a single
;; (!) global variable.
(define *memo-class* #f)
(define *memo-method* #f)
(define-syntax memo-send
  (syntax-rules ()
((_ obj msg arg ...)
 (let* ((tmp obj)
(cls (ob-vt tmp)))
   (if (eq? *memo-class* cls)
   (*memo-method* tmp arg ...)
   (let ((method (cls 'msg)))
 (set! *memo-class* cls)
 (set! *memo-method* method)
 (method tmp arg ...)))

;; Test objects.
(define a0 (new a%))
(define b0 (b%))

;; Syntax: (measure-ns exp)
;;
;; Expands to an expression that repeats exp NREPEATS times,
;; measuring the elapsed time, and returns the number of nanoseconds
;; of CPU time used *per iteration*, excluding any GC time.
(define NREPEATS 500)
(define-syntax measure-ns
  (syntax-rules ()
((_ exp)
 (call-with-values (lambda ()
 (pretty-print `(measuring exp))
 (time-apply (lambda ()
   (do ((i 0 (+ i 1)))
 ((= i NREPEATS))
 exp))
 '()))
   (lambda (results cpu real gc)
 (/ (* 10.0 (/ (- cpu gc) 1000.0))
NREPEATS))

;; Main program.

;; Measure the time for a null measure-ns loop first, then measure the
;; operations of interest, subtracting the null-time overhead
;; measurement from each to get an estimate of the time taken for the
;; interesting operation.

(let ((null-time (measure-ns 123)))
  (define (report-on t)
(let ((name (first t))
  (ns/op (second t)))
  (write (list name (- ns/op null-time)))
  (newline)))
  (for-each report-on
`(
  ;; Report on the loop overhead for sanity checking.
  (constant ,null-time)

  ;; How long does a plain Scheme addition operation take?
  (simple-add
  

Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Matthias Felleisen

As I was reading a paper, I encountered the word 'cache' and remembered that I 
had wanted to implement a caching version of the central method in my central 
data structure. That took me about 10 minutes, and I now have good times for 
the game: 

with contracts: 1.8s per game 
without ctrcts: 1.2s per game 

I am now within 20% of the fastest Java implementation in my class. That's 
good. 

-- Matthias

p.s. Okay, I'll say it: algorithms matter, too. 
_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev


Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Jos Koot
 

 -Original Message-
 From: dev-boun...@racket-lang.org 
 [mailto:dev-boun...@racket-lang.org] On Behalf Of Matthias Felleisen
 Sent: 04 May 2011 22:58
 To: Tony Garnock-Jones; D Herring
 Cc: dev List
 Subject: Re: [racket-dev] Inline caching (was Re: my '312' 
 this semester,how we compare to others)
 
 
 As I was reading a paper, I encountered the word 'cache' and 
 remembered that I had wanted to implement a caching version 
 of the central method in my central data structure. That took 
 me about 10 minutes, and I now have good times for the game: 
 
 with contracts: 1.8s per game 
 without ctrcts: 1.2s per game 
 
 I am now within 20% of the fastest Java implementation in my 
 class. That's good. 
 
 -- Matthias
 
 p.s. Okay, I'll say it: algorithms matter, too. 


That to me seems most important.
Of course inplementation of an algortithm requires wisdom too:

Is cashing of data faster than recomputing them?
How do we deal with shortage of RAM?
How do we optimize software? (nowadays surely not by hand)
Do software devellopers inspire hardware designers, or may/should it be the
orher way around?
There is a tendency to separate programmers from the hardware equipment they
are using. Is this wise?

I think hardware designers must understand programmers and vice versa.
Most important, both parties must understand their customers.

mho, Jos

 _
   For list-related administrative tasks:
   http://lists.racket-lang.org/listinfo/dev

_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev


Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Tony Garnock-Jones
The attached (highly experimental) patch seems to improve the 
performance of normal sends (in the case of a cache hit) by roughly 100% 
- 150%. The difference between this mere factor of two improvement and 
the factor of six-through-ten I was seeing earlier is, I speculate, 
related to object-ref taking a lot of time.


Interestingly, the performance of (send something message) is, with this 
patch, seemingly roughly en par with the performance of generics 
captured using generic and invoked using send-generic.


I haven't yet tried running a full racket system with this patch 
applied. I wonder if it makes a difference to the interactive 
performance of DrRacket?


Major weaknesses of the approach:

 - overhead on cache miss is unknown
 - not safe-for-space yet (weak boxes are immutable)

The approach should generalize easily to polymorphic inline caches.

Regards,
  Tony
diff --git a/collects/racket/private/class-internal.rkt 
b/collects/racket/private/class-internal.rkt
index d00ef10..0a96cba 100644
--- a/collects/racket/private/class-internal.rkt
+++ b/collects/racket/private/class-internal.rkt
@@ -3734,19 +3734,30 @@
   (let ()
 
 (define (do-method traced? stx form obj name args rest-arg?)
-  (with-syntax ([(sym method receiver)
- (generate-temporaries (syntax (1 2 3)))])
+  (with-syntax ([(sym in-object in-class state method receiver)
+ (generate-temporaries (syntax (1 2 3 4 5 6)))]
+   [*cached-state*
+(syntax-local-lift-expression
+ (syntax (cons #f #f)))])
 (quasisyntax/loc stx
   (let*-values ([(sym) (quasiquote (unsyntax (localize name)))]
-[(method receiver)
- (find-method/who '(unsyntax form)
-  (unsyntax obj)
-  sym)])
+   [(in-object) (unsyntax obj)]
+   [(in-class) (and (object? in-object) (object-ref 
in-object))]
+   [(state) *cached-state*]
+[(method)
+(if (and in-class (eq? (car state) in-class))
+(cdr state)
+(let-values ([(m r)
+  (find-method/who '(unsyntax form)
+   in-object
+   sym)])
+  (set! *cached-state* (cons in-class m))
+  m))])
 (unsyntax
  (make-method-call
   traced?
   stx
-  (syntax/loc stx receiver)
+  (syntax/loc stx in-object)
   (syntax/loc stx method)
   (syntax/loc stx sym)
   args
_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev

Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Asumu Takikawa
On 2011-05-04 18:49:13 -0400, Tony Garnock-Jones wrote:
 The attached (highly experimental) patch seems to improve the
 performance of normal sends (in the case of a cache hit) by roughly
 100% - 150%. The difference between this mere factor of two
 improvement and the factor of six-through-ten I was seeing earlier
 is, I speculate, related to object-ref taking a lot of time.
 
 Interestingly, the performance of (send something message) is, with
 this patch, seemingly roughly en par with the performance of
 generics captured using generic and invoked using send-generic.
 
 I haven't yet tried running a full racket system with this patch
 applied. I wonder if it makes a difference to the interactive
 performance of DrRacket?

Wow, impressive! I've been benchmarking with the DrRacket interactive
tests already for contracts, so I can run my test driver and get some
numbers for that.

Cheers,
Asumu
_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev


Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Tony Garnock-Jones

On 2011-05-04 6:54 PM, Asumu Takikawa wrote:

Wow, impressive! I've been benchmarking with the DrRacket interactive
tests already for contracts, so I can run my test driver and get some
numbers for that.


That'd be great. I mean, it'll probably just break, but if it doesn't... 
it'd be interesting :-)


Tony
_
 For list-related administrative tasks:
 http://lists.racket-lang.org/listinfo/dev


Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Ryan Culpepper

On 05/04/2011 01:57 PM, Tony Garnock-Jones wrote:

On 2011-05-04 12:04 PM, Matthias Felleisen wrote:

I still believe that the Java implementation (just under 1s without
their 'Google' contract) benefits from typed dispatches.


Maybe it does, but it's almost certain that it is benefiting from inline
caching at send sites (i.e. dynamic type information) much more than it
will be benefiting from static type information.

A quick-and-dirty comparison of raw send performance on my Mac:

Language Virtual machine Nanoseconds per send

Java Hotspot 64-bit 1.6.0_24 1.4
Smalltalk Cog r2382 21
Smalltalk SqueakVM 4.2.4beta1U 122
Racket Racket v5.1 ~350

Note that Cog is a JITting VM and SqueakVM is a plain (but very well
optimised) interpreter. Both Cog and SqueakVM use a couple of levels of
method lookup cache.

A simple experiment I just performed suggests that a monomorphic inline
cache hit can reduce the time needed for a send in Racket from 350ns to
around 60ns, which is a massive win. I've attached the program I used to
measure this, FWIW. (Run it using command-line Racket, not DrRacket: I
got some *very* weird timing artifacts out of DrRacket during this
experiment!)

The question, then, is: how do we implement MICs or PICs using Racket's
macro system? Each send site needs to expand into

- a piece of global state
- a test involving that global state
- a possible update to that global state

Hypothesising some kind of (let-static) form that introduces a
lexically-scoped piece of global state,


Here's 'let-static':

  (define-syntax (let-static stx)
(syntax-case stx ()
  [(let-static ([var rhs] ...) . body)
   (with-syntax ([(gvar ...)
  (syntax-local-lift-values-expression
   (length (syntax-list #'(var ...)))
   #'(values rhs ...))])
 #'(let-syntax ([var (make-rename-transformer #'gvar)] ...)
 . body))]))

 this kind of thing might Just

Work to provide a speedup of almost six-fold on almost-monomorphic
send-heavy code:



(define-syntax cached-send
  (syntax-rules ()
((_ obj msg arg ...)
 (let-static ((bc (box #f))
  (bm (box #f)))
   (let* ((tmp obj)
  (cls (object-ref tmp)))
 (if (eq? (unbox bc) cls)
 ((unbox bm) tmp arg ...)
 (let-values (((method _)
   (find-method/who 'send tmp 'msg)))
   (set-box! bc cls)
   (set-box! bm method)
   (method tmp arg ...


That code has a space-safety problem: the caches might keep around 
classes that should be garbage collected. It also has a race condition: 
there's a period of time when the class cache has been updated but the 
method cache hasn't.


The safe-for-space issue can be fixed by using weak boxes. The race 
condition could be fixed by using one instead of two... but there's no 
structure that combines a class and a particular method implementation 
that's strongly held elsewhere. Creating a new pair for the weak box to 
hold works, but it means that every GC is likely to flush the cache, but 
maybe that's okay. It also means that every cache miss triggers heap 
allocation, but I don't know any way around that.


I've attached an updated benchmark script. Still looks like a win on 
speed, but slightly less than before.


Ryan
#lang racket
(require racket/private/class-internal)

(define-syntax (let-static stx)
  (syntax-case stx ()
[(let-static ([var rhs] ...) . body)
 (with-syntax ([(gvar ...)
(syntax-local-lift-values-expression
 (length (syntax-list #'(var ...)))
 #'(values rhs ...))])
   #'(let-syntax ([var (make-rename-transformer #'gvar)] ...)
   . body))]))

;; An ordinary Racket class.
(define a%
  (class* object% ()
(super-new)
(define/public (op x) (+ x 1
;;  Cached versions of send 

;; (Real version should base cache on class, not object.)

;; send/cache0
;; Uses 2 vars.
;; Problem: not safe for space!
;; Problem: race condition in 2 set!s.
(define-syntax send/cache0
  (syntax-rules ()
[(_ obj-expr msg arg ...)
 (let-static ([*memo-class* #f]
  [*memo-method* #f])
   (let ([obj obj-expr])
 (let ([f (if (eq? obj *memo-class*)
  *memo-method*
  (let-values ([(method _obj)
(find-method/who 'send obj 'msg)])
(set! *memo-class* obj)
(set! *memo-method* method)
method))])
   (f obj arg ...]))

;; send/cache1
;; Uses 2 weak-boxes to be safe for space.
;; Problem: race condition in 2 set!s.
(define-syntax send/cache1
  (syntax-rules ()
((_ obj-expr msg arg ...)
 (let-static ([*memo-class* (make-weak-box #f)]
  [*memo-method* (make-weak-box #f)])
   (let* ([obj obj-expr]

Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Matthias Felleisen

Wow. Why didn't I think of asking for this before :-) 


On May 4, 2011, at 7:11 PM, Ryan Culpepper wrote:

 On 05/04/2011 01:57 PM, Tony Garnock-Jones wrote:
 On 2011-05-04 12:04 PM, Matthias Felleisen wrote:
 I still believe that the Java implementation (just under 1s without
 their 'Google' contract) benefits from typed dispatches.
 
 Maybe it does, but it's almost certain that it is benefiting from inline
 caching at send sites (i.e. dynamic type information) much more than it
 will be benefiting from static type information.
 
 A quick-and-dirty comparison of raw send performance on my Mac:
 
 Language Virtual machine Nanoseconds per send
 
 Java Hotspot 64-bit 1.6.0_24 1.4
 Smalltalk Cog r2382 21
 Smalltalk SqueakVM 4.2.4beta1U 122
 Racket Racket v5.1 ~350
 
 Note that Cog is a JITting VM and SqueakVM is a plain (but very well
 optimised) interpreter. Both Cog and SqueakVM use a couple of levels of
 method lookup cache.
 
 A simple experiment I just performed suggests that a monomorphic inline
 cache hit can reduce the time needed for a send in Racket from 350ns to
 around 60ns, which is a massive win. I've attached the program I used to
 measure this, FWIW. (Run it using command-line Racket, not DrRacket: I
 got some *very* weird timing artifacts out of DrRacket during this
 experiment!)
 
 The question, then, is: how do we implement MICs or PICs using Racket's
 macro system? Each send site needs to expand into
 
 - a piece of global state
 - a test involving that global state
 - a possible update to that global state
 
 Hypothesising some kind of (let-static) form that introduces a
 lexically-scoped piece of global state,
 
 Here's 'let-static':
 
  (define-syntax (let-static stx)
(syntax-case stx ()
  [(let-static ([var rhs] ...) . body)
   (with-syntax ([(gvar ...)
  (syntax-local-lift-values-expression
   (length (syntax-list #'(var ...)))
   #'(values rhs ...))])
 #'(let-syntax ([var (make-rename-transformer #'gvar)] ...)
 . body))]))
 
  this kind of thing might Just
 Work to provide a speedup of almost six-fold on almost-monomorphic
 send-heavy code:
 
 (define-syntax cached-send
  (syntax-rules ()
((_ obj msg arg ...)
 (let-static ((bc (box #f))
  (bm (box #f)))
   (let* ((tmp obj)
  (cls (object-ref tmp)))
 (if (eq? (unbox bc) cls)
 ((unbox bm) tmp arg ...)
 (let-values (((method _)
   (find-method/who 'send tmp 'msg)))
   (set-box! bc cls)
   (set-box! bm method)
   (method tmp arg ...
 
 That code has a space-safety problem: the caches might keep around classes 
 that should be garbage collected. It also has a race condition: there's a 
 period of time when the class cache has been updated but the method cache 
 hasn't.
 
 The safe-for-space issue can be fixed by using weak boxes. The race condition 
 could be fixed by using one instead of two... but there's no structure that 
 combines a class and a particular method implementation that's strongly held 
 elsewhere. Creating a new pair for the weak box to hold works, but it means 
 that every GC is likely to flush the cache, but maybe that's okay. It also 
 means that every cache miss triggers heap allocation, but I don't know any 
 way around that.
 
 I've attached an updated benchmark script. Still looks like a win on speed, 
 but slightly less than before.
 
 Ryan
 time-method-send-v2.rkt_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev


_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev


Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread Ryan Culpepper

On 05/04/2011 04:49 PM, Tony Garnock-Jones wrote:

The attached (highly experimental) patch seems to improve the
performance of normal sends (in the case of a cache hit) by roughly 100%
- 150%. The difference between this mere factor of two improvement and
the factor of six-through-ten I was seeing earlier is, I speculate,
related to object-ref taking a lot of time.

Interestingly, the performance of (send something message) is, with this
patch, seemingly roughly en par with the performance of generics
captured using generic and invoked using send-generic.

I haven't yet tried running a full racket system with this patch
applied. I wonder if it makes a difference to the interactive
performance of DrRacket?


Robby: What happened to the drracket automated gui tests? I remember 
there used to be an environment variable to enable the button that ran 
them, but I can't find it in the manual or in the source tree.


Ryan
_
 For list-related administrative tasks:
 http://lists.racket-lang.org/listinfo/dev


Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)

2011-05-04 Thread D Herring

On 05/04/2011 03:57 PM, Tony Garnock-Jones wrote:

A simple experiment I just performed suggests that a monomorphic
inline cache hit can reduce the time needed for a send in Racket from
350ns to around 60ns, which is a massive win. I've attached the
program I used to measure this, FWIW. (Run it using command-line
Racket, not DrRacket: I got some *very* weird timing artifacts out of
DrRacket during this experiment!)

The question, then, is: how do we implement MICs or PICs using
Racket's macro system? Each send site needs to expand into

- a piece of global state
- a test involving that global state
- a possible update to that global state


You might borrow some ideas from this description of call-site caching 
in SBCL.

http://random-state.net/log/3507261942.html

Later,
Daniel

_
 For list-related administrative tasks:
 http://lists.racket-lang.org/listinfo/dev