[racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
On 2011-05-04 12:04 PM, Matthias Felleisen wrote: I still believe that the Java implementation (just under 1s without their 'Google' contract) benefits from typed dispatches. Maybe it does, but it's almost certain that it is benefiting from inline caching at send sites (i.e. dynamic type information) much more than it will be benefiting from static type information. A quick-and-dirty comparison of raw send performance on my Mac: Language Virtual machine Nanoseconds per send Java Hotspot 64-bit 1.6.0_24 1.4 SmalltalkCog r2382 21 SmalltalkSqueakVM 4.2.4beta1U 122 Racket Racket v5.1 ~350 Note that Cog is a JITting VM and SqueakVM is a plain (but very well optimised) interpreter. Both Cog and SqueakVM use a couple of levels of method lookup cache. A simple experiment I just performed suggests that a monomorphic inline cache hit can reduce the time needed for a send in Racket from 350ns to around 60ns, which is a massive win. I've attached the program I used to measure this, FWIW. (Run it using command-line Racket, not DrRacket: I got some *very* weird timing artifacts out of DrRacket during this experiment!) The question, then, is: how do we implement MICs or PICs using Racket's macro system? Each send site needs to expand into - a piece of global state - a test involving that global state - a possible update to that global state Hypothesising some kind of (let-static) form that introduces a lexically-scoped piece of global state, this kind of thing might Just Work to provide a speedup of almost six-fold on almost-monomorphic send-heavy code: (define-syntax cached-send (syntax-rules () ((_ obj msg arg ...) (let-static ((bc (box #f)) (bm (box #f))) (let* ((tmp obj) (cls (object-ref tmp))) (if (eq? (unbox bc) cls) ((unbox bm) tmp arg ...) (let-values (((method _) (find-method/who 'send tmp 'msg))) (set-box! bc cls) (set-box! bm method) (method tmp arg ... Regards, Tony #lang racket (require racket/private/class-internal) ;; An ordinary Racket class. (define a% (class* object% () (super-new) (define/public (op x) (+ x 1 ;; Representation of a trivial vtable. (struct ob (vt state) #:transparent) ;; A simple vtable providing a single method named op. (define (b%-vt selector) (case selector ((op) (lambda (self x) (+ x 2))) (else (error 'dnu ;; A simple class, using b%-vt as its behaviour. (define (b%) (ob b%-vt 'no-state)) ;; An uncached send to a struct ob. (define-syntax unmemo-send (syntax-rules () ((_ obj msg arg ...) (let ((tmp obj)) (((ob-vt tmp) 'msg) tmp arg ...) ;; A quasi-cached send to a struct ob. ;; ;; A real cache would have per-send-site state rather than a single ;; (!) global variable. (define *memo-class* #f) (define *memo-method* #f) (define-syntax memo-send (syntax-rules () ((_ obj msg arg ...) (let* ((tmp obj) (cls (ob-vt tmp))) (if (eq? *memo-class* cls) (*memo-method* tmp arg ...) (let ((method (cls 'msg))) (set! *memo-class* cls) (set! *memo-method* method) (method tmp arg ...))) ;; Test objects. (define a0 (new a%)) (define b0 (b%)) ;; Syntax: (measure-ns exp) ;; ;; Expands to an expression that repeats exp NREPEATS times, ;; measuring the elapsed time, and returns the number of nanoseconds ;; of CPU time used *per iteration*, excluding any GC time. (define NREPEATS 500) (define-syntax measure-ns (syntax-rules () ((_ exp) (call-with-values (lambda () (pretty-print `(measuring exp)) (time-apply (lambda () (do ((i 0 (+ i 1))) ((= i NREPEATS)) exp)) '())) (lambda (results cpu real gc) (/ (* 10.0 (/ (- cpu gc) 1000.0)) NREPEATS)) ;; Main program. ;; Measure the time for a null measure-ns loop first, then measure the ;; operations of interest, subtracting the null-time overhead ;; measurement from each to get an estimate of the time taken for the ;; interesting operation. (let ((null-time (measure-ns 123))) (define (report-on t) (let ((name (first t)) (ns/op (second t))) (write (list name (- ns/op null-time))) (newline))) (for-each report-on `( ;; Report on the loop overhead for sanity checking. (constant ,null-time) ;; How long does a plain Scheme addition operation take? (simple-add
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
As I was reading a paper, I encountered the word 'cache' and remembered that I had wanted to implement a caching version of the central method in my central data structure. That took me about 10 minutes, and I now have good times for the game: with contracts: 1.8s per game without ctrcts: 1.2s per game I am now within 20% of the fastest Java implementation in my class. That's good. -- Matthias p.s. Okay, I'll say it: algorithms matter, too. _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
-Original Message- From: dev-boun...@racket-lang.org [mailto:dev-boun...@racket-lang.org] On Behalf Of Matthias Felleisen Sent: 04 May 2011 22:58 To: Tony Garnock-Jones; D Herring Cc: dev List Subject: Re: [racket-dev] Inline caching (was Re: my '312' this semester,how we compare to others) As I was reading a paper, I encountered the word 'cache' and remembered that I had wanted to implement a caching version of the central method in my central data structure. That took me about 10 minutes, and I now have good times for the game: with contracts: 1.8s per game without ctrcts: 1.2s per game I am now within 20% of the fastest Java implementation in my class. That's good. -- Matthias p.s. Okay, I'll say it: algorithms matter, too. That to me seems most important. Of course inplementation of an algortithm requires wisdom too: Is cashing of data faster than recomputing them? How do we deal with shortage of RAM? How do we optimize software? (nowadays surely not by hand) Do software devellopers inspire hardware designers, or may/should it be the orher way around? There is a tendency to separate programmers from the hardware equipment they are using. Is this wise? I think hardware designers must understand programmers and vice versa. Most important, both parties must understand their customers. mho, Jos _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
The attached (highly experimental) patch seems to improve the performance of normal sends (in the case of a cache hit) by roughly 100% - 150%. The difference between this mere factor of two improvement and the factor of six-through-ten I was seeing earlier is, I speculate, related to object-ref taking a lot of time. Interestingly, the performance of (send something message) is, with this patch, seemingly roughly en par with the performance of generics captured using generic and invoked using send-generic. I haven't yet tried running a full racket system with this patch applied. I wonder if it makes a difference to the interactive performance of DrRacket? Major weaknesses of the approach: - overhead on cache miss is unknown - not safe-for-space yet (weak boxes are immutable) The approach should generalize easily to polymorphic inline caches. Regards, Tony diff --git a/collects/racket/private/class-internal.rkt b/collects/racket/private/class-internal.rkt index d00ef10..0a96cba 100644 --- a/collects/racket/private/class-internal.rkt +++ b/collects/racket/private/class-internal.rkt @@ -3734,19 +3734,30 @@ (let () (define (do-method traced? stx form obj name args rest-arg?) - (with-syntax ([(sym method receiver) - (generate-temporaries (syntax (1 2 3)))]) + (with-syntax ([(sym in-object in-class state method receiver) + (generate-temporaries (syntax (1 2 3 4 5 6)))] + [*cached-state* +(syntax-local-lift-expression + (syntax (cons #f #f)))]) (quasisyntax/loc stx (let*-values ([(sym) (quasiquote (unsyntax (localize name)))] -[(method receiver) - (find-method/who '(unsyntax form) - (unsyntax obj) - sym)]) + [(in-object) (unsyntax obj)] + [(in-class) (and (object? in-object) (object-ref in-object))] + [(state) *cached-state*] +[(method) +(if (and in-class (eq? (car state) in-class)) +(cdr state) +(let-values ([(m r) + (find-method/who '(unsyntax form) + in-object + sym)]) + (set! *cached-state* (cons in-class m)) + m))]) (unsyntax (make-method-call traced? stx - (syntax/loc stx receiver) + (syntax/loc stx in-object) (syntax/loc stx method) (syntax/loc stx sym) args _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
On 2011-05-04 18:49:13 -0400, Tony Garnock-Jones wrote: The attached (highly experimental) patch seems to improve the performance of normal sends (in the case of a cache hit) by roughly 100% - 150%. The difference between this mere factor of two improvement and the factor of six-through-ten I was seeing earlier is, I speculate, related to object-ref taking a lot of time. Interestingly, the performance of (send something message) is, with this patch, seemingly roughly en par with the performance of generics captured using generic and invoked using send-generic. I haven't yet tried running a full racket system with this patch applied. I wonder if it makes a difference to the interactive performance of DrRacket? Wow, impressive! I've been benchmarking with the DrRacket interactive tests already for contracts, so I can run my test driver and get some numbers for that. Cheers, Asumu _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
On 2011-05-04 6:54 PM, Asumu Takikawa wrote: Wow, impressive! I've been benchmarking with the DrRacket interactive tests already for contracts, so I can run my test driver and get some numbers for that. That'd be great. I mean, it'll probably just break, but if it doesn't... it'd be interesting :-) Tony _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
On 05/04/2011 01:57 PM, Tony Garnock-Jones wrote: On 2011-05-04 12:04 PM, Matthias Felleisen wrote: I still believe that the Java implementation (just under 1s without their 'Google' contract) benefits from typed dispatches. Maybe it does, but it's almost certain that it is benefiting from inline caching at send sites (i.e. dynamic type information) much more than it will be benefiting from static type information. A quick-and-dirty comparison of raw send performance on my Mac: Language Virtual machine Nanoseconds per send Java Hotspot 64-bit 1.6.0_24 1.4 Smalltalk Cog r2382 21 Smalltalk SqueakVM 4.2.4beta1U 122 Racket Racket v5.1 ~350 Note that Cog is a JITting VM and SqueakVM is a plain (but very well optimised) interpreter. Both Cog and SqueakVM use a couple of levels of method lookup cache. A simple experiment I just performed suggests that a monomorphic inline cache hit can reduce the time needed for a send in Racket from 350ns to around 60ns, which is a massive win. I've attached the program I used to measure this, FWIW. (Run it using command-line Racket, not DrRacket: I got some *very* weird timing artifacts out of DrRacket during this experiment!) The question, then, is: how do we implement MICs or PICs using Racket's macro system? Each send site needs to expand into - a piece of global state - a test involving that global state - a possible update to that global state Hypothesising some kind of (let-static) form that introduces a lexically-scoped piece of global state, Here's 'let-static': (define-syntax (let-static stx) (syntax-case stx () [(let-static ([var rhs] ...) . body) (with-syntax ([(gvar ...) (syntax-local-lift-values-expression (length (syntax-list #'(var ...))) #'(values rhs ...))]) #'(let-syntax ([var (make-rename-transformer #'gvar)] ...) . body))])) this kind of thing might Just Work to provide a speedup of almost six-fold on almost-monomorphic send-heavy code: (define-syntax cached-send (syntax-rules () ((_ obj msg arg ...) (let-static ((bc (box #f)) (bm (box #f))) (let* ((tmp obj) (cls (object-ref tmp))) (if (eq? (unbox bc) cls) ((unbox bm) tmp arg ...) (let-values (((method _) (find-method/who 'send tmp 'msg))) (set-box! bc cls) (set-box! bm method) (method tmp arg ... That code has a space-safety problem: the caches might keep around classes that should be garbage collected. It also has a race condition: there's a period of time when the class cache has been updated but the method cache hasn't. The safe-for-space issue can be fixed by using weak boxes. The race condition could be fixed by using one instead of two... but there's no structure that combines a class and a particular method implementation that's strongly held elsewhere. Creating a new pair for the weak box to hold works, but it means that every GC is likely to flush the cache, but maybe that's okay. It also means that every cache miss triggers heap allocation, but I don't know any way around that. I've attached an updated benchmark script. Still looks like a win on speed, but slightly less than before. Ryan #lang racket (require racket/private/class-internal) (define-syntax (let-static stx) (syntax-case stx () [(let-static ([var rhs] ...) . body) (with-syntax ([(gvar ...) (syntax-local-lift-values-expression (length (syntax-list #'(var ...))) #'(values rhs ...))]) #'(let-syntax ([var (make-rename-transformer #'gvar)] ...) . body))])) ;; An ordinary Racket class. (define a% (class* object% () (super-new) (define/public (op x) (+ x 1 ;; Cached versions of send ;; (Real version should base cache on class, not object.) ;; send/cache0 ;; Uses 2 vars. ;; Problem: not safe for space! ;; Problem: race condition in 2 set!s. (define-syntax send/cache0 (syntax-rules () [(_ obj-expr msg arg ...) (let-static ([*memo-class* #f] [*memo-method* #f]) (let ([obj obj-expr]) (let ([f (if (eq? obj *memo-class*) *memo-method* (let-values ([(method _obj) (find-method/who 'send obj 'msg)]) (set! *memo-class* obj) (set! *memo-method* method) method))]) (f obj arg ...])) ;; send/cache1 ;; Uses 2 weak-boxes to be safe for space. ;; Problem: race condition in 2 set!s. (define-syntax send/cache1 (syntax-rules () ((_ obj-expr msg arg ...) (let-static ([*memo-class* (make-weak-box #f)] [*memo-method* (make-weak-box #f)]) (let* ([obj obj-expr]
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
Wow. Why didn't I think of asking for this before :-) On May 4, 2011, at 7:11 PM, Ryan Culpepper wrote: On 05/04/2011 01:57 PM, Tony Garnock-Jones wrote: On 2011-05-04 12:04 PM, Matthias Felleisen wrote: I still believe that the Java implementation (just under 1s without their 'Google' contract) benefits from typed dispatches. Maybe it does, but it's almost certain that it is benefiting from inline caching at send sites (i.e. dynamic type information) much more than it will be benefiting from static type information. A quick-and-dirty comparison of raw send performance on my Mac: Language Virtual machine Nanoseconds per send Java Hotspot 64-bit 1.6.0_24 1.4 Smalltalk Cog r2382 21 Smalltalk SqueakVM 4.2.4beta1U 122 Racket Racket v5.1 ~350 Note that Cog is a JITting VM and SqueakVM is a plain (but very well optimised) interpreter. Both Cog and SqueakVM use a couple of levels of method lookup cache. A simple experiment I just performed suggests that a monomorphic inline cache hit can reduce the time needed for a send in Racket from 350ns to around 60ns, which is a massive win. I've attached the program I used to measure this, FWIW. (Run it using command-line Racket, not DrRacket: I got some *very* weird timing artifacts out of DrRacket during this experiment!) The question, then, is: how do we implement MICs or PICs using Racket's macro system? Each send site needs to expand into - a piece of global state - a test involving that global state - a possible update to that global state Hypothesising some kind of (let-static) form that introduces a lexically-scoped piece of global state, Here's 'let-static': (define-syntax (let-static stx) (syntax-case stx () [(let-static ([var rhs] ...) . body) (with-syntax ([(gvar ...) (syntax-local-lift-values-expression (length (syntax-list #'(var ...))) #'(values rhs ...))]) #'(let-syntax ([var (make-rename-transformer #'gvar)] ...) . body))])) this kind of thing might Just Work to provide a speedup of almost six-fold on almost-monomorphic send-heavy code: (define-syntax cached-send (syntax-rules () ((_ obj msg arg ...) (let-static ((bc (box #f)) (bm (box #f))) (let* ((tmp obj) (cls (object-ref tmp))) (if (eq? (unbox bc) cls) ((unbox bm) tmp arg ...) (let-values (((method _) (find-method/who 'send tmp 'msg))) (set-box! bc cls) (set-box! bm method) (method tmp arg ... That code has a space-safety problem: the caches might keep around classes that should be garbage collected. It also has a race condition: there's a period of time when the class cache has been updated but the method cache hasn't. The safe-for-space issue can be fixed by using weak boxes. The race condition could be fixed by using one instead of two... but there's no structure that combines a class and a particular method implementation that's strongly held elsewhere. Creating a new pair for the weak box to hold works, but it means that every GC is likely to flush the cache, but maybe that's okay. It also means that every cache miss triggers heap allocation, but I don't know any way around that. I've attached an updated benchmark script. Still looks like a win on speed, but slightly less than before. Ryan time-method-send-v2.rkt_ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
On 05/04/2011 04:49 PM, Tony Garnock-Jones wrote: The attached (highly experimental) patch seems to improve the performance of normal sends (in the case of a cache hit) by roughly 100% - 150%. The difference between this mere factor of two improvement and the factor of six-through-ten I was seeing earlier is, I speculate, related to object-ref taking a lot of time. Interestingly, the performance of (send something message) is, with this patch, seemingly roughly en par with the performance of generics captured using generic and invoked using send-generic. I haven't yet tried running a full racket system with this patch applied. I wonder if it makes a difference to the interactive performance of DrRacket? Robby: What happened to the drracket automated gui tests? I remember there used to be an environment variable to enable the button that ran them, but I can't find it in the manual or in the source tree. Ryan _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] Inline caching (was Re: my '312' this semester, how we compare to others)
On 05/04/2011 03:57 PM, Tony Garnock-Jones wrote: A simple experiment I just performed suggests that a monomorphic inline cache hit can reduce the time needed for a send in Racket from 350ns to around 60ns, which is a massive win. I've attached the program I used to measure this, FWIW. (Run it using command-line Racket, not DrRacket: I got some *very* weird timing artifacts out of DrRacket during this experiment!) The question, then, is: how do we implement MICs or PICs using Racket's macro system? Each send site needs to expand into - a piece of global state - a test involving that global state - a possible update to that global state You might borrow some ideas from this description of call-site caching in SBCL. http://random-state.net/log/3507261942.html Later, Daniel _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev