Re: [Chicken-hackers] Floating point performance

2019-04-24 Thread Peter Bex
On Thu, Apr 18, 2019 at 06:28:11PM +0200, Peter Bex wrote:
> Now, fp+ is only inlineable if the scrutinizer can prove that it's adding
> flonums, otherwise it falls back to a CPS call.  I'm sure we can change
> that relatively easily by making it into an inline function that uses
> check_flonum or so.  We could rename the current one to C_a_u_i_flonum_plus,
> which is more correct anyway since it's unsafe and may crash when given
> another kind of object.
> 
> Of course this means several more intrinsics will have to be added as
> safe versions for each of the specific flonum operators.  Thoughts?

OK, maybe we can do it differently and more controlled by automating
this.  I've created http://bugs.call-cc.org/ticket/1611 to track this.

I believe this approach also allows us to get rid of some of the rewrites
in c-platform.scm, which I've always found quite an eyesore.

Cheers,
Peter


signature.asc
Description: PGP signature
___
Chicken-hackers mailing list
Chicken-hackers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-hackers


Re: [Chicken-hackers] Floating point performance

2019-04-19 Thread felix . winkelmann
> To make this code specialize two things are needed:
> 
> 1. Infer more specific types for (recursive) functions.
> 
> 2. Prove that a function is always called with correct arguments.
> 
> If we can do this then we can effectively re-walk the function with the
> arguments assumed to be of the correct types. This should cause all
> calls to be specialized.
> 
> Feature 1. is hard and lots of work, but doable. This needs for example
> adding support for unifying 2 type variables.
> 
> Feature 2. is probably not that hard for the simple cases like above.
> And handling the simple cases might be enough.
> 
> [...]
> >
> > That's only slower due to a C_trace call.  With -d0 it produces
> > more or less identical code with -O3.
> 
> I do get a nice speedup with -O3 if I annotate the sum inside the fp+
> call: (the float sum).

I was about to suggest that, there is also "assume", IIRC.

Doing global flow-analysis is a completely different problem from
what the scrutinizer currently implements, and it was never intended
to do such an analysis. This is really, really hard, just as megane
says. There is also the question whether all the additional complexity
needed for this is worth the effort (we are still talking about a micro-
benchmark here), and the scrutinizer is already terribly complex.

To suggestion 2: one could track calls to unexported ("block-mode")
functions, and so build a set of possible argument types. But this
can not be done without some form of declaration or compile-mode,
otherwise debugging cases where the things are "optimized" while
procedures escape (using low-level APIs or just because of wrong
assumptions by the user) will drive you insane.


felix



felix


___
Chicken-hackers mailing list
Chicken-hackers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-hackers


Re: [Chicken-hackers] Floating point performance

2019-04-19 Thread megane


Peter Bex  writes:

[...]
>
> Of course this means several more intrinsics will have to be added as
> safe versions for each of the specific flonum operators.  Thoughts?

Can't see a reason why not, except it's a lot of code to write.

>
> I also wonder why the scrutinizer can't detect that these are always
> flonum arguments.  Probably because it's a recursive loop?  These
> local vars never escape, so it should be possible to make assumptions
> about them.

Here's the relevant -debug 2 output:

(##core#app
 (let ((doloop1617 (##core#undefined)))
   (let ((t20 (set! doloop1617
  (##core#lambda
   (i18 sum19)
   (if (chicken.fixnum#fx= i18 size13)
   (chicken.base#print sum19)
   (##core#app
doloop1617
(chicken.fixnum#fx+ i18 '1)
(chicken.flonum#fp+
 sum19
 (srfi-4#f64vector-ref v14 i18
 (let () doloop1617)))
 '0
 '0.0)

For the fp+ call to be specialized both of the arguments would have to
be inferred to be floats. The second one is. The 'sum19' however is not.

In the scrutinizer the function arguments are always of type * at the
beginning of a function's body. Then their types are refined with
predicates in 'if' branches or with calls to enforcing functions. Here
'sum19' is only used as an argument to print, which doesn't refine
anything, before the fp+ call.

To make this code specialize two things are needed:

1. Infer more specific types for (recursive) functions.

2. Prove that a function is always called with correct arguments.

If we can do this then we can effectively re-walk the function with the
arguments assumed to be of the correct types. This should cause all
calls to be specialized.

Feature 1. is hard and lots of work, but doable. This needs for example
adding support for unifying 2 type variables.

Feature 2. is probably not that hard for the simple cases like above.
And handling the simple cases might be enough.

[...]
>
> That's only slower due to a C_trace call.  With -d0 it produces
> more or less identical code with -O3.

I do get a nice speedup with -O3 if I annotate the sum inside the fp+
call: (the float sum).

___
Chicken-hackers mailing list
Chicken-hackers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-hackers


[Chicken-hackers] Floating point performance

2019-04-18 Thread Peter Bex
Hi guys,

I came across this post[1] and could not resist writing something like it
in CHICKEN:

(import srfi-4 (chicken fixnum) (chicken flonum))

(let* ((size (* 3200))
   (v (make-f64vector size)))
  (time
   (do ((i 0 (fx+ i 1))
(sum 0.0 (fp+ sum (f64vector-ref v i
   ((fx= i size) (print sum)

The above code is actually slower with fp+ instead of just +!

I checked, and the reason is that the generated C contains a goto loop
if we use +, due to using C_s_a_i_plus() which is inlineable.

Now, fp+ is only inlineable if the scrutinizer can prove that it's adding
flonums, otherwise it falls back to a CPS call.  I'm sure we can change
that relatively easily by making it into an inline function that uses
check_flonum or so.  We could rename the current one to C_a_u_i_flonum_plus,
which is more correct anyway since it's unsafe and may crash when given
another kind of object.

Of course this means several more intrinsics will have to be added as
safe versions for each of the specific flonum operators.  Thoughts?

I also wonder why the scrutinizer can't detect that these are always
flonum arguments.  Probably because it's a recursive loop?  These
local vars never escape, so it should be possible to make assumptions
about them.

For completeness, I also tried named let:

(let* ((size (* 3200))
   (v (make-f64vector size)))
  (time
   (let lp ((i 0)
(sum 0.0))
 (if (fx= i size)
 (print sum)
 (lp (fx+ i 1) (fp+ sum (f64vector-ref v i)))

That's only slower due to a C_trace call.  With -d0 it produces
more or less identical code with -O3.

With -O5 we get into more interesting territory, as we get unboxed
flonum references as a result of choosing the unsafe fp+ call.
We could get the same speedup at safe optimization levels if the
scrutinizer were able to deduce the types here.

Cheers,
Peter

[1] https://jackmott.github.io//programming/2016/07/22/making-obvious-fast.html


signature.asc
Description: PGP signature
___
Chicken-hackers mailing list
Chicken-hackers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-hackers