Hello list,

This is a call for testers concerning an experimental OCaml compiler
back-end that uses SSE2 instructions for floating-point arithmetic.
This code generation strategy was discussed before on this list, and I
include below a summary in Q&A style.

The new back-end is being considered for inclusion in the next major
release (3.12), but performance testing done so far at INRIA and by
Caml Consortium members is not conclusive.  Additional results
from members of this list would therefore be very welcome.

We're not terribly interested in small (< 50 LOC), Shootout-style
benchmarks, since their performance is very sensitive to code and data
placement.  However, if some of you have a sizeable (> 500 LOC) body
of float-intensive Caml code, we'd be very interested to hear about
the compared speed of the SSE2 back-end and the old back-end on your
code.

Switching to Q&A style:

Q: Where can I get the code?

A: From the SVN repository:

svn checkout http://caml.inria.fr/svn/ocaml/branches/sse2 ocaml-sse2

Source-code only.  Very lightly tested under Windows, so you might be
better off testing under Unix.

Q: What is this SSE2 thingy?

A: An extension of the Intel/AMD x86 instruction set that provides,
among other things, 64-bit float arithmetic instructions operating
over 64-bit float registers.  Before SSE2, the only way to perform
64-bit float arithmetic on x86 was the x87 instructions, which compute
in 80-bit precision and use a stack instead of registers.

Q: Why this sudden interest in SSE2?

A: SSE2 has several potential advantages over x87, including:

- The register-based SSE2 model fits the OCaml back-end much better
  than the stack-based x87 model.  In particular, "let"-bound intermediate
  results of type "float" can be kept in SSE2 registers, while in
  the current x87 mode they are systematically flushed to the stack.

- SSE2 implements exactly 64-bit IEEE arithmetic, giving float results
  that are consistent with those obtained on other platforms and with
  the OCaml bytecode interpreter.  The 80-bit format of x87 produces
  different results and can causes surprises such as "double rounding"
  errors.  (For more explanations, see David Monniaux's excellent article,
  http://hal.archives-ouvertes.fr/hal-00128124/ )

- Some x86 processors execute SSE2 instructions faster than their x87
  counterparts.  This speed difference was notable on the Pentium 4
  in particular, but is much smaller on more recent processors such as
  Core 2.

Note that x86-64 bits systems as well as Mac OS X already use SSE2 as
their default floating-point model.

SSE2 also has some potential disadvantages:

- The instructions are bigger than x87 instructions, causing some
  increase in code size and potentially some decrease in instruction
  cache efficiency.

- Computing intermediate results in 80-bit precision, like x87 does,
  can improve the numerical stability of poorly-conditioned float
  computations, although it doesn't make a difference for well-written
  numerical code.

Q: Is SSE2 universally available on x86 processors?

A: Not universally but pretty close.  SSE2 made its debut in 2000, in
the Pentium 4 processor.  All x86 machines built in the last 4 years
or so support SSE2, but pre-Pentium 4 and pre-Athlon64 processors do not.

Q: So if you adopt this new back-end, OCaml will stop working on my
trusty 1995-vintage Pentium?

A: No.  Under friendly pressure from our Debian friends, we agreed to
keep the x87 back-end alive for a while in parallel with the SSE2
back-end.  The x87 back-end is selected at configuration time if the
processor doesn't support SSE2 or if a special flag is given to the
configure script.

Q: I observed a 20% (speedup|slowdown)!  Should I tell the world about it?

A: If your benchmark spends all its time in 10 lines of OCaml, maybe
not.  On such small codes, variations in code and data placement alone
(without changing the instructions that are actually executed) can
result in performance variations by 20%, so this is just experimental
noise.  Larger programs are less sensitive to this noise, which is why
we're much more interested in results obtained on real OCaml
applications.  Finally, one micro-benchmark slowed down by a factor of
2 for reasons we couldn't explain.

Q: What are those inconclusive results you mentioned?

A: On medium-sized numerical kernels (e.g. FFT, Gaussian process
regression), we've observed speedups of about 8% on Core 2 processors
and somewhat higher on recent AMD processors.  On bigger OCaml
applications that perform floating-point computations but not
exclusively, the performance difference was lost in the noise.

Looking forward to interesting experimental results,

- Xavier Leroy

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Reply via email to