Re: GCC, Mac OS X & the future

Simon Marlow Fri, 01 Jul 2011 12:05:42 -0700

On 30/06/11 17:43, David Peixotto wrote:

I have made the changes necessary to compile GHC with llvm-gcc. The
major change was to use the pthread api for thread level storage to
access the gct variable during garbage collection. My measurements
indicate this causes an average slowdown of about 5% for gc heavy
programs. The changes are available from the `clang` branch on my
github fork.

Sounds good. One question: did you measure the GC performance with-threaded? Because the thread-specific variable in the GC is only usedwith -threaded.


Cheers,
        Simon

git://github.com/dmpots/ghc.git clang

The branch contains only two new patches. One patch changes the gc to
use pthreads for thread local storage when the `llvm_CC_FLAVOR`
symbol is defined by the preprocessor, and the other patch defines
the symbol based on an autoconf test. The autoconf patch may be a bit
heavy-handed because it is really just checking to see if the
`__llvm__` symbol is defined by the C compiler. I based it on an
answer from a stack overflow question:
http://stackoverflow.com/questions/1617877/how-to-detect-llvm-and-its-version-through-define-directives.
I'm open to suggestions on improving either patch.

I've been using the following configure line to test the llvm-gcc
support.

$ CC=/usr/bin/llvm-gcc ./configure --with-gcc=/usr/bin/llvm-gcc

The validate script finds the same errors with or without my
patches.

For the performance measurements, I looked at the fibon benchmarks
and the nofib gc benchmarks. Both benchmarks were tested on MacOS X
10.6 with a 64-bit GHC.

The fibon benchmarks show an average slowdown of 3% in execution
time, but the gc time slows down by an average of 10%. The nofib gc
benchmarks show an average execution time slowdown of 5%.

The detailed results are below. In the fibon results, a negative
number means that the llvm-gcc version is slower and a positive
number means it was faster. The efficiency column is the percent of
total execution time spent in the garbage collector.

Fibon Results
-----------------------------------------------------------------
MutCPUTime    GCCPUTime TotalCPUTime   Efficiency Agum
+8.52%      -10.95%       +4.14%       78.56% BinaryTrees
-0.06%      -16.01%       -6.09%       64.40% Blur
-0.19%       -3.03%       -0.22%       99.06% Bzlib
-2.65%       -3.08%       -2.66%       99.90% Chameneos
-22.01%      -11.22%      -21.95%       99.55% Cpsa
+1.82%       -8.68%       +0.88%       91.23% Crypto
-1.13%      -15.03%       -8.91%       48.58% FFT2d
+3.98%       -5.58%       +3.52%       95.26% FFT3d
+0.44%       -3.25%       +0.35%       97.50% Fannkuch
-1.92%       -7.41%       -2.27%       93.87% Fgl
+3.18%      -11.23%       -2.71%       60.60% Fst
-0.21%      -19.60%       -3.84%       81.98% Funsat
+0.55%      -11.31%       -4.35%       60.39% Gf
+0.29%       -9.78%       -2.80%       70.16% HaLeX
+3.59%      -16.01%       +2.86%       96.39% Happy
-0.37%      -13.65%       -6.07%       59.52% Hgalib
+2.09%       -9.85%       +1.14%       92.11% Laplace
+0.04%       -5.34%       -0.17%       96.07% MMult
-13.61%       -6.31%      -13.38%       97.27% Mandelbrot
+0.12%       -3.76%       +0.11%       99.81% Nbody
+0.11%       -4.12%       +0.08%       99.35% Palindromes
+15.02%      -15.63%       -4.46%       41.55% Pappy
+2.18%      -11.66%       -9.90%       20.66% Pidigits
+0.49%      -21.80%       -3.83%       81.35% QuickCheck
-2.02%       +2.75%       -1.14%       81.42% Regex
+3.39%       -6.62%       +2.92%       95.41% Simgi
+5.20%      -16.37%       -0.10%       76.39% SpectralNorm
+0.13%         ----       +0.13%      100.00% TernaryTrees
+2.79%       -9.93%       -3.66%       51.37% Xsact
-0.52%      -14.58%       -6.84%       57.94%
----------------------------------------------------------------- Min
-22.01%      -21.80%      -21.95%       20.66% Mean
+0.31%       -9.97%       -2.97%       79.59% Max
+15.02%       +2.75%       +4.14%      100.00%


In the nofib results a positive number means the llvm-gcc version was
slower and a negative number means it was faster (sorry for the
inconsistency!)

NoFib Results
------------------------------------------------------------------------------

Program           Size    Allocs   Runtime   Elapsed  TotalMem

------------------------------------------------------------------------------

circsim         -75.0%     +0.0%     +5.6%     +5.1%     +0.0%

constraints         -75.6%     +0.0%     +6.6%     +6.2%     +0.0%
gc_bench         -75.9%     +0.0%     +8.6%     +8.4%     +0.0% lcss
-76.2%     +0.0%     +7.7%     +6.8%     +0.0% power         -74.6%
+0.0%     +5.4%     +4.6%     +0.9% spellcheck         -80.3%
+0.0%     -1.0%     -1.9%     +0.0%
------------------------------------------------------------------------------

Min         -80.3%     +0.0%     -1.0%     -1.9%     +0.0%

Max         -74.6%     +0.0%     +8.6%     +8.4%     +0.9% Geometric
Mean         -76.3%     +0.0%     +5.4%     +4.8%     +0.1%

On Jun 27, 2011, at 6:18 PM, David Peixotto wrote:

I'll take a look at getting the llvm-gcc route going by switching
the gct variable to use pthread_getspecific() on mac os x. I can do
some benchmarking to measure the impact.

I was playing around just to get the compilation to succeed. After
a small change in STGCRun.c, the compile went through but then it
was getting a segfault in the stage 2 compiler because of the
global register variables.

I thought that llvm-gcc would complain about the global register
variables, but it seems to accept them and generate the assembly
code to read and write them. Only problem is it will also use these
registers for other purposes, so the gct was getting stomped which
was causing the segfaults.

So from what I can see llvm-gcc dies at compile time when given
__thread variables and accepts global register variables but can
generate code that stomps on the register.

-David

On Jun 24, 2011, at 3:23 AM, Simon Marlow wrote:

On 21/06/2011 05:51, Manuel M T Chakravarty wrote:

austin seipp:

(CC'ing Dan so he can chime in, for those who don't IRC.)

Dan Knapp (dankna on freenode) is running OS X Lion on his
machine (and corresponding new xcode tools I believe,) and
apparently Apple have gone the whole way in the next release
and by default making 'gcc' a symbolic link to 'llvm-gcc.'


Just like my prediction ;)

It's likely that will soon be clang, given llvm-gcc is
already deprecated as of LLVM 2.9. There is still a regular
GCC bundled with Lion apparently, ISTR Dan saying the
executable was under /Developer under the name
'i686-apple-darwin-gcc-4.2' or somesuch, but I can't verify
that (Snow Leopard here.) Anyone with lion want to chime in?


I would assume that 'gcc-4.2' will still point to the
traditional GCC for a while.  Especially with C++, clang is
still behind and there are still the odd code generator bugs in
LLVM that require code generation with traditional gcc.

Dan was working on build fixes/RTS fixes last week to try and
make GHC build cleanly with the pthread_getspecific and work
with compilers other than GCC. I think he did make some good
headway in this area, but his work isn't done either.

Considering global register variables are a rather rare and
intricate GCC extension, it's much more likely that we will
see __thread support in Clang first (TLS also has
implications for C++0x I've heard them say.) It's not on
their short-term TODO list, however. In the mean time if
apple were to remove GCC entirely for some reason, we'd
still need Dan's patches, wouldn't we?


If we could move to clang (on OS X) that would be ideal, but as
I wrote above I seriously doubt that Apple will entirely remove
gcc (at least not before whatever cat comes after Lion).  So,
for the time being, and until we can use clang, I think it
would be wise to use 'gcc-4.2' as a default on OS X (instead of
'gcc', which appears to morph into llvm-gcc soon).  If we do
that for GHC 7.2, then GHC 7.2 won't break once Apple flips the
sym link over.

Simon, what do you think?


I have no strong opinions, you guys know the platform much better
then me, so I'm happy to go with whatever you think makes the
most sense.

One thing I would keep an eye on is the performance of the GC,
because the handling of the gct thread-local variable is
critical.  I can help you with some quick benchmarks if you want
to test out changes.

Cheers, Simon

Manuel

On Sun, Jun 19, 2011 at 9:43 PM, Manuel M T Chakravarty
<[email protected]>   wrote:

As llvm-gcc on OS X seems to require some work, I wonder
whether we should by default build with the 'gcc-4.2'
executable on OS X (which uses the traditional gcc
backend), instead of the generic 'gcc' (probably still
using 'gcc' as a fallback in configure if 'gcc-4.2' is not
available).  Then, when Apple makes the switch, binary GHC
packages will continue to work.

Manuel

PS: I am all for resolving the problems with llvm-gcc, but
that will likely take a while.  It'd be good to get a fix
into 7.2, though.

Simon Marlow:

On 01/06/2011 13:30, Manuel M T Chakravarty wrote:

Simon Marlow:

On 01/06/2011 07:11, Manuel M T Chakravarty wrote:

Simon Marlow:

On 30/05/2011 14:59, Manuel M T Chakravarty
wrote:

It is no secret that Apple moves away from the
traditional GCC backend to LLVM.  In fact,
Xcode (which bundles all command line developer
tools on the Mac) today comes with two flavours
of gcc: 'gcc' and 'llvm-gcc', which AFAIK only
differ in the backend that is being used.
Currently, the default is the traditional GCC
backend, but it takes no precognition to
realise that this will eventually change.  The
'gcc' executable will use the LLVM backend and,
at least for a while, the traditional backend
will still be available under a different
name.

Unfortunately, GHC will break at this point as
the LLVM backend does not support pinned global
registers.  ('llvm-gcc' happily accepts the
register assignment, but fails with a runtime
error during code generation.)


This shouldn't be a problem.  We don't use pinned
global registers any more, except in one place -
the GC (see rts/sm/GCTDecl.h).  There it's
optional, but you lose a bit of performance by
not using a pinned register.  It's not a huge
deal.

Have you tried building GHC with llvm-gcc?  I
think I tried it on the RTS a year or so ago to
check the LLVM output against gcc (LLVM wasn't
quite as good at the time).


Yes, I tried and it failed, while compiling the
RTS, with

sorry, unimplemented: LLVM cannot handle register
variable ‘R1’, report a bug

This was using the 64bit version of GHC.  I'll have
a closer look.


Perhaps that was when compiling StgCRun.c? It doesn't
actually need register variables (on x86_64 at
least), but it does include the header files, so that
probably needs some #ifdefery somewhere for
llvm-gcc.


Yes, it's in 'StgCRun.c'.   Ok, and how about on i386
(or do you want to phase that arch out)?


It doesn't look like the x86 code in StgCRun.c uses
registers either. The sparc version does, but it could be
rewritten.

The other place, as I mentioned above, is
rts/sm/GCTDecl.h, which will need to use a different
method for declaring the garbage collector's
thread-local state variable, gct.  On x86_64 I found
that using a fixed register was the fastest, but
using a thread-local variable (the __thread modifier)
also works.


Just to make sure I understand correctly, are you
saying that using a thread-local variable is already
implemented as an option,


Yes - look at the series of #ifdefs in that file, it's
pretty straightforward to change how gct is declared for
a particular platform.

However, I've just done some poking around and it seems
that __thread is not supported on OS X:

http://lifecs.likai.org/2010/05/mac-os-x-thread-local-storage.html

see also this thread about Clang:


http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-March/013673.html

It seems there might be support for __thread in the future, but not inthe short term.


It seems our very own David Peixotto tried building GHC
with Clang a year ago and ran into the same thing:

http://www.dmpots.com/blog/2010/05/08/building-ghc-with-clang.html

So this is less than ideal. The short term fix would be to #define gctto be a called to pthread_getspecific(). The call will be inlined - theOS X headers define pthread_getspecific in terms of some inlineassembly, but the optimiser won't know anything about the inlineassembly so it won't be able to common up multiple loads of gct, andthat probably means it won't perform well. If that's the case, then thesolution is to load up gct into a temporary in the performance-criticalfunctions in the GC (evacuate(), scavenge_block()), and add it as anargument to inline functions. I'd rather avoid having to do all that ifpossible.


If you want to benchmark the GC, there are some good
programs in nofib/gc.

Cheers, Simon



_______________________________________________ Cvs-ghc
mailing list [email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc




-- Regards, Austin



_______________________________________________ Cvs-ghc mailing
list [email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc



_______________________________________________ Cvs-ghc mailing
list [email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc



_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: GCC, Mac OS X & the future

Reply via email to