I have made the changes necessary to compile GHC with llvm-gcc. The major
change was to use the pthread api for thread level storage to access the gct
variable during garbage collection. My measurements indicate this causes an
average slowdown of about 5% for gc heavy programs. The changes are available
from the `clang` branch on my github fork.
git://github.com/dmpots/ghc.git clang
The branch contains only two new patches. One patch changes the gc to use
pthreads for thread local storage when the `llvm_CC_FLAVOR` symbol is defined
by the preprocessor, and the other patch defines the symbol based on an
autoconf test. The autoconf patch may be a bit heavy-handed because it is
really just checking to see if the `__llvm__` symbol is defined by the C
compiler. I based it on an answer from a stack overflow question:
http://stackoverflow.com/questions/1617877/how-to-detect-llvm-and-its-version-through-define-directives.
I'm open to suggestions on improving either patch.
I've been using the following configure line to test the llvm-gcc support.
$ CC=/usr/bin/llvm-gcc ./configure --with-gcc=/usr/bin/llvm-gcc
The validate script finds the same errors with or without my patches.
For the performance measurements, I looked at the fibon benchmarks and the
nofib gc benchmarks. Both benchmarks were tested on MacOS X 10.6 with a 64-bit
GHC.
The fibon benchmarks show an average slowdown of 3% in execution time, but the
gc time slows down by an average of 10%. The nofib gc benchmarks show an
average execution time slowdown of 5%.
The detailed results are below. In the fibon results, a negative number means
that the llvm-gcc version is slower and a positive number means it was faster.
The efficiency column is the percent of total execution time spent in the
garbage collector.
Fibon Results
-----------------------------------------------------------------
MutCPUTime GCCPUTime TotalCPUTime Efficiency
Agum +8.52% -10.95% +4.14% 78.56%
BinaryTrees -0.06% -16.01% -6.09% 64.40%
Blur -0.19% -3.03% -0.22% 99.06%
Bzlib -2.65% -3.08% -2.66% 99.90%
Chameneos -22.01% -11.22% -21.95% 99.55%
Cpsa +1.82% -8.68% +0.88% 91.23%
Crypto -1.13% -15.03% -8.91% 48.58%
FFT2d +3.98% -5.58% +3.52% 95.26%
FFT3d +0.44% -3.25% +0.35% 97.50%
Fannkuch -1.92% -7.41% -2.27% 93.87%
Fgl +3.18% -11.23% -2.71% 60.60%
Fst -0.21% -19.60% -3.84% 81.98%
Funsat +0.55% -11.31% -4.35% 60.39%
Gf +0.29% -9.78% -2.80% 70.16%
HaLeX +3.59% -16.01% +2.86% 96.39%
Happy -0.37% -13.65% -6.07% 59.52%
Hgalib +2.09% -9.85% +1.14% 92.11%
Laplace +0.04% -5.34% -0.17% 96.07%
MMult -13.61% -6.31% -13.38% 97.27%
Mandelbrot +0.12% -3.76% +0.11% 99.81%
Nbody +0.11% -4.12% +0.08% 99.35%
Palindromes +15.02% -15.63% -4.46% 41.55%
Pappy +2.18% -11.66% -9.90% 20.66%
Pidigits +0.49% -21.80% -3.83% 81.35%
QuickCheck -2.02% +2.75% -1.14% 81.42%
Regex +3.39% -6.62% +2.92% 95.41%
Simgi +5.20% -16.37% -0.10% 76.39%
SpectralNorm +0.13% ---- +0.13% 100.00%
TernaryTrees +2.79% -9.93% -3.66% 51.37%
Xsact -0.52% -14.58% -6.84% 57.94%
-----------------------------------------------------------------
Min -22.01% -21.80% -21.95% 20.66%
Mean +0.31% -9.97% -2.97% 79.59%
Max +15.02% +2.75% +4.14% 100.00%
In the nofib results a positive number means the llvm-gcc version was slower
and a negative number means it was faster (sorry for the inconsistency!)
NoFib Results
------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
------------------------------------------------------------------------------
circsim -75.0% +0.0% +5.6% +5.1% +0.0%
constraints -75.6% +0.0% +6.6% +6.2% +0.0%
gc_bench -75.9% +0.0% +8.6% +8.4% +0.0%
lcss -76.2% +0.0% +7.7% +6.8% +0.0%
power -74.6% +0.0% +5.4% +4.6% +0.9%
spellcheck -80.3% +0.0% -1.0% -1.9% +0.0%
------------------------------------------------------------------------------
Min -80.3% +0.0% -1.0% -1.9% +0.0%
Max -74.6% +0.0% +8.6% +8.4% +0.9%
Geometric Mean -76.3% +0.0% +5.4% +4.8% +0.1%
On Jun 27, 2011, at 6:18 PM, David Peixotto wrote:
> I'll take a look at getting the llvm-gcc route going by switching the gct
> variable to use pthread_getspecific() on mac os x. I can do some benchmarking
> to measure the impact.
>
> I was playing around just to get the compilation to succeed. After a small
> change in STGCRun.c, the compile went through but then it was getting a
> segfault in the stage 2 compiler because of the global register variables.
>
> I thought that llvm-gcc would complain about the global register variables,
> but it seems to accept them and generate the assembly code to read and write
> them. Only problem is it will also use these registers for other purposes, so
> the gct was getting stomped which was causing the segfaults.
>
> So from what I can see llvm-gcc dies at compile time when given __thread
> variables and accepts global register variables but can generate code that
> stomps on the register.
>
> -David
>
> On Jun 24, 2011, at 3:23 AM, Simon Marlow wrote:
>
>> On 21/06/2011 05:51, Manuel M T Chakravarty wrote:
>>> austin seipp:
>>>> (CC'ing Dan so he can chime in, for those who don't IRC.)
>>>>
>>>> Dan Knapp (dankna on freenode) is running OS X Lion on his machine
>>>> (and corresponding new xcode tools I believe,) and apparently Apple
>>>> have gone the whole way in the next release and by default making
>>>> 'gcc' a symbolic link to 'llvm-gcc.'
>>>
>>> Just like my prediction ;)
>>>
>>>> It's likely that will soon be
>>>> clang, given llvm-gcc is already deprecated as of LLVM 2.9. There is
>>>> still a regular GCC bundled with Lion apparently, ISTR Dan saying the
>>>> executable was under /Developer under the name
>>>> 'i686-apple-darwin-gcc-4.2' or somesuch, but I can't verify that (Snow
>>>> Leopard here.) Anyone with lion want to chime in?
>>>
>>> I would assume that 'gcc-4.2' will still point to the traditional GCC for a
>>> while. Especially with C++, clang is still behind and there are still the
>>> odd code generator bugs in LLVM that require code generation with
>>> traditional gcc.
>>>
>>>> Dan was working on build fixes/RTS fixes last week to try and make GHC
>>>> build cleanly with the pthread_getspecific and work with compilers
>>>> other than GCC. I think he did make some good headway in this area,
>>>> but his work isn't done either.
>>>>
>>>> Considering global register variables are a rather rare and intricate
>>>> GCC extension, it's much more likely that we will see __thread support
>>>> in Clang first (TLS also has implications for C++0x I've heard them
>>>> say.) It's not on their short-term TODO list, however. In the mean
>>>> time if apple were to remove GCC entirely for some reason, we'd still
>>>> need Dan's patches, wouldn't we?
>>>
>>> If we could move to clang (on OS X) that would be ideal, but as I wrote
>>> above I seriously doubt that Apple will entirely remove gcc (at least not
>>> before whatever cat comes after Lion). So, for the time being, and until
>>> we can use clang, I think it would be wise to use 'gcc-4.2' as a default on
>>> OS X (instead of 'gcc', which appears to morph into llvm-gcc soon). If we
>>> do that for GHC 7.2, then GHC 7.2 won't break once Apple flips the sym link
>>> over.
>>>
>>> Simon, what do you think?
>>
>> I have no strong opinions, you guys know the platform much better then me,
>> so I'm happy to go with whatever you think makes the most sense.
>>
>> One thing I would keep an eye on is the performance of the GC, because the
>> handling of the gct thread-local variable is critical. I can help you with
>> some quick benchmarks if you want to test out changes.
>>
>> Cheers,
>> Simon
>>
>>
>>
>>> Manuel
>>>
>>>
>>>> On Sun, Jun 19, 2011 at 9:43 PM, Manuel M T Chakravarty
>>>> <[email protected]> wrote:
>>>>> As llvm-gcc on OS X seems to require some work, I wonder whether we
>>>>> should by default build with the 'gcc-4.2' executable on OS X (which uses
>>>>> the traditional gcc backend), instead of the generic 'gcc' (probably
>>>>> still using 'gcc' as a fallback in configure if 'gcc-4.2' is not
>>>>> available). Then, when Apple makes the switch, binary GHC packages will
>>>>> continue to work.
>>>>>
>>>>> Manuel
>>>>>
>>>>> PS: I am all for resolving the problems with llvm-gcc, but that will
>>>>> likely take a while. It'd be good to get a fix into 7.2, though.
>>>>>
>>>>> Simon Marlow:
>>>>>> On 01/06/2011 13:30, Manuel M T Chakravarty wrote:
>>>>>>> Simon Marlow:
>>>>>>>> On 01/06/2011 07:11, Manuel M T Chakravarty wrote:
>>>>>>>>> Simon Marlow:
>>>>>>>>>> On 30/05/2011 14:59, Manuel M T Chakravarty wrote:
>>>>>>>>>>> It is no secret that Apple moves away from the traditional GCC
>>>>>>>>>>> backend to LLVM. In fact, Xcode (which bundles all command line
>>>>>>>>>>> developer tools on the Mac) today comes with two flavours of gcc:
>>>>>>>>>>> 'gcc' and 'llvm-gcc', which AFAIK only differ in the backend that is
>>>>>>>>>>> being used. Currently, the default is the traditional GCC backend,
>>>>>>>>>>> but it takes no precognition to realise that this will eventually
>>>>>>>>>>> change. The 'gcc' executable will use the LLVM backend and, at
>>>>>>>>>>> least
>>>>>>>>>>> for a while, the traditional backend will still be available under a
>>>>>>>>>>> different name.
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, GHC will break at this point as the LLVM backend does
>>>>>>>>>>> not support pinned global registers. ('llvm-gcc' happily accepts
>>>>>>>>>>> the
>>>>>>>>>>> register assignment, but fails with a runtime error during code
>>>>>>>>>>> generation.)
>>>>>>>>>>
>>>>>>>>>> This shouldn't be a problem. We don't use pinned global registers
>>>>>>>>>> any more, except in one place - the GC (see rts/sm/GCTDecl.h).
>>>>>>>>>> There it's optional, but you lose a bit of performance by not using
>>>>>>>>>> a pinned register. It's not a huge deal.
>>>>>>>>>>
>>>>>>>>>> Have you tried building GHC with llvm-gcc? I think I tried it on
>>>>>>>>>> the RTS a year or so ago to check the LLVM output against gcc (LLVM
>>>>>>>>>> wasn't quite as good at the time).
>>>>>>>>>
>>>>>>>>> Yes, I tried and it failed, while compiling the RTS, with
>>>>>>>>>
>>>>>>>>> sorry, unimplemented: LLVM cannot handle register variable
>>>>>>>>> ‘R1’, report a bug
>>>>>>>>>
>>>>>>>>> This was using the 64bit version of GHC. I'll have a closer look.
>>>>>>>>
>>>>>>>> Perhaps that was when compiling StgCRun.c? It doesn't actually need
>>>>>>>> register variables (on x86_64 at least), but it does include the
>>>>>>>> header files, so that probably needs some #ifdefery somewhere for
>>>>>>>> llvm-gcc.
>>>>>>>
>>>>>>> Yes, it's in 'StgCRun.c'. Ok, and how about on i386 (or do you want
>>>>>>> to phase that arch out)?
>>>>>>
>>>>>> It doesn't look like the x86 code in StgCRun.c uses registers either.
>>>>>> The sparc version does, but it could be rewritten.
>>>>>>
>>>>>>>> The other place, as I mentioned above, is rts/sm/GCTDecl.h, which will
>>>>>>>> need to use a different method for declaring the garbage collector's
>>>>>>>> thread-local state variable, gct. On x86_64 I found that using a
>>>>>>>> fixed register was the fastest, but using a thread-local variable (the
>>>>>>>> __thread modifier) also works.
>>>>>>>
>>>>>>> Just to make sure I understand correctly, are you saying that using a
>>>>>>> thread-local variable is already implemented as an option,
>>>>>>
>>>>>> Yes - look at the series of #ifdefs in that file, it's pretty
>>>>>> straightforward to change how gct is declared for a particular platform.
>>>>>>
>>>>>> However, I've just done some poking around and it seems that __thread is
>>>>>> not supported on OS X:
>>>>>>
>>>>>> http://lifecs.likai.org/2010/05/mac-os-x-thread-local-storage.html
>>>>>>
>>>>>> see also this thread about Clang:
>>>>>>
>>>>>> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-March/013673.html
>>>>>>
>>>>>> It seems there might be support for __thread in the future, but not in
>>>>>> the short term.
>>>>>>
>>>>>> It seems our very own David Peixotto tried building GHC with Clang a
>>>>>> year ago and ran into the same thing:
>>>>>>
>>>>>> http://www.dmpots.com/blog/2010/05/08/building-ghc-with-clang.html
>>>>>>
>>>>>> So this is less than ideal. The short term fix would be to #define gct
>>>>>> to be a called to pthread_getspecific(). The call will be inlined - the
>>>>>> OS X headers define pthread_getspecific in terms of some inline
>>>>>> assembly, but the optimiser won't know anything about the inline
>>>>>> assembly so it won't be able to common up multiple loads of gct, and
>>>>>> that probably means it won't perform well. If that's the case, then the
>>>>>> solution is to load up gct into a temporary in the performance-critical
>>>>>> functions in the GC (evacuate(), scavenge_block()), and add it as an
>>>>>> argument to inline functions. I'd rather avoid having to do all that if
>>>>>> possible.
>>>>>>
>>>>>> If you want to benchmark the GC, there are some good programs in
>>>>>> nofib/gc.
>>>>>>
>>>>>> Cheers,
>>>>>> Simon
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Cvs-ghc mailing list
>>>>> [email protected]
>>>>> http://www.haskell.org/mailman/listinfo/cvs-ghc
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Austin
>>>
>>
>>
>> _______________________________________________
>> Cvs-ghc mailing list
>> [email protected]
>> http://www.haskell.org/mailman/listinfo/cvs-ghc
>>
>
>
> _______________________________________________
> Cvs-ghc mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/cvs-ghc
>
_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc