Thanks very much for this information. My observations match your recommendations, insofar as I can test them.
Cheers, John On Mon, Jun 25, 2012 at 11:42 PM, Simon Marlow <marlo...@gmail.com> wrote: > On 19/06/12 02:32, John Lato wrote: >> >> Thanks for the suggestions. I'll try them and report back. Although >> I've since found that out of 3 not-identical systems, this problem >> only occurs on one. So I may try different kernel/system libs and see >> where that gets me. >> >> -qg is funny. My interpretation from the results so far is that, when >> the parallel collector doesn't get stalled, it results in a big win. >> But when parGC does stall, it's slower than disabling parallel gc >> entirely. > > > Parallel GC is usually a win for idiomatic Haskell code, it may or may not > be a good idea for things like Repa - I haven't done much analysis of those > types of programs yet. Experiment with the -A flag, e.g. -A1m is often > better than the default if your processor has a large cache. > > However, the parallel GC will be a problem if one or more of your cores is > being used by other process(es) on the machine. In that case, the GC > synchronisation will stall and performance will go down the drain. You can > often see this on a ThreadScope profile as a big delay during GC while the > other cores wait for the delayed core. Make sure your machine is quiet > and/or use one fewer cores than the total available. It's not usually a > good idea to use hyperthreaded cores either. > > I'm also seeing unpredictable performance on a 32-core AMD machine with > NUMA. I'd avoid NUMA for Haskell for the time being if you can. Indeed you > get unpredictable performance on this machine even for single-threaded code, > because it makes a difference on which node the pages of your executable are > cached (I heard a rumour that Linux has some kind of a fix for this in the > pipeline, but I don't know the details). > > >> I had thought the last core parallel slowdown problem was fixed a >> while ago, but apparently not? > > > We improved matters by inserting some "yield"s into the spinlock loops. > This helped a lot, but the problem still exists. > > Cheers, > Simon > > > >> Thanks, >> John >> >> On Tue, Jun 19, 2012 at 8:49 AM, Ben Lippmeier<b...@ouroborus.net> wrote: >>> >>> >>> On 19/06/2012, at 24:48 , Tyson Whitehead wrote: >>> >>>> On June 18, 2012 04:20:51 John Lato wrote: >>>>> >>>>> Given this, can anyone suggest any likely causes of this issue, or >>>>> anything I might want to look for? Also, should I be concerned about >>>>> the much larger gc_alloc_block_sync level for the slow run? Does that >>>>> indicate the allocator waiting to alloc a new block, or is it >>>>> something else? Am I on completely the wrong track? >>>> >>>> >>>> A total shot in the dark here, but wasn't there something about really >>>> bad >>>> performance when you used all the CPUs on your machine under Linux? >>>> >>>> Presumably very tight coupling that is causing all the threads to stall >>>> everytime the OS needs to do something or something? >>> >>> >>> This can be a problem for data parallel computations (like in Repa). In >>> Repa all threads in the gang are supposed to run for the same time, but if >>> one gets swapped out by the OS then the whole gang is stalled. >>> >>> I tend to get best results using -N7 for an 8 core machine. >>> >>> It is also important to enable thread affinity (with the -qa) flag. >>> >>> For a Repa program on an 8 core machine I use +RTS -N7 -qa -qg >>> >>> Ben. >>> >>> >> >> _______________________________________________ >> Glasgow-haskell-users mailing list >> Glasgow-haskell-users@haskell.org >> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users > > _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users