RE: [Haskell-cafe] transparent parallelization

2007-09-19 Thread Tim Harris (RESEARCH)
 The problem with Haskell is not finding opportunities to parallelize,
 they are legion. Actually, quite the opposite, there's so much that your
 code ends up slower than a sequential realization.  The hard part is
 making a good cost-model and a good way to create coarser chunks of
 work.  It's not worthwhile to spawn a thread (even a very lightweight
 one) for virtually every subexpression.

 Automatic parallelization is easy, efficient parallelization is hard.

Absolutely.

We had another crack at this recently in an upcoming ICFP paper (draft online 
at http://research.microsoft.com/~tharris/papers/2007-fdip.pdf).

In that paper we start out by collecting profiles of the thunk creation, entry, 
and update operations for a set of benchmarks and conduct a limit study of how 
fast these dependencies would allow them to be evaluated (e.g. ignoring 
interactions through the GC, cache effects, the costs of sparking thunks etc.).

We then make the system increasingly more practical, (i) setting a lower bound 
on the compute-time of thunks that we spark, (ii) making predictions of a 
thunk's compute-time when it's allocated, and then (iii) building a real 
implementation based on GHC 6.6.

The parallel speedup dwindles at each stage: this kind of automated approach 
looks plausible for using 2n cores instead of using n, but I'm sceptical 
about it being able to efficiently exploit n cores instead of 1.

Tim

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] transparent parallelization

2007-09-18 Thread Thomas Schilling
On Tue, 2007-09-18 at 18:13 +0200, Thomas Girod wrote:
 Hi there. Beeing rather new to the realm of Haskell and functional
 programming, I've been reading about how is easier it is to
 parallelize code in a purely functional language (am I right saying
 that ?).
 
 My knowledge of parallelization is also very weak, but I've been
 thinking about this and I have a question. Let's say we take a small
 piece of code, like the quicksort algorithm.
 
  qsort [] = []
  qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater 
  where lesser = filter (x) xs
greater = filter (=x) xs
 
 (actually I didn't test this code, this is just for the example)
 
 Here, computing lesser and greater are two disjoint actions - we
 don't need the result from one to compute the other, and haskell does
 not allow one to alter data so that would change the behaviour of the
 other. So I guess those could be run in parallel. 
 
 Would it be fairly simple to automatically determine parts of code
 that can be run in parallel, and parts that cannot (at least in
 non-monadic code) ?
 
 So the final question is : if it is possible to automatically define
 segments of code that can be run in parallel, is [insert compiler name
 here] compiling this code as a one thread thing, or as a multithreaded
 version ? 
 
 I guess on single processor machines, it is more efficient to do it as
 a single thread - but with those many-cores processors that are
 coming, it probably won't be true anymore.
 
 Sorry if this post is blowing off open doors (I guess this doesn't
 mean anything in english but it sounds cool) ... 
 

Detecting parallelism is possible, but generally rather fragile.
Coarse-grained parallelism in form of threads (or processes) is only
efficient if enough data can be processed in parallel.  This in turn is
determined by the data-dependencies, which are hard to detect
automatically.  To preserve program semantics the analysis has to be
conservative, i.e., assume that two parts of a program depend on each
other unless it can prove otherwise.  (OpenMP relies on the programmer
to explicitly declare what to parallelize.)

A better way is to let the user specify the algorithm at a higher level.
One very promising technique to do this is explored in Data Parallel
Haskell (DPH) [1].  In DPH you specify your algorithm as functional
programs operating on vectors, and even allows nested parallelism, i.e.,
you can call parallel functions inside parallel functions.  If
implemented naïvely, this could easily lead to inefficiencies due to too
little workload per thread.  This is where GHCs rewriting capabilities
kick in and transform nested parallel programs into flat parallel
programs.  I really recommend reading the paper(s) (see [1]).

/ Thomas

[1] .. http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] transparent parallelization

2007-09-18 Thread bf3
 http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

Wow this is cool stuff! It would be nice to have something like this for the 
Playstation 3 :-)

Regarding parallelism, I wander how this extension will compare to Sun's 
Fortress language, if/when it gets finally released.

Peter


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] transparent parallelization

2007-09-18 Thread Dave Tapley
If I recall correctly a rather neat way of exploiting this property of
qsort is exploited with Nested Data Parallelism and covered in this
talk:
http://www.londonhug.net/2007/05/25/video-of-spjs-talk-is-now-online/

Good food for thought :)

Dave,

On 18/09/2007, Thomas Girod [EMAIL PROTECTED] wrote:
 Hi there. Beeing rather new to the realm of Haskell and functional
 programming, I've been reading about how is easier it is to parallelize
 code in a purely functional language (am I right saying that ?).

 My knowledge of parallelization is also very weak, but I've been thinking
 about this and I have a question. Let's say we take a small piece of code,
 like the quicksort algorithm.

  qsort [] = []
  qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater
  where lesser = filter (x) xs
greater = filter (=x) xs

 (actually I didn't test this code, this is just for the example)

 Here, computing lesser and greater are two disjoint actions - we don't
 need the result from one to compute the other, and haskell does not allow
 one to alter data so that would change the behaviour of the other. So I
 guess those could be run in parallel.

 Would it be fairly simple to automatically determine parts of code that can
 be run in parallel, and parts that cannot (at least in non-monadic code) ?

 So the final question is : if it is possible to automatically define
 segments of code that can be run in parallel, is [insert compiler name here]
 compiling this code as a one thread thing, or as a multithreaded version ?

 I guess on single processor machines, it is more efficient to do it as a
 single thread - but with those many-cores processors that are coming, it
 probably won't be true anymore.

 Sorry if this post is blowing off open doors (I guess this doesn't mean
 anything in english but it sounds cool) ...

 Regards

 Thomas Girod



 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] transparent parallelization

2007-09-18 Thread Jan-Willem Maessen


On Sep 18, 2007, at 4:24 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:


http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell


Wow this is cool stuff! It would be nice to have something like  
this for the Playstation 3 :-)


Regarding parallelism, I wander how this extension will compare to  
Sun's Fortress language, if/when it gets finally released.


The scope of the two is very different.  DPH proposes a single rather  
flexible data structure---nested Data Parallel Arrays (really as much  
list-like as array-like).  The underlying data structure is  
manipulated using bulk operations like zip, sum, and comprehensions.


By contrast, Fortress defines the notion of a Generator which you  
can think of as being akin to a parallel version of Data.Traversable  
or ListLike, where the fundamental primitive is a generalization of  
foldP and mapP.  This is strictly more general---we can define many  
of the operations in Data.PArr on arbitrary data types, permitting us  
to talk about the sum of the elements of a set, or mapping a function  
across a distributed array.  We can define nested data parallel  
arrays in Fortress.  There isn't (yet) an equivalent of the RULES  
pragma that permits Fortress to optimize combinations of function  
calls.  However, clever uses of type information and function  
overloading let Fortress do some interesting program transformations  
of its own (eg early exit for reductions with left zeros).  Finally,  
Fortress actually has a mechanism for defining new types of  
comprehensions (though this isn't in the language specification yet).
The other nice thing about Generators is that we can support  
consumption of large or infinite things, if we're very careful about  
how we do the consumption.  We're planning to write the equivalent of  
hGetContents that works over blocks of file data in parallel where  
possible, but processes streams as chunks of data become available.   
It remains to be seen how this will work out in practice, though.   
Our goal is something LazyByteString or rope-like.



So: DPH: available today (-ish), one (very flexible) data structure.   
Bulk operations on a data structure run in parallel.  Relies on RULES  
+ special compiler support (am I getting that right?  You can fuse  
multiple traversals of a common argument, which isn't doable using  
RULES, right?) to make it all run nicely.  A well-established  
performance model, cribbed from NESL, for the PArr bits.


Fortress: Arrays and lists currently built in.  Bulk operations on a  
data structure can run in parallel.  Ability to define new parallel  
types with carefully-tailored traversal (eg we have a PureList that's  
based on Ralf Hinze and Ross Paterson's finger tree where traversal  
walks the tree structure in parallel).  No optimization RULES yet (an  
interpreter doesn't optimize), but a good deal of type-based code  
selection.  Implementation less complete than DPH in general (even  
the Generator API is in flux, though the fundamental use of a foldP- 
like operation hasn't changed over time).


-Jan-Willem Maessen
 Longtime Haskell Hacker
 Fortress Library Developer

PS - By the way, working on Generators has increased my suspicion  
that comprehensions do NOT have a naturally monadic structure (which  
bugged me when I worked on parallel list traversal optimization in pH  
back in 1994).  It just happens that for cons-lists the two  
structures happen to coincide.  If anyone else has had similarly  
subversive thoughts I'd love to chat.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] transparent parallelization

2007-09-18 Thread Derek Elkins
On Tue, 2007-09-18 at 18:13 +0200, Thomas Girod wrote:
 Hi there. Beeing rather new to the realm of Haskell and functional
 programming, I've been reading about how is easier it is to
 parallelize code in a purely functional language (am I right saying
 that ?).
 
 My knowledge of parallelization is also very weak, but I've been
 thinking about this and I have a question. Let's say we take a small
 piece of code, like the quicksort algorithm.
 
  qsort [] = []
  qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater 
  where lesser = filter (x) xs
greater = filter (=x) xs
 
 (actually I didn't test this code, this is just for the example)
 
 Here, computing lesser and greater are two disjoint actions - we
 don't need the result from one to compute the other, and haskell does
 not allow one to alter data so that would change the behaviour of the
 other. So I guess those could be run in parallel. 
 
 Would it be fairly simple to automatically determine parts of code
 that can be run in parallel, and parts that cannot (at least in
 non-monadic code) ?

The problem with Haskell is not finding opportunities to parallelize,
they are legion. Actually, quite the opposite, there's so much that your
code ends up slower than a sequential realization.  The hard part is
making a good cost-model and a good way to create coarser chunks of
work.  It's not worthwhile to spawn a thread (even a very lightweight
one) for virtually every subexpression.

Automatic parallelization is easy, efficient parallelization is hard.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe