Re: [Haskell-cafe] optimising for vector units

2004-07-29 Thread Jan-Willem Maessen - Sun Labs East
Ronny Wichers Schreur wrote:
Jan-Willem Maessen writes (in the Haskell Cafe):
My ultimate goal was parallelization.  However, I'm now convinced it 
would take about 2-3 programmer-years more of effort to realize that 
goal.  Parallel garbage collection (which is *not* optional on a 
parallel Haskell implementation, as our experience with pH 
demonstrated) and very fine-grained thunk-update protocols are both 
tricky technical challenges.  And that leaves aside the question of 
whether there's enough coarse-grained work buried among all that 
fine-grained work to make it all worthwhile. 

I'm not sure how to interpret this. Will you also have solved
the granularity problem in these two to three years?
Probably not.  One might hope to do well by choosing an appropriate 
schedule (favor leaves locally, roots when work is moved between 
processors).  But there's no evidence generating the work lazily would 
even leave enough work to be done.

There's a difference between a system that would work (and probably 
work OK for 2-4 processors) and a system that would scale.  The latter 
would also require some hard thought by the application programmer, as 
well.

-Jan-Willem Maessen
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-29 Thread Ronny Wichers Schreur
Jan-Willem Maessen writes (in the Haskell Cafe):
My ultimate goal was parallelization.  However, I'm now convinced it 
would take about 2-3 programmer-years more of effort to realize that 
goal.  Parallel garbage collection (which is *not* optional on a 
parallel Haskell implementation, as our experience with pH demonstrated) 
and very fine-grained thunk-update protocols are both tricky technical 
challenges.  And that leaves aside the question of whether there's 
enough coarse-grained work buried among all that fine-grained work to 
make it all worthwhile. 
I'm not sure how to interpret this. Will you also have solved
the granularity problem in these two to three years?
Cheers,
Ronny Wichers Schreur
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-28 Thread MR K P SCHUPKE

Erm, when I said no overhead I meant there is no overhead to choosing
an instruction from a different thread compared to choosing an instruction
from the same thread... Obviously the overall scheduling overhead will 
increase. 

>The real killer, of course, is memory latency.  The cache resources
>required to hold pending work can't be used to hold heap data

I would have thought cache thrashing caused by two CPUs accessing the
same address repeatedly would be the real killer... Of course this
is not so much of a problem with hyper-threading where both
'virtual' CPUs access the same cache.

>* All your functions must be strict in their arguments,

I don't see that I am making that assumption. Obviously parts of
the program that are in the IO monad are strict, so we start with
the demand for a value required strictly by an IO function. This
value will be the result of a function, so we take the graph
representing the function, and carry on as suggested. This is
definitely lazy as parameters not affecting the result will never
be evaluated.

I appreciate there are some unanswered hard questions, but most
of these are to do with efficiency - not whether it is possible...

Having an inefiicient implementation now rather than nothing 
might just be what is needed to get people looking at the
efficiency issues.

Thanks for the references by the way - although I am quite
familiar with the Monsoon architecture - and the Pentium
architecture (which since PPro has been internaly dataflow...
SuperScalar is really a dataflow computer emulating a
traditional processor, and interpreting its instructions)

Keean./
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-28 Thread Jan-Willem Maessen - Sun Labs East
MR K P SCHUPKE wrote:
That was me.  I think you're underestimating the cost of starting
threads even in this very lightweight world.

Maybe... Perhaps haskell could be made to resemble dataflow instructions
more... If when a computation completes we insert the result directly
into the data structure which represents function, we can infact pick any
function for execution with _no_ overhead. (In a true dataflow system any
instruction from any thread can be executed with no overhead)
With architectural support for dataflow execution, this is true 
(though hardware support for scheduling necessarily imposes resource 
limitations).  You might be rather interested in a couple of books 
about the Monsoon dataflow machine, and the compiler for the 
functional language (Id) used to prgram it---Id is similar in many 
ways to Haskell.  The Haskell dialect pH was basically Id with Haskell 
syntax, and early versions used the Id back end to generate Monsoon code.

@book{102865,
 author = {Gregory M. Papadopoulos},
 title = {Implementation of a general-purpose dataflow multiprocessor},
 year = {1991},
 isbn = {0-262-66069-5},
 publisher = {MIT Press},
 }
@book{103033,
 author = {Kenneth R. Traub},
 title = {Implementation of non-strict functional programming languages},
 year = {1991},
 isbn = {0-262-70042-5},
 publisher = {MIT Press},
 }
Interestingly, even in a general-purpose dataflow system it became 
obvious that pushing data to and from memory was rather inefficient. 
As a result, Monsoon had registers to handle loop code, and ran inner 
loops sequentially.  A discussion of the tradeoffs between sequential 
and parallel loops can be found here:

@inproceedings{165203,
 author = {Boon Seong Ang},
 title = {Efficient implementation of sequential loops in dataflow 
computation},
 booktitle = {Proceedings of the conference on Functional programming 
languages and computer architecture},
 year = {1993},
 isbn = {0-89791-595-X},
 pages = {169--178},
 location = {Copenhagen, Denmark},
 doi = {http://doi.acm.org/10.1145/165180.165203},
 publisher = {ACM Press},
 }

The suggestion I guess is to only use instructions that get their arguments
from main memory. 
In the absence of architectural support for dataflow, this is the 
simplest option by far.

> This way any instruction can be sequenced with no overhead
on any CPU. With modern on-die core-speed caches this can be almost as fast
a registers (with good cache access patterns)
>
> ... Note that I am only suggesting
> interleaving instructions at the function level, so registers can 
be used within
> functions... of course as things get more and more parallel we may see
> hardware with no registers, just pipelined high speed cache access. 
(The
> hardware may well use registers to pre-fetch cache values, but that can
> be made transparent to the software)...

Alas, I think your making a few naive assumptions here:
* All your functions must be strict in their arguments, yet your 
language as a whole is somehow non-strict.  How?
* Tracking which computations are partially satisfied has no overhead 
(we'll need to match arguments to computations somehow).
* Choosing a function which is ready to run has no overhead (we'll 
need a queue, a stack, or some similar data structure to manage ready 
instructions).
* The extra instructions do not interfere in any material way with the 
instruction level parallelism that may have existed in our program 
(doubtful, all those scheduling instructions, operand loads, and 
result stores consume issue bandwidth at the very least).

There's been quite a bit of work on splitting non-strict code up into 
strict threads which can be run in much the way I think you're 
imagining.  The following paper describes some of the basics:
@inproceedings{141568,
 author = {Kenneth R. Traub and David E. Culler and Klaus E. Schauser},
 title = {Global analysis for partitioning non-strict programs into 
sequential threads},
 booktitle = {Proceedings of the 1992 ACM conference on LISP and 
functional programming},
 year = {1992},
 isbn = {0-89791-481-3},
 pages = {324--334},
 location = {San Francisco, California, United States},
 doi = {http://doi.acm.org/10.1145/141471.141568},
 publisher = {ACM Press},
 }

The real killer, of course, is memory latency.  The cache resources 
required to hold pending work can't be used to hold heap data and the 
like instead.  You're going to increase your cache miss rate as well. 
 The result?  A drop in performance.

A multi-threaded processor can help in masking cache miss latency 
(while increasing the miss rate of an individual thread).  But it is 
still limited by the fundamental bandwidth limitations of the 
load/store units.  You will, on the whole, still be doing 
substantially worse than an ordinary program.

Why bother at all?  The same reason we bother with laziness (which has 
essentially the same set of problems in different clothing).  It's 
easy to write clean, beautiful programs.  And I do still hold out ho

Re: [Haskell-cafe] optimising for vector units

2004-07-28 Thread MR K P SCHUPKE
>That was me.  I think you're underestimating the cost of starting
>threads even in this very lightweight world.

Maybe... Perhaps haskell could be made to resemble dataflow instructions
more... If when a computation completes we insert the result directly
into the data structure which represents function, we can infact pick any
function for execution with _no_ overhead. (In a true dataflow system any
instruction from any thread can be executed with no overhead)

The point it at any time we have N functions ready for execution (a function
is ready for execution when all its arguments are ready)... we can pick and
execute any (or all of these if enough execution units are ready) of these.


The suggestion I guess is to only use instructions that get their arguments
from main memory. This way any instruction can be sequenced with no overhead
on any CPU. With modern on-die core-speed caches this can be almost as fast
a registers (with good cache access patterns) ... Note that I am only suggesting
interleaving instructions at the function level, so registers can be used within
functions... of course as things get more and more parallel we may see
hardware with no registers, just pipelined high speed cache access. (The 
hardware may well use registers to pre-fetch cache values, but that can
be made transparent to the software)...

Hardware manufacturers have hit the limit for pure sequential execution speed,
so more parallelism is the only way forward (see Intels revised roadmap, they
abandoned the pentium 4 and 5 and have focused on an updated _low_power_
pentium 3M, and are planning multi core versions for more speed).

C and other imperitive languages focus toom
much on  the how and not the what to be easy to use in such multi-cpu
environments... A language with abstracts and hides the parallelism could
well take off in a big way.


Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-27 Thread Jan-Willem Maessen - Sun Labs East
MR K P SCHUPKE wrote:
A final point someone made about the cost of starting threads... Surely 
the obvious approach
is to start one OS thread per execution unit, and do all the thread 
starting with the very
lightweight haskell threads...
That was me.  I think you're underestimating the cost of starting 
threads even in this very lightweight world.  Again, tens of 
instructions between startup, shutdown, and (very important) 
synchronization to make sure that other threads see the results which 
were produced.  It *can* be done without locks, but it often can't be 
done without memory fences of some sort.

If you actually have to start OS threads, the cost is thousands or 
tens of thousands of cycles.

Even with very lightweight thread creation, you probably don't want to 
 do a multiplication in a separate thread---unless it's, say, an 
Integer multiply with a minimum of a thousand digits or so.

-Jan-Willem Maessen
Keean
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-27 Thread MR K P SCHUPKE
Just wanted to explore a few points you made:
Jon Cast wrote:
No.  A function may not use all of its arguments, so there's no point in
evaluating all arguments---you have to find out which ones are actually
needed.  Terminology note: GHC uses the STG language internally.  In
STG, evaluation is always triggered by a case expression, which
evaluates precisely one expression (not necessarily an argument of
anything, either!)
If a  function never uses an argument you dont need to evaluate it... 
remember in
dataflow we start from the end of the program working backwards so we know
all future demands on function arguments...

Now, (:) cannot `spark' a thread to evaluate xn1 in downFrom, because
that would end up sparking infinitely many threads---which may be
possible, but nothing you've said suggests that it is.
 

Again think of lazy lists in their 'stream' interpretation. We model them as
a FIFO queue and can set high and low water marks in the FIFO. We can use
hysteresis to ensure work gets done in reasonably large chunks.
I guess my description didn't include all the details of the model I had 
in mind...
you obviously dont want to unroll all recursive functions... I was 
thinking along
the lines of expressions like

   let a = c*d + e*f
where you can obviously execute the two multiplies in parallel (the 
multiplies
being the arguments to the (+) function.

Also it is obvious you don't need a new thread for one of the arguments 
(a function
of one argument would use the same thread)...

In the general recursive case:
fna x y z = fna (x+1) (y-1) (z*3)
again we can do the three simple arithmetic operations in parallel but 
have to wait for all
of them to finish before the call to (fna) can be made. So there isn't 
an infinite expansion
of threads there is one thead that does the recursion and one of the 
arithmetic operations
and two that are forked and reaped each iteration.

This is true for tree reduction, but not for graph reduction.  Consider,
e.g., the fact that the call to product in choose and the recursive call
to downFrom walk down (and hence evaluate) the same list.  Since the
cons cells making up this list are evaluated by two different functions,
some sort of locking is indeed necessary to ensure that you don't end up
with two threads for each of the (infinitely many) cons cells in the
list.
 

The list is read only (this is a declarative language) no locking is 
necessary on read
only data structures.

I suspect I haven't fully understood the difficaulty in doing this,
care to enlighten me?
   

A final point someone made about the cost of starting threads... Surely 
the obvious approach
is to start one OS thread per execution unit, and do all the thread 
starting with the very
lightweight haskell threads...

I guess in this case I can see where locking would be necessary...
Keean
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-27 Thread Jan-Willem Maessen - Sun Labs East
Ketil Malde wrote:
Jan-Willem Maessen - Sun Labs East <[EMAIL PROTECTED]> writes:
There are, I believe, a couple of major challenges:
  * It's easy to identify very small pieces of parallel work, but much
harder to identify large, yet finite, pieces of work.  Only the
latter are really worth parallelizing.
By the former, are you thinking of so small grain that it is handled
by out-of-order execution units in the CPU?  And/or the C compiler?
I'm thinking of so small grain that the cost of fine-grained thread 
creation (tens of instructions) is comparable to the cost of running 
the thread itself (hundreds of instructions).

ILP is a bit of a sideshow here.  Usually there's plenty of integer 
ILP hanging about already.

  * If you don't compute speculatively, you'll never find enough work
to do.

Although I'm not familiar with the issues, my point is that the number
of CPUs available, even in common household pee cees, is already more
than one (P4 hyper-threading), and could be something like eight in
the not-so-distant future.  It no longer matters (much) if you waste
`cycles, cycles are cheap.  (The next next IA64, Montecito is 1.7G
transistors, including 24Mb on-chip cache.  The P4 is big, but you
could fit thirty of them in that space.  No way Montecito is going to
have anywhere near 30x the performance)
So speculative execution, even if you end up throwing away 50% of the
work you do, could in theory make your program faster anyway.  This is
a headache for C programs; my hope would be that a functional language
would make it easier.
Be careful.  A P4 will slow down if you get it hot enough.  So 
"throwing away" that bit of extra performance may actually make things 
slower...  And who gets to use the throwaway performance, anyhow?  You 
want it, the OS wants it, other applications on your machine want 
it---you have to be able to adjust to varying amounts of "extra" 
compute power hanging around.  [Actually, speaking from experience, 
half the compute power on my SMP ends up going to the GUI, the OS, or 
my browser---and it's great to have those things not slow down 
uniprocessor compute-bound tasks.]

  * If you compute speculatively, you need some way to *stop* working
on useless, yet infinite computations.

And you need to choose which computations to start working on, I guess.
Predicting the future never was easy :-)
Indeed.
[perhaps getting off-topic, but hey, this is -cafe]
That's why I subscribe to -cafe at all!
-Jan-Willem Maessen
-kzm
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-27 Thread Ketil Malde
Jan-Willem Maessen - Sun Labs East <[EMAIL PROTECTED]> writes:

I missed this bit:

> I'm building compilers for supercomputers at Sun

So, any plans for compilers for functional languages making
use of Niagara? 

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Ketil Malde
Jan-Willem Maessen - Sun Labs East <[EMAIL PROTECTED]> writes:

> There are, I believe, a couple of major challenges:
>* It's easy to identify very small pieces of parallel work, but much
>  harder to identify large, yet finite, pieces of work.  Only the
>  latter are really worth parallelizing.

By the former, are you thinking of so small grain that it is handled
by out-of-order execution units in the CPU?  And/or the C compiler?

>* If you don't compute speculatively, you'll never find enough work
>  to do.

Although I'm not familiar with the issues, my point is that the number
of CPUs available, even in common household pee cees, is already more
than one (P4 hyper-threading), and could be something like eight in
the not-so-distant future.  It no longer matters (much) if you waste
`cycles, cycles are cheap.  (The next next IA64, Montecito is 1.7G
transistors, including 24Mb on-chip cache.  The P4 is big, but you
could fit thirty of them in that space.  No way Montecito is going to
have anywhere near 30x the performance)

So speculative execution, even if you end up throwing away 50% of the
work you do, could in theory make your program faster anyway.  This is
a headache for C programs; my hope would be that a functional language
would make it easier.

>* If you compute speculatively, you need some way to *stop* working
>  on useless, yet infinite computations.

And you need to choose which computations to start working on, I guess.
Predicting the future never was easy :-)

[perhaps getting off-topic, but hey, this is -cafe]

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Jan-Willem Maessen - Sun Labs East
Ketil Malde wrote:
I'm sure somebody, somewhere, is working on speculative execution of
Haskell code.  
I think we both graduated (myself from MIT, Robert Ennals from 
Cambridge; I'm building compilers for supercomputers at Sun).  If 
anyone else is working seriously on speculative evaluation of Haskell 
at the moment, please drop me a line, I'd love to hear of your existence.

> Now that Sun is building Niagara with IIRC 8 cores on a
chip, Intel is rumoured to put up to 16 cores on the post-Montecito
Tukwila, Sony/IBM's Cell is said to be some kind of parallel design,
and every self-respecting chip vendor is putting at least two cores on
a chip and/or adding multithreaded cores.
One would expect a lazy and pure language to be excellent for
parallelization, since the programmer is generally removed from the
actual flow of execution anyway.  
As I recall, this was one of the original goals of the Grip project 
(which built the original incarnation of GHC, if I have my history 
straight).

Indeed, if you read any early paper on strictness analysis, you'd 
think that it was all about parallelizing lazy code.  I'm now of the 
opinion that strictness analysis tells us where to *serialize* our 
code, because parallelism just won't be worth it.

There are, I believe, a couple of major challenges:
  * It's easy to identify very small pieces of parallel work, but much
harder to identify large, yet finite, pieces of work.  Only the
latter are really worth parallelizing.
  * If you don't compute speculatively, you'll never find enough work
to do.
  * If you compute speculatively, you need some way to *stop* working
on useless, yet infinite computations.  Rob and I had two rather
different takes on the same basic idea: run with some sort of
resource limits, and stop working when you exhaust them.  Roughly
speaking, Rob then made sure you never speculated that bit of code
again, while I just kept relying on the bounds to kick in.  On
pretty eager code, I did all right, but on code that actually
needed the laziness (this was after a good bit of compiler munging
to wring out most of the unnecessary laziness) I got
killed---Rob's approach fixed the mistake by patching the code,
and did much better thereafter.
My ultimate goal was parallelization.  However, I'm now convinced it 
would take about 2-3 programmer-years more of effort to realize that 
goal.  Parallel garbage collection (which is *not* optional on a 
parallel Haskell implementation, as our experience with pH 
demonstrated) and very fine-grained thunk-update protocols are both 
tricky technical challenges.  And that leaves aside the question of 
whether there's enough coarse-grained work buried among all that 
fine-grained work to make it all worthwhile.  Anyone want to pay me to 
do this? :-)

> At some point (for some n), being
able to spawn n threads will gain you more than a factor c constant
overhead, and Haskell programs, with a run-time system that can
evaluate expressions in paralllel, will outperform single threaded C
code. 
Only if you can keep the granularity of the work large.  This hasn't 
been so easy.

-Jan-Willem Maessen
Eager Haskell Implementor
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Jon Cast
MR K P SCHUPKE <[EMAIL PROTECTED]> wrote:
> As far as I understand it haskells lazy evaluation resembles dataflow
> computation - IE values are computed on demand.
> 
> This problem has already been solved for hardware, you end up with
> what is termed a super-scalar architecture.
> 
> Instructions read from memory fill a re-order buffer. As computations
> complete value slots in the reorder buffer are filled with the result
> values. When an instruction has all its values it can be executed, and
> will be assigned to the next available execution unit.
> 
> I am not too sure of the internal architecture of GHC, but I can
> imagine that funtions waiting to be run will have slots for arguments,
> that will be filled in when values become available.

No.  A function may not use all of its arguments, so there's no point in
evaluating all arguments---you have to find out which ones are actually
needed.  Terminology note: GHC uses the STG language internally.  In
STG, evaluation is always triggered by a case expression, which
evaluates precisely one expression (not necessarily an argument of
anything, either!).

> The whole point is that a function with three parameters can spark
> three threads

This is actually a bad idea: the point of non-strict languages and lazy
evaluation is to allow lots of values which are (a) unused and (b)
expensive to evaluate.  If you evaluate everything (even in parallel),
there's a lot of wasted work if the programmer is making optimal use of
the non-strictness of the language.

Furthermore, if de-forestation fails anywhere, you've likely got another
problem.  Consider the following program:

> n `choose` m = product $ take m $ [n, n-1..]

Which translates (after some optimisation) to STG equivalent to:

> downFrom n = let n1 = (-) n 1
>  in let xn1 = downFrom n1
>  in let xn2 = n:xn1
>  in xn2
> choose n m = let xn1 = downFrom n
>  in let xn2 = take m xn1
>  in product xn2

Now, (:) cannot `spark' a thread to evaluate xn1 in downFrom, because
that would end up sparking infinitely many threads---which may be
possible, but nothing you've said suggests that it is.

> to calculate each parameter without worrying about the side affects of
> each function. No locking is necessary because you only spark one
> thread for each argument.

This is true for tree reduction, but not for graph reduction.  Consider,
e.g., the fact that the call to product in choose and the recursive call
to downFrom walk down (and hence evaluate) the same list.  Since the
cons cells making up this list are evaluated by two different functions,
some sort of locking is indeed necessary to ensure that you don't end up
with two threads for each of the (infinitely many) cons cells in the
list.

> I suspect I haven't fully understood the difficaulty in doing this,
> care to enlighten me?

Certainly.  See above.

Jon Cast
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread MR K P SCHUPKE
As far as I understand it haskells lazy evaluation resembles
dataflow computation - IE values are computed on demand.

This problem has already been solved for hardware, you end up
with what is termed a super-scalar architecture. 

Instructions read from memory fill a re-order buffer. As computations
complete value slots in the reorder buffer are filled with the result
values. When an instruction has all its values it can be executed, and
will be assigned to the next available execution unit.

I am not too sure of the internal architecture of GHC, but I can imagine
that funtions waiting to be run will have slots for arguments, that will
be filled in when values become available. 

The whole point is that a function with three parameters can spark
three threads to calculate each parameter without worrying about
the side affects of each function. No locking is necessary because
you only spark one thread for each argument.

I suspect I haven't fully understood the difficaulty in doing this,
care to enlighten me?

Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Ketil Malde
Jon Cast <[EMAIL PROTECTED]> writes:

>> factor c constant overhead,
>   ^^

> What makes you think the overhead is constant?

(Referring to the overhead introduced by boxing and such, not
parallelizing.  Sorry if that wasn't clear) 

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Jon Cast
Ketil Malde <[EMAIL PROTECTED]> wrote:
> MR K P SCHUPKE <[EMAIL PROTECTED]> writes:



> One would expect a lazy and pure language

Not lazy!  See below.

> to be excellent for parallelization, since the programmer is generally
> removed from the actual flow of execution anyway.  At some point (for
> some n), being able to spawn n threads will gain you more than a
> factor c constant overhead,
  ^^

What makes you think the overhead is constant?  Without Optimistic
Evaluation, Haskell programs mutate /heavily/, and even with Optimistic
Evaluation I'm sure there is some mutation going on.  With a truly
parallel system, every mutation has to be locked in some way; reducing
the number of locking operations to merely constant overhead is an
interesting (as in extremely difficult) research problem.

> and Haskell programs, with a run-time system that can evaluate
> expressions in paralllel, will outperform single threaded C code.
> 
> (But it probably isn't that simple, or we would have it already :-)

Nope.

Jon Cast
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Ketil Malde
MR K P SCHUPKE <[EMAIL PROTECTED]> writes:

> Well I have to say the dataflow style of lazy programming made me think
> Haskell would be ideal for multi-processor use (and now HyperThreading
> is common most PCs have more than one processor from the code's point
> of view)...

> I was disappointed to find GHC only uses one thread, and therefore will
> only use one CPU.

I'm sure somebody, somewhere, is working on speculative execution of
Haskell code.  Now that Sun is building Niagara with IIRC 8 cores on a
chip, Intel is rumoured to put up to 16 cores on the post-Montecito
Tukwila, Sony/IBM's Cell is said to be some kind of parallel design,
and every self-respecting chip vendor is putting at least two cores on
a chip and/or adding multithreaded cores.

Even if there's a lot of rumor-mongering going on, it seems fairly
clear that you can only go on for so long, adding more cache to your
old designs, and the bottleneck then becomes the inherent parallelism
in your code.

One would expect a lazy and pure language to be excellent for
parallelization, since the programmer is generally removed from the
actual flow of execution anyway.  At some point (for some n), being
able to spawn n threads will gain you more than a factor c constant
overhead, and Haskell programs, with a run-time system that can
evaluate expressions in paralllel, will outperform single threaded C
code. 

(But it probably isn't that simple, or we would have it already :-)

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread MR K P SCHUPKE
>Correct me if I'm wrong but I thought that there was some overhead to
>making FFI (even safe/unsafe)

You are not wrong... so it would only be worth it if you could
execute a whole bunch of C at the same time. But it is possible...

Consider: Use a monad to sequence vector operations on the Vector class.
The monad instead of directly executing vector instructions compiles
a syntax tree, or a list of instructions. This list of instructions
is passed to the C code, which sequences the primitive vector ops
into a block of code, and returns a function pointer via the FFI.

You just call the whole block with the right parameters. 

This is probably getting to be more work than integrating stuff into
GHC though. It is also doing stuff dynamically that could be done
statically at compile time.

If GCC is providing vector abstractions then it would seem a lot
less work for GHC to incorporate 'vector' types directly...

Anyone like to comment more specifically on what would be involved
in plumbing a new datatype into GHC that handles vectors? Assuming
we wanted an interface like:

myFn :: Vector Word8 Four -> Vector Word8 Four.

Also vector is not a good name, as with vectors:

"a * b" would be the cross-product, but with
SIMD we get 4 scalar multiplies...

Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread Duncan Coutts
On Mon, 2004-07-26 at 10:58, MR K P SCHUPKE wrote:
> In the absence of such a standard, the best that can be done is to
> abstract vectorisation by word size and number of words, and supply
> a software implementation of all vector ops to use if the hardware
> does not support certain primitives.

Recent version of gcc provides a semi-portable vector extension. You
define some vector data types and gcc will make use of whatever vector
capability exists on the machine you are compiling for. Otherwise it
uses narrower vector units if possible or just scaler units if that is
all that is available.

http://gcc.gnu.org/onlinedocs/gcc-3.4.1/gcc/Vector-Extensions.html

Correct me if I'm wrong but I thought that there was some overhead to
making FFI (even safe/unsafe) which would probably kill any advantage
from using small vector units. In which case it'd need to be implemented
as ghc primitive ops which is rather more work.

Duncan

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-26 Thread MR K P SCHUPKE
Well I have to say the dataflow style of lazy programming made me think
Haskell would be ideal for multi-processor use (and now HyperThreading
is common most PCs have more than one processor from the code's point
of view)...

I was disappointed to find GHC only uses one thread, and therefore will
only use one CPU.

Vectorisation is much more limited than hyperthreading - as the instructions
on each vector unit must be the same... The easiest implementation would be
to add primitive vector types that map directly to the underlying SIMD
operations. This is probably a good starting point - it would be nice to
vectorise normal expressions, but that requires an optimisation search to
find instructions that can be grouped together ... this becomes a lot
easier if the primitives are supported as you just need to transform
the code rather than dealing with non-portable assembly instructions.

The problem is that different platforms support different SIMD style
instructions... What would be really needed is a IEEE/ISO standard
for vector instructions much like the current one for floating point.

In the absence of such a standard, the best that can be done is to
abstract vectorisation by word size and number of words, and supply
a software implementation of all vector ops to use if the hardware
does not support certain primitives.

A futher point is that for the compiler to know the vector size, you
would have to resort to type level naturals... something like:

data Nil
data Suc x = Suc x

type Four = Suc (Suc (Suc (Suc Nil)))

myFn :: Vector Four Word16

Of course this would need to be integrated with the compiler. As
an interim measure a C-library providing primitive operations
on vectors could be written and used via the FFI.

Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] optimising for vector units

2004-07-25 Thread Ben Lippmeier
Matthew Roberts wrote:
Does anybody know if any of the Haskell compilers are able to optimise for
vector units (like MMX, SSE(2), 3D_Now! and AltiVec)?
 

No, not as yet. FP systems don't generally provide enough control over 
how data is laid out in memory to be able to invoke SIMD operations on 
it (or control data locality).

I suppose you could add an unboxed Float32x4 type and appropriate 
instances of IOArrays etc to GHC, but if you wanted to do anything with 
it you'd have to use specialised unboxed operations.. and it'd probably 
be more trouble than just writing it in assembler.

I would have thought that if a developer cared enough about the 
performance of their program to turn to non-portable SIMD extensions, 
they'd want to write it in assembler anyway so they had absolute control 
over what was going on..

... though it would be nice to be able to define
a + b :: (Float, Float, Float, Float) -> (Float, Float, Float, Float) -> 
(Float, Float, Float, Float)

and expect it to go via SSE..
Ben.
My investigations have revealed that c requires special programming
constructs to make the most of these processor capabilites (i.e. the
compiler is not able to work it out for itself).
Is there any work on getting haskell compilers to work this kind of thing
out?
Cheers,
Matt
 

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] optimising for vector units

2004-07-25 Thread Matthew Roberts
Hi,

Does anybody know if any of the Haskell compilers are able to optimise for
vector units (like MMX, SSE(2), 3D_Now! and AltiVec)?

My investigations have revealed that c requires special programming
constructs to make the most of these processor capabilites (i.e. the
compiler is not able to work it out for itself).

Is there any work on getting haskell compilers to work this kind of thing
out?

Cheers,

Matt
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe