Re: [Haskell-cafe] optimising for vector units
Ronny Wichers Schreur wrote: Jan-Willem Maessen writes (in the Haskell Cafe): My ultimate goal was parallelization. However, I'm now convinced it would take about 2-3 programmer-years more of effort to realize that goal. Parallel garbage collection (which is *not* optional on a parallel Haskell implementation, as our experience with pH demonstrated) and very fine-grained thunk-update protocols are both tricky technical challenges. And that leaves aside the question of whether there's enough coarse-grained work buried among all that fine-grained work to make it all worthwhile. I'm not sure how to interpret this. Will you also have solved the granularity problem in these two to three years? Probably not. One might hope to do well by choosing an appropriate schedule (favor leaves locally, roots when work is moved between processors). But there's no evidence generating the work lazily would even leave enough work to be done. There's a difference between a system that would work (and probably work OK for 2-4 processors) and a system that would scale. The latter would also require some hard thought by the application programmer, as well. -Jan-Willem Maessen ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Jan-Willem Maessen writes (in the Haskell Cafe): My ultimate goal was parallelization. However, I'm now convinced it would take about 2-3 programmer-years more of effort to realize that goal. Parallel garbage collection (which is *not* optional on a parallel Haskell implementation, as our experience with pH demonstrated) and very fine-grained thunk-update protocols are both tricky technical challenges. And that leaves aside the question of whether there's enough coarse-grained work buried among all that fine-grained work to make it all worthwhile. I'm not sure how to interpret this. Will you also have solved the granularity problem in these two to three years? Cheers, Ronny Wichers Schreur ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Erm, when I said no overhead I meant there is no overhead to choosing an instruction from a different thread compared to choosing an instruction from the same thread... Obviously the overall scheduling overhead will increase. >The real killer, of course, is memory latency. The cache resources >required to hold pending work can't be used to hold heap data I would have thought cache thrashing caused by two CPUs accessing the same address repeatedly would be the real killer... Of course this is not so much of a problem with hyper-threading where both 'virtual' CPUs access the same cache. >* All your functions must be strict in their arguments, I don't see that I am making that assumption. Obviously parts of the program that are in the IO monad are strict, so we start with the demand for a value required strictly by an IO function. This value will be the result of a function, so we take the graph representing the function, and carry on as suggested. This is definitely lazy as parameters not affecting the result will never be evaluated. I appreciate there are some unanswered hard questions, but most of these are to do with efficiency - not whether it is possible... Having an inefiicient implementation now rather than nothing might just be what is needed to get people looking at the efficiency issues. Thanks for the references by the way - although I am quite familiar with the Monsoon architecture - and the Pentium architecture (which since PPro has been internaly dataflow... SuperScalar is really a dataflow computer emulating a traditional processor, and interpreting its instructions) Keean./ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
MR K P SCHUPKE wrote: That was me. I think you're underestimating the cost of starting threads even in this very lightweight world. Maybe... Perhaps haskell could be made to resemble dataflow instructions more... If when a computation completes we insert the result directly into the data structure which represents function, we can infact pick any function for execution with _no_ overhead. (In a true dataflow system any instruction from any thread can be executed with no overhead) With architectural support for dataflow execution, this is true (though hardware support for scheduling necessarily imposes resource limitations). You might be rather interested in a couple of books about the Monsoon dataflow machine, and the compiler for the functional language (Id) used to prgram it---Id is similar in many ways to Haskell. The Haskell dialect pH was basically Id with Haskell syntax, and early versions used the Id back end to generate Monsoon code. @book{102865, author = {Gregory M. Papadopoulos}, title = {Implementation of a general-purpose dataflow multiprocessor}, year = {1991}, isbn = {0-262-66069-5}, publisher = {MIT Press}, } @book{103033, author = {Kenneth R. Traub}, title = {Implementation of non-strict functional programming languages}, year = {1991}, isbn = {0-262-70042-5}, publisher = {MIT Press}, } Interestingly, even in a general-purpose dataflow system it became obvious that pushing data to and from memory was rather inefficient. As a result, Monsoon had registers to handle loop code, and ran inner loops sequentially. A discussion of the tradeoffs between sequential and parallel loops can be found here: @inproceedings{165203, author = {Boon Seong Ang}, title = {Efficient implementation of sequential loops in dataflow computation}, booktitle = {Proceedings of the conference on Functional programming languages and computer architecture}, year = {1993}, isbn = {0-89791-595-X}, pages = {169--178}, location = {Copenhagen, Denmark}, doi = {http://doi.acm.org/10.1145/165180.165203}, publisher = {ACM Press}, } The suggestion I guess is to only use instructions that get their arguments from main memory. In the absence of architectural support for dataflow, this is the simplest option by far. > This way any instruction can be sequenced with no overhead on any CPU. With modern on-die core-speed caches this can be almost as fast a registers (with good cache access patterns) > > ... Note that I am only suggesting > interleaving instructions at the function level, so registers can be used within > functions... of course as things get more and more parallel we may see > hardware with no registers, just pipelined high speed cache access. (The > hardware may well use registers to pre-fetch cache values, but that can > be made transparent to the software)... Alas, I think your making a few naive assumptions here: * All your functions must be strict in their arguments, yet your language as a whole is somehow non-strict. How? * Tracking which computations are partially satisfied has no overhead (we'll need to match arguments to computations somehow). * Choosing a function which is ready to run has no overhead (we'll need a queue, a stack, or some similar data structure to manage ready instructions). * The extra instructions do not interfere in any material way with the instruction level parallelism that may have existed in our program (doubtful, all those scheduling instructions, operand loads, and result stores consume issue bandwidth at the very least). There's been quite a bit of work on splitting non-strict code up into strict threads which can be run in much the way I think you're imagining. The following paper describes some of the basics: @inproceedings{141568, author = {Kenneth R. Traub and David E. Culler and Klaus E. Schauser}, title = {Global analysis for partitioning non-strict programs into sequential threads}, booktitle = {Proceedings of the 1992 ACM conference on LISP and functional programming}, year = {1992}, isbn = {0-89791-481-3}, pages = {324--334}, location = {San Francisco, California, United States}, doi = {http://doi.acm.org/10.1145/141471.141568}, publisher = {ACM Press}, } The real killer, of course, is memory latency. The cache resources required to hold pending work can't be used to hold heap data and the like instead. You're going to increase your cache miss rate as well. The result? A drop in performance. A multi-threaded processor can help in masking cache miss latency (while increasing the miss rate of an individual thread). But it is still limited by the fundamental bandwidth limitations of the load/store units. You will, on the whole, still be doing substantially worse than an ordinary program. Why bother at all? The same reason we bother with laziness (which has essentially the same set of problems in different clothing). It's easy to write clean, beautiful programs. And I do still hold out ho
Re: [Haskell-cafe] optimising for vector units
>That was me. I think you're underestimating the cost of starting >threads even in this very lightweight world. Maybe... Perhaps haskell could be made to resemble dataflow instructions more... If when a computation completes we insert the result directly into the data structure which represents function, we can infact pick any function for execution with _no_ overhead. (In a true dataflow system any instruction from any thread can be executed with no overhead) The point it at any time we have N functions ready for execution (a function is ready for execution when all its arguments are ready)... we can pick and execute any (or all of these if enough execution units are ready) of these. The suggestion I guess is to only use instructions that get their arguments from main memory. This way any instruction can be sequenced with no overhead on any CPU. With modern on-die core-speed caches this can be almost as fast a registers (with good cache access patterns) ... Note that I am only suggesting interleaving instructions at the function level, so registers can be used within functions... of course as things get more and more parallel we may see hardware with no registers, just pipelined high speed cache access. (The hardware may well use registers to pre-fetch cache values, but that can be made transparent to the software)... Hardware manufacturers have hit the limit for pure sequential execution speed, so more parallelism is the only way forward (see Intels revised roadmap, they abandoned the pentium 4 and 5 and have focused on an updated _low_power_ pentium 3M, and are planning multi core versions for more speed). C and other imperitive languages focus toom much on the how and not the what to be easy to use in such multi-cpu environments... A language with abstracts and hides the parallelism could well take off in a big way. Keean. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
MR K P SCHUPKE wrote: A final point someone made about the cost of starting threads... Surely the obvious approach is to start one OS thread per execution unit, and do all the thread starting with the very lightweight haskell threads... That was me. I think you're underestimating the cost of starting threads even in this very lightweight world. Again, tens of instructions between startup, shutdown, and (very important) synchronization to make sure that other threads see the results which were produced. It *can* be done without locks, but it often can't be done without memory fences of some sort. If you actually have to start OS threads, the cost is thousands or tens of thousands of cycles. Even with very lightweight thread creation, you probably don't want to do a multiplication in a separate thread---unless it's, say, an Integer multiply with a minimum of a thousand digits or so. -Jan-Willem Maessen Keean ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Just wanted to explore a few points you made: Jon Cast wrote: No. A function may not use all of its arguments, so there's no point in evaluating all arguments---you have to find out which ones are actually needed. Terminology note: GHC uses the STG language internally. In STG, evaluation is always triggered by a case expression, which evaluates precisely one expression (not necessarily an argument of anything, either!) If a function never uses an argument you dont need to evaluate it... remember in dataflow we start from the end of the program working backwards so we know all future demands on function arguments... Now, (:) cannot `spark' a thread to evaluate xn1 in downFrom, because that would end up sparking infinitely many threads---which may be possible, but nothing you've said suggests that it is. Again think of lazy lists in their 'stream' interpretation. We model them as a FIFO queue and can set high and low water marks in the FIFO. We can use hysteresis to ensure work gets done in reasonably large chunks. I guess my description didn't include all the details of the model I had in mind... you obviously dont want to unroll all recursive functions... I was thinking along the lines of expressions like let a = c*d + e*f where you can obviously execute the two multiplies in parallel (the multiplies being the arguments to the (+) function. Also it is obvious you don't need a new thread for one of the arguments (a function of one argument would use the same thread)... In the general recursive case: fna x y z = fna (x+1) (y-1) (z*3) again we can do the three simple arithmetic operations in parallel but have to wait for all of them to finish before the call to (fna) can be made. So there isn't an infinite expansion of threads there is one thead that does the recursion and one of the arithmetic operations and two that are forked and reaped each iteration. This is true for tree reduction, but not for graph reduction. Consider, e.g., the fact that the call to product in choose and the recursive call to downFrom walk down (and hence evaluate) the same list. Since the cons cells making up this list are evaluated by two different functions, some sort of locking is indeed necessary to ensure that you don't end up with two threads for each of the (infinitely many) cons cells in the list. The list is read only (this is a declarative language) no locking is necessary on read only data structures. I suspect I haven't fully understood the difficaulty in doing this, care to enlighten me? A final point someone made about the cost of starting threads... Surely the obvious approach is to start one OS thread per execution unit, and do all the thread starting with the very lightweight haskell threads... I guess in this case I can see where locking would be necessary... Keean ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Ketil Malde wrote: Jan-Willem Maessen - Sun Labs East <[EMAIL PROTECTED]> writes: There are, I believe, a couple of major challenges: * It's easy to identify very small pieces of parallel work, but much harder to identify large, yet finite, pieces of work. Only the latter are really worth parallelizing. By the former, are you thinking of so small grain that it is handled by out-of-order execution units in the CPU? And/or the C compiler? I'm thinking of so small grain that the cost of fine-grained thread creation (tens of instructions) is comparable to the cost of running the thread itself (hundreds of instructions). ILP is a bit of a sideshow here. Usually there's plenty of integer ILP hanging about already. * If you don't compute speculatively, you'll never find enough work to do. Although I'm not familiar with the issues, my point is that the number of CPUs available, even in common household pee cees, is already more than one (P4 hyper-threading), and could be something like eight in the not-so-distant future. It no longer matters (much) if you waste `cycles, cycles are cheap. (The next next IA64, Montecito is 1.7G transistors, including 24Mb on-chip cache. The P4 is big, but you could fit thirty of them in that space. No way Montecito is going to have anywhere near 30x the performance) So speculative execution, even if you end up throwing away 50% of the work you do, could in theory make your program faster anyway. This is a headache for C programs; my hope would be that a functional language would make it easier. Be careful. A P4 will slow down if you get it hot enough. So "throwing away" that bit of extra performance may actually make things slower... And who gets to use the throwaway performance, anyhow? You want it, the OS wants it, other applications on your machine want it---you have to be able to adjust to varying amounts of "extra" compute power hanging around. [Actually, speaking from experience, half the compute power on my SMP ends up going to the GUI, the OS, or my browser---and it's great to have those things not slow down uniprocessor compute-bound tasks.] * If you compute speculatively, you need some way to *stop* working on useless, yet infinite computations. And you need to choose which computations to start working on, I guess. Predicting the future never was easy :-) Indeed. [perhaps getting off-topic, but hey, this is -cafe] That's why I subscribe to -cafe at all! -Jan-Willem Maessen -kzm ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Jan-Willem Maessen - Sun Labs East <[EMAIL PROTECTED]> writes: I missed this bit: > I'm building compilers for supercomputers at Sun So, any plans for compilers for functional languages making use of Niagara? -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Jan-Willem Maessen - Sun Labs East <[EMAIL PROTECTED]> writes: > There are, I believe, a couple of major challenges: >* It's easy to identify very small pieces of parallel work, but much > harder to identify large, yet finite, pieces of work. Only the > latter are really worth parallelizing. By the former, are you thinking of so small grain that it is handled by out-of-order execution units in the CPU? And/or the C compiler? >* If you don't compute speculatively, you'll never find enough work > to do. Although I'm not familiar with the issues, my point is that the number of CPUs available, even in common household pee cees, is already more than one (P4 hyper-threading), and could be something like eight in the not-so-distant future. It no longer matters (much) if you waste `cycles, cycles are cheap. (The next next IA64, Montecito is 1.7G transistors, including 24Mb on-chip cache. The P4 is big, but you could fit thirty of them in that space. No way Montecito is going to have anywhere near 30x the performance) So speculative execution, even if you end up throwing away 50% of the work you do, could in theory make your program faster anyway. This is a headache for C programs; my hope would be that a functional language would make it easier. >* If you compute speculatively, you need some way to *stop* working > on useless, yet infinite computations. And you need to choose which computations to start working on, I guess. Predicting the future never was easy :-) [perhaps getting off-topic, but hey, this is -cafe] -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Ketil Malde wrote: I'm sure somebody, somewhere, is working on speculative execution of Haskell code. I think we both graduated (myself from MIT, Robert Ennals from Cambridge; I'm building compilers for supercomputers at Sun). If anyone else is working seriously on speculative evaluation of Haskell at the moment, please drop me a line, I'd love to hear of your existence. > Now that Sun is building Niagara with IIRC 8 cores on a chip, Intel is rumoured to put up to 16 cores on the post-Montecito Tukwila, Sony/IBM's Cell is said to be some kind of parallel design, and every self-respecting chip vendor is putting at least two cores on a chip and/or adding multithreaded cores. One would expect a lazy and pure language to be excellent for parallelization, since the programmer is generally removed from the actual flow of execution anyway. As I recall, this was one of the original goals of the Grip project (which built the original incarnation of GHC, if I have my history straight). Indeed, if you read any early paper on strictness analysis, you'd think that it was all about parallelizing lazy code. I'm now of the opinion that strictness analysis tells us where to *serialize* our code, because parallelism just won't be worth it. There are, I believe, a couple of major challenges: * It's easy to identify very small pieces of parallel work, but much harder to identify large, yet finite, pieces of work. Only the latter are really worth parallelizing. * If you don't compute speculatively, you'll never find enough work to do. * If you compute speculatively, you need some way to *stop* working on useless, yet infinite computations. Rob and I had two rather different takes on the same basic idea: run with some sort of resource limits, and stop working when you exhaust them. Roughly speaking, Rob then made sure you never speculated that bit of code again, while I just kept relying on the bounds to kick in. On pretty eager code, I did all right, but on code that actually needed the laziness (this was after a good bit of compiler munging to wring out most of the unnecessary laziness) I got killed---Rob's approach fixed the mistake by patching the code, and did much better thereafter. My ultimate goal was parallelization. However, I'm now convinced it would take about 2-3 programmer-years more of effort to realize that goal. Parallel garbage collection (which is *not* optional on a parallel Haskell implementation, as our experience with pH demonstrated) and very fine-grained thunk-update protocols are both tricky technical challenges. And that leaves aside the question of whether there's enough coarse-grained work buried among all that fine-grained work to make it all worthwhile. Anyone want to pay me to do this? :-) > At some point (for some n), being able to spawn n threads will gain you more than a factor c constant overhead, and Haskell programs, with a run-time system that can evaluate expressions in paralllel, will outperform single threaded C code. Only if you can keep the granularity of the work large. This hasn't been so easy. -Jan-Willem Maessen Eager Haskell Implementor ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
MR K P SCHUPKE <[EMAIL PROTECTED]> wrote: > As far as I understand it haskells lazy evaluation resembles dataflow > computation - IE values are computed on demand. > > This problem has already been solved for hardware, you end up with > what is termed a super-scalar architecture. > > Instructions read from memory fill a re-order buffer. As computations > complete value slots in the reorder buffer are filled with the result > values. When an instruction has all its values it can be executed, and > will be assigned to the next available execution unit. > > I am not too sure of the internal architecture of GHC, but I can > imagine that funtions waiting to be run will have slots for arguments, > that will be filled in when values become available. No. A function may not use all of its arguments, so there's no point in evaluating all arguments---you have to find out which ones are actually needed. Terminology note: GHC uses the STG language internally. In STG, evaluation is always triggered by a case expression, which evaluates precisely one expression (not necessarily an argument of anything, either!). > The whole point is that a function with three parameters can spark > three threads This is actually a bad idea: the point of non-strict languages and lazy evaluation is to allow lots of values which are (a) unused and (b) expensive to evaluate. If you evaluate everything (even in parallel), there's a lot of wasted work if the programmer is making optimal use of the non-strictness of the language. Furthermore, if de-forestation fails anywhere, you've likely got another problem. Consider the following program: > n `choose` m = product $ take m $ [n, n-1..] Which translates (after some optimisation) to STG equivalent to: > downFrom n = let n1 = (-) n 1 > in let xn1 = downFrom n1 > in let xn2 = n:xn1 > in xn2 > choose n m = let xn1 = downFrom n > in let xn2 = take m xn1 > in product xn2 Now, (:) cannot `spark' a thread to evaluate xn1 in downFrom, because that would end up sparking infinitely many threads---which may be possible, but nothing you've said suggests that it is. > to calculate each parameter without worrying about the side affects of > each function. No locking is necessary because you only spark one > thread for each argument. This is true for tree reduction, but not for graph reduction. Consider, e.g., the fact that the call to product in choose and the recursive call to downFrom walk down (and hence evaluate) the same list. Since the cons cells making up this list are evaluated by two different functions, some sort of locking is indeed necessary to ensure that you don't end up with two threads for each of the (infinitely many) cons cells in the list. > I suspect I haven't fully understood the difficaulty in doing this, > care to enlighten me? Certainly. See above. Jon Cast ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
As far as I understand it haskells lazy evaluation resembles dataflow computation - IE values are computed on demand. This problem has already been solved for hardware, you end up with what is termed a super-scalar architecture. Instructions read from memory fill a re-order buffer. As computations complete value slots in the reorder buffer are filled with the result values. When an instruction has all its values it can be executed, and will be assigned to the next available execution unit. I am not too sure of the internal architecture of GHC, but I can imagine that funtions waiting to be run will have slots for arguments, that will be filled in when values become available. The whole point is that a function with three parameters can spark three threads to calculate each parameter without worrying about the side affects of each function. No locking is necessary because you only spark one thread for each argument. I suspect I haven't fully understood the difficaulty in doing this, care to enlighten me? Keean. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Jon Cast <[EMAIL PROTECTED]> writes: >> factor c constant overhead, > ^^ > What makes you think the overhead is constant? (Referring to the overhead introduced by boxing and such, not parallelizing. Sorry if that wasn't clear) -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Ketil Malde <[EMAIL PROTECTED]> wrote: > MR K P SCHUPKE <[EMAIL PROTECTED]> writes: > One would expect a lazy and pure language Not lazy! See below. > to be excellent for parallelization, since the programmer is generally > removed from the actual flow of execution anyway. At some point (for > some n), being able to spawn n threads will gain you more than a > factor c constant overhead, ^^ What makes you think the overhead is constant? Without Optimistic Evaluation, Haskell programs mutate /heavily/, and even with Optimistic Evaluation I'm sure there is some mutation going on. With a truly parallel system, every mutation has to be locked in some way; reducing the number of locking operations to merely constant overhead is an interesting (as in extremely difficult) research problem. > and Haskell programs, with a run-time system that can evaluate > expressions in paralllel, will outperform single threaded C code. > > (But it probably isn't that simple, or we would have it already :-) Nope. Jon Cast ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
MR K P SCHUPKE <[EMAIL PROTECTED]> writes: > Well I have to say the dataflow style of lazy programming made me think > Haskell would be ideal for multi-processor use (and now HyperThreading > is common most PCs have more than one processor from the code's point > of view)... > I was disappointed to find GHC only uses one thread, and therefore will > only use one CPU. I'm sure somebody, somewhere, is working on speculative execution of Haskell code. Now that Sun is building Niagara with IIRC 8 cores on a chip, Intel is rumoured to put up to 16 cores on the post-Montecito Tukwila, Sony/IBM's Cell is said to be some kind of parallel design, and every self-respecting chip vendor is putting at least two cores on a chip and/or adding multithreaded cores. Even if there's a lot of rumor-mongering going on, it seems fairly clear that you can only go on for so long, adding more cache to your old designs, and the bottleneck then becomes the inherent parallelism in your code. One would expect a lazy and pure language to be excellent for parallelization, since the programmer is generally removed from the actual flow of execution anyway. At some point (for some n), being able to spawn n threads will gain you more than a factor c constant overhead, and Haskell programs, with a run-time system that can evaluate expressions in paralllel, will outperform single threaded C code. (But it probably isn't that simple, or we would have it already :-) -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
>Correct me if I'm wrong but I thought that there was some overhead to >making FFI (even safe/unsafe) You are not wrong... so it would only be worth it if you could execute a whole bunch of C at the same time. But it is possible... Consider: Use a monad to sequence vector operations on the Vector class. The monad instead of directly executing vector instructions compiles a syntax tree, or a list of instructions. This list of instructions is passed to the C code, which sequences the primitive vector ops into a block of code, and returns a function pointer via the FFI. You just call the whole block with the right parameters. This is probably getting to be more work than integrating stuff into GHC though. It is also doing stuff dynamically that could be done statically at compile time. If GCC is providing vector abstractions then it would seem a lot less work for GHC to incorporate 'vector' types directly... Anyone like to comment more specifically on what would be involved in plumbing a new datatype into GHC that handles vectors? Assuming we wanted an interface like: myFn :: Vector Word8 Four -> Vector Word8 Four. Also vector is not a good name, as with vectors: "a * b" would be the cross-product, but with SIMD we get 4 scalar multiplies... Keean. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
On Mon, 2004-07-26 at 10:58, MR K P SCHUPKE wrote: > In the absence of such a standard, the best that can be done is to > abstract vectorisation by word size and number of words, and supply > a software implementation of all vector ops to use if the hardware > does not support certain primitives. Recent version of gcc provides a semi-portable vector extension. You define some vector data types and gcc will make use of whatever vector capability exists on the machine you are compiling for. Otherwise it uses narrower vector units if possible or just scaler units if that is all that is available. http://gcc.gnu.org/onlinedocs/gcc-3.4.1/gcc/Vector-Extensions.html Correct me if I'm wrong but I thought that there was some overhead to making FFI (even safe/unsafe) which would probably kill any advantage from using small vector units. In which case it'd need to be implemented as ghc primitive ops which is rather more work. Duncan ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Well I have to say the dataflow style of lazy programming made me think Haskell would be ideal for multi-processor use (and now HyperThreading is common most PCs have more than one processor from the code's point of view)... I was disappointed to find GHC only uses one thread, and therefore will only use one CPU. Vectorisation is much more limited than hyperthreading - as the instructions on each vector unit must be the same... The easiest implementation would be to add primitive vector types that map directly to the underlying SIMD operations. This is probably a good starting point - it would be nice to vectorise normal expressions, but that requires an optimisation search to find instructions that can be grouped together ... this becomes a lot easier if the primitives are supported as you just need to transform the code rather than dealing with non-portable assembly instructions. The problem is that different platforms support different SIMD style instructions... What would be really needed is a IEEE/ISO standard for vector instructions much like the current one for floating point. In the absence of such a standard, the best that can be done is to abstract vectorisation by word size and number of words, and supply a software implementation of all vector ops to use if the hardware does not support certain primitives. A futher point is that for the compiler to know the vector size, you would have to resort to type level naturals... something like: data Nil data Suc x = Suc x type Four = Suc (Suc (Suc (Suc Nil))) myFn :: Vector Four Word16 Of course this would need to be integrated with the compiler. As an interim measure a C-library providing primitive operations on vectors could be written and used via the FFI. Keean. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] optimising for vector units
Matthew Roberts wrote: Does anybody know if any of the Haskell compilers are able to optimise for vector units (like MMX, SSE(2), 3D_Now! and AltiVec)? No, not as yet. FP systems don't generally provide enough control over how data is laid out in memory to be able to invoke SIMD operations on it (or control data locality). I suppose you could add an unboxed Float32x4 type and appropriate instances of IOArrays etc to GHC, but if you wanted to do anything with it you'd have to use specialised unboxed operations.. and it'd probably be more trouble than just writing it in assembler. I would have thought that if a developer cared enough about the performance of their program to turn to non-portable SIMD extensions, they'd want to write it in assembler anyway so they had absolute control over what was going on.. ... though it would be nice to be able to define a + b :: (Float, Float, Float, Float) -> (Float, Float, Float, Float) -> (Float, Float, Float, Float) and expect it to go via SSE.. Ben. My investigations have revealed that c requires special programming constructs to make the most of these processor capabilites (i.e. the compiler is not able to work it out for itself). Is there any work on getting haskell compilers to work this kind of thing out? Cheers, Matt ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] optimising for vector units
Hi, Does anybody know if any of the Haskell compilers are able to optimise for vector units (like MMX, SSE(2), 3D_Now! and AltiVec)? My investigations have revealed that c requires special programming constructs to make the most of these processor capabilites (i.e. the compiler is not able to work it out for itself). Is there any work on getting haskell compilers to work this kind of thing out? Cheers, Matt ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe