[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
OK, so there are few extra costs for unknown tail calls. First, there might be a pipeline stall if the jump address isn't loaded into a register far enough in advance before the jump. Second, you need to pass information about the number of arguments in the caller. And finally, you need to test the number of arguments in the callee. Under the best of circumstances the argument check costs a load immediate, a compare, and a non-taken branch. But looking at the output from ghc I doubt this is the case. -- Lennart On Apr 10, 2007, at 23:13 , Stefan O'Rear wrote: On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote: So then tail calls should be very cheap when most of the arguments don't change. On Apr 10, 2007, at 10:17 , Simon Marlow wrote: Lennart Augustsson wrote: It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :) Right, and that's what GHC does too, with a strongly-connected- component analysis of the dependencies between assignments of the args for the tail call. Cheers, Simon The tailcall in question is NOT statically known (it is a variant of CPS), so simpleminded tail recursion optimizations will not help much. Stefan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
Lennart Augustsson wrote: So then tail calls should be very cheap when most of the arguments don't change. Yes, but the problem tends to be the arguments that change, and the fact that they are passed on the stack. A C loop would keep the loop-carried variables in registers. On x86_64 you get better code becuase some of the args are passed in registers, but it's not really proper loop optimisation. We know what needs to be done, and we plan to make progress on this soon, hopefully over the summer. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
Lennart Augustsson wrote: It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :) Right, and that's what GHC does too, with a strongly-connected-component analysis of the dependencies between assignments of the args for the tail call. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
So then tail calls should be very cheap when most of the arguments don't change. On Apr 10, 2007, at 10:17 , Simon Marlow wrote: Lennart Augustsson wrote: It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :) Right, and that's what GHC does too, with a strongly-connected- component analysis of the dependencies between assignments of the args for the tail call. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote: So then tail calls should be very cheap when most of the arguments don't change. On Apr 10, 2007, at 10:17 , Simon Marlow wrote: Lennart Augustsson wrote: It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :) Right, and that's what GHC does too, with a strongly-connected- component analysis of the dependencies between assignments of the args for the tail call. Cheers, Simon The tailcall in question is NOT statically known (it is a variant of CPS), so simpleminded tail recursion optimizations will not help much. Stefan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] RE: What I learned from my first serious attempt low-level Haskell programming
| 6. The inliner is a bit too greedy. Removing the slow-path code from |singleton doesn't help because popSingleton is only used once; but |if I explicitly {-# NOINLINE popSingleton #-}, the code for |singleton itself becomes much smaller, and inlinable (15% perf |gain). Plus the new singleton doesn't allocate memory, so I can |use even MORE realWorld#s. That's a hard one! Inlining functions that are called just once is a huge win usually. I don't know how to spot what you did in an automated way. Yeah. We found this to be an issue with the Mercury compiler. We processed functions (well, okay predicates, in the case of Mercury) in dependency order. We experimented with top down and bottom up order. Bottom up inlining is great for eliminating all the little access and convenience functions one writes, and top down gets the case above (at least most of the time). IIRC, our experiments showed that overall, bottom up inlining performed significantly better than top down, or arbitrary order. Bottom up inlining worked really well round the leaves because it frequently replaced a call (requiring register saves, etc) with structure packing/unpacking which didn't require register saves/restores. Thus it eliminated calls altogether. It is also advantageous when it allows producers and consumers to be merged, eliminating memory allocations (as noted above). That said, I had better point out that Mercury is strict, which simplifies things rather. Andrew Appel's code generator that used dynamic programming to select between different generated code sequences comes to mind as potential inspiration for a super-duper inliner. cheers, Tom -- Dr Thomas Conway You are beautiful; but learn to work, [EMAIL PROTECTED] for you cannot eat your beauty. -- Congo proverb ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
Stefan O'Rear wrote: 2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# - Int# - [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x. If you were measuring on x86 then parameters are passed on the stack, which may be expensive. On x86_64 the first 3 arguments are passed in registers, which is usually a win, but if the function immediately does an eval they need to be saved on the stack anyway. Still, 4x sounds like a lot, perhaps you managed to avoid a stack check in the inner loop or something. 3. MBA# is just as fast as Addr#, and garbage collected to boot. Not really surprising, that. 4. You can't keep track of which version of the code is which, what is a regression, and what is an enhancement. Don't even try. Next time I try something like this I will make as much use of darcs as possible. Absolutely - if you'd used darcs, then we could peer in more detail at changes that you thought gave counter-intuitive results. Simon Peyton-Jones wrote: | 5. State# threads clog the optimizer quite effectively. Replacing |st(n-1)# with realWorld# everywhere I could count on data |dependencies to do the same job doubled performance. The idea is that the optimiser should allow you to write at a high level, and do the book keeping for you. When it doesn't, I like to know, and preferably fix. If you had a moment to boil out a small, reproducible example of this kind of optimisation failure (with as few dependencies as poss), then I'll look to see if the optimiser can be cleverer. Yes, and *please* add some of this folklore to the performance wiki at http://haskell.org/haskellwiki/Performance, if you have the time. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] RE: What I learned from my first serious attempt low-level Haskell programming
| 5. State# threads clog the optimizer quite effectively. Replacing |st(n-1)# with realWorld# everywhere I could count on data |dependencies to do the same job doubled performance. The idea is that the optimiser should allow you to write at a high level, and do the book keeping for you. When it doesn't, I like to know, and preferably fix. If you had a moment to boil out a small, reproducible example of this kind of optimisation failure (with as few dependencies as poss), then I'll look to see if the optimiser can be cleverer. | | 6. The inliner is a bit too greedy. Removing the slow-path code from |singleton doesn't help because popSingleton is only used once; but |if I explicitly {-# NOINLINE popSingleton #-}, the code for |singleton itself becomes much smaller, and inlinable (15% perf |gain). Plus the new singleton doesn't allocate memory, so I can |use even MORE realWorld#s. That's a hard one! Inlining functions that are called just once is a huge win usually. I don't know how to spot what you did in an automated way. thanks Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming
It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :) -- Lennart On Apr 5, 2007, at 19:17 , Claus Reinke wrote: Stefan O'Rear wrote: 2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# - Int# - [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x. If you were measuring on x86 then parameters are passed on the stack, which may be expensive. On x86_64 the first 3 arguments are passed in registers, which is usually a win, but if the function immediately does an eval they need to be saved on the stack anyway. Still, 4x sounds like a lot, perhaps you managed to avoid a stack check in the inner loop or something. just out of curiosity: what does GHC do with stack parameters in loops/tail calls? there tends to be a conflict as the old set of parameters is needed to build the new one for the recursive call/next loop iteration, while one wants to get rid of the old set before doing the call. unless one keeps the parameter frames away from the stack, or can figure out a safe order in which to overwrite the parameters in the frame, that seems to imply some copying, even though that may be cheap for few/small parameters per frame. many years ago, i saw an abstract machine and RISC processor design aiming for good fp support that used two parameter stacks instead of one for just this reason. instead of moving stack frames around on a single stack, parameters were read from one stack, and built up on the other, followed by a cheap stack switch before the call. perhaps something like this might be of use here, too? claus ___ Libraries mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/libraries ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe