OK, so there are few extra costs for unknown tail calls.
First, there might be a pipeline stall if the jump address isn't
loaded into a register far enough in advance before the jump.
Second, you need to pass information about the number of arguments in
the caller.
And finally, you need to
Lennart Augustsson wrote:
So then tail calls should be very cheap when most of the arguments don't
change.
Yes, but the problem tends to be the arguments that change, and the fact that
they are passed on the stack. A C loop would keep the loop-carried variables in
registers. On x86_64 you
Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments on
the stack before a tail call that minimizes that number of moves and
temporary locations. Lmlc did this 20 years ago. :)
Right, and that's what GHC does too, with a strongly-connected-component
So then tail calls should be very cheap when most of the arguments
don't change.
On Apr 10, 2007, at 10:17 , Simon Marlow wrote:
Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments
on the stack before a tail call that minimizes that number of
On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote:
So then tail calls should be very cheap when most of the arguments
don't change.
On Apr 10, 2007, at 10:17 , Simon Marlow wrote:
Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the
| 6. The inliner is a bit too greedy. Removing the slow-path code from
|singleton doesn't help because popSingleton is only used once; but
|if I explicitly {-# NOINLINE popSingleton #-}, the code for
|singleton itself becomes much smaller, and inlinable (15% perf
|gain). Plus
Stefan O'Rear wrote:
2. Parameters are very expensive. Our type of functions that build
(ignoring CPS for the time being) was MBA# - Int# - [ByteString],
where the Int# is the current write pointer. Adding an extra Int#
to cache the size of the array (rather than calling sMBA# each
| 5. State# threads clog the optimizer quite effectively. Replacing
|st(n-1)# with realWorld# everywhere I could count on data
|dependencies to do the same job doubled performance.
The idea is that the optimiser should allow you to write at a high level, and
do the book keeping for you.
It's not that hard to figure out an order to permute the arguments on
the stack before a tail call that minimizes that number of moves and
temporary locations. Lmlc did this 20 years ago. :)
-- Lennart
On Apr 5, 2007, at 19:17 , Claus Reinke wrote:
Stefan O'Rear wrote:
2.