[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-11 Thread Lennart Augustsson

OK, so there are few extra costs for unknown tail calls.
First, there might be a pipeline stall if the jump address isn't  
loaded into a register far enough in advance before the jump.
Second, you need to pass information about the number of arguments in  
the caller.

And finally, you need to test the number of arguments in the callee.

Under the best of circumstances the argument check costs a load  
immediate, a compare, and a non-taken branch.  But looking at the  
output from ghc I doubt this is the case.


-- Lennart

On Apr 10, 2007, at 23:13 , Stefan O'Rear wrote:


On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote:

So then tail calls should be very cheap when most of the arguments
don't change.

On Apr 10, 2007, at 10:17 , Simon Marlow wrote:


Lennart Augustsson wrote:

It's not that hard to figure out an order to permute the arguments
on the stack before a tail call that minimizes that number of
moves and temporary locations.  Lmlc did this 20 years ago. :)


Right, and that's what GHC does too, with a strongly-connected-
component analysis of the dependencies between assignments of the
args for the tail call.

Cheers,
Simon


The tailcall in question is NOT statically known (it is a variant of
CPS), so simpleminded tail recursion optimizations will not help much.

Stefan


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-11 Thread Simon Marlow

Lennart Augustsson wrote:
So then tail calls should be very cheap when most of the arguments don't 
change.


Yes, but the problem tends to be the arguments that change, and the fact that 
they are passed on the stack.  A C loop would keep the loop-carried variables in 
registers.  On x86_64 you get better code becuase some of the args are passed in 
registers, but it's not really proper loop optimisation.


We know what needs to be done, and we plan to make progress on this soon, 
hopefully over the summer.


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-10 Thread Simon Marlow

Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments on 
the stack before a tail call that minimizes that number of moves and 
temporary locations.  Lmlc did this 20 years ago. :)


Right, and that's what GHC does too, with a strongly-connected-component 
analysis of the dependencies between assignments of the args for the tail call.


Cheers,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-10 Thread Lennart Augustsson
So then tail calls should be very cheap when most of the arguments  
don't change.


On Apr 10, 2007, at 10:17 , Simon Marlow wrote:


Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments  
on the stack before a tail call that minimizes that number of  
moves and temporary locations.  Lmlc did this 20 years ago. :)


Right, and that's what GHC does too, with a strongly-connected- 
component analysis of the dependencies between assignments of the  
args for the tail call.


Cheers,
Simon


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-10 Thread Stefan O'Rear
On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote:
 So then tail calls should be very cheap when most of the arguments  
 don't change.
 
 On Apr 10, 2007, at 10:17 , Simon Marlow wrote:
 
 Lennart Augustsson wrote:
 It's not that hard to figure out an order to permute the arguments  
 on the stack before a tail call that minimizes that number of  
 moves and temporary locations.  Lmlc did this 20 years ago. :)
 
 Right, and that's what GHC does too, with a strongly-connected- 
 component analysis of the dependencies between assignments of the  
 args for the tail call.
 
 Cheers,
  Simon

The tailcall in question is NOT statically known (it is a variant of
CPS), so simpleminded tail recursion optimizations will not help much. 

Stefan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] RE: What I learned from my first serious attempt low-level Haskell programming

2007-04-08 Thread Thomas Conway

| 6. The inliner is a bit too greedy.  Removing the slow-path code from
|singleton doesn't help because popSingleton is only used once; but
|if I explicitly {-# NOINLINE popSingleton #-}, the code for
|singleton itself becomes much smaller, and inlinable (15% perf
|gain).  Plus the new singleton doesn't allocate memory, so I can
|use even MORE realWorld#s.

That's a hard one!  Inlining functions that are called just once is a huge win 
usually. I don't know how to spot what you did in an automated way.


Yeah. We found this to be an issue with the Mercury compiler. We
processed functions (well, okay predicates, in the case of Mercury) in
dependency order. We experimented with top down and bottom up order.
Bottom up inlining is great for eliminating all the little access and
convenience functions one writes, and top down gets the case above (at
least most of the time). IIRC, our experiments showed that overall,
bottom up inlining performed significantly better than top down, or
arbitrary order.

Bottom up inlining worked really well round the leaves because it
frequently replaced a call (requiring register saves, etc) with
structure packing/unpacking which didn't require register
saves/restores. Thus it eliminated calls altogether. It is also
advantageous when it allows producers and consumers to be merged,
eliminating memory allocations (as noted above). That said, I had
better point out that Mercury is strict, which simplifies things
rather.

Andrew Appel's code generator that used dynamic programming to select
between different generated code sequences comes to mind as potential
inspiration for a super-duper inliner.

cheers,
Tom
--
Dr Thomas Conway  You are beautiful; but learn to work,
[EMAIL PROTECTED] for you cannot eat your beauty.
 -- Congo proverb
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-05 Thread Simon Marlow

Stefan O'Rear wrote:
 2. Parameters are very expensive.  Our type of functions that build
(ignoring CPS for the time being) was MBA# - Int# - [ByteString],
where the Int# is the current write pointer.  Adding an extra Int#
to cache the size of the array (rather than calling sMBA# each
time) slowed the code down ~2x.  Conversely, moving the write
pointer into the byte array (storing it in bytes 0#, 1#, 2#, and
3#) sped the code by 4x.

If you were measuring on x86 then parameters are passed on the stack, which may 
be expensive.  On x86_64 the first 3 arguments are passed in registers, which is 
usually a win, but if the function immediately does an eval they need to be 
saved on the stack anyway.  Still, 4x sounds like a lot, perhaps you managed to 
avoid a stack check in the inner loop or something.


 3. MBA# is just as fast as Addr#, and garbage collected to boot.

Not really surprising, that.

 4. You can't keep track of which version of the code is which, what is
a regression, and what is an enhancement.  Don't even try.  Next
time I try something like this I will make as much use of darcs as
possible.

Absolutely - if you'd used darcs, then we could peer in more detail at changes 
that you thought gave counter-intuitive results.


Simon Peyton-Jones wrote:

| 5. State# threads clog the optimizer quite effectively.  Replacing
|st(n-1)# with realWorld# everywhere I could count on data
|dependencies to do the same job doubled performance.

The idea is that the optimiser should allow you to write at a high level, and 
do the book keeping for you.  When it doesn't, I like to know, and preferably 
fix.

If you had a moment to boil out a small, reproducible example of this kind of 
optimisation failure (with as few dependencies as poss), then I'll look to see 
if the optimiser can be cleverer.


Yes, and *please* add some of this folklore to the performance wiki at 
http://haskell.org/haskellwiki/Performance, if you have the time.


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] RE: What I learned from my first serious attempt low-level Haskell programming

2007-04-05 Thread Simon Peyton-Jones
| 5. State# threads clog the optimizer quite effectively.  Replacing
|st(n-1)# with realWorld# everywhere I could count on data
|dependencies to do the same job doubled performance.

The idea is that the optimiser should allow you to write at a high level, and 
do the book keeping for you.  When it doesn't, I like to know, and preferably 
fix.

If you had a moment to boil out a small, reproducible example of this kind of 
optimisation failure (with as few dependencies as poss), then I'll look to see 
if the optimiser can be cleverer.

|
| 6. The inliner is a bit too greedy.  Removing the slow-path code from
|singleton doesn't help because popSingleton is only used once; but
|if I explicitly {-# NOINLINE popSingleton #-}, the code for
|singleton itself becomes much smaller, and inlinable (15% perf
|gain).  Plus the new singleton doesn't allocate memory, so I can
|use even MORE realWorld#s.

That's a hard one!  Inlining functions that are called just once is a huge win 
usually. I don't know how to spot what you did in an automated way.

thanks

Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: What I learned from my first serious attempt low-level Haskell programming

2007-04-05 Thread Lennart Augustsson
It's not that hard to figure out an order to permute the arguments on  
the stack before a tail call that minimizes that number of moves and  
temporary locations.  Lmlc did this 20 years ago. :)


-- Lennart

On Apr 5, 2007, at 19:17 , Claus Reinke wrote:


Stefan O'Rear wrote:
 2. Parameters are very expensive.  Our type of functions that build
(ignoring CPS for the time being) was MBA# - Int# -  
[ByteString],
where the Int# is the current write pointer.  Adding an extra  
Int#

to cache the size of the array (rather than calling sMBA# each
time) slowed the code down ~2x.  Conversely, moving the write
pointer into the byte array (storing it in bytes 0#, 1#, 2#, and
3#) sped the code by 4x.
If you were measuring on x86 then parameters are passed on the  
stack, which may be expensive.  On x86_64 the first 3 arguments  
are passed in registers, which is usually a win, but if the  
function immediately does an eval they need to be saved on the  
stack anyway.  Still, 4x sounds like a lot, perhaps you managed to  
avoid a stack check in the inner loop or something.


just out of curiosity: what does GHC do with stack parameters in  
loops/tail calls?


there tends to be a conflict as the old set of parameters is needed  
to build the new
one for the recursive call/next loop iteration, while one wants to  
get rid of the old set before doing the call. unless one keeps the  
parameter frames away from the stack,
or can figure out a safe order in which to overwrite the parameters  
in the frame,
that seems to imply some copying, even though that may be cheap for  
few/small

parameters per frame.

many years ago, i saw an abstract machine and RISC processor design  
aiming for good fp support that used two parameter stacks instead  
of one for just this reason.


instead of moving stack frames around on a single stack, parameters  
were read
from one stack, and built up on the other, followed by a cheap  
stack switch before

the call. perhaps something like this might be of use here, too?

claus


___
Libraries mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/libraries


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe