[Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

2007-10-28 Thread Don Stewart
rendel:
 Prabhakar Ragde wrote:
 divisors i = [j | j-[1..i-1], i `mod` j == 0]
 main = print [i | i-[1..1], i == sum (divisors i)]
 
 Jerzy Karczmarczuk wrote:
 My point didn't concern that point. Haskell compiler cannot change an
 algorithm using lists into something which deals with indexable arrays,
 usually faster. Indexing may be faster than the indirection, and the
 allocation of memory costs. And there is laziness...
 
 This may be true, but it isn't relevant in this case, since the obvious 
 c program doesn't need any arrays, only two loops:
 
 for (int i = 1; i = 1; i++) {
   int sum = 0;
   for (int j = 1; j  i; j++)
 if (i % j == 0)
   sum += i;
   if (sum == i)
 print(i);
 }
 
 Loops can be expressed with lazy lists in Haskell. Therefore, the 
 presented Haskell program is perfectly equivalent to the obvious c 
 program.

So what we would hope is that GHC could transform a set of composed lazy list 
functions
into a doubly nested strict loop in Int#... 

Let's see if we can get that result from GHC, using a bit of fusion.

First, to explain what is happening, let's first replace the `mod` call with
`rem`, which is faster, and then desugar the list comprehension and
enumerations syntax, to expose the underlying program:

default(Int)

divisors i   = filter (\j - i `rem`j == 0) (enumFromTo 1 (i-1))
main = print $ filter (\i - i == sum (divisors i)) (enumFromTo 1 1)

Looking at this we see some good chances for fusion to take place: the
enumFromTo should fuse twice with 'filter', using build/foldr fusion.

And with stream fusion, the left fold 'sum' should also fuse with pipeline that
results from divisors. So my prediction would be that this program would run
slightly faster with stream fusion. Let's see...

Compiling with -O2 and ghc 6.8 snapshot, with build/foldr fusion, we see
two fusion sites, as expected, and a spec-constr of 'sum:

$ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats
RuleFired
1 SPEC Data.List.sum
2 fold/build

Good, running this:

$ time ./A-stream
[6,28,496,8128]
./A-stream  1.29s user 0.02s system 99% cpu 1.319 total

Now we can try with stream fusion, using the stream-fusible list library
here:
http://www.cse.unsw.edu.au/~dons/code/streams/list/

To use these list functions in preference to the default, we have to
import:

import Prelude hiding (filter,sum,enumFromTo)
import Data.List.Stream

and since the base library doesn't include stream fusible enumFromTo,
we'll have to write our own definition in terms of stream:

import Data.Stream (enumFromToInt,unstream)
enumFromTo i j = unstream (enumFromToInt i j)

Ok, that's easy. Compiling this, we hope to see 3 fusion sites, and all
heap-allocated Ints removed:

$ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats -package list
RuleFired
2 filter - fusible
1 sumInt - fusible
1 sum spec Int
3 STREAM stream/unstream fusion

Terrific! The 'sum' was specialised to Int, then translated to a stream
version, the two filters also were translated, then 3 fusion sites were found
and fused.  Our program should now run faster:

$ time ./A-stream
[6,28,496,8128]
./A-stream  1.23s user 0.01s system 99% cpu 1.251 total

And so it does, with no list allocated for the sum loop. In fact the entire 
program reduces
to a strict unboxed nested loop:

unfold = Int# - [Int]
wloop_sum_sV5 :: Int# - Int# - Int#

So almost identical types to the C program (bar for the return [Int]). 

Finally, we can manually translate the C code into a confusing set of nested
loops with interleaved IO,

main = loop 1 
  where
loop !i | i  1 = return ()
| otherwise = if i == go i 0 1 then print i  loop (i+1) 
   else loop (i+1)

go !i !s !j | j = i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
else go i s (j+1)
 | otherwise = s

And we get *no speed benefit* at all!

time ./A-loop  1.24s user 0.00s system 98% cpu 1.256 total

So the lesson is: write in a high level style, and the compiler can do the work
for you. Or, GHC is pretty smart on high level code.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

2007-10-28 Thread Stefan O'Rear
On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
 Finally, we can manually translate the C code into a confusing set of nested
 loops with interleaved IO,
 
 main = loop 1 
   where
 loop !i | i  1 = return ()
 | otherwise = if i == go i 0 1 then print i  loop (i+1) 
else loop (i+1)
 
 go !i !s !j | j = i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
 else go i s (j+1)
  | otherwise = s
 
 And we get *no speed benefit* at all!
 
 time ./A-loop  1.24s user 0.00s system 98% cpu 1.256 total
 
 So the lesson is: write in a high level style, and the compiler can do the 
 work
 for you. Or, GHC is pretty smart on high level code.

IO blocks unboxing in GHC.  How fast is your mock-C code refactored to
do IO outside of the loops only?

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

2007-10-28 Thread Don Stewart
stefanor:
 On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
  Finally, we can manually translate the C code into a confusing set of nested
  loops with interleaved IO,
  
  main = loop 1 
where
  loop !i | i  1 = return ()
  | otherwise = if i == go i 0 1 then print i  loop (i+1) 
 else loop (i+1)
  
  go !i !s !j | j = i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
  else go i s (j+1)
   | otherwise = s
  
  And we get *no speed benefit* at all!
  
  time ./A-loop  1.24s user 0.00s system 98% cpu 1.256 total
  
  So the lesson is: write in a high level style, and the compiler can do the 
  work
  for you. Or, GHC is pretty smart on high level code.
 
 IO blocks unboxing in GHC.  How fast is your mock-C code refactored to
 do IO outside of the loops only?

It doesn't! The above code yields:

Main.$wloop :: GHC.Prim.Int#
   - GHC.Prim.State# GHC.Prim.RealWorld
   - (# GHC.Prim.State# GHC.Prim.RealWorld, () #)

$wgo_rMK :: GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int#
where
$s$wgo_rMI :: GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int#
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

2007-10-28 Thread Stefan O'Rear
On Sun, Oct 28, 2007 at 01:43:07PM -0700, Don Stewart wrote:
 stefanor:
  IO blocks unboxing in GHC.  How fast is your mock-C code refactored to
  do IO outside of the loops only?
 
 It doesn't! The above code yields:
 
 Main.$wloop :: GHC.Prim.Int#
- GHC.Prim.State# GHC.Prim.RealWorld
- (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
 
 $wgo_rMK :: GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int# - 
 GHC.Prim.Int#
 where
 $s$wgo_rMI :: GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int#

Ah, interesting.  I was reading something too general from
http://hackage.haskell.org/trac/ghc/ticket/1592#comment:5.

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

2007-10-28 Thread Tim Chevalier
On 10/28/07, Stefan O'Rear [EMAIL PROTECTED] wrote:
 On Sun, Oct 28, 2007 at 01:43:07PM -0700, Don Stewart wrote:
  stefanor:
   IO blocks unboxing in GHC.  How fast is your mock-C code refactored to
   do IO outside of the loops only?
 
  It doesn't! The above code yields:
 
  Main.$wloop :: GHC.Prim.Int#
 - GHC.Prim.State# GHC.Prim.RealWorld
 - (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
 
  $wgo_rMK :: GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int# - 
  GHC.Prim.Int#
  where
  $s$wgo_rMI :: GHC.Prim.Int# - GHC.Prim.Int# - GHC.Prim.Int#

 Ah, interesting.  I was reading something too general from
 http://hackage.haskell.org/trac/ghc/ticket/1592#comment:5.


Right, unboxing is successful here because the arguments are marked as
strict with (!) annotations. The IO hack described in that bug report
would only be a problem if the arguments were not marked strict, and
were used in code that has to be executed after an IO call.

Although I don't think all of the strictness annotations are even
necessary here.  In any case, loop must evaluate i before the call to
print; there would only be an issue if it called an IO function before
doing anything that demands i.

Cheers,
Tim

-- 
Tim Chevalier * catamorphism.org * Often in error, never in doubt
Accordingly, computer scientists commonly choose models which have
bottoms, but prefer them topless. -- Davey  Priestley, _Introduction
to Lattices and Order_
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

2007-10-28 Thread Don Stewart
dons:
 stefanor:
  On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
   Finally, we can manually translate the C code into a confusing set of 
   nested
   loops with interleaved IO,
   
   main = loop 1 
 where
   loop !i | i  1 = return ()
   | otherwise = if i == go i 0 1 then print i  loop (i+1) 
  else loop (i+1)
   
   go !i !s !j | j = i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
   else go i s (j+1)
| otherwise = s
   
   And we get *no speed benefit* at all!
   
   time ./A-loop  1.24s user 0.00s system 98% cpu 1.256 total
   
   So the lesson is: write in a high level style, and the compiler can do 
   the work
   for you. Or, GHC is pretty smart on high level code.
  

Oh, and we can fuse with the print loop too, yielding an entire program
of type Int# - IO (), and really no intermediate lists (even for the return
list).

Again, we need to use the fusible version of mapM_:

import Prelude hiding (filter,sum,enumFromTo,mapM_,sequence_,map,foldr)
import Data.List.Stream
import Data.Stream (enumFromToInt,unstream)

enumFromTo i j = unstream (enumFromToInt i j)

mapM_ f as =  sequence_ (map f as)
sequence_ ms   =  foldr () (return ()) ms
-- ^ fuse happily

default(Int)


main = mapM_ print $ filter (\i - i == sum (divisors i)) (enumFromTo 1 
1)
divisors i = filter (\j - i `rem`j == 0) (enumFromTo 1 (i-1))

And we see the map and foldr fuse in sequence, which in turn fuses with the 
filter (and rest of the program):

18 RuleFired
5 STREAM stream/unstream fusion
2 filter - fusible
1 foldr - fusible
1 map - fusible
1 sumInt - fusible

The program flattens to a single nested loop,

Main.$wloop_foldr :: GHC.Prim.Int#
 - GHC.Prim.State# GHC.Prim.RealWorld
 - (# GHC.Prim.State# GHC.Prim.RealWorld, () #)

which really is equivalent in terms of control flow and intermediate structures
to the C program.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe