Re: Re[2]: [Haskell-cafe] Fusing foldr's

2007-10-30 Thread Josef Svenningsson
On 10/29/07, Bulat Ziganshin [EMAIL PROTECTED] wrote:
 you may also look at these data:

   1,225,416 bytes allocated in the heap
 152,984 bytes copied during GC (scavenged)
   8,448 bytes copied during GC (not scavenged)
  86,808 bytes maximum residency (1 sample(s))

   3 collections in generation 0 (  0.00s)
   1 collections in generation 1 (  0.00s)

 if your hypothesis is true, amount of data copied and number of
 generation-1 collection should be much less in the second case

Indeed.

avg4:
880,935,612 bytes allocated in the heap
319,064,404 bytes copied during GC (scavenged)
318,965,812 bytes copied during GC (not scavenged)
201,080,832 bytes maximum residency (9 sample(s))

   1681 collections in generation 0 (  1.67s)
  9 collections in generation 1 ( 13.62s)

avgP:
1,761,224,604 bytes allocated in the heap
714,644 bytes copied during GC (scavenged)
593,184 bytes copied during GC (not scavenged)
184,320 bytes maximum residency (2 sample(s))

   1908 collections in generation 0 (  0.04s)
  2 collections in generation 1 (  0.00s)

Allocation is cheap, copying expensive.

All the best,

/Josef
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[4]: [Haskell-cafe] Fusing foldr's

2007-10-30 Thread Bulat Ziganshin
Hello Josef,

Tuesday, October 30, 2007, 4:13:04 PM, you wrote:

 201,080,832 bytes maximum residency (9 sample(s))
1681 collections in generation 0 (  1.67s)
   9 collections in generation 1 ( 13.62s)

 184,320 bytes maximum residency (2 sample(s))
1908 collections in generation 0 (  0.04s)
   2 collections in generation 1 (  0.00s)

 Allocation is cheap, copying expensive.

not copying itself, but generation-1 garbage collections. while g-0
collection scans 256kb which lives in CPU cache, g-1 collection scans
entire 100-200 mb of data that is very slow. try to use -H1g option,
though :)

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-29 Thread Josef Svenningsson
On 10/28/07, Isaac Dupree [EMAIL PROTECTED] wrote:
 Josef Svenningsson wrote:
  Less bogus timing:
  avg4: 18.0s
  avgS: 2.2s
  avgP: 17.4s
 
  OK, so these figures make an even stronger case for my conclusion :-)
  Single traversal can be much faster than multiple traversals *when
  done right*.

 Did you use +RTS -N2 on your program (or whatever it is that makes GHC
 actually use multiple threads? or is that not necessary?)  Anyway I
 assume you wouldn't get better than 9.0s, which is still much worse than
 2.2s.

Oh, this is getting embarrassing.. Indeed, I forgot to use +RTS -N2.
But using those flags yielded a very interesting result:

avgP: 4.3s

Superlinear speedup!? As you say, I would have expected something
slightly larger than 9s. I think what happens here is that for avg4
the entire list has to be kept in memory between the two traversals
whereas for avgP the beginning of the list can be garbage collected
incrementally as the two threads traverse it. This could mean that the
list never moves to the second generation in the memory manager and
that can maybe account for the additional time savings. I'm not sure
how to verify that this is the case though.

Cheers,

/Josef
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-29 Thread Josef Svenningsson
On 10/29/07, Josef Svenningsson [EMAIL PROTECTED] wrote:
 But using those flags yielded a very interesting result:

 avgP: 4.3s

 Superlinear speedup!? As you say, I would have expected something
 slightly larger than 9s. I think what happens here is that for avg4
 the entire list has to be kept in memory between the two traversals
 whereas for avgP the beginning of the list can be garbage collected
 incrementally as the two threads traverse it. This could mean that the
 list never moves to the second generation in the memory manager and
 that can maybe account for the additional time savings. I'm not sure
 how to verify that this is the case though.

Bulat kindly suggested I use +RTS -s to monitor the garbage collectors
behavior. It seems my hypothesis was right.

avg4:
387 Mb total memory in use
MUT   time2.43s  (  2.47s elapsed)
GCtime   15.32s  ( 16.05s elapsed)

avgP (+RTS -N2):
3 Mb total memory in use
MUT   time4.61s  (  2.51s elapsed)
GCtime0.04s  (  0.06s elapsed)

So it seems that the garbage collector takes an awful lot of time when
we allocate a big list like this. Hmmm. Strikes me as somewhat
suboptimal.

Cheers,

/Josef
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Fusing foldr's

2007-10-29 Thread Bulat Ziganshin
Hello Josef,

Monday, October 29, 2007, 2:08:54 PM, you wrote:

 that can maybe account for the additional time savings. I'm not sure
 how to verify that this is the case though.

 Bulat kindly suggested I use +RTS -s to monitor the garbage collectors
 behavior. It seems my hypothesis was right.

you may also look at these data:

  1,225,416 bytes allocated in the heap
152,984 bytes copied during GC (scavenged)
  8,448 bytes copied during GC (not scavenged)
 86,808 bytes maximum residency (1 sample(s))

  3 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

if your hypothesis is true, amount of data copied and number of
generation-1 collection should be much less in the second case


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-27 Thread Josef Svenningsson
On 10/26/07, Dan Weston [EMAIL PROTECTED] wrote:
 Thanks for letting me know about the Data.Strict library on Hackage. I
 will definitely make use of that! BTW, you left out an import
 Data.List(foldl') in your example.

Yes, Data.Strict can be pretty handy for getting the right strictness.
Sorry about the missing import.

 My timing test is an order of magnitude worse than yours. Do you have an
 extra zero in your list endpoint?

   I fed these functions to ghc with the -O2 and -threaded flags and
   timed them using the list [1..1000]. The result (best times out of
   several runs):
   avg4: 284 ms
   avgS: 184 ms
   avgP: 248 ms

 Using ghc -threaded -O2 --make Avg.hs for ghc 6.6.1, I ran your tests on
 [1..1000] and got the user times:

 avg4: 12.75 s
 avgS:  3.65 s
 avgP: 15.56 s

 The funny thing is that avg4/avgS = 3.5 for and only 1.5 for you. I
 understand that with only 1 processor my avgP time may be twice yours,
 but not the avgS or avg4.

Oooops.. My numbers are totally bogus. I had code that looked like the
following:
\begin{code}
main = do
time avg4 [1..1000]
time avg4 [1..1000]
time avg4 [1..1000]
time avgS [1..1000]
time avgS [1..1000]
time avgS [1..1000]
time avgP [1..1000]
time avgP [1..1000]
time avgP [1..1000]
\end{code}
Not very elegant I know but I thought it would do the job. Apparently
I was wrong. GHC does common subexpression elimination on all the
lists so they're all shared between the different calls. Of course,
the first function call would always take long time but I ignored it,
thinking it was some anomaly. Anyway, I was totally sure that GHC only
did cse on constructor expressions and not on arbitrary computations.
Guess I was wrong. A little searching revealed the following quote by
Simon PJ:

 GHC does a very simple form of CSE. If it sees
let x = e in e
 it replaces the inner 'e' by x.  But that's all at the moment.

Lesson learned.

Less bogus timing:
avg4: 18.0s
avgS: 2.2s
avgP: 17.4s

OK, so these figures make an even stronger case for my conclusion :-)
Single traversal can be much faster than multiple traversals *when
done right*.

 I have the following machine:

 Main memory size: 2026 Mbytes
 Num Processors: 1
 Processor Type: Intel(R) Xeon(TM) CPU 2.80GHz x32
 Clock Speed: 2790 MHZ

In case you're still interested my machine looks like this:

Memory: 2026 Mbytes
Processor: AMD Turion(tm) 64 X2 Mobile Technology TL-56
Clock Speed: 1800MHz

All the best,

/Josef
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-27 Thread Isaac Dupree

Josef Svenningsson wrote:

Less bogus timing:
avg4: 18.0s
avgS: 2.2s
avgP: 17.4s

OK, so these figures make an even stronger case for my conclusion :-)
Single traversal can be much faster than multiple traversals *when
done right*.


Did you use +RTS -N2 on your program (or whatever it is that makes GHC 
actually use multiple threads? or is that not necessary?)  Anyway I 
assume you wouldn't get better than 9.0s, which is still much worse than 
2.2s.


Isaac
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-26 Thread Josef Svenningsson
Sorry for reacting so late on this mail. I'm digging through some old mails...

On 10/12/07, Dan Weston [EMAIL PROTECTED] wrote:
 Always check optimizations to make sure they are not pessimizations!

 Actually, traversing the list twice is very cheap compared to space
 leakage, and accumulating pairs requires tuple boxing and unboxing which
 I don't know how to get GHC not to do.

I agree hole-heartedly that replacing multiple traversals with a
single traversal should be done with care as it more often than not
results in a pessimization. Indeed you showed just that with your
examples! But I'd thought it'd be interesting to see how it can
actually be an improvement if done carefully.

\begin{code}
import Control.Arrow
import qualified Data.Strict.Tuple as T
import Data.Strict.Tuple (Pair(..))
import Control.Parallel

avg4 = uncurry (/) . (foldl' (+) 0  foldl' (\x y - x + 1) 0)
avgS = T.uncurry (/) . foldl' (\p n - ((+n) *!* (+1)) p) (0 :!: 0)
avgP = uncurry (/) . (foldl' (+) 0 ! foldl' (\x y - x + 1) 0)

(*!*) f g (a :!: b) = f a :!: g b

(!) f g a = fa `par` (fa,ga)
  where fa = f a
ga = g a
\end{code}

avg4 is the function which was best among Dan's benchmarks. avgS uses
strict tuples. I just threw in avgP for fun, it traverses the lists in
parallel. Note: I do have a dual core machine so it makes sense to try
avgP.

I fed these functions to ghc with the -O2 and -threaded flags and
timed them using the list [1..1000]. The result (best times out of
several runs):
avg4: 284 ms
avgS: 184 ms
avgP: 248 ms

It seems doing a single traversal can be faster if your write your
function carefully. Doing the traversal in parallel was beneficial but
not as good as the single traversal.

Cheers,

/Josef
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-26 Thread Dan Weston
Thanks for letting me know about the Data.Strict library on Hackage. I 
will definitely make use of that! BTW, you left out an import 
Data.List(foldl') in your example.


My timing test is an order of magnitude worse than yours. Do you have an 
extra zero in your list endpoint?


 I fed these functions to ghc with the -O2 and -threaded flags and
 timed them using the list [1..1000]. The result (best times out of
 several runs):
 avg4: 284 ms
 avgS: 184 ms
 avgP: 248 ms

Using ghc -threaded -O2 --make Avg.hs for ghc 6.6.1, I ran your tests on 
[1..1000] and got the user times:


avg4: 12.75 s
avgS:  3.65 s
avgP: 15.56 s

The funny thing is that avg4/avgS = 3.5 for and only 1.5 for you. I 
understand that with only 1 processor my avgP time may be twice yours, 
but not the avgS or avg4.


I have the following machine:

Main memory size: 2026 Mbytes
Num Processors: 1
Processor Type: Intel(R) Xeon(TM) CPU 2.80GHz x32
Clock Speed: 2790 MHZ

Josef Svenningsson wrote:

Sorry for reacting so late on this mail. I'm digging through some old mails...

On 10/12/07, Dan Weston [EMAIL PROTECTED] wrote:

Always check optimizations to make sure they are not pessimizations!

Actually, traversing the list twice is very cheap compared to space
leakage, and accumulating pairs requires tuple boxing and unboxing which
I don't know how to get GHC not to do.


I agree hole-heartedly that replacing multiple traversals with a
single traversal should be done with care as it more often than not
results in a pessimization. Indeed you showed just that with your
examples! But I'd thought it'd be interesting to see how it can
actually be an improvement if done carefully.

\begin{code}
import Control.Arrow
import qualified Data.Strict.Tuple as T
import Data.Strict.Tuple (Pair(..))
import Control.Parallel

avg4 = uncurry (/) . (foldl' (+) 0  foldl' (\x y - x + 1) 0)
avgS = T.uncurry (/) . foldl' (\p n - ((+n) *!* (+1)) p) (0 :!: 0)
avgP = uncurry (/) . (foldl' (+) 0 ! foldl' (\x y - x + 1) 0)

(*!*) f g (a :!: b) = f a :!: g b

(!) f g a = fa `par` (fa,ga)
  where fa = f a
ga = g a
\end{code}

avg4 is the function which was best among Dan's benchmarks. avgS uses
strict tuples. I just threw in avgP for fun, it traverses the lists in
parallel. Note: I do have a dual core machine so it makes sense to try
avgP.

I fed these functions to ghc with the -O2 and -threaded flags and
timed them using the list [1..1000]. The result (best times out of
several runs):
avg4: 284 ms
avgS: 184 ms
avgP: 248 ms

It seems doing a single traversal can be faster if your write your
function carefully. Doing the traversal in parallel was beneficial but
not as good as the single traversal.

Cheers,

/Josef





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Fusing foldr's

2007-10-12 Thread Dan Weston

Always check optimizations to make sure they are not pessimizations!

Actually, traversing the list twice is very cheap compared to space 
leakage, and accumulating pairs requires tuple boxing and unboxing which 
I don't know how to get GHC not to do.


Your avg3 (along with several attempts of mine to fix the problem) gave 
stack overflows on a large list.


Only avg4 below (traversing the list twice with strict accumulation) 
didn't blow up on large lists, even though avg5 and avg6 were intended 
to be strict.


Prelude Control.Arrow Data.List
 let avg4 = uncurry (/) . (foldl' (+) 0  foldl' (\x y - x + 1) 0)
  in avg4 [1..1000]
500.5
-- This took 13 sec on my machine

Prelude Control.Arrow Data.List let avg3 = uncurry (/) . foldr (\x 
(s,n) - (s + x,n + 1)) (0,0) in avg3 [1..1000]

*** Exception: stack overflow
-- This fails in 1 sec

Prelude Control.Arrow Data.List
 let avg5 = uncurry (/) . foldl' (\(s,n) x - (s + x,n + 1)) (0,0)
  in avg5 [1..1000]
*** Exception: stack overflow
-- This fails in 100 sec

Prelude Control.Arrow Data.List
 let avg6 = uncurry (/) . foldl' (\sn x - (fst sn+x,snd sn+1)) (0,0)
  in avg6 [1..1000]
*** Exception: stack overflow
-- This fails in 30 sec

Prelude Control.Arrow Data.List
 let avg3 = uncurry (/) . foldr (\n - (+n) *** (+1)) (0, 0)
  in avg3 [1..1000]
*** Exception: stack overflow
-- This fails in 2 sec

Tim Newsham wrote:

Just goofing around with arrows and foldr while reading Hutton's
excellent paper on folds (http://www.cs.nott.ac.uk/~gmh/fold.pdf).

Wondering if this can be done automatically and more generally?

module Main where
import Control.Arrow
import Data.List

-- sum and length expressed as foldr.
fsum = foldr (\n - (+n)) 0
flen = foldr (\n - (+1)) 0

-- compute average using arrows..
-- compute the sum of a list, compute the length, and do a divide.
-- this traverses the list twice using two foldrs.
avg1 = uncurry (/) . (fsum  flen)
avg2 = uncurry (/) . (foldr (\n - (+n)) 0  foldr (\n - (+1)) 0)

-- But the two foldr's can be fused together
-- here we're mixing the two foldr constants 0 and 0 to (0,0)
-- and we're mixing the two functions (\n - (+n)) and
-- (\n - (+1)) to (\n - (+n) *** (+1)).
avg3 = uncurry (/) . foldr (\n - (+n) *** (+1)) (0, 0)

main = do
print $ avg1 [1,2,3,4]
print $ avg2 [1,2,3,4]
print $ avg3 [1,2,3,4]

Tim Newsham
http://www.thenewsh.com/~newsham/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Fusing foldr's

2007-10-11 Thread Tim Newsham

Just goofing around with arrows and foldr while reading Hutton's
excellent paper on folds (http://www.cs.nott.ac.uk/~gmh/fold.pdf).

Wondering if this can be done automatically and more generally?

module Main where
import Control.Arrow
import Data.List

-- sum and length expressed as foldr.
fsum = foldr (\n - (+n)) 0
flen = foldr (\n - (+1)) 0

-- compute average using arrows..
-- compute the sum of a list, compute the length, and do a divide.
-- this traverses the list twice using two foldrs.
avg1 = uncurry (/) . (fsum  flen)
avg2 = uncurry (/) . (foldr (\n - (+n)) 0  foldr (\n - (+1)) 0)

-- But the two foldr's can be fused together
-- here we're mixing the two foldr constants 0 and 0 to (0,0)
-- and we're mixing the two functions (\n - (+n)) and
-- (\n - (+1)) to (\n - (+n) *** (+1)).
avg3 = uncurry (/) . foldr (\n - (+n) *** (+1)) (0, 0)

main = do
print $ avg1 [1,2,3,4]
print $ avg2 [1,2,3,4]
print $ avg3 [1,2,3,4]

Tim Newsham
http://www.thenewsh.com/~newsham/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe