Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Ozgur Akgun
Richard,

I found this very annoying when I first realised GHC doesn't do any CSE,
given that the result of pure functions only depend on their parameters.

Even though CSE usually sounds good, when you ask, they go and find obscure
examples in which it causes great trouble :)

I think, there should at least be a compiler flag or something similar to
enforce CSE on some structures. But currently, as others pointed out, you
need to bind and do your sub expression elimination yourself.

Best,

On 18 May 2010 17:30, Richard Warburton richard.warbur...@gmail.com wrote:

 A colleague of mine pointed out that ghc wasn't performing as he
 expected when optimising some code.  I wonder if anyone could offer
 any insight as to why its not noting this common subexpression:

 main = print $ newton 4 24

 newton a 0 = a
 newton a n =  ((newton a (n-1))^2 + a)/(2*(newton a (n-1)))

 Compiled with 'ghc -O3 --make perf.hs', results in:

 real0m5.544s
 user0m5.492s
 sys 0m0.008s

 However if we factor out the repeated call to the newton method:

 main = print $ newton2 4 24

 newton2 a 0 = a
 newton2 a n =  (x^2 + a)/(2*x)
where
  x = newton2 a (n-1)

 real0m0.004s
 user0m0.004s
 sys 0m0.004s

 It looks to me like Referential transparency should make this a sound
 optimisation to apply, but ghc isn't doing it even on -O3.

 regards,

  Richard
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
Ozgur Akgun
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Michael Lesniak
Hello,

 Even though CSE usually sounds good, when you ask, they go and find obscure
 examples in which it causes great trouble :)
Do you (or others) have any particular mean but understandable example?

Cheers,
  Michael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Ivan Lazar Miljenovic
Michael Lesniak mlesn...@uni-kassel.de writes:

 Even though CSE usually sounds good, when you ask, they go and find obscure
 examples in which it causes great trouble :)
 Do you (or others) have any particular mean but understandable
 example?

http://www.haskell.org/haskellwiki/GHC:FAQ#Does_GHC_do_common_subexpression_elimination.3F

All hail Google! :p

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Michael Lesniak
Hello,

 http://www.haskell.org/haskellwiki/GHC:FAQ#Does_GHC_do_common_subexpression_elimination.3F

 All hail Google! :p
You're right :D. Thanks!

- Michael


-- 
Dipl.-Inf. Michael C. Lesniak
University of Kassel
Programming Languages / Methodologies Research Group
Department of Computer Science and Electrical Engineering

Wilhelmshöher Allee 73
34121 Kassel

Phone: +49-(0)561-804-6269
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Daniel Fischer
On Saturday 22 May 2010 15:00:25, Thomas Schilling wrote:
 Actually, in this case it would be safe to do CSS.  Because

  a) the function is strict in both arguments so GHC creates a worker
 which only uses unboxed types
  b) this cannot cause any space leaks (it contains no pointers)

 The generated Core does look pretty weird, though:

 $wnewton =
   \ (ww_s115 :: Double#) (ww1_s119 :: Int#) -
 case ww1_s119 of ds_Xr8 {
   __DEFAULT -
 case ^_r11D
(case $wnewton ww_s115 (-# ds_Xr8 1)
 of ww2_s11d { __DEFAULT -
 D# ww2_s11d  -- box the result of $wnewton
 })
lvl_r11B
 of _ { D# x_avk --- unbox it again

The boxing is due to the use of (^).
If you write x*x instead of x^2, it can use the primop *## and needn't box 
it.
As a side effect, the original time leak probably wouldn't have occured 
with x*x instead of x^2 because one would've made it
   let x = newton a (n-1) in (x*x +a) / (2*x)
instead of writing out newton a (n-1) thrice anyway, wouldn't one?

 case $wnewton ww_s115 (-# ds_Xr8 1)
 of ww2_s11d { __DEFAULT -
 /##
   (+## x_avk ww_s115) (*## 2.0 ww2_s11d)
 }
 };
   0 - ww_s115
 }

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Daniel Fischer
On Saturday 22 May 2010 16:48:27, Daniel Fischer wrote:
 The boxing is due to the use of (^).
 If you write x*x instead of x^2, it can use the primop *## and needn't
 box it.
 As a side effect, the original time leak probably wouldn't have occured
 with x*x instead of x^2 because one would've made it
let x = newton a (n-1) in (x*x +a) / (2*x)
 instead of writing out newton a (n-1) thrice anyway, wouldn't one?


Even if. With

newton :: Double - Int - Double
newton a 0 = a
newton a n =
(((newton a (n-1)) * (newton a (n-1)) ) + a)/(2*(newton a (n-1)))

(and optimisations of course), GHC does share newton a (n-1).

Lesson: Writing x^2 is a baad thing.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Brandon S. Allbery KF8NH

On May 22, 2010, at 08:30 , Ozgur Akgun wrote:
Even though CSE usually sounds good, when you ask, they go and find  
obscure examples in which it causes great trouble :)



They're not actually that obscure.  Most of them can be summarized  
thusly:  anything that would cause a fold to cause a stack or heap  
overflow if you get the strictness wrong is likely to cause the same  
problems if CSE optimizes it, for much the same reason.  And, worse,  
if you have CSE, the only way to avoid this would be to disable it;  
the current situation at least permits you to manually perform CSE via  
let-binding.  (In particular, it's often impossible to manipulate  
strictness; consider computing an average over a list as sum / length  
(obscure?):   the only way to avoid a large list exploding in memory  
usage is to pull apart sum and length and thread them together,  
*no* strictness annotation will help.)


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-22 Thread Thomas Schilling
On 22 May 2010 16:06, Daniel Fischer daniel.is.fisc...@web.de wrote:
 On Saturday 22 May 2010 16:48:27, Daniel Fischer wrote:
 The boxing is due to the use of (^).
 If you write x*x instead of x^2, it can use the primop *## and needn't
 box it.
 As a side effect, the original time leak probably wouldn't have occured
 with x*x instead of x^2 because one would've made it
    let x = newton a (n-1) in (x*x +a) / (2*x)
 instead of writing out newton a (n-1) thrice anyway, wouldn't one?


 Even if. With

 newton :: Double - Int - Double
 newton a 0 = a
 newton a n =
    (((newton a (n-1)) * (newton a (n-1)) ) + a)/(2*(newton a (n-1)))

 (and optimisations of course), GHC does share newton a (n-1).

 Lesson: Writing x^2 is a baad thing.

Interesting.  Clearly GHC needs a better partial evaluator! :)  (^) is
not inlined because it's recursive (or rather it's worker is) and
there also is no SPECIALISE pragma for Double - Integer - Double.
Yes, it's Integer, not Int, because the literal 2 defaults to
Integer.

It doesn't seem to be possible to add SPECIALISE pragmas for non-local
functions.  If I copy over the definition of (^) no pragma is needed.
GHC creates an worker for Double# - Integer - Double# and that seems
to be sufficient to make CSE work.



-- 
Push the envelope.  Watch it bend.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Performance Issue

2010-05-18 Thread Richard Warburton
A colleague of mine pointed out that ghc wasn't performing as he
expected when optimising some code.  I wonder if anyone could offer
any insight as to why its not noting this common subexpression:

main = print $ newton 4 24

newton a 0 = a
newton a n =  ((newton a (n-1))^2 + a)/(2*(newton a (n-1)))

Compiled with 'ghc -O3 --make perf.hs', results in:

real0m5.544s
user0m5.492s
sys 0m0.008s

However if we factor out the repeated call to the newton method:

main = print $ newton2 4 24

newton2 a 0 = a
newton2 a n =  (x^2 + a)/(2*x)
where
  x = newton2 a (n-1)

real0m0.004s
user0m0.004s
sys 0m0.004s

It looks to me like Referential transparency should make this a sound
optimisation to apply, but ghc isn't doing it even on -O3.

regards,

  Richard
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-18 Thread Bryan O'Sullivan
On Tue, May 18, 2010 at 9:30 AM, Richard Warburton 
richard.warbur...@gmail.com wrote:

 A colleague of mine pointed out that ghc wasn't performing as he
 expected when optimising some code.  I wonder if anyone could offer
 any insight as to why its not noting this common subexpression:


GHC performs almost no common subexpression elimination, the reasons being
that it can introduce space leaks and undesired extra laziness.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-18 Thread Richard Warburton
 A colleague of mine pointed out that ghc wasn't performing as he
 expected when optimising some code.  I wonder if anyone could offer
 any insight as to why its not noting this common subexpression:

 GHC performs almost no common subexpression elimination, the reasons being
 that it can introduce space leaks and undesired extra laziness.

Is there any way to encourage it to do so, for example compilation
flags?  Or is it generally best to hand apply these kind of
optimisations.

regards,

  Richard
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-18 Thread Serguey Zefirov
2010/5/18 Richard Warburton richard.warbur...@gmail.com:
 GHC performs almost no common subexpression elimination, the reasons being
 that it can introduce space leaks and undesired extra laziness.
 Is there any way to encourage it to do so, for example compilation
 flags?  Or is it generally best to hand apply these kind of
 optimisations.

I think that handmade common expression elimination improves overall
quality of code. ;)

(as it amounts to refactoring it)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2010-05-18 Thread Bryan O'Sullivan
On Tue, May 18, 2010 at 9:43 AM, Richard Warburton 
richard.warbur...@gmail.com wrote:

 Is there any way to encourage it to do so, for example compilation flags?


No. It would be difficult for the compiler to see when CSE is or is not safe
to apply, and so it doesn't have any code to perform full CSE at all.


 Or is it generally best to hand apply these kind of optimisations.


Since the compiler won't do it for you, the answer is yes :-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] performance issue and HGL

2010-02-14 Thread Vojtěch Knyttl
Hello,

I've created a simple Mandelbrot set generator, using HGL, the source is 
viewable at:

http://pastebin.dqd.cz/cUmg/

1. The problem is, that it is very slow. It is obvious that what takes the most 
time is the mandel function computation. I have no idea, how it can be 
improved.

2. What is the easiest way of changing a color and plotting a pixel with it? Do 
I always have to create a pen for it? Isn't there a simple way? Already – 
using a line to plot a pixel is probably not the best way…

Thanks___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Performance Issue

2009-02-26 Thread James Swaine
i'm implementing a benchmark which includes a detailed specification for a
random number generator.  for any of the kernels outlined in the benchmark,
i might have to generate a set of random numbers R, which has a length n,
using the following formulas:

R[k] = ((2^-46)(X[k])) mod 2^46, where

X[k] = (a^k)s

where the values of a and s are constant and defined below.
many of the kernels in the benchmark require a large number of randoms to be
generated (in the tens of millions).  when i invoke the following getRandAt
function that many times to build up a list, evaluation of the list takes
forever (somewhere between 5 and 10 minutes).  i've tried optimizing this
several different ways, with no luck.  i though i might post my code here
and see if anyone notices anything i'm doing wrong that might be causing
such a large bottleneck:

--constants
a :: Int64
a = 5^13

divisor :: Int64
divisor = 2^46

multiplier :: Float
multiplier = 2**(-46)


--gets r[k], which is the value at the kth
--position in the overall sequence of
--pseudorandom numbers
getRandAt :: Int64 - Int64 - Float
getRandAt 0 seed = multiplier * (fromIntegral seed)
getRandAt k seed = multiplier * (fromIntegral x_next)
where
x_prev = (a^k * seed) `mod` divisor
x_next = (a * x_prev) `mod` divisor

thanks all in advance for your help!
-james
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2009-02-26 Thread Don Stewart
james.swaine:
 i'm implementing a benchmark which includes a detailed specification for a
 random number generator.  for any of the kernels outlined in the benchmark, i
 might have to generate a set of random numbers R, which has a length n, using
 the following formulas:
 
 R[k] = ((2^-46)(X[k])) mod 2^46, where
 
 X[k] = (a^k)s
 
 where the values of a and s are constant and defined below. 
 many of the kernels in the benchmark require a large number of randoms to be
 generated (in the tens of millions).  when i invoke the following getRandAt
 function that many times to build up a list, evaluation of the list takes
 forever (somewhere between 5 and 10 minutes).  i've tried optimizing this
 several different ways, with no luck.  i though i might post my code here and
 see if anyone notices anything i'm doing wrong that might be causing such a
 large bottleneck:
 
 --constants
 a :: Int64
 a = 5^13   
 
 divisor :: Int64
 divisor = 2^46
 
 multiplier :: Float
 multiplier = 2**(-46)
 
 
 --gets r[k], which is the value at the kth
 --position in the overall sequence of
 --pseudorandom numbers
 getRandAt :: Int64 - Int64 - Float
 getRandAt 0 seed = multiplier * (fromIntegral seed)
 getRandAt k seed = multiplier * (fromIntegral x_next)
 where
 x_prev = (a^k * seed) `mod` divisor
 x_next = (a * x_prev) `mod` divisor
 
 thanks all in advance for your help!


Using ghc -O2 --make

There's nothing wrong with this code, really:

Z.$wgetRandAt :: Int# - Int# - Float#

and an inner loop of:

Z.$w$j :: Int# - Float#
Z.$w$j =
  \ (w_sHx :: Int#) -
case Z.^1 Z.lit1 Z.lvl of w1_XFs { I64# ww_XFv -
case ww_XFv of wild_aFB {
  __DEFAULT -
case minBound3 of wild1_aFC { I64# b1_aFE -
case ==# w_sHx b1_aFE of wild2_aFG {
  False -
case modInt# w_sHx wild_aFB of wild3_aFJ { __DEFAULT -
timesFloat#
  (powerFloat# __float 2.0 __float -46.0)
  (int2Float# wild3_aFJ)
};
  True -
case wild_aFB of wild3_aFM {
  __DEFAULT -
case modInt# w_sHx wild3_aFM of wild4_aFN { __DEFAULT -
timesFloat#
  (powerFloat# __float 2.0 __float -46.0)
  (int2Float# wild4_aFN)
};
  (-1) -
overflowError
`cast` (CoUnsafe (forall a_aFS. a_aFS) Float#
:: forall a_aFS. a_aFS ~ Float#)
}
}
};
  0 -
divZeroError
`cast` (CoUnsafe (forall a_aFT. a_aFT) Float#
:: forall a_aFT. a_aFT ~ Float#)
}

Which is just fine.

Inlining those constants explicitly might be a good idea, then we get an outer 
loop of:

Z.$wgetRandAt :: Int# - Int# - Float#
Z.$wgetRandAt =
  \ (ww_sHG :: Int#) (ww1_sHK :: Int#) -
case ww_sHG of wild_B1 {
  __DEFAULT -
case Z.lvl3 of wild1_aEd { I64# x#_aEf -
case Z.^ Z.a (I64# wild_B1)
of wild2_XFe { I64# x#1_XFh -
case Z.lvl1 of w_aE7 { I64# ww2_aE9 -
case ww2_aE9 of wild3_aFB {
  __DEFAULT -
case minBound3 of wild11_aFC { I64# b1_aFE -
let {
  ww3_aFz [ALWAYS Just L] :: Int#

  ww3_aFz = *# x#1_XFh ww1_sHK } in
case ==# ww3_aFz b1_aFE of wild21_aFG {
  False -
case modInt# ww3_aFz wild3_aFB
of wild31_aFJ { __DEFAULT -
Z.$w$j (*# x#_aEf wild31_aFJ)
};
  True -
case wild3_aFB of wild31_aFM {
  __DEFAULT -
case modInt# ww3_aFz wild31_aFM
of wild4_aFN { __DEFAULT -
Z.$w$j (*# x#_aEf wild4_aFN)
};
  (-1) -
overflowError
`cast` (CoUnsafe (forall a_aFS. a_aFS) Float#
:: forall a_aFS. a_aFS ~ Float#)
}
  0 -
divZeroError
`cast` (CoUnsafe (forall a_aFT. a_aFT) Float#
:: forall a_aFT. a_aFT ~ Float#)
}
}
}
};
  0 -
timesFloat#
  (powerFloat# __float 2.0 __float -46.0)
  (int2Float# ww1_sHK)
}

which is the fast path, then error / bounds checking.

this looks perfectly acceptable.

What does sound troublesome is using lazy lists .. that's more likely to be the 
bottleneck.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2009-02-26 Thread Luke Palmer
2009/2/26 James Swaine james.swa...@gmail.com

 --gets r[k], which is the value at the kth
 --position in the overall sequence of
 --pseudorandom numbers
 getRandAt :: Int64 - Int64 - Float
 getRandAt 0 seed = multiplier * (fromIntegral seed)
 getRandAt k seed = multiplier * (fromIntegral x_next)
 where
 x_prev = (a^k * seed) `mod` divisor
 x_next = (a * x_prev) `mod` divisor


One thing that comes to mind is that this exponentiation, with a very big
exponent, could potentially take a very long time. I believe that GHC
implements (^) using a repeated squaring technique, so it runs in log(k)
time, which ought to be no problem.  I'm not sure about other compilers
though.

Also note:

(a^k * seed) `mod` divisor = ((a^k `mod` divisor) * seed) `mod` divisor =
(a^(k `mod` phi(divisor)) * seed) `mod` divisor.

Where phi is the Euler totient function: phi(2^46) = 2^23.

Modulo errors... it's been a while since I've done this stuff.

Luke
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance Issue

2009-02-26 Thread Daniel Fischer
Am Donnerstag, 26. Februar 2009 18:48 schrieb Luke Palmer:
 2009/2/26 James Swaine james.swa...@gmail.com

  --gets r[k], which is the value at the kth
  --position in the overall sequence of
  --pseudorandom numbers
  getRandAt :: Int64 - Int64 - Float
  getRandAt 0 seed = multiplier * (fromIntegral seed)
  getRandAt k seed = multiplier * (fromIntegral x_next)
  where
  x_prev = (a^k * seed) `mod` divisor
  x_next = (a * x_prev) `mod` divisor

 One thing that comes to mind is that this exponentiation, with a very big
 exponent, could potentially take a very long time. I believe that GHC
 implements (^) using a repeated squaring technique, so it runs in log(k)
 time, which ought to be no problem.  I'm not sure about other compilers
 though.

Another thing: if you don't need to pick random indices, but use them in 
order, it may be faster to have a list of the random numbers :

randInts = iterate ((`mod` divisor) . (*a)) seed

or carry x[k] around as state.


 Also note:

 (a^k * seed) `mod` divisor = ((a^k `mod` divisor) * seed) `mod` divisor =
 (a^(k `mod` phi(divisor)) * seed) `mod` divisor.

 Where phi is the Euler totient function: phi(2^46) = 2^23.

phi(2^n) = 2^(n-1).

Apart from that, correct.


 Modulo errors... it's been a while since I've done this stuff.

 Luke

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe