Re: [Haskell-cafe] Profiling nested case
Hi! On Sat, Jul 12, 2008 at 3:33 AM, Max Bolingbroke [EMAIL PROTECTED] wrote: If findColor had been a function defined in terms of foldr rather than using explicit recursion, then theres a good chance GHC 6.9 would have fused it with the list to yield your optimized, loop unrolled, version: My first version was with msum. Is this also covered by this fusion? (And it is interesting that my own recursion version is faster than the version with msum. Why?) Incidentally, if in your most recent email castRayScene2 was your only used of castRay, GHC would have inlined the whole definition into that use site and you would have got castRayScene1 back again. It is a little more tricky. I choose in an IO monad which scene it will render (selected by a program argument). So at compile time it does not yet know which one it will use. But there is a finite number of possibilities (in my case two) - why not inline both versions and at run time choose one? Mitar ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
2008/7/12 Mitar [EMAIL PROTECTED]: So that I can easily change the type everywhere. But it would be much nicer to write: data Quaternion a = Q !a !a !a !a deriving (Eq,Show) Only the performance of Num instance functions of Quaternion is then quite worse. You can probably use a specialization pragma to get around that. -- Jedaï ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
Hi! On Fri, Jul 18, 2008 at 3:54 PM, Chaddaï Fouché [EMAIL PROTECTED] wrote: So that I can easily change the type everywhere. But it would be much nicer to write: data Quaternion a = Q !a !a !a !a deriving (Eq,Show) Only the performance of Num instance functions of Quaternion is then quite worse. You can probably use a specialization pragma to get around that. But why is this not automatic? If I use Quaternions of only one type in the whole program then why it does not make specialized version for it? At least with -O2 switch. Why exactly are polymorphic functions slower? Is not this just a question of type checking (and of writing portable/reusable code)? But later on in a compiler process we do know of which type exactly is the called function - so we could use a function as it would be written only for that type. Something what specialization is doing as I see. I thought this is always done. Mitar ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
2008/7/18 Mitar [EMAIL PROTECTED]: On Sat, Jul 12, 2008 at 3:33 AM, Max Bolingbroke [EMAIL PROTECTED] wrote: If findColor had been a function defined in terms of foldr rather than using explicit recursion, then theres a good chance GHC 6.9 would have fused it with the list to yield your optimized, loop unrolled, version: My first version was with msum. Is this also covered by this fusion? Note that as I said the fusion only applies to 6.9 onwards. However, assuming that you were using msum at the list monad then since msum = concat is defined in terms of foldr there is a chance it could happen. The best guide to this kind of thing is not asking me but rather looking at the Core output by GHC for your particular program. (And it is interesting that my own recursion version is faster than the version with msum. Why?) I don't know why you find this suprising :-). Your own version is specialized exactly for the situation you wish to use it for. msum is a generic combinator, which naturally makes it less amenable to optimization because there is less information available about it's usage pattern. Note that GHC will of course try to do its best to de-specialize the msum function for any particular scenario it's used in, through inlining etc, but there are no guarantees. It is a little more tricky. I choose in an IO monad which scene it will render (selected by a program argument). So at compile time it does not yet know which one it will use. But there is a finite number of possibilities (in my case two) - why not inline both versions and at run time choose one? If there was only one static occurance of each identifier in your module then it would do that. This is because there is no code size implications for inlining a function into it's unitary use site. Inlining is not a cure-all for performance though: if you inline too much then you increase code size and hence increase the amount of main memory you're reading and reduce instruction cache hits, not to mention fill up your disk with multi-MB binaries. Max ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
Hi! (I will reply propely later, I have a project to finish and GHC is playing me around and does not want to cooperate.) This project of mine is getting really interesting. Is like playing table tennis with GHC. Some time it gives a nice ball, sometimes I have to run around after it. But I just wanted to make a simple raycasting engine. Not debug GHC. But it is interesting - just I do not have time just know for playing the change code - compile - run on known input - check if time elapsed increased (I have to do this even when I am thinking that I am optimizing things or even when I am thinking that I am just refactoring code - moving constants to definitions...). And this is a slow process because every iteration runs for a few minutes. The next beautiful example in this series is this function for computing 4D Julia set fractal: julia4DFractal :: BasicReal - World julia4DFractal param (x,y,z) = julia4D (Q (x / scale) (y / scale) (z / scale) param) iterations where c = (Q (-0.08) 0.0 (-0.8) (-0.03)) alphaBlue = VoxelColor 0 0 (2 / scale) (2 / scale) scale = fromIntegral sceneHeight / 1.8 threshold = 16 iterations = 100 :: Int julia4D _ 0= (# alphaBlue, 1 #) -- point is (probably) not in the set julia4D q it | qMagnitudeSquared q threshold = (# noColor, 1 #) -- point is in the set | otherwise = julia4D (qSquared q + c) (it - 1) where distance = scale * (qMagnitude qN) / (2 * (qMagnitude qN')) * log (qMagnitude qN) (# qN, qN' #) = disIter q (Q 1 0 0 0) iterations where disIter qn qn' 0 = (# qn, qn' #) disIter qn qn' i | qMagnitudeSquared qn threshold = (# qn, qn' #) | otherwise = disIter (qSquared qn + c) (2 * qn * qn') (i - 1) Please observe that distance is never used. And this is also what GHC warns. But the trick is that with having this part of a code in there, the program virtually never finishes (I killed it after 15 minutes). If I remove everything on and after the where distance line it finishes in 30 seconds. OK, the problem is with (# qN, qN' #), if this is changed to normal (qN, qN'), then it works. But to notice this ... This is something you have to spend a day for. Mitar ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
On Sat, Jul 12, 2008 at 8:57 PM, Mitar [EMAIL PROTECTED] wrote: julia4DFractal :: BasicReal - World julia4DFractal param (x,y,z) = julia4D (Q (x / scale) (y / scale) (z / scale) param) iterations where c = (Q (-0.08) 0.0 (-0.8) (-0.03)) alphaBlue = VoxelColor 0 0 (2 / scale) (2 / scale) scale = fromIntegral sceneHeight / 1.8 threshold = 16 iterations = 100 :: Int julia4D _ 0= (# alphaBlue, 1 #) -- point is (probably) not in the set julia4D q it | qMagnitudeSquared q threshold = (# noColor, 1 #) -- point is in the set | otherwise = julia4D (qSquared q + c) (it - 1) where distance = scale * (qMagnitude qN) / (2 * (qMagnitude qN')) * log (qMagnitude qN) (# qN, qN' #) = disIter q (Q 1 0 0 0) iterations where disIter qn qn' 0 = (# qn, qn' #) disIter qn qn' i | qMagnitudeSquared qn threshold = (# qn, qn' #) | otherwise = disIter (qSquared qn + c) (2 * qn * qn') (i - 1) Please observe that distance is never used. And this is also what GHC warns. But the trick is that with having this part of a code in there, the program virtually never finishes (I killed it after 15 minutes). If I remove everything on and after the where distance line it finishes in 30 seconds. OK, the problem is with (# qN, qN' #), if this is changed to normal (qN, qN'), then it works. But to notice this ... This is something you have to spend a day for. My guess is that it was premature optimization that created this bug. Unboxed tuples are not the best answer for every situation. They are evaluated strictly! Which means: unboxedBottom x | False = (# 0, 0 #) | otherwise = unboxedBottom x let (# x, y #) = unboxedBottom 0 in 42 Is an infinite loop, not 42 as you would expect. So when you write: where (# ... #) = something You are requiring your program to evaluate 'something' regardless of whether it is needed. Unboxed tuples should be taken in the same vain as explicit strictness annotations: almost never use them, and let GHC do the work for you. If you are in that phase where you are doing performance tweaks and you think GHC's strictness analysis might not be picking up on some strict behavior in your program, add the annotation. If it makes it faster, great; if it doesn't change things, take it out! Best to underconstrain your program. But these days I try to make my programs fast by making the structure of my program apparent to the compiler, not by forcing it to do things in a certain way. Admittedly making the structure of a program apparent to the compiler is a rather subtle and brittle process. I'm sure people have at least brainstormed ways to help the compiler more. Luke ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
Hi! My guess is that it was premature optimization that created this bug. It is the root of all evil. ;-) Unboxed tuples are not the best answer for every situation. They are evaluated strictly! Then I have not understood the last paragraph correctly: http://www.haskell.org/ghc/docs/latest/html/users_guide/primitives.html Oh, no. It is like you say. I also use -funbox-strict-fields and Q is defined with strict fields. But I tried also without the switch and is it the same (it takes forever). But then qN and qN' does not have unboxed types. So it should be lazy? If you are in that phase where you are doing performance tweaks and you think GHC's strictness analysis might not be picking up on some strict behavior in your program, add the annotation. If it makes it faster, great; if it doesn't change things, take it out! Best to underconstrain your program. I completely agree. I am also a firm believer in the clean and pure code where I would leave all optimization to compiler and just write semantics into a program. But this project just showed me that there is still a long way of compiler development before that would be possible (and usable). That some simple refactoring of code which is not really changing semantics have a big influence on a performance because compiler uses it differently (polymorphic types instead of hardcoded types, passing function as an parameter instead of hardcode it). For example I have now defined my types as: type BasicReal = Double data Quaternion = Q !BasicReal !BasicReal !BasicReal !BasicReal deriving (Eq,Show) So that I can easily change the type everywhere. But it would be much nicer to write: data Quaternion a = Q !a !a !a !a deriving (Eq,Show) Only the performance of Num instance functions of Quaternion is then quite worse. Mitar ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
Hi! This is not all. If I compare performance of those two semantically same functions: castRayScene1 :: Ray - ViewportDotColor castRayScene1 (Ray vd o d) = ViewportDotColor vd (castRay' noColor 0) where castRay' color@(VoxelColor _ _ _ alpha) depth | depth depthOfField = color | alpha 0.001 = castRay' pointColor (depth + distance') | alpha 0.999 = color | otherwise = castRay' newColor (depth + distance') where (# pointColor, distance #) = worldScene (o + (d * depth)) distance' = max 1 distance newColor = addColor color pointColor and: castRay :: World - Ray - ViewportDotColor castRay world (Ray vd o d) = ViewportDotColor vd (castRay' noColor 0) where castRay' color@(VoxelColor _ _ _ alpha) depth | depth depthOfField = color | alpha 0.001 = castRay' pointColor (depth + distance') | alpha 0.999 = color | otherwise = castRay' newColor (depth + distance') where (# pointColor, distance #) = world (o + (d * depth)) distance' = max 1 distance newColor = addColor color pointColor castRayScene2 :: Ray - ViewportDotColor castRayScene2 = castRay worldScene is the program which uses castRayScene1 1.35 faster then the program which uses castRayScene2 (37 seconds versus 50 seconds). (Compiler with GHC 6.8.3 and -O2 switch. Program is executing almost just this function over and over again.) It is somehow award that passing function as an argument slow down the program so much. Is not Haskell a functional language and this such (functional) code reuse is one of its main points? Of course. I could use some preprocessor/template engine to change/find/replace castRay-like function into a castRayScene1 before compiling. But this somehow kills the idea of a compiler? Smart compiler. Which should do things for you? The same as my previous example. Where a hard-coded list was not optimized. (Like it would change during the execution.) It looks like my program would be interpreted and not compiled. Mitar ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
On 2008 Jul 11, at 19:46, Mitar wrote: It is somehow award that passing function as an argument slow down the program so much. Is not Haskell a functional language and this such (functional) code reuse is one of its main points? That is in fact the case; GHC's version of various Prelude functions refactors them to avoid passing functional arguments. IIRC the problem is that, while Haskell is indeed functional, passing a polymorphic function as an argument causes the runtime to have to look up which type is needed for every call, whereas if it's factored out it can be computed only once and (implicitly) let-bound. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED] system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED] electrical and computer engineering, carnegie mellon universityKF8NH ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
It is somehow award that passing function as an argument slow down the program so much. Is not Haskell a functional language and this such (functional) code reuse is one of its main points? I can think of a few reasons why function passing is slow: * There is an overhead to closure creation (I don't know how big this is in practice, but it can be significant) * GHC has little information about what function-typed variables mean, because the function they are bound to is not known until runtime. This precludes the use of inlining, rewrite rules etc, which are absolutely key factors in making Haskell fast Regarding your first example: currently GHC does not do loop unrolling. It probably should though because loop unrolling reduces braches and increases the amount of information about the execution path that is available statically (as in e.g. the case liberation transformation), which probably explains the increased performance you saw by doing it manually in your first email. Earlier this year I implemented something somewhat similar though: fusable list literals. In this example from your first email: world :: SpacePoint - VoxelColor world point = findColor [redSphere (0,50,0) 50, greenSphere (25,-50,0) 50, blueSphere (-150,0,150) 50] where findColor [] = noColor findColor (f:fs) = case f point of Just v - v Nothing - findColor fs If findColor had been a function defined in terms of foldr rather than using explicit recursion, then theres a good chance GHC 6.9 would have fused it with the list to yield your optimized, loop unrolled, version: world :: SpacePoint - VoxelColor world point = case redSphere (0,50,0) 50 point of Just v - v Nothing - case greenSphere (25,-50,0) 50 point of Just v - v Nothing - case blueSphere (-150,0,150) 50 point of Just v - v Nothing - noColor Incidentally, if in your most recent email castRayScene2 was your only used of castRay, GHC would have inlined the whole definition into that use site and you would have got castRayScene1 back again. So, GHC does try its best to make higher order functions fast :-). But it's quite tricky! All the best, Max ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Profiling nested case
2008/7/9 Mitar [EMAIL PROTECTED]: And it took 15 s. And also the profiling was like I would anticipate. Calculating points coordinates and checking spheres takes almost all time. So any suggestions how could I build a list of objects to check at runtime and still have this third performance? Why this big difference? I think the speed difference really comes from using a list to hold the spheres in your first two examples, to referring to them directly in case statements in your last example. Lists are going to introduce indirections and therefore are slower than referring directly to the values themselves. Maybe you would have better luck using arrays? Template Haskell is also an option - if you want to hard code your scene in another module, TH can turn it into that kind of case statement. Of course, as the scenes get more complex a series of nested cases isn't going to be too effecient. Justin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Profiling nested case
Hi! I am making a simple raycasting engine and have a function which take a point in space and return a color of an object (if there is any) at this point in space. And because the whole thing is really slow (or was really slow) on simple examples I decided to profile it. It takes around 60 seconds for a 640x480 px image with 400 depth of field. This is at worst 122,880,000 calculations (if the scene is rather empty) of a coordinate of a point in space and then checking for a color. And 60 seconds look really a lot to me for that. So I went profiling and found out that the strange part of code is the main color checking function which has a list of objects (at this time the list is hardcoded). It looks like this: world :: SpacePoint - VoxelColor world point = case msum . sequence elements $ point of Just v - v Nothing - noColor where elements = [redSphere (0,50,0) 50, greenSphere (25,-50,0) 50, blueSphere (-150,0,150) 50] So three spheres in a world and I check if the point is in any of them. Like that: sphere :: SpacePoint - BasicReal - VoxelColor - WorldElement -- center of sphere, it's radius, it's color sphere (x0,y0,z0) r color (x,y,z) | x' * x' + y' * y' + z' * z' = r * r = Just color | otherwise= Nothing where x' = x - x0 y' = y - y0 z' = z - z0 redSphere :: SpacePoint - BasicReal - WorldElement redSphere c r = sphere c r redColor So profiling told me that world function takes 38.4 % of all running time. So I decided to play with it. Maybe a more direct approach would be better: world :: SpacePoint - VoxelColor world point = findColor [redSphere (0,50,0) 50, greenSphere (25,-50,0) 50, blueSphere (-150,0,150) 50] where findColor [] = noColor findColor (f:fs) = case f point of Just v - v Nothing - findColor fs Great, it improved. To 40 s. But still it was too much. I tried this: world :: SpacePoint - VoxelColor world point = case redSphere (0,50,0) 50 point of Just v - v Nothing - case greenSphere (25,-50,0) 50 point of Just v - v Nothing - case blueSphere (-150,0,150) 50 point of Just v - v Nothing - noColor And it took 15 s. And also the profiling was like I would anticipate. Calculating points coordinates and checking spheres takes almost all time. So any suggestions how could I build a list of objects to check at runtime and still have this third performance? Why this big difference? (I am using GHC 6.8.3 with -O2 compile switch.) (The * operator is casting a ray, that is multiplying a ray direction vector with a scalar factor.) Mitar Main-case.prof Description: Binary data Main-rec.prof Description: Binary data Main-seq.prof Description: Binary data ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe