Wow, now performance is on par with Java ;)So slow division was main problem, that and GC . Thanks!
> From: [email protected] > To: [email protected] > CC: [email protected] > Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow? > Date: Sat, 1 Dec 2012 21:12:29 +0100 > > On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote: > > All in all even unboxed array is about 10 times slower than Java version. > > I don't understand why is even unboxed array so slow. > > It's not the unboxed arrays that are slow. > > Your code has a couple of weak spots, and GHC's native code generator has a > weakness that bites here. > > On my box, I don't quite have a 10× difference to my translation to Java, > it's > a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its > knees > by running something that takes 3GB+ of RAM, so I run the unboxed array part > only) with the LLVM backend and 8× (0.93s) with the native code generator. > That's in the same ballpark, though. > > So what's the deal? > > Main.main_$s$wa1 [Occ=LoopBreaker] > :: GHC.Prim.Int# > -> GHC.Prim.Int# > -> GHC.Prim.State# GHC.Prim.RealWorld > -> GHC.Types.Int > -> GHC.Types.Int > -> GHC.Types.Int > -> ... > > Your loops carry boxed Ints around, that's always a bad sign. In this case it > doesn't hurt too much, however, since these values are neither read nor > substituted during the loop (they're first and last index of the array and > number of elements). Additionally, they carry an IOUArray constructor around. > That is unnecessary. Eliminating a couple of dead parameters > > > init' a = do > (_,n) <- getBounds a > let init k > | k > n = return () > | otherwise = do > let x = fromIntegral $ k + k `div` 3 > unsafeWrite a k x > init (k+1) > init 0 > > partial_sum a = do > (_,n) <- getBounds a > let ps i s > | i > n = return () > | otherwise = do > k <- unsafeRead a i > let l = s + k > unsafeWrite a i l > ps (i+1) l > k <- unsafeRead a 0 > ps 1 k > > brings the time for the native code generator down to 0.82s, and for the LLVM > backend the time remains the same. > > Next problem, you're using `div` for the division. > > `div` does some checking and potentially fixup (not here, since everything is > non-negative) after the machine division because `div` is specified to satisfy > > a = (a `div` b) * b + (a `mod` b) > > with 0 <= a `mod` b < abs b. > > That is in itself slower than the pure machine division you get with quot. > > So let's see what we get with `quot`. > > 0.65s with the native code generator, and 0.13 with the LLVM backend. > > Whoops, what's that? > > The problem is, as can be seen by manually replacing k `quot` 3 with > > (k *2863311531) `shiftR` 33 > > (requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native > backend, the LLVM backend and Java (as well as C) all take more or less the > same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s, > 0.14s], that division is a **very** slow operation. > > Java and LLVM know how to replace the division by the constant 3 with a > mulitplication, a couple of shifts and an addition (since we never have > negative numbers here, just one multiplication and shift suffice, but neither > Java nor LLVM can do that on their own because it's not guaranteed by the > type). The native code generator doesn't - not yet. > > So the programme spends the majority of the time dividing. The array reads > and > writes are on par with Java's (and, for that matter, C's). > > If you make the divisor a parameter instead of a compile time constant, the > NCG is not affected at all, the LLVM backend gives you equal performance (it > can't optimise a division by a divisor it doesn't know). Java is at an > advantage there, after a while the JIT sees that it might be a good idea to > optimise the division and so its time only trebles.
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
