agreed, this level of unrolling would cause problems even in most C compilers! When I write unrolled simd C code, I use a fixed number of variables that corresponds to the # of registers that are live on my target arch (yes, internally the turn into SSA which then does an ANF/CPS style and such), but by shadowing/resuing a fixed number of names, i can make it "syntactically" clear what the lifetimes of my variables is.
But yeah, branch predictors are pretty good on modern hardware, a loop is worth considering. On Sun, Jan 12, 2014 at 10:06 PM, Ben Lippmeier <[email protected]> wrote: > > On 10/01/2014, at 6:17 , Adam Wick <[email protected]> wrote: > > > That’s the problem with SHA, then. The implementation (and the spec, > really) is essentially a long combination of the form: > > > > let x_n5 = small_computation x_n1 x_n2 x_n3 x_n4 > > x_n6 = small_computation x_n2 x_n3 x_n4 x_n5 > > … > > > > Which has ~70 entries. The actual number of live variables alive at any > time should be relatively small, but if slots aren’t getting reused there’s > going to be some significant blowup. (To be honest, I had figured — and > thought I had validated — that doing it this way would give the compiler > the best chance at generating optimal code, but it appears I merely set > myself up to hit this limitation several years later.) > > If this [1] is the current version then I don't think there is any > performance reason to manually unroll the loops like that. If you write a > standard tail-recursive loop then the branch predictor in the processor > should make the correct prediction for all iterations except the last one. > You'll get one pipeline stall at the end due to a mis-predicted backward > branch, but it won't matter in terms of absolute percentage of execution > time. You generally only need to worry about branches if the branch flag > flips between True and False frequently. > > If you care deeply about performance then on some processors it can be > helpful to unfold this sort of code so that the SHA constants are > represented as literals in the instruction stream instead of in static data > memory -- but that ability is very processor specific and you'd need to > really stare at the emitted assembly code to see if it's worthwhile. > > Ben. > > > [1] https://github.com/GaloisInc/SHA/blob/master/Data/Digest/Pure/SHA.hs > > > _______________________________________________ > ghc-devs mailing list > [email protected] > http://www.haskell.org/mailman/listinfo/ghc-devs >
_______________________________________________ ghc-devs mailing list [email protected] http://www.haskell.org/mailman/listinfo/ghc-devs
