I understand from your later post that is was in fact specialized, but
how do I make sure it _is_ specialized?

-ddump-tc seems to give the generalized type, so it seems you'd need
to look at the -ddump-simpl output if you want to know whether a
local function is specialized.

http://www.haskell.org/haskellwiki/Performance/GHC#Looking_at_the_Core

Can I just add a type signature in the dist_fast definition for euclidean,

If you need it at exactly one type, that is the simplest way. There's
also the SPECIALIZE pragma

http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#specialize-pragma

and for local and non-exported functions '-O' should enable auto-specialization

http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html


After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet:

http://hackage.haskell.org/trac/ghc/ticket/3123
http://hackage.haskell.org/trac/ghc/wiki/Inlining

And even if that gets implemented, it doesn't apply directly to your
case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation.

This seems rather serious, and might be one of the bigger reasons why
I'm getting nowhere close to C in terms of performance...

You should be able to get near (a small factor) without it, but not having it leaves a substantial gap in performance, especially for simple loop bodies (there is also the case of enabling fusion over multiple loop iterations, but that may call for proper fix-point fusion).

The loop body is ridiculously small, so it would make sense to
unroll it somewhat to help avoid the loop overhead.
However, it seems like GHC isn't able to do that now.

Apart from the links above, the new backend also has relevant TODO
items: http://hackage.haskell.org/trac/ghc/wiki/BackEndNotes

Is there any way to unroll the loop myself, to speed things up?
Seems hard, because I'm using uvector...

You could 'cabal unpack' uvector, hand-unroll the core loop all
its operations get fused into, then reinstall the modified package..
(perhaps that should be a package configuration option, at least
until GHC gets loop or recursion unrolling - since this would be
a temporary workaround, it would probably not be worth it to have multiple package versions with different unroll factors; if this actually helps uvector users with performance in practice, they could suggest it as a patch).

Claus


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to