Hi, > Hi, if I apply '%res = fadd <vector 4 float> %arg0 %arg1' on an SSE2 > capable machine, will that `fadd` instruction generate operations that work > in the XMM 128 bit register? > > i get "addps %xmm1, %xmm0" for that > > nicholas: Awesome, that's an SSE instruction. Thanks for your help. >
rl has pointed out that we want to do arbitrary complicated operations on each pair of chunks in a vector. It seems that adding a specific primOp to act on, say, Float4 = packed 4 x 32 bit float, is naive. When 256 bit = 8 x 32 bit float SIMD processors come around we have to then add that set of primOps everywhere. What would be nice is a `zipWith#` primitive, e.g. zipFloatArrayWith# :: Array# Float# -> Array# Float# -> (Float# -> Float# -> Float#) -> Array# Float# Fusion optimisation can then optimise the Haskell expression > let u = sin(v+2*w) + sqrt(v+1) to > let u = zipWith v w (\ v' w' -> v' + 2*w' + sqrt(v'+1)) which uses the zipFloatArrayWith# primitive over Float#s. We can then leave it up to the code generator, which knows about CPU specifics, what chunk size to use. That is, on an old, pre-SSE, x86 the processing would be done on individual Float#s, i,e, in llvm we would use `float`, whereas on an SSE machine that has SIMD operations on 4xfloat, the processing is done on Float4# chunks, i.e. in llvm we would use `<vector 4 float>`. I'm not certain the direction this should take. Thoughts? Maybe providing both Float4# (vector) primitives for individual operations and programmer control and zipWith# primitives that fire different vector sizes depending upon architecture? Vivian DISCLAIMER This transmission contains information that may be confidential. It is intended for the named addressee only. Unless you are the named addressee you may not copy or use it or disclose it to anyone else.
_______________________________________________ Glasgow-haskell-bugs mailing list [email protected] http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs
