Hi,

> Hi, if I apply '%res = fadd <vector 4 float> %arg0 %arg1' on an SSE2
> capable machine, will that `fadd` instruction generate operations that work
> in the XMM 128 bit register?
> > i get "addps %xmm1, %xmm0" for that
> > nicholas: Awesome, that's an SSE instruction. Thanks for your help.
>

rl has pointed out that we want to do arbitrary complicated operations on
each pair of chunks in a vector.

It seems that adding a specific primOp to act on, say, Float4 = packed 4 x
32 bit float, is naive.  When 256 bit = 8 x 32 bit float SIMD processors
come around we have to then add that set of primOps everywhere.

What would be nice is a `zipWith#` primitive, e.g.

zipFloatArrayWith# :: Array# Float# -> Array# Float# -> (Float# -> Float# ->
Float#) -> Array# Float#

Fusion optimisation can then optimise the Haskell expression

> let u = sin(v+2*w) + sqrt(v+1)

to

> let u = zipWith v w (\ v' w' -> v' + 2*w' + sqrt(v'+1))

which uses the zipFloatArrayWith# primitive over Float#s.

We can then leave it up to the code generator, which knows about CPU
specifics, what chunk size to use.  That is, on an old, pre-SSE, x86 the
processing would be done on individual Float#s, i,e, in llvm we would use
`float`, whereas on an SSE machine that has SIMD operations on 4xfloat, the
processing is done on Float4# chunks, i.e. in llvm we would use `<vector 4
float>`.

I'm not certain the direction this should take.  Thoughts?

Maybe providing both Float4# (vector) primitives for individual operations
and programmer control and zipWith# primitives that fire different vector
sizes depending upon architecture?

Vivian

DISCLAIMER

This transmission contains information that may be confidential. It is
intended for the named addressee only. Unless you are the named addressee
you may not copy or use it or disclose it to anyone else.
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to