Re: [GHC] #3557: SIMD operations in GHC.Prim

GHC Tue, 26 Oct 2010 18:01:37 -0700

#3557: SIMD operations in GHC.Prim
---------------------------------+------------------------------------------
    Reporter:  guest             |        Owner:  vivian      
        Type:  feature request   |       Status:  new         
    Priority:  normal            |    Milestone:  _|_         
   Component:  Compiler (NCG)    |      Version:  6.11        
    Keywords:                    |     Testcase:              
   Blockedby:                    |   Difficulty:  Unknown     
          Os:  Unknown/Multiple  |     Blocking:              
Architecture:  Unknown/Multiple  |      Failure:  None/Unknown
---------------------------------+------------------------------------------
Changes (by vivian):


 * cc: haskell.vivian.mcph...@… (added)
  * owner:  => vivian


Comment:

 Okay, this might take a while.  First set of questions:

 I'm thinking of making a module `GHC.Prim.SSE` that contains the new ops.

 1) SSE instructions are CPU specific, so we need a way to check whether
 the CPU supports the various SSE extensions (SSE, SSE2, SSE3, SSE4,...).
 (assembler instruction CPUID).

 a) If an extension is '''not''' supported, then does the primop not get
 defined, or do we hand code a definition for the primop?  How do we
 propagate this information to user code?  We could have functions
 {{{
 sse :: Bool
 sse2 :: Bool
 sse3 :: Bool
 ...
 }}}

 b) This also affects cross-compilation, where checking the CPU of the
 build machine doesn't tell us about the capabilities of the target
 machine.

 c) do we include memory management primOps?  Specifically there are
 opcodes for bypassing the cache, which is helpful for live, streaming
 data.

 2) One of the instructions is a dot product instruction that takes
 (A0,A1,A2,A3), (B0,B1,B2,B3) as packed 32-bit floats and returns
 (A0*B0)+(A1*B1)+(A2*B2)+(A3*B3).  This would work really well with a
 streaming data type, the first pass (for a vector of 32-bit floats)
 computes 4-piece chunks of dot product and the next pass computes the sum
 of those results.

 a) I seem to recall that `Data.Vector.Unboxed` is faster than
 `Data.Vector.Storable`.  My initial thought about packing/unpacking 4
 32-bit types into a 128-bit unit would be to peek/poke.  Do unboxed tuples
 have any guarantees about alignment and so on in memory?  It would be
 great to have a function like
 {{{
 packFloat :: Float# -> Float# -> Float# -> Float# -> Xmm128#
 packFloat a b c d = (# a, b, c, d #)
 }}}
 The real win is when we have contiguous sequences of well-aligned floats
 in memory, so we can fetch 128-bit chunks at a time, bypassing the cache
 if necessary.

 b) Also, what is the relationship between boxed/unboxed numbers?  is
 `Float# -> Float` a no-op?  The 'unboxed' vectors in Data.Vector.Unboxed
 appear to not be types like {{{Int#, Int8#}}} but rather {{{Int, Int8}}}.

 So, my plan is to start with adding primOps to GHC.Prim and the compiler
 and then follow the code through the compiler to Cmm and the code
 generators, making changes as necessary.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/3557#comment:5>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Re: [GHC] #3557: SIMD operations in GHC.Prim

Reply via email to