SIMD/SSE support & alignment

Nicolas Trangez Mon, 11 Mar 2013 10:27:18 -0700

All,

I've been toying with the SSE code generation in GHC 7.7 and Geoffrey
Mainland's work to integrate this into the 'vector' library in order to
generate SIMD code from high-level Haskell code.


While working with this, I wrote some simple code for testing purposes,
then compiled this into LLVM IR and x86_64 assembly form in order to
figure out how 'good' the resulting code would be.

First and foremost: I'm really impressed. Whilst there's most certainly
room for improvement (one of them touched in this mail, though I also
noticed unnecessary constant memory reads inside a tight loop), the
initial results look very promising, especially taking into account how
high-level the source code is. This is pretty amazing!

As an example, here's 'test.hs':

{-# OPTIONS_GHC -fllvm -O3  -optlo-O3 -optlc-O=3 -funbox-strict-fields
#-}
module Test (sum) where

import Prelude hiding (sum)
import Data.Int (Int32)
import Data.Vector.Unboxed (Vector)
import qualified Data.Vector.Unboxed as U

sum :: Vector Int32 -> Int32
sum v = U.mfold' (+) (+) 0 v

When compiling this into assembly (compiler/library version details at
the end of this message), the 'sum' function yields (among other things)
this code:

.LBB2_3:                                # %c1C0
                                        # =>This Inner Loop Header:
Depth=1
        prefetcht0      (%rsi)
        movdqu  -1536(%rsi), %xmm1
        paddd   %xmm1, %xmm0
        addq    $16, %rsi
        addq    $4, %rcx
        cmpq    %rdx, %rcx
        jl      .LBB2_3

The full LLVM IR and assembler output are attached to this message.

Whilst this is a nice and tight loop, I noticed the use of 'movdqu',
which is used for non-128bit aligned memory access in SSE code. For
aligned memory, 'movdqa' can be used, and this can have a major
performance impact.

Whilst I understand why this code is currently generated as-is (also in
other sample inputs), I wondered whether there are plans/approaches to
tackle this. In some cases (e.g. in 'sum') this could be by using the
scalar calculation at the beginning of the vector up until an aligned
boundary, then use aligned access and handle the tail using scalars
again, but I assume OTOH that's not trivial when multiple 'source'
vectors are used in the calculation.

This might even become more complex when using AVX code, which needs
256bit alignments.

Whilst I can't propose an out-of-the-box solution, I'd like to point at
the 'vector-simd' code [1] I wrote some months ago, which might propose
some ideas. In this package, I created an unboxed vector-like type whose
alignment is tracked at type level, and functions which consume a vector
define the minimal required alignment. As such, vectors can be allocated
at the minimal alignment they're required to be, throughout all code
using them.

As an example, if I'd use this code (OTOH):

sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2)
=> Vector o1 a -> Vector o2 a
sseFoo = undefined

avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2,
AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a
avxFoo = undefined

the type of

combinedFoo v = avxFoo sv sv
  where
    sv = sseFoo v

would automagically be

combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast
A32 o2) => Vector o1 a -> Vector o2 a

and when using this

v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])

the allocated argument vector (result of Vector.fromList) will be
16byte-aligned as expected/required for the SSE function to work with
unaligned loads internally (assuming no unaligned slices are supported,
etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte
aligned as required by 'avxFoo'.

Attached: test.ll and test.s, compilation results of test.hs using

$ ghc-7.7.20130302 -keep-llvm-files
-package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S
test.hs

GHC from HEAD/master compiled on my Fedora 18 system using system LLVM
(3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from
git://github.com/mainland/primitive.git and 'vector'
e1a6c403bcca07b4c8121753daf120d30dedb1b0 from
git://github.com/mainland/vector.git

Nicolas

[1] https://github.com/NicolasT/vector-simd

{-# OPTIONS_GHC -fllvm -O3  -optlo-O3 -optlc-O=3 -funbox-strict-fields #-}
module Test (sum) where

import Prelude hiding (sum)
import Data.Int (Int32)
import Data.Vector.Unboxed (Vector)
import qualified Data.Vector.Unboxed as U

sum :: Vector Int32 -> Int32
sum v = U.mfold' (+) (+) 0 v

target datalayout = 
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"

declare  ccc i8* @memcpy(i8*, i8*, i64)

declare  ccc i8* @memmove(i8*, i8*, i64)

declare  ccc i8* @memset(i8*, i64, i64)

declare  ccc i64 @newSpark(i8*, i8*)

!0 = metadata !{metadata !"top"}
!1 = metadata !{metadata !"stack",metadata !0}
!2 = metadata !{metadata !"heap",metadata !0}
!3 = metadata !{metadata !"rx",metadata !2}
!4 = metadata !{metadata !"base",metadata !0}
!5 = metadata !{metadata !"other",metadata !0}

%__stginit_Test_struct = type <{}>
@__stginit_Test =  global %__stginit_Test_struct<{}>

%Test_zdwa_closure_struct = type <{i64}>
@Test_zdwa_closure =  global %Test_zdwa_closure_struct<{i64 ptrtoint (void 
(i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @Test_zdwa_info to i64)}>

%Test_sum1_closure_struct = type <{i64}>
@Test_sum1_closure =  global %Test_sum1_closure_struct<{i64 ptrtoint (void 
(i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @Test_sum1_info to i64)}>

%Test_sum_closure_struct = type <{i64}>
@Test_sum_closure =  global %Test_sum_closure_struct<{i64 ptrtoint (void (i64*, 
i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @Test_sum_info to i64)}>

%S1DM_srt_struct = type <{}>
@S1DM_srt = internal constant %S1DM_srt_struct<{}>

%s1xB_entry_struct = type <{i64, i64, i64}>
@s1xB_info_itable = internal constant %s1xB_entry_struct<{i64 8589934602, i64 
8589934593, i64 9}>, section "X98A__STRIP,__me1", align 8

define internal cc 10 void @s1xB_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me2"
{
c1AJ:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 %R3_Arg, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1xr = alloca i64, i32 1
  %ls1xy = alloca i64, i32 1
  %ls1xB = alloca i64, i32 1
  %ln1EB = load i64* %R3_Var
  store i64 %ln1EB, i64* %ls1xr
  %ln1EC = load i64* %R2_Var
  store i64 %ln1EC, i64* %ls1xy
  %ln1ED = load i64* %R1_Var
  store i64 %ln1ED, i64* %ls1xB
  %ln1EE = load i64* %ls1xr
  %ln1EF = load i64* %ls1xB
  %ln1EG = add i64 %ln1EF, 14
  %ln1EH = inttoptr i64 %ln1EG to i64*
  %ln1EI = load i64* %ln1EH, !tbaa !5
  %ln1EJ = icmp sge i64 %ln1EE, %ln1EI
  br i1 %ln1EJ, label %c1AN, label %c1AM

c1AM:
  %ln1EK = load i64* %ls1xr
  %ln1EL = add i64 %ln1EK, 1
  store i64 %ln1EL, i64* %R3_Var
  %ln1EM = load i64* %ls1xy
  %ln1EN = load i64* %ls1xB
  %ln1EO = add i64 %ln1EN, 6
  %ln1EP = inttoptr i64 %ln1EO to i64*
  %ln1EQ = load i64* %ln1EP, !tbaa !5
  %ln1ER = load i64* %ls1xB
  %ln1ES = add i64 %ln1ER, 22
  %ln1ET = inttoptr i64 %ln1ES to i64*
  %ln1EU = load i64* %ln1ET, !tbaa !5
  %ln1EV = load i64* %ls1xr
  %ln1EW = add i64 %ln1EU, %ln1EV
  %ln1EX = shl i64 %ln1EW, 2
  %ln1EY = add i64 %ln1EX, 16
  %ln1EZ = add i64 %ln1EQ, %ln1EY
  %ln1F0 = inttoptr i64 %ln1EZ to i32*
  %ln1F1 = load i32* %ln1F0, !tbaa !5
  %ln1F2 = sext i32 %ln1F1 to i64
  %ln1F3 = add i64 %ln1EM, %ln1F2
  %ln1F4 = trunc i64 %ln1F3 to i32
  %ln1F5 = sext i32 %ln1F4 to i64
  store i64 %ln1F5, i64* %R2_Var
  %ln1F6 = load i64* %ls1xB
  store i64 %ln1F6, i64* %R1_Var
  %ln1F7 = load i64** %Base_Var
  %ln1F8 = load i64** %Sp_Var
  %ln1F9 = load i64** %Hp_Var
  %ln1Fa = load i64* %R1_Var
  %ln1Fb = load i64* %R2_Var
  %ln1Fc = load i64* %R3_Var
  %ln1Fd = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@s1xB_info( i64* %ln1F7, i64* %ln1F8, i64* %ln1F9, i64 %ln1Fa, i64 %ln1Fb, i64 
%ln1Fc, i64 undef, i64 undef, i64 undef, i64 %ln1Fd ) nounwind
  ret void

c1AN:
  %ln1Fe = load i64* %ls1xy
  store i64 %ln1Fe, i64* %R1_Var
  %ln1Ff = load i64** %Sp_Var
  %ln1Fg = getelementptr inbounds i64* %ln1Ff, i32 0
  %ln1Fh = bitcast i64* %ln1Fg to i64*
  %ln1Fi = load i64* %ln1Fh, !tbaa !1
  %ln1Fj = inttoptr i64 %ln1Fi to void (i64*, i64*, i64*, i64, i64, i64, i64, 
i64, i64, i64)*
  %ln1Fk = load i64** %Base_Var
  %ln1Fl = load i64** %Sp_Var
  %ln1Fm = load i64** %Hp_Var
  %ln1Fn = load i64* %R1_Var
  %ln1Fo = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1Fj( 
i64* %ln1Fk, i64* %ln1Fl, i64* %ln1Fm, i64 %ln1Fn, i64 undef, i64 undef, i64 
undef, i64 undef, i64 undef, i64 %ln1Fo ) nounwind
  ret void

}


%Test_zdwa_entry_struct = type <{i64, i64, i64}>
@Test_zdwa_info_itable =  constant %Test_zdwa_entry_struct<{i64 4294967301, i64 
0, i64 15}>, section "X98A__STRIP,__me3", align 8

define  cc 10 void @Test_zdwa_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me4"
{
c1Bf:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1xj = alloca i64, i32 1
  %ln1FV = load i64* %R2_Var
  store i64 %ln1FV, i64* %ls1xj
  %ln1FW = load i64** %Sp_Var
  %ln1FX = getelementptr inbounds i64* %ln1FW, i32 -4
  %ln1FY = ptrtoint i64* %ln1FX to i64
  %ln1FZ = load i64* %SpLim_Var
  %ln1G0 = icmp ult i64 %ln1FY, %ln1FZ
  br i1 %ln1G0, label %c1Cf, label %c1Ce

c1Ce:
  %ln1G1 = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* 
@c1Bg_info to i64
  %ln1G2 = load i64** %Sp_Var
  %ln1G3 = getelementptr inbounds i64* %ln1G2, i32 -1
  store i64 %ln1G1, i64* %ln1G3, !tbaa !1
  %ln1G4 = load i64* %ls1xj
  store i64 %ln1G4, i64* %R1_Var
  %ln1G5 = load i64** %Sp_Var
  %ln1G6 = getelementptr inbounds i64* %ln1G5, i32 -1
  %ln1G7 = ptrtoint i64* %ln1G6 to i64
  %ln1G8 = inttoptr i64 %ln1G7 to i64*
  store i64* %ln1G8, i64** %Sp_Var
  %ln1G9 = load i64** %Base_Var
  %ln1Ga = load i64** %Sp_Var
  %ln1Gb = load i64** %Hp_Var
  %ln1Gc = load i64* %R1_Var
  %ln1Gd = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@stg_ap_0_fast( i64* %ln1G9, i64* %ln1Ga, i64* %ln1Gb, i64 %ln1Gc, i64 undef, 
i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1Gd ) nounwind
  ret void

c1Cf:
  %ln1Ge = load i64* %ls1xj
  store i64 %ln1Ge, i64* %R2_Var
  %ln1Gf = ptrtoint %Test_zdwa_closure_struct* @Test_zdwa_closure to i64
  store i64 %ln1Gf, i64* %R1_Var
  %ln1Gg = load i64** %Base_Var
  %ln1Gh = getelementptr inbounds i64* %ln1Gg, i32 -1
  %ln1Gi = bitcast i64* %ln1Gh to i64*
  %ln1Gj = load i64* %ln1Gi, !tbaa !4
  %ln1Gk = inttoptr i64 %ln1Gj to void (i64*, i64*, i64*, i64, i64, i64, i64, 
i64, i64, i64)*
  %ln1Gl = load i64** %Base_Var
  %ln1Gm = load i64** %Sp_Var
  %ln1Gn = load i64** %Hp_Var
  %ln1Go = load i64* %R1_Var
  %ln1Gp = load i64* %R2_Var
  %ln1Gq = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1Gk( 
i64* %ln1Gl, i64* %ln1Gm, i64* %ln1Gn, i64 %ln1Go, i64 %ln1Gp, i64 undef, i64 
undef, i64 undef, i64 undef, i64 %ln1Gq ) nounwind
  ret void

}


declare  cc 10 void @stg_ap_0_fast(i64* noalias nocapture, i64* noalias 
nocapture, i64* noalias nocapture, i64, i64, i64, i64, i64, i64, i64) align 8

%c1Bg_entry_struct = type <{i64, i64}>
@c1Bg_info_itable = internal constant %c1Bg_entry_struct<{i64 0, i64 32}>, 
section "X98A__STRIP,__me5", align 8

define internal cc 10 void @c1Bg_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me6"
{
c1Bg:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 undef, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1yF = alloca i64, i32 1
  %ls1xu = alloca i64, i32 1
  %ls1xv = alloca i64, i32 1
  %ls1xs = alloca i64, i32 1
  %lc1Bn = alloca i64, i32 1
  %ls1xH = alloca i64, i32 1
  %ls1xX = alloca <4 x i32>, i32 1
  %ls1xL = alloca i64, i32 1
  %ln1I5 = load i64** %Hp_Var
  %ln1I6 = getelementptr inbounds i64* %ln1I5, i32 4
  %ln1I7 = ptrtoint i64* %ln1I6 to i64
  %ln1I8 = inttoptr i64 %ln1I7 to i64*
  store i64* %ln1I8, i64** %Hp_Var
  %ln1I9 = load i64* %R1_Var
  store i64 %ln1I9, i64* %ls1yF
  %ln1Ia = load i64** %Hp_Var
  %ln1Ib = ptrtoint i64* %ln1Ia to i64
  %ln1Ic = load i64** %Base_Var
  %ln1Id = getelementptr inbounds i64* %ln1Ic, i32 35
  %ln1Ie = bitcast i64* %ln1Id to i64*
  %ln1If = load i64* %ln1Ie, !tbaa !4
  %ln1Ig = icmp ugt i64 %ln1Ib, %ln1If
  br i1 %ln1Ig, label %c1Cb, label %c1BR

c1BR:
  %ln1Ih = load i64* %ls1yF
  %ln1Ii = add i64 %ln1Ih, 7
  %ln1Ij = inttoptr i64 %ln1Ii to i64*
  %ln1Ik = load i64* %ln1Ij, !tbaa !5
  store i64 %ln1Ik, i64* %ls1xu
  %ln1Il = load i64* %ls1yF
  %ln1Im = add i64 %ln1Il, 15
  %ln1In = inttoptr i64 %ln1Im to i64*
  %ln1Io = load i64* %ln1In, !tbaa !5
  store i64 %ln1Io, i64* %ls1xv
  %ln1Ip = load i64* %ls1yF
  %ln1Iq = add i64 %ln1Ip, 23
  %ln1Ir = inttoptr i64 %ln1Iq to i64*
  %ln1Is = load i64* %ln1Ir, !tbaa !5
  store i64 %ln1Is, i64* %ls1xs
  %ln1It = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* 
@s1xB_info to i64
  %ln1Iu = load i64** %Hp_Var
  %ln1Iv = getelementptr inbounds i64* %ln1Iu, i32 -3
  store i64 %ln1It, i64* %ln1Iv, !tbaa !2
  %ln1Iw = load i64* %ls1xu
  %ln1Ix = load i64** %Hp_Var
  %ln1Iy = getelementptr inbounds i64* %ln1Ix, i32 -2
  store i64 %ln1Iw, i64* %ln1Iy, !tbaa !2
  %ln1Iz = load i64* %ls1xs
  %ln1IA = load i64** %Hp_Var
  %ln1IB = getelementptr inbounds i64* %ln1IA, i32 -1
  store i64 %ln1Iz, i64* %ln1IB, !tbaa !2
  %ln1IC = load i64* %ls1xv
  %ln1ID = load i64** %Hp_Var
  %ln1IE = getelementptr inbounds i64* %ln1ID, i32 0
  store i64 %ln1IC, i64* %ln1IE, !tbaa !2
  %ln1IF = load i64** %Hp_Var
  %ln1IG = ptrtoint i64* %ln1IF to i64
  %ln1IH = add i64 %ln1IG, -22
  store i64 %ln1IH, i64* %lc1Bn
  %ln1II = load i64* %ls1xs
  %ln1IJ = load i64* %ls1xs
  %ln1IK = srem i64 %ln1IJ, 4
  %ln1IL = sub i64 %ln1II, %ln1IK
  store i64 %ln1IL, i64* %ls1xH
  %ln1IM = insertelement <4 x i32> < i32 0, i32 0, i32 0, i32 0 >, i32 0, i32 0
  %ln1IN = insertelement <4 x i32> %ln1IM, i32 0, i32 1
  %ln1IO = insertelement <4 x i32> %ln1IN, i32 0, i32 2
  %ln1IP = insertelement <4 x i32> %ln1IO, i32 0, i32 3
  %ln1IQ = bitcast <4 x i32> %ln1IP to <4 x i32>
  store <4 x i32> %ln1IQ, <4 x i32>* %ls1xX, align 1
  store i64 0, i64* %ls1xL
  br label %s1xV

s1xV:
  %ln1IR = load i64* %ls1xL
  %ln1IS = load i64* %ls1xH
  %ln1IT = icmp sge i64 %ln1IR, %ln1IS
  br i1 %ln1IT, label %c1C1, label %c1C0

c1C0:
  %ln1IU = load i64* %ls1xu
  %ln1IV = add i64 %ln1IU, 16
  %ln1IW = load i64* %ls1xv
  %ln1IX = load i64* %ls1xL
  %ln1IY = add i64 %ln1IW, %ln1IX
  %ln1IZ = shl i64 %ln1IY, 2
  %ln1J0 = add i64 %ln1IZ, 1536
  %ln1J1 = add i64 %ln1IV, %ln1J0
  %ln1J2 = inttoptr i64 %ln1J1 to i8*
  store i64 undef, i64* %R3_Var
  store i64 undef, i64* %R4_Var
  store i64 undef, i64* %R5_Var
  store i64 undef, i64* %R6_Var
  store float undef, float* %F1_Var
  store double undef, double* %D1_Var
  store float undef, float* %F2_Var
  store double undef, double* %D2_Var
  store float undef, float* %F3_Var
  store double undef, double* %D3_Var
  store float undef, float* %F4_Var
  store double undef, double* %D4_Var
  store float undef, float* %F5_Var
  store double undef, double* %D5_Var
  store float undef, float* %F6_Var
  store double undef, double* %D6_Var
  call ccc void (i8*,i32,i32,i32)* @llvm.prefetch( i8* %ln1J2, i32 0, i32 3, 
i32 1 )
  %ln1J3 = load <4 x i32>* %ls1xX, align 1
  %ln1J4 = load i64* %ls1xu
  %ln1J5 = add i64 %ln1J4, 16
  %ln1J6 = load i64* %ls1xv
  %ln1J7 = load i64* %ls1xL
  %ln1J8 = add i64 %ln1J6, %ln1J7
  %ln1J9 = shl i64 %ln1J8, 2
  %ln1Ja = add i64 %ln1J5, %ln1J9
  %ln1Jb = inttoptr i64 %ln1Ja to <4 x i32>*
  %ln1Jc = load <4 x i32>* %ln1Jb, align 1, !tbaa !5
  %ln1Jd = add <4 x i32> %ln1J3, %ln1Jc
  %ln1Je = bitcast <4 x i32> %ln1Jd to <4 x i32>
  store <4 x i32> %ln1Je, <4 x i32>* %ls1xX, align 1
  %ln1Jf = load i64* %ls1xL
  %ln1Jg = add i64 %ln1Jf, 4
  store i64 %ln1Jg, i64* %ls1xL
  br label %s1xV

c1C1:
  %ln1Jh = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* 
@c1Bm_info to i64
  %ln1Ji = load i64** %Sp_Var
  %ln1Jj = getelementptr inbounds i64* %ln1Ji, i32 -3
  store i64 %ln1Jh, i64* %ln1Jj, !tbaa !1
  %ln1Jk = load i64* %ls1xL
  store i64 %ln1Jk, i64* %R3_Var
  store i64 0, i64* %R2_Var
  %ln1Jl = load i64* %lc1Bn
  store i64 %ln1Jl, i64* %R1_Var
  %ln1Jm = load <4 x i32>* %ls1xX, align 1
  %ln1Jn = load i64** %Sp_Var
  %ln1Jo = getelementptr inbounds i64* %ln1Jn, i32 -2
  %ln1Jp = bitcast i64* %ln1Jo to <4 x i32>*
  store <4 x i32> %ln1Jm, <4 x i32>* %ln1Jp, align 1, !tbaa !1
  %ln1Jq = load i64** %Sp_Var
  %ln1Jr = getelementptr inbounds i64* %ln1Jq, i32 -3
  %ln1Js = ptrtoint i64* %ln1Jr to i64
  %ln1Jt = inttoptr i64 %ln1Js to i64*
  store i64* %ln1Jt, i64** %Sp_Var
  %ln1Ju = load i64** %Base_Var
  %ln1Jv = load i64** %Sp_Var
  %ln1Jw = load i64** %Hp_Var
  %ln1Jx = load i64* %R1_Var
  %ln1Jy = load i64* %R2_Var
  %ln1Jz = load i64* %R3_Var
  %ln1JA = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@s1xB_info( i64* %ln1Ju, i64* %ln1Jv, i64* %ln1Jw, i64 %ln1Jx, i64 %ln1Jy, i64 
%ln1Jz, i64 undef, i64 undef, i64 undef, i64 %ln1JA ) nounwind
  ret void

c1Cb:
  %ln1JB = load i64** %Base_Var
  %ln1JC = getelementptr inbounds i64* %ln1JB, i32 41
  store i64 32, i64* %ln1JC, !tbaa !4
  %ln1JD = load i64* %ls1yF
  store i64 %ln1JD, i64* %R1_Var
  %ln1JE = load i64** %Base_Var
  %ln1JF = load i64** %Sp_Var
  %ln1JG = load i64** %Hp_Var
  %ln1JH = load i64* %R1_Var
  %ln1JI = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@stg_gc_unpt_r1( i64* %ln1JE, i64* %ln1JF, i64* %ln1JG, i64 %ln1JH, i64 undef, 
i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1JI ) nounwind
  ret void

}


declare  ccc void @llvm.prefetch(i8*, i32, i32, i32)

declare  cc 10 void @stg_gc_unpt_r1(i64* noalias nocapture, i64* noalias 
nocapture, i64* noalias nocapture, i64, i64, i64, i64, i64, i64, i64) align 8

%c1Bm_entry_struct = type <{i64, i64}>
@c1Bm_info_itable = internal constant %c1Bm_entry_struct<{i64 451, i64 32}>, 
section "X98A__STRIP,__me7", align 8

define internal cc 10 void @c1Bm_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me8"
{
c1Bm:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 undef, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1xX = alloca <4 x i32>, i32 1
  %ln1Kr = load i64** %Sp_Var
  %ln1Ks = getelementptr inbounds i64* %ln1Kr, i32 1
  %ln1Kt = bitcast i64* %ln1Ks to <4 x i32>*
  %ln1Ku = load <4 x i32>* %ln1Kt, align 1, !tbaa !1
  %ln1Kv = bitcast <4 x i32> %ln1Ku to <4 x i32>
  store <4 x i32> %ln1Kv, <4 x i32>* %ls1xX, align 1
  %ln1Kw = load i64* %R1_Var
  %ln1Kx = load <4 x i32>* %ls1xX, align 1
  %ln1Ky = extractelement <4 x i32> %ln1Kx, i32 0
  %ln1Kz = sext i32 %ln1Ky to i64
  %ln1KA = add i64 %ln1Kw, %ln1Kz
  %ln1KB = trunc i64 %ln1KA to i32
  %ln1KC = sext i32 %ln1KB to i64
  %ln1KD = load <4 x i32>* %ls1xX, align 1
  %ln1KE = extractelement <4 x i32> %ln1KD, i32 1
  %ln1KF = sext i32 %ln1KE to i64
  %ln1KG = add i64 %ln1KC, %ln1KF
  %ln1KH = trunc i64 %ln1KG to i32
  %ln1KI = sext i32 %ln1KH to i64
  %ln1KJ = load <4 x i32>* %ls1xX, align 1
  %ln1KK = extractelement <4 x i32> %ln1KJ, i32 2
  %ln1KL = sext i32 %ln1KK to i64
  %ln1KM = add i64 %ln1KI, %ln1KL
  %ln1KN = trunc i64 %ln1KM to i32
  %ln1KO = sext i32 %ln1KN to i64
  %ln1KP = load <4 x i32>* %ls1xX, align 1
  %ln1KQ = extractelement <4 x i32> %ln1KP, i32 3
  %ln1KR = sext i32 %ln1KQ to i64
  %ln1KS = add i64 %ln1KO, %ln1KR
  %ln1KT = trunc i64 %ln1KS to i32
  %ln1KU = sext i32 %ln1KT to i64
  store i64 %ln1KU, i64* %R1_Var
  %ln1KV = load i64** %Sp_Var
  %ln1KW = getelementptr inbounds i64* %ln1KV, i32 4
  %ln1KX = ptrtoint i64* %ln1KW to i64
  %ln1KY = inttoptr i64 %ln1KX to i64*
  store i64* %ln1KY, i64** %Sp_Var
  %ln1KZ = load i64** %Sp_Var
  %ln1L0 = getelementptr inbounds i64* %ln1KZ, i32 0
  %ln1L1 = bitcast i64* %ln1L0 to i64*
  %ln1L2 = load i64* %ln1L1, !tbaa !1
  %ln1L3 = inttoptr i64 %ln1L2 to void (i64*, i64*, i64*, i64, i64, i64, i64, 
i64, i64, i64)*
  %ln1L4 = load i64** %Base_Var
  %ln1L5 = load i64** %Sp_Var
  %ln1L6 = load i64** %Hp_Var
  %ln1L7 = load i64* %R1_Var
  %ln1L8 = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1L3( 
i64* %ln1L4, i64* %ln1L5, i64* %ln1L6, i64 %ln1L7, i64 undef, i64 undef, i64 
undef, i64 undef, i64 undef, i64 %ln1L8 ) nounwind
  ret void

}


%Test_sum1_entry_struct = type <{i64, i64, i64}>
@Test_sum1_info_itable =  constant %Test_sum1_entry_struct<{i64 4294967301, i64 
0, i64 15}>, section "X98A__STRIP,__me9", align 8

define  cc 10 void @Test_sum1_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me10"
{
c1Dh:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1yn = alloca i64, i32 1
  %ln1LG = load i64* %R2_Var
  store i64 %ln1LG, i64* %ls1yn
  %ln1LH = load i64** %Sp_Var
  %ln1LI = getelementptr inbounds i64* %ln1LH, i32 -1
  %ln1LJ = ptrtoint i64* %ln1LI to i64
  %ln1LK = load i64* %SpLim_Var
  %ln1LL = icmp ult i64 %ln1LJ, %ln1LK
  br i1 %ln1LL, label %c1Dw, label %c1Dv

c1Dv:
  %ln1LM = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* 
@c1Di_info to i64
  %ln1LN = load i64** %Sp_Var
  %ln1LO = getelementptr inbounds i64* %ln1LN, i32 -1
  store i64 %ln1LM, i64* %ln1LO, !tbaa !1
  %ln1LP = load i64* %ls1yn
  store i64 %ln1LP, i64* %R2_Var
  %ln1LQ = load i64** %Sp_Var
  %ln1LR = getelementptr inbounds i64* %ln1LQ, i32 -1
  %ln1LS = ptrtoint i64* %ln1LR to i64
  %ln1LT = inttoptr i64 %ln1LS to i64*
  store i64* %ln1LT, i64** %Sp_Var
  %ln1LU = load i64** %Base_Var
  %ln1LV = load i64** %Sp_Var
  %ln1LW = load i64** %Hp_Var
  %ln1LX = load i64* %R1_Var
  %ln1LY = load i64* %R2_Var
  %ln1LZ = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@Test_zdwa_info( i64* %ln1LU, i64* %ln1LV, i64* %ln1LW, i64 %ln1LX, i64 %ln1LY, 
i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1LZ ) nounwind
  ret void

c1Dw:
  %ln1M0 = load i64* %ls1yn
  store i64 %ln1M0, i64* %R2_Var
  %ln1M1 = ptrtoint %Test_sum1_closure_struct* @Test_sum1_closure to i64
  store i64 %ln1M1, i64* %R1_Var
  %ln1M2 = load i64** %Base_Var
  %ln1M3 = getelementptr inbounds i64* %ln1M2, i32 -1
  %ln1M4 = bitcast i64* %ln1M3 to i64*
  %ln1M5 = load i64* %ln1M4, !tbaa !4
  %ln1M6 = inttoptr i64 %ln1M5 to void (i64*, i64*, i64*, i64, i64, i64, i64, 
i64, i64, i64)*
  %ln1M7 = load i64** %Base_Var
  %ln1M8 = load i64** %Sp_Var
  %ln1M9 = load i64** %Hp_Var
  %ln1Ma = load i64* %R1_Var
  %ln1Mb = load i64* %R2_Var
  %ln1Mc = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1M6( 
i64* %ln1M7, i64* %ln1M8, i64* %ln1M9, i64 %ln1Ma, i64 %ln1Mb, i64 undef, i64 
undef, i64 undef, i64 undef, i64 %ln1Mc ) nounwind
  ret void

}


%c1Di_entry_struct = type <{i64, i64}>
@c1Di_info_itable = internal constant %c1Di_entry_struct<{i64 0, i64 32}>, 
section "X98A__STRIP,__me11", align 8

define internal cc 10 void @c1Di_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me12"
{
c1Di:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 undef, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1yp = alloca i64, i32 1
  %ln1MU = load i64** %Hp_Var
  %ln1MV = getelementptr inbounds i64* %ln1MU, i32 2
  %ln1MW = ptrtoint i64* %ln1MV to i64
  %ln1MX = inttoptr i64 %ln1MW to i64*
  store i64* %ln1MX, i64** %Hp_Var
  %ln1MY = load i64* %R1_Var
  store i64 %ln1MY, i64* %ls1yp
  %ln1MZ = load i64** %Hp_Var
  %ln1N0 = ptrtoint i64* %ln1MZ to i64
  %ln1N1 = load i64** %Base_Var
  %ln1N2 = getelementptr inbounds i64* %ln1N1, i32 35
  %ln1N3 = bitcast i64* %ln1N2 to i64*
  %ln1N4 = load i64* %ln1N3, !tbaa !4
  %ln1N5 = icmp ugt i64 %ln1N0, %ln1N4
  br i1 %ln1N5, label %c1Ds, label %c1Dp

c1Dp:
  %ln1N6 = ptrtoint [0 x i64]* @base_GHCziInt_I32zh_con_info to i64
  %ln1N7 = load i64** %Hp_Var
  %ln1N8 = getelementptr inbounds i64* %ln1N7, i32 -1
  store i64 %ln1N6, i64* %ln1N8, !tbaa !2
  %ln1N9 = load i64* %ls1yp
  %ln1Na = load i64** %Hp_Var
  %ln1Nb = getelementptr inbounds i64* %ln1Na, i32 0
  store i64 %ln1N9, i64* %ln1Nb, !tbaa !2
  %ln1Nc = load i64** %Hp_Var
  %ln1Nd = ptrtoint i64* %ln1Nc to i64
  %ln1Ne = add i64 %ln1Nd, -7
  store i64 %ln1Ne, i64* %R1_Var
  %ln1Nf = load i64** %Sp_Var
  %ln1Ng = getelementptr inbounds i64* %ln1Nf, i32 1
  %ln1Nh = ptrtoint i64* %ln1Ng to i64
  %ln1Ni = inttoptr i64 %ln1Nh to i64*
  store i64* %ln1Ni, i64** %Sp_Var
  %ln1Nj = load i64** %Sp_Var
  %ln1Nk = getelementptr inbounds i64* %ln1Nj, i32 0
  %ln1Nl = bitcast i64* %ln1Nk to i64*
  %ln1Nm = load i64* %ln1Nl, !tbaa !1
  %ln1Nn = inttoptr i64 %ln1Nm to void (i64*, i64*, i64*, i64, i64, i64, i64, 
i64, i64, i64)*
  %ln1No = load i64** %Base_Var
  %ln1Np = load i64** %Sp_Var
  %ln1Nq = load i64** %Hp_Var
  %ln1Nr = load i64* %R1_Var
  %ln1Ns = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1Nn( 
i64* %ln1No, i64* %ln1Np, i64* %ln1Nq, i64 %ln1Nr, i64 undef, i64 undef, i64 
undef, i64 undef, i64 undef, i64 %ln1Ns ) nounwind
  ret void

c1Ds:
  %ln1Nt = load i64** %Base_Var
  %ln1Nu = getelementptr inbounds i64* %ln1Nt, i32 41
  store i64 16, i64* %ln1Nu, !tbaa !4
  %ln1Nv = load i64* %ls1yp
  store i64 %ln1Nv, i64* %R1_Var
  %ln1Nw = load i64** %Base_Var
  %ln1Nx = load i64** %Sp_Var
  %ln1Ny = load i64** %Hp_Var
  %ln1Nz = load i64* %R1_Var
  %ln1NA = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@stg_gc_unbx_r1( i64* %ln1Nw, i64* %ln1Nx, i64* %ln1Ny, i64 %ln1Nz, i64 undef, 
i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1NA ) nounwind
  ret void

}


@base_GHCziInt_I32zh_con_info = external global [0 x i64]

declare  cc 10 void @stg_gc_unbx_r1(i64* noalias nocapture, i64* noalias 
nocapture, i64* noalias nocapture, i64, i64, i64, i64, i64, i64, i64) align 8

%Test_sum_entry_struct = type <{i64, i64, i64}>
@Test_sum_info_itable =  constant %Test_sum_entry_struct<{i64 4294967301, i64 
0, i64 15}>, section "X98A__STRIP,__me13", align 8

define  cc 10 void @Test_sum_info(i64* noalias nocapture %Base_Arg, i64* 
noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 
%R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) 
align 8 nounwind section "X98A__STRIP,__me14"
{
c1DE:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ln1NI = load i64* %R2_Var
  store i64 %ln1NI, i64* %R2_Var
  %ln1NJ = load i64** %Base_Var
  %ln1NK = load i64** %Sp_Var
  %ln1NL = load i64** %Hp_Var
  %ln1NM = load i64* %R1_Var
  %ln1NN = load i64* %R2_Var
  %ln1NO = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* 
@Test_sum1_info( i64* %ln1NJ, i64* %ln1NK, i64* %ln1NL, i64 %ln1NM, i64 %ln1NN, 
i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1NO ) nounwind
  ret void

}


@llvm.used = appending global [4 x i8*] [i8* bitcast (%c1Di_entry_struct* 
@c1Di_info_itable to i8*), i8* bitcast (%c1Bm_entry_struct* @c1Bm_info_itable 
to i8*), i8* bitcast (%c1Bg_entry_struct* @c1Bg_info_itable to i8*), i8* 
bitcast (%s1xB_entry_struct* @s1xB_info_itable to i8*)], section "llvm.metadata"

        .file   "/tmp/ghc19964_0/ghc19964_0.bc"
        .data
        .type   Test_zdwa_closure,@object # @Test_zdwa_closure
        .globl  Test_zdwa_closure
        .align  8
Test_zdwa_closure:
        .quad   Test_zdwa_info
        .size   Test_zdwa_closure, 8

        .type   Test_sum1_closure,@object # @Test_sum1_closure
        .globl  Test_sum1_closure
        .align  8
Test_sum1_closure:
        .quad   Test_sum1_info
        .size   Test_sum1_closure, 8

        .type   Test_sum_closure,@object # @Test_sum_closure
        .globl  Test_sum_closure
        .align  8
Test_sum_closure:
        .quad   Test_sum_info
        .size   Test_sum_closure, 8

        .section        ".note.GNU-stack","",@progbits

        .text
        .type   s1xB_info_itable,@object # @s1xB_info_itable
        .align  8
s1xB_info_itable:
        .quad   8589934602              # 0x20000000a
        .quad   8589934593              # 0x200000001
        .quad   9                       # 0x9
        .size   s1xB_info_itable, 24

        .text
        .align  8, 0x90
        .type   s1xB_info,@function
s1xB_info:                              # @s1xB_info
# BB#0:                                 # %c1AJ
        movq    %r14, %rax
        movq    14(%rbx), %rcx
        cmpq    %rsi, %rcx
        jle     .LBB0_3
# BB#1:                                 # %c1AM.lr.ph
        movq    22(%rbx), %rdx
        addq    %rsi, %rdx
        movq    6(%rbx), %rdi
        leaq    16(%rdi,%rdx,4), %rdx
        .align  16, 0x90
.LBB0_2:                                # %c1AM
                                        # =>This Inner Loop Header: Depth=1
        addl    (%rdx), %eax
        movslq  %eax, %rax
        addq    $4, %rdx
        incq    %rsi
        cmpq    %rsi, %rcx
        jg      .LBB0_2
.LBB0_3:                                # %c1AN
        movq    (%rbp), %rcx
        movq    %rax, %rbx
        jmpq    *%rcx  # TAILCALL
.Ltmp0:
        .size   s1xB_info, .Ltmp0-s1xB_info

        .text
        .type   Test_zdwa_info_itable,@object # @Test_zdwa_info_itable
        .globl  Test_zdwa_info_itable
        .align  8
Test_zdwa_info_itable:
        .quad   4294967301              # 0x100000005
        .quad   0                       # 0x0
        .quad   15                      # 0xf
        .size   Test_zdwa_info_itable, 24

        .text
        .globl  Test_zdwa_info
        .align  8, 0x90
        .type   Test_zdwa_info,@function
Test_zdwa_info:                         # @Test_zdwa_info
# BB#0:                                 # %c1Bf
        leaq    -32(%rbp), %rax
        cmpq    %r15, %rax
        jae     .LBB1_1
# BB#2:                                 # %c1Cf
        movq    -8(%r13), %rax
        movl    $Test_zdwa_closure, %ebx
        jmpq    *%rax  # TAILCALL
.LBB1_1:                                # %c1Ce
        movq    $c1Bg_info, -8(%rbp)
        addq    $-8, %rbp
        movq    %r14, %rbx
        jmp     stg_ap_0_fast           # TAILCALL
.Ltmp1:
        .size   Test_zdwa_info, .Ltmp1-Test_zdwa_info

        .text
        .type   c1Bg_info_itable,@object # @c1Bg_info_itable
        .align  8
c1Bg_info_itable:
        .quad   0                       # 0x0
        .quad   32                      # 0x20
        .size   c1Bg_info_itable, 16

        .text
        .align  8, 0x90
        .type   c1Bg_info,@function
c1Bg_info:                              # @c1Bg_info
# BB#0:                                 # %c1Bg
        movq    %r12, %rax
        leaq    32(%rax), %r12
        cmpq    280(%r13), %r12
        jbe     .LBB2_1
# BB#8:                                 # %c1Cb
        movq    $32, 328(%r13)
        jmp     stg_gc_unpt_r1          # TAILCALL
.LBB2_1:                                # %c1BR
        movq    23(%rbx), %rcx
        movq    7(%rbx), %rsi
        movq    15(%rbx), %rdi
        movq    $s1xB_info, 8(%rax)
        movq    %rsi, 16(%rax)
        movq    %rcx, 24(%rax)
        movq    %rcx, %rdx
        sarq    $63, %rdx
        shrq    $62, %rdx
        addq    %rcx, %rdx
        movq    %rdi, (%r12)
        andq    $-4, %rdx
        pxor    %xmm0, %xmm0
        xorl    %eax, %eax
        testq   %rdx, %rdx
        movq    %rax, %rcx
        jle     .LBB2_4
# BB#2:                                 # %c1C0.lr.ph
        leaq    1552(%rsi,%rdi,4), %rsi
        pxor    %xmm0, %xmm0
        xorl    %ecx, %ecx
        .align  16, 0x90
.LBB2_3:                                # %c1C0
                                        # =>This Inner Loop Header: Depth=1
        prefetcht0      (%rsi)
        movdqu  -1536(%rsi), %xmm1
        paddd   %xmm1, %xmm0
        addq    $16, %rsi
        addq    $4, %rcx
        cmpq    %rdx, %rcx
        jl      .LBB2_3
.LBB2_4:                                # %c1C1
        movq    $c1Bm_info, -24(%rbp)
        movdqu  %xmm0, -16(%rbp)
        movq    -8(%r12), %rdx
        cmpq    %rcx, %rdx
        jle     .LBB2_7
# BB#5:                                 # %c1AM.lr.ph.i
        subq    %rcx, %rdx
        addq    (%r12), %rcx
        movq    -16(%r12), %rax
        leaq    16(%rax,%rcx,4), %rcx
        xorl    %eax, %eax
        .align  16, 0x90
.LBB2_6:                                # %c1AM.i
                                        # =>This Inner Loop Header: Depth=1
        addl    (%rcx), %eax
        movslq  %eax, %rax
        addq    $4, %rcx
        decq    %rdx
        jne     .LBB2_6
.LBB2_7:                                # %s1xB_info.exit
        pextrd  $3, %xmm0, %ecx
        addl    %eax, %ecx
        pextrd  $2, %xmm0, %eax
        addl    %ecx, %eax
        pextrd  $1, %xmm0, %ecx
        addl    %eax, %ecx
        movd    %xmm0, %eax
        addl    %ecx, %eax
        movslq  %eax, %rbx
        movq    8(%rbp), %rax
        addq    $8, %rbp
        jmpq    *%rax  # TAILCALL
.Ltmp2:
        .size   c1Bg_info, .Ltmp2-c1Bg_info

        .text
        .type   c1Bm_info_itable,@object # @c1Bm_info_itable
        .align  8
c1Bm_info_itable:
        .quad   451                     # 0x1c3
        .quad   32                      # 0x20
        .size   c1Bm_info_itable, 16

        .text
        .align  8, 0x90
        .type   c1Bm_info,@function
c1Bm_info:                              # @c1Bm_info
# BB#0:                                 # %c1Bm
        movdqu  8(%rbp), %xmm0
        pextrd  $3, %xmm0, %eax
        addl    %ebx, %eax
        pextrd  $2, %xmm0, %ecx
        addl    %eax, %ecx
        pextrd  $1, %xmm0, %eax
        addl    %ecx, %eax
        movd    %xmm0, %ecx
        addl    %eax, %ecx
        movslq  %ecx, %rbx
        movq    32(%rbp), %rax
        addq    $32, %rbp
        jmpq    *%rax  # TAILCALL
.Ltmp3:
        .size   c1Bm_info, .Ltmp3-c1Bm_info

        .text
        .type   Test_sum1_info_itable,@object # @Test_sum1_info_itable
        .globl  Test_sum1_info_itable
        .align  8
Test_sum1_info_itable:
        .quad   4294967301              # 0x100000005
        .quad   0                       # 0x0
        .quad   15                      # 0xf
        .size   Test_sum1_info_itable, 24

        .text
        .globl  Test_sum1_info
        .align  8, 0x90
        .type   Test_sum1_info,@function
Test_sum1_info:                         # @Test_sum1_info
# BB#0:                                 # %c1Dh
        leaq    -8(%rbp), %rax
        cmpq    %r15, %rax
        jae     .LBB4_1
# BB#3:                                 # %c1Dw
        movq    -8(%r13), %rax
        movl    $Test_sum1_closure, %ebx
        jmpq    *%rax  # TAILCALL
.LBB4_1:                                # %c1Dv
        movq    $c1Di_info, -8(%rbp)
        leaq    -40(%rbp), %rcx
        cmpq    %r15, %rcx
        jae     .LBB4_4
# BB#2:                                 # %c1Cf.i
        movq    -8(%r13), %rcx
        movq    %rax, %rbp
        movl    $Test_zdwa_closure, %ebx
        jmpq    *%rcx  # TAILCALL
.LBB4_4:                                # %c1Ce.i
        movq    $c1Bg_info, -16(%rbp)
        addq    $-16, %rbp
        movq    %r14, %rbx
        jmp     stg_ap_0_fast           # TAILCALL
.Ltmp4:
        .size   Test_sum1_info, .Ltmp4-Test_sum1_info

        .text
        .type   c1Di_info_itable,@object # @c1Di_info_itable
        .align  8
c1Di_info_itable:
        .quad   0                       # 0x0
        .quad   32                      # 0x20
        .size   c1Di_info_itable, 16

        .text
        .align  8, 0x90
        .type   c1Di_info,@function
c1Di_info:                              # @c1Di_info
# BB#0:                                 # %c1Di
        movq    %r12, %rax
        leaq    16(%rax), %r12
        cmpq    280(%r13), %r12
        jbe     .LBB5_1
# BB#2:                                 # %c1Ds
        movq    $16, 328(%r13)
        jmp     stg_gc_unbx_r1          # TAILCALL
.LBB5_1:                                # %c1Dp
        movq    $base_GHCziInt_I32zh_con_info, 8(%rax)
        movq    %rbx, 16(%rax)
        movq    8(%rbp), %rax
        addq    $8, %rbp
        leaq    -7(%r12), %rbx
        jmpq    *%rax  # TAILCALL
.Ltmp5:
        .size   c1Di_info, .Ltmp5-c1Di_info

        .text
        .type   Test_sum_info_itable,@object # @Test_sum_info_itable
        .globl  Test_sum_info_itable
        .align  8
Test_sum_info_itable:
        .quad   4294967301              # 0x100000005
        .quad   0                       # 0x0
        .quad   15                      # 0xf
        .size   Test_sum_info_itable, 24


        .text
        .globl  Test_sum_info
        .align  8, 0x90
        .type   Test_sum_info,@function
Test_sum_info:                          # @Test_sum_info
# BB#0:                                 # %c1DE
        leaq    -8(%rbp), %rax
        cmpq    %r15, %rax
        jae     .LBB6_1
# BB#3:                                 # %c1Dw.i
        movq    -8(%r13), %rax
        movl    $Test_sum1_closure, %ebx
        jmpq    *%rax  # TAILCALL
.LBB6_1:                                # %c1Dv.i
        movq    $c1Di_info, -8(%rbp)
        leaq    -40(%rbp), %rcx
        cmpq    %r15, %rcx
        jae     .LBB6_4
# BB#2:                                 # %c1Cf.i.i
        movq    -8(%r13), %rcx
        movq    %rax, %rbp
        movl    $Test_zdwa_closure, %ebx
        jmpq    *%rcx  # TAILCALL
.LBB6_4:                                # %c1Ce.i.i
        movq    $c1Bg_info, -16(%rbp)
        addq    $-16, %rbp
        movq    %r14, %rbx
        jmp     stg_ap_0_fast           # TAILCALL
.Ltmp6:
        .size   Test_sum_info, .Ltmp6-Test_sum_info

        .type   __stginit_Test,@object  # @__stginit_Test
        .bss
        .globl  __stginit_Test
        .align  8
__stginit_Test:
        .size   __stginit_Test, 0

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

SIMD/SSE support & alignment

Reply via email to