Re: [Open-graphics] OGA2 SIMD/MIMD

Andre Pouliot Wed, 23 Sep 2009 10:11:20 -0700

2009/9/23 Nicolas Boulay <[email protected]>

> 2009/9/23 Andre Pouliot <[email protected]>:
> >
> >
> > 2009/9/23 Nicolas Boulay <[email protected]>
> >>
> >> 2009/9/23 Kenneth Ostby <[email protected]>:
> >> > Nicolas Boulay:
> >> >>2009/9/23 Kenneth Ostby <[email protected]>:
> >> >>> Hi,
> >> >>>
> >> >>> Nicolas Boulay:
> >> >>>>2009/9/23 Hugh Fisher <[email protected]>:
> >> >>>>> Andre Pouliot wrote:
> >> >><...>
> >> >>
> >> >>I doesn't understand your point. That means that the ALU will be full
> >> >>but the other unit will be unused ? for example adder and
> >> >>multiplication could be a separat unit, both could be filled in the
> >> >>same time (MAC instead of MUL + an adder should be better).
> >> >
> >> > Aaah, the joy of terminology. If you take a look at the shader unit
> >> > figure in [1], you can see how we plan to have several ALUs in a
> single
> >> > shader. All those ALUs will execute the same instruction in over
> >> > different threads. Thus, exposing the ALUs for the software developer
> >> > only adds more complexity on both the hardware and software.
> Futhermore,
> >> > the software side will in most cases only have to duplicate the same
> >> > instruction over several ALUs.
> >> >
> >>
> >> For me it's the definition of SIMD code. How do you deal with
> >> branchies ? You execute both branches and one is discared ? If you
> >> used masked vector it's looks very like the new 512 bits vector
> >> instruction from intel and larrabee (avx ?).
> >
> >
> > The Microarchitecture look is the same as SIMD. The main difference is
> the
> > data being processed by each ALU is independent from the other.
> >
>
> It's very like the SIMD of larrabee that use vectorised load/store (a
> different adresse for each vector value)
>
> For more information :
> http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf p51
>
> > For the branch management we have up to now proposed a few solutions:
> > minimum and maximum operation to compare two value and select the
> > corresponding result. The other amelioration is a loop unit it should
> help
> > for small code loop who are of a fixed length.
> > Both of those should help reduce the branch. Still if there is some that
> do
> > happen there will be a caching mechanism. The Kernel will continue to
> > execute the threads that correspond to the first branch. once the kernel
> is
> > finished executing the kernel will go back in the program to the previous
> > branch to finish executing the remaining threads.
> >
>
> So you want some vectorised conditionnal move, and some "Conditional
> vector masks", like Intel for larrabee :) (it's a kind of predicat)
> (VPU is the name of this instruction set)
>


There is no vector. What happen is the write on the result register is
disabled on the threads that don't follow one branch of the kernel. Once the
kernel as finished the execution of a few threads, it revert to the last
branch to execute them. So each time we have a branch that diverge it waste
some of the shader ressource. That why we want to prevent branching.


>
> If you add loops controls inside the ALU it became hard to programme
> it. Do you want a kind of repeat "n" times for an instruction like in
> DSP (but dsp have memory to ALU adressing mode). That became complexe.
>
> The loop will be with in the instruction decoder and dispatcher. There will
be 2 instruction: "Loopstart id, n" and "loopend id" it will be for fixed
length loop it should be rather straigthforward to program.


>
> >>
> >> > That being said, after having finished my coffee, and had some time to
> >> > think, we might be able to utilize LIW, although I'm still unsure
> about
> >> > the cost to benefit ratio. Imagine if we, in what we call the ALUs,
> >> > include several functional units, adders, multipliers, &c. we can use
> >> > LIW in order to fully utilize them. However, this comes with the added
> >> > cost of logic, and design complexity. The simple way to solve this
> could
> >> > be to add a single multiply-adder unit inside each ALU, and thus we
> >> > avoid the LIW problem altogether.
> >> >
> >>
> >> An x86 instruction use 2 registers adresse, 1 for reading, 1 for
> >> read/write. It's compact but fast only with register renaming.
> >> Typical RISC operation is 2r1w, 3 adresses, 2 read, one write. MAC
> >> operation is 3 read 1 write.
> >>
> >> An LIW could be seens as 6 reads, 3 writes execution unit.
> >
> >  The most frequent instruction will be one with 3 operand 2 read and 1
> > write. The instruction will be fixed length we don't want to fall in the
> > trap of variable length instruction. Also the instruction will not
> contain
> > any constant, those will be in the constant register.
>
> In this list many years ago, some statistics where display on the use
> of combined instruction. The more common was "ADD ADD" and "ADD MUL".
> So 3r1w should be used, at least for ADDMULL (a*b+c), ADDADD could be
> also use often (a+b+c). An adder using 3 inputs should be smaller than
> 2 simpler adder unit?
>
> The ALU will only be able to do 1 math/logical operation at once on 2 data.
Complex instruction like addmult and other 3 operand cannot be supported for
simplicity sake.


> >>
> >> >From your terminology, it's look like an ALU with a lot of register
> >> port. (for exemple MAC/MUL unit, beside load/store, beside complete
> >> ALU without MUL)
> >
> >
> > The ALU is split from the register file for simplicity sake. Still one
> > feature of the ALU is that it will be completely integrated. What I mean
> is
> > the same ALU will do integer and float operation. That ALU will be split
> in
> > different stage that will allow a reuse of the different logic bloc. For
> > example the multiplier will serve to do the multiplication of all data
> type
> > also it will act as a barrel shifter. A barrel shifter is needed for both
> > the float add and also if we want to realize shift bigger than 1 bit per
> > clock.
>
> It's hard to know if its better to have one more ALU or splitted ALU
> more complexe but with a better use rate.
>

The advantage of having 1 alu is we can share some of the resources and we
fix the pipeline length, this allow us to make a simple control mechanism to
prevent data dependency issue.


> > The only part that may be not integrated will be the divider it's status
> is
> > still unclear at the moment. We know we need one but we aren't sure of
> how
> > to implement it and how the kernel will reaction to that instruction.  We
> > have 2 main choice either it a serial divider or a pipelined divider. The
> > serial divider advantage is it's relatively small size, the downside is
> it
> > can only process 1 request at once and take multiple cycle. The pipelined
> > divider advantage we could probably realize 1 division per cycle but the
> > problem would be the size requirement.
>
> You should only implement 1 cycle operation. If you really need div,
> pipeline (1/x) with MUL with enough garded bit to have the required
> precision. There is a lots of 1 cycle operation for complexe function
> (1/x, 1/sqrt(x)).
>

Any division operation even when if it's supposedly 1 cycle is in reality: 1
operation is to be executed per cycle, but the latency will be between 25 to
64 cycles. It depend on the operation requested and the data type.  Doing so
will require ~ 64 substractor if we support fractionnal result divide for
integer. 32 substractor if we support divide and modulo only.

Some operation can use constant value for complex function, not all of them
but a few.

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to