Re: [Open-graphics] OGA2 SIMD/MIMD

Andre Pouliot Wed, 23 Sep 2009 08:56:21 -0700

2009/9/23 Nicolas Boulay <[email protected]>

> 2009/9/23 Kenneth Ostby <[email protected]>:
> > Nicolas Boulay:
> >>2009/9/23 Kenneth Ostby <[email protected]>:
> >>> Hi,
> >>>
> >>> Nicolas Boulay:
> >>>>2009/9/23 Hugh Fisher <[email protected]>:
> >>>>> Andre Pouliot wrote:
> >><...>
> >>>>
> >>>>Personnaly LIW is what i prefer : exposed every unit of the shader in
> >>>>the instruction word. Then it became a software challenge to optimise
> >>>>them.
> >>>
> >>> I'm unsure if LIW is the good option for this architecture. This due to
> >>> the fact that Andre mentioned earlier, we have a lot of threads that
> >>> needs to to execute the same instruction over data in close spatial
> >>> locality. Hence, there is really no use in having fine grained control
> >>> over the different units in a single shader, since in most cases they
> >>> are going to execute the same instruction anyways. Thus, including LIW
> >>> will only increase the complexity of the hardware, without providing
> any
> >>> substantial gains.
> >>>
> >>
> >>I doesn't understand your point. That means that the ALU will be full
> >>but the other unit will be unused ? for example adder and
> >>multiplication could be a separat unit, both could be filled in the
> >>same time (MAC instead of MUL + an adder should be better).
> >
> > Aaah, the joy of terminology. If you take a look at the shader unit
> > figure in [1], you can see how we plan to have several ALUs in a single
> > shader. All those ALUs will execute the same instruction in over
> > different threads. Thus, exposing the ALUs for the software developer
> > only adds more complexity on both the hardware and software. Futhermore,
> > the software side will in most cases only have to duplicate the same
> > instruction over several ALUs.
> >
>
> For me it's the definition of SIMD code. How do you deal with
> branchies ? You execute both branches and one is discared ? If you
> used masked vector it's looks very like the new 512 bits vector
> instruction from intel and larrabee (avx ?).
>


The Microarchitecture look is the same as SIMD. The main difference is the
data being processed by each ALU is independent from the other.

For the branch management we have up to now proposed a few solutions:
minimum and maximum operation to compare two value and select the
corresponding result. The other amelioration is a loop unit it should help
for small code loop who are of a fixed length.
Both of those should help reduce the branch. Still if there is some that do
happen there will be a caching mechanism. The Kernel will continue to
execute the threads that correspond to the first branch. once the kernel is
finished executing the kernel will go back in the program to the previous
branch to finish executing the remaining threads.


> > That being said, after having finished my coffee, and had some time to
> > think, we might be able to utilize LIW, although I'm still unsure about
> > the cost to benefit ratio. Imagine if we, in what we call the ALUs,
> > include several functional units, adders, multipliers, &c. we can use
> > LIW in order to fully utilize them. However, this comes with the added
> > cost of logic, and design complexity. The simple way to solve this could
> > be to add a single multiply-adder unit inside each ALU, and thus we
> > avoid the LIW problem altogether.
> >
>
> An x86 instruction use 2 registers adresse, 1 for reading, 1 for
> read/write. It's compact but fast only with register renaming.
> Typical RISC operation is 2r1w, 3 adresses, 2 read, one write. MAC
> operation is 3 read 1 write.
>
> An LIW could be seens as 6 reads, 3 writes execution unit.
>

 The most frequent instruction will be one with 3 operand 2 read and 1
write. The instruction will be fixed length we don't want to fall in the
trap of variable length instruction. Also the instruction will not contain
any constant, those will be in the constant register.


> >From your terminology, it's look like an ALU with a lot of register
> port. (for exemple MAC/MUL unit, beside load/store, beside complete
> ALU without MUL)
>

The ALU is split from the register file for simplicity sake. Still one
feature of the ALU is that it will be completely integrated. What I mean is
the same ALU will do integer and float operation. That ALU will be split in
different stage that will allow a reuse of the different logic bloc. For
example the multiplier will serve to do the multiplication of all data type
also it will act as a barrel shifter. A barrel shifter is needed for both
the float add and also if we want to realize shift bigger than 1 bit per
clock.

The only part that may be not integrated will be the divider it's status is
still unclear at the moment. We know we need one but we aren't sure of how
to implement it and how the kernel will reaction to that instruction.  We
have 2 main choice either it a serial divider or a pipelined divider. The
serial divider advantage is it's relatively small size, the downside is it
can only process 1 request at once and take multiple cycle. The pipelined
divider advantage we could probably realize 1 division per cycle but the
problem would be the size requirement.


>
> >
> >>
> >>>>
> >>>>One other solution is having word aligned instructions. So you could
> >>>>have 32, 64, 128 bits instructions size.
> >>>
> >>> Before we decide on the length of the instruction, it would be fun to
> >>> further investigate some stuff from real life. And this is where we can
> >>> benefit from some of the software dudes out there. I would like to see
> >>> how big the average shader code is, compared to the available memory we
> >>> have on the underlying technology. Cause due to my initial calculations
> >>> here, if we assume 32'000 instructions in a kernel( Which from what I
> >>> have seen is a lot ), we use about 250KB [1] to store it using 64 bit
> >>> instruction words.  That also leaves us with a lot of flexibility in
> the
> >>> instruction word, and the decoding should really not be that hard
> >>> either. However, depending on the underlying technology, 250KB might be
> >>> a lot of RAM.
>

The standard ram block size for spartan is 18,432 Kbits for 32 bits
instruction it would be 512 instruction by BRAM. For 32000 instruction we
would need 64 of them...

>>
> >>I hope you could put more than a single RISC instruction on 64 bits !
> >>If you take 3 "basic" instructions in 64 bits. You should divide your
> >>result by 3.
> >
> > Yup, I haven't been thinking a lot about how to structure the ISA yet,
> > and of-course, using 64 bits for a RISC-ish ISA is waste of space. The
> > 64 bit was just to get an example of a worst-case kernel size. However,
> > it would still be interesting to get some metric on the average shader
> > size though, so we can get a better feeling of how big real-world
> > programs are.
> >
> > [1]
> >
> http://docs.google.com/View?id=dfsp4qpd_41dtrrskfb#Specification_for_Shaders_9367_2463043036062943
> >
> > --
> > Life on the earth might be expensive, but it
> > includes an annual free trip around the sun.
> >
> > Kenneth Østby
> > http://langly.org
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.9 (GNU/Linux)
> >
> > iEYEARECAAYFAkq5+jUACgkQpcFZhY+Vljx4dACfQ83XLoHPa2E4OQs3Lk+2DFC6
> > hygAmwXz76ZBT/2N591rTjzhQsISzYQw
> > =a7Nv
> > -----END PGP SIGNATURE-----
> >
> >
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 SIMD/MIMD

Reply via email to