On Wed, Sep 23, 2009 at 6:36 AM, Kenneth Ostby <[email protected]> wrote:
> Nicolas Boulay:
>>2009/9/23 Kenneth Ostby <[email protected]>:
>>> Hi,
>>>
>>> Nicolas Boulay:
>>>>2009/9/23 Hugh Fisher <[email protected]>:
>>>>> Andre Pouliot wrote:
>><...>
>>>>
>>>>Personnaly LIW is what i prefer : exposed every unit of the shader in
>>>>the instruction word. Then it became a software challenge to optimise
>>>>them.
>>>
>>> I'm unsure if LIW is the good option for this architecture. This due to
>>> the fact that Andre mentioned earlier, we have a lot of threads that
>>> needs to to execute the same instruction over data in close spatial
>>> locality. Hence, there is really no use in having fine grained control
>>> over the different units in a single shader, since in most cases they
>>> are going to execute the same instruction anyways. Thus, including LIW
>>> will only increase the complexity of the hardware, without providing any
>>> substantial gains.
>>>
>>
>>I doesn't understand your point. That means that the ALU will be full
>>but the other unit will be unused ? for example adder and
>>multiplication could be a separat unit, both could be filled in the
>>same time (MAC instead of MUL + an adder should be better).
>
> Aaah, the joy of terminology. If you take a look at the shader unit
> figure in [1], you can see how we plan to have several ALUs in a single
> shader. All those ALUs will execute the same instruction in over
> different threads. Thus, exposing the ALUs for the software developer
> only adds more complexity on both the hardware and software. Futhermore,
> the software side will in most cases only have to duplicate the same
> instruction over several ALUs.
>
> That being said, after having finished my coffee, and had some time to
> think, we might be able to utilize LIW, although I'm still unsure about
> the cost to benefit ratio. Imagine if we, in what we call the ALUs,
> include several functional units, adders, multipliers, &c. we can use
> LIW in order to fully utilize them. However, this comes with the added
> cost of logic, and design complexity. The simple way to solve this could
> be to add a single multiply-adder unit inside each ALU, and thus we
> avoid the LIW problem altogether.

[Side note:  When I say "add", I mean "add and sub", but they would
use the same hardware.  Also, I forgot to mention logical ops and
several other things, but that's a nit-pick.]

These instructions are rare:

- div
- convert
- memory load/store (more common than the others but rarer than add and mult

These instructions are common:

- add
- mult

This can be had for free:

- Flow control (but it will be completely absent from many kernels)


So there's no point in adding extra instruction bits for anything
other than add and mult.  Also, since we won't tend to mix fp and int,
there's no point in providing simultaneous access to fp and int add
and mul.

So if we do LIW, I propose this:

Slot 0:  any instruction (add, sub, div, flow control, memory, etc.)
Slot 1:  any of fp add, fp mul, int add, int mul not conflicting with slot 0.

I'm assuming that we include vector instructions (even if they get
unrolled into scalars).

Now, you could do an int add and an fp add at the same time or an add
and a mul, or whatever.  Or you could do one at the same time as a
memory op.

I fear that the code bloat will be so bad that we'll get killed by the
icache misses, completely defeating the gains we get from LIW.


>
>
>>
>>>>
>>>>One other solution is having word aligned instructions. So you could
>>>>have 32, 64, 128 bits instructions size.
>>>
>>> Before we decide on the length of the instruction, it would be fun to
>>> further investigate some stuff from real life. And this is where we can
>>> benefit from some of the software dudes out there. I would like to see
>>> how big the average shader code is, compared to the available memory we
>>> have on the underlying technology. Cause due to my initial calculations
>>> here, if we assume 32'000 instructions in a kernel( Which from what I
>>> have seen is a lot ), we use about 250KB [1] to store it using 64 bit
>>> instruction words.  That also leaves us with a lot of flexibility in the
>>> instruction word, and the decoding should really not be that hard
>>> either. However, depending on the underlying technology, 250KB might be
>>> a lot of RAM.
>>
>>I hope you could put more than a single RISC instruction on 64 bits !
>>If you take 3 "basic" instructions in 64 bits. You should divide your
>>result by 3.
>
> Yup, I haven't been thinking a lot about how to structure the ISA yet,
> and of-course, using 64 bits for a RISC-ish ISA is waste of space. The
> 64 bit was just to get an example of a worst-case kernel size. However,
> it would still be interesting to get some metric on the average shader
> size though, so we can get a better feeling of how big real-world
> programs are.
>
> [1]
> http://docs.google.com/View?id=dfsp4qpd_41dtrrskfb#Specification_for_Shaders_9367_2463043036062943
>
> --
> Life on the earth might be expensive, but it
> includes an annual free trip around the sun.
>
> Kenneth Østby
> http://langly.org
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkq5+jUACgkQpcFZhY+Vljx4dACfQ83XLoHPa2E4OQs3Lk+2DFC6
> hygAmwXz76ZBT/2N591rTjzhQsISzYQw
> =a7Nv
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>



-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to