Re: [Open-graphics] OGA2 specification and ALU

Nicolas Boulay Thu, 07 Jun 2012 00:49:24 -0700

2012/6/7 André Pouliot <[email protected]>:
> Hi,
> I'll answer each email independently for simplicity sake.
>
>
> On 2012-06-06 04:07, Nicolas Boulay wrote:
>>
>> Hello,
>>
>> Could you add diagonal of matrix of size 2 and 4 as data type. This a
>> complex and quaternion datatype.
>>
>> Maybe string could be added (utf-8) ? This is quite far from GPGPU but
>> string management are really a pain to manage in pure software.
>>
>> Nvidia have a bunch of vector format coded as 16 bits numbers for rgb
>> value using 5 5 5 bits for each texture. "S3TC" are something like
>> that packs few pixels in 64 bits. etc... This kind of data avoid a lot
>> of shift and bit manipulation, and still save data bandwith.
>>
>> You should also add a way to define array of all this data type (size
>> are defined at runtime), so you could introduce kind of "map"
>> instructions or behavior to execute an instruction to all the array.
>> This could help to have fast tiny inner loops. This looks like repeat
>> instruction of some DSP, but i think it's more comprehensive to link
>> the behavior to the data them-self.
>
> For the type we use some basic data type and we try to keep it simple. Each
> data type we need to support add complexity we don't want and that could be
> included in another data type.


In nvidia paper of few _years_ ago, they used a lot of kind of basic
data types, this save a lot of data manipulation.

>
> For the vector or matrix we don't support them. The choice was made because
> each alu is a scalar unit and can process vector/matrix as scalar component.
> The latency is greater but we are optimizing hardware resource utilization.
>

You don't like multicycle instructions ? This could help to reduce
code size, this help scheduling of instructions, you could even
anticipate some pipeline hazard.

> The big thing is that we have a "kernel" that control multiple ALU. Each alu
> have it's own thread. So essentially you are controlling for example N
> different thread of data from the same program.

You means that the "kernel" is a kind of decoder, and the same
instruction go to each ALU. So how do you manage branch ?
A single decoder, even with a kind of split in task, for many alu is
the exact definition of SIMD ! SSE are only a simple way of doing it
(load/store of consecutive words in memory for example).

>
> In parallel to that each step of the pipeline is controlled by a different
> kernel. So for an alu you could have 8 different thread running different
> kernel and data set.  if you have 4 alu that mean you have 32 threads that
> are running concurrently controlled by 8 kernel.

How do you manage that the data of this 32 threads are in cache ? And
this is not enough to hide the 100 cycles minimum to wait the DRAM.

> It could be interesting to save data bandwidth but we aren't trying to do
> that yet. we were trying to keep it simple for the few person who will write
> the code.
>
> For string data type I must say I don't know any processor that support that
> datatype naturally. It's usually all software. I know of a few architecture
> that support BCD integer and float but it's for banking system.

Last version of SSE could use XMM register as string.

>
>>
>> ++ and -- are very annoying function because it means to use the same
>> register to read and write.
>
> You can but you could also target another register. It just that an
> operation that's so common that having a dedicated instruction for it can be
> interesting. Unless we use a constant register for the value 1.
>
>> Some  new cpu use encoding for some heavly used constant. So there is
>> some instruction which reserve 3 bits for coding 8 constants as 1 2 4
>> 8 16 and not only an immediat number.
>>
>> Read after write dependancy should be break to better use the
>> pipeline. I have already think about a "load load" instruction to
>> better code instruction as "pointer->tab[i]". This hide also more
>> memory latency if the 2 loads make a cache miss, the core will wait
>> only a single time instead of 2.
>
>
> Normally we have a deep enough pipeline and enough kernel running to not
> care about read write problem. We suppose for memory acces that most of the
> data set will be available locally. Each section being broken in small work
> unit we believe it would be the case most of the time.

How could you garanti that the data will be in cache ? This seems
impossible with 16 textures of few 100kB and a screen in megapixels.

What happen is that you will change the kernel/thread at each load,
and you will trash the cache continuously, because so many threads
will wait too many data streams. So your cache hit rate wil be very
low.

>
>>
>> You should add MACC operation that is the most used instruction
>> (d=a*b+c). You should also think about a fast way to do polynomial
>> evaluation as(( ((x+a)*x+b)*x+c)*x+d)... this is used a lot in GPGPU
>> to approximate mathematical function and for trigonometric function.
>> This could be optimised because there is only one variable reused a
>> lot and a bunch of constants. The most common  constant could also be
>> hardwired.
>
> See previous point a simple alu mean each operation is seen as a scalar
> operation that operation will be broken down in multiple simple instruction.

If the broken down version is faster, there are no problem. But MACC
is the most used combined instructions by a far margin (80% of the
instruction of shader ? more ?). You could almost double the speed of
the shader if you enable FMACC.

The problem of polynom evalutation are the way you could retreive the
constant. If the memory load are slow, you will lost a lot of cycle
wait memory.

> For the constant there will be some value in a constant register file to
> help so calculus. It could be some approximation of trigonometric function
> to help speed up the result.
>
>> 1/sqrt(x) is missing, it could be a one cycle instruction, it's much
>> faster than sqrt() and 1/x alone.
>
> Divide was still a question without an answer, it was debated but not
> decided what was the best option.
>
>>
>> The kennet remark about thread management looks like what AMD as done
>> for buldozer : many decoder that fill many ALU, instead of having one
>> ALU for many decoders. I don't think it's possible to have a single
>> decoder for a see of ALU. It looks like a large SIMD processor as Cray
>> cumpter. In SSE, it miss some generic instruction as vector of pointer
>> load, to realy vectorised all kinds of loops.
>
> It's not SIMD it more a SIMT each alu work on an independent data set. That
> mean no vector instruction. We evaluated the hardware use of scalar versus
> vector and you have a lot of wasted resources with vector processor.

vector precessor are a wast only if the alu could be used only by a
fixe wide buse (like could be a risc pipeline duplicated 4x times as
the register with). If the alu are more independant, the problem are
not the same at all. If you could make vector load of different
adresse it''s not the same problem. The problem will be then in the
number of port of the register bank and the way to interlace alu use,
between threads.

The design you show use is an SIMD design, if what you call the
"kernel" is a single decoder. The independant data set could be done
with vector of pointer. A register contains different address and each
adress are fetch independantly. This what the game designer want for
CPU. But the pressure will be on the load/store unit, which will be
always the slowest part.

>
>> Regards,
>> Nicolas
>> 2012/6/6 Andre Pouliot<[email protected]>:
>>>
>>> Hi everyone,
>>>
>>> I have two document to share. They are both related to OGA2 and the
>>> programmable architecture we were planning.
>>>
>>> The first one is the architecture description for OGA2 as was
>>> discussed between me and Kenneth. It now a few year old, time past
>>> fast. If people want to rework it it need some reorganizing and some
>>> update. Contact me I'll allow edit to those who ask.
>>>
>>> https://docs.google.com/document/pub?id=1yE70dWsRPmg723tfxouQHdK5Mlq3khU8gCRmiu1vxNI
>>>
>>> The other document is the breakdown for an ALU for the shader that do
>>> both float and integer. It's based mostly on the instruction set in
>>> the specification. I still need to find back the original document I
>>> only have found the PDF.
>>> https://docs.google.com/open?id=0B0gdvUojV4mJUWhldEZIWWx0UEk
>>>
>>> If you have question after looking at those document I'll try to
>>> answer as best as I can.
>>>
>>> Have fun
>>>
>>> André
>>> _______________________________________________
>>> Open-graphics mailing list
>>> [email protected]
>>> http://lists.duskglow.com/mailman/listinfo/open-graphics
>>> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>
>
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 specification and ALU

Reply via email to