Re: [Open-graphics] OGA2 specification and ALU

André Pouliot Wed, 06 Jun 2012 18:49:20 -0700

Hi,
I'll answer each email independently for simplicity sake.


On 2012-06-06 04:07, Nicolas Boulay wrote:

Hello,

Could you add diagonal of matrix of size 2 and 4 as data type. This a
complex and quaternion datatype.

Maybe string could be added (utf-8) ? This is quite far from GPGPU but
string management are really a pain to manage in pure software.

Nvidia have a bunch of vector format coded as 16 bits numbers for rgb
value using 5 5 5 bits for each texture. "S3TC" are something like
that packs few pixels in 64 bits. etc... This kind of data avoid a lot
of shift and bit manipulation, and still save data bandwith.

You should also add a way to define array of all this data type (size
are defined at runtime), so you could introduce kind of "map"
instructions or behavior to execute an instruction to all the array.
This could help to have fast tiny inner loops. This looks like repeat
instruction of some DSP, but i think it's more comprehensive to link
the behavior to the data them-self.

For the type we use some basic data type and we try to keep it simple.Each data type we need to support add complexity we don't want and thatcould be included in another data type.

For the vector or matrix we don't support them. The choice was madebecause each alu is a scalar unit and can process vector/matrix asscalar component. The latency is greater but we are optimizing hardwareresource utilization.

The big thing is that we have a "kernel" that control multiple ALU. Eachalu have it's own thread. So essentially you are controlling for exampleN different thread of data from the same program.

In parallel to that each step of the pipeline is controlled by adifferent kernel. So for an alu you could have 8 different threadrunning different kernel and data set. if you have 4 alu that mean youhave 32 threads that are running concurrently controlled by 8 kernel.

It could be interesting to save data bandwidth but we aren't trying todo that yet. we were trying to keep it simple for the few person whowill write the code.

For string data type I must say I don't know any processor that supportthat datatype naturally. It's usually all software. I know of a fewarchitecture that support BCD integer and float but it's for banking system.


++ and -- are very annoying function because it means to use the same
register to read and write.

You can but you could also target another register. It just that anoperation that's so common that having a dedicated instruction for itcan be interesting. Unless we use a constant register for the value 1.

Some  new cpu use encoding for some heavly used constant. So there is
some instruction which reserve 3 bits for coding 8 constants as 1 2 4
8 16 and not only an immediat number.

Read after write dependancy should be break to better use the
pipeline. I have already think about a "load load" instruction to
better code instruction as "pointer->tab[i]". This hide also more
memory latency if the 2 loads make a cache miss, the core will wait
only a single time instead of 2.

Normally we have a deep enough pipeline and enough kernel running to notcare about read write problem. We suppose for memory acces that most ofthe data set will be available locally. Each section being broken insmall work unit we believe it would be the case most of the time.


You should add MACC operation that is the most used instruction
(d=a*b+c). You should also think about a fast way to do polynomial
evaluation as(( ((x+a)*x+b)*x+c)*x+d)... this is used a lot in GPGPU
to approximate mathematical function and for trigonometric function.
This could be optimised because there is only one variable reused a
lot and a bunch of constants. The most common  constant could also be
hardwired.

See previous point a simple alu mean each operation is seen as a scalaroperation that operation will be broken down in multiple simpleinstruction. For the constant there will be some value in a constantregister file to help so calculus. It could be some approximation oftrigonometric function to help speed up the result.

1/sqrt(x) is missing, it could be a one cycle instruction, it's much
faster than sqrt() and 1/x alone.

Divide was still a question without an answer, it was debated but notdecided what was the best option.


The kennet remark about thread management looks like what AMD as done
for buldozer : many decoder that fill many ALU, instead of having one
ALU for many decoders. I don't think it's possible to have a single
decoder for a see of ALU. It looks like a large SIMD processor as Cray
cumpter. In SSE, it miss some generic instruction as vector of pointer
load, to realy vectorised all kinds of loops.

It's not SIMD it more a SIMT each alu work on an independent data set.That mean no vector instruction. We evaluated the hardware use of scalarversus vector and you have a lot of wasted resources with vector processor.

Regards,
Nicolas
2012/6/6 Andre Pouliot<[email protected]>:

Hi everyone,

I have two document to share. They are both related to OGA2 and the
programmable architecture we were planning.

The first one is the architecture description for OGA2 as was
discussed between me and Kenneth. It now a few year old, time past
fast. If people want to rework it it need some reorganizing and some
update. Contact me I'll allow edit to those who ask.
https://docs.google.com/document/pub?id=1yE70dWsRPmg723tfxouQHdK5Mlq3khU8gCRmiu1vxNI

The other document is the breakdown for an ALU for the shader that do
both float and integer. It's based mostly on the instruction set in
the specification. I still need to find back the original document I
only have found the PDF.
https://docs.google.com/open?id=0B0gdvUojV4mJUWhldEZIWWx0UEk

If you have question after looking at those document I'll try to
answer as best as I can.

Have fun

André
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)


_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 specification and ALU

Reply via email to