On 2012-06-06 11:29, Nicolas Boulay wrote:
You could add in the instruction section the splited load.

The DRAM bus is at least 64 bits wide, some are 128 bits, but most of
the time you have 2 or 3 64 bits buses. DRAM have at least a burst of
2 elements minimum. Burst of 4 seems to vanished. So for every load
you will have 128 bits of data. This is used most of the time to fill
a cache line. but if you have a dedicated registers bank that support
write from memory of 128 bits, and a read/write access for every type
from the ALU side, you could use true preload technique. Consecutive
load will be register access and not Load/Store with cache hit, the
datapath is still much faster. It's like you make the write buffer
accessible.
For data access we were thinking of going with a bus who's passing message so the width of the data received from the RAM is without a lot of consequence. Since the data is modulated as packet that travel between each functional unit. Normally each kernel receive it's data from a query from a scheduler before starting the kernel.


Prefetch technique are hard to do right. The software became obsolete
with every new hardware release (prefetch too soon and the data will
be discarded but the band with has been used, to late and you still
have the latency penalty). Hardware prefetch should better do nothing
than something wrong, because the performance will suffer. Preload
will always be good, because the data stay in register.
No prefetch except in the sense that the data is tranfered from memory to the thread register/local memory. Normally when a kernel is running it shouldn't need to fetch new data before ending the work. Latency penalty isn't a problem, we expect for a lot of latency and we plan to run other kernel in the shader at the same time.


Regards,
Nicolas

2012/6/6 Nicolas Boulay<[email protected]>:
no problem.

2012/6/6 André Pouliot<[email protected]>:
Hi Nicolas,

I'll try to answer you today. But could I send you my answer on the list?
You do raise a few good point that I discussed with Kenneth before. It would
be interesting for everyone to see the discussion.

André


On 2012-06-06 04:07, Nicolas Boulay wrote:
Hello,

Could you add diagonal of matrix of size 2 and 4 as data type. This a
complex and quaternion datatype.

Maybe string could be added (utf-8) ? This is quite far from GPGPU but
string management are really a pain to manage in pure software.

Nvidia have a bunch of vector format coded as 16 bits numbers for rgb
value using 5 5 5 bits for each texture. "S3TC" are something like
that packs few pixels in 64 bits. etc... This kind of data avoid a lot
of shift and bit manipulation, and still save data bandwith.

You should also add a way to define array of all this data type (size
are defined at runtime), so you could introduce kind of "map"
instructions or behavior to execute an instruction to all the array.
This could help to have fast tiny inner loops. This looks like repeat
instruction of some DSP, but i think it's more comprehensive to link
the behavior to the data them-self.

++ and -- are very annoying function because it means to use the same
register to read and write.

Some  new cpu use encoding for some heavly used constant. So there is
some instruction which reserve 3 bits for coding 8 constants as 1 2 4
8 16 and not only an immediat number.

Read after write dependancy should be break to better use the
pipeline. I have already think about a "load load" instruction to
better code instruction as "pointer->tab[i]". This hide also more
memory latency if the 2 loads make a cache miss, the core will wait
only a single time instead of 2.

You should add MACC operation that is the most used instruction
(d=a*b+c). You should also think about a fast way to do polynomial
evaluation as(( ((x+a)*x+b)*x+c)*x+d)... this is used a lot in GPGPU
to approximate mathematical function and for trigonometric function.
This could be optimised because there is only one variable reused a
lot and a bunch of constants. The most common  constant could also be
hardwired.

1/sqrt(x) is missing, it could be a one cycle instruction, it's much
faster than sqrt() and 1/x alone.

The kennet remark about thread management looks like what AMD as done
for buldozer : many decoder that fill many ALU, instead of having one
ALU for many decoders. I don't think it's possible to have a single
decoder for a see of ALU. It looks like a large SIMD processor as Cray
cumpter. In SSE, it miss some generic instruction as vector of pointer
load, to realy vectorised all kinds of loops.

Regards,
Nicolas
2012/6/6 Andre Pouliot<[email protected]>:
Hi everyone,

I have two document to share. They are both related to OGA2 and the
programmable architecture we were planning.

The first one is the architecture description for OGA2 as was
discussed between me and Kenneth. It now a few year old, time past
fast. If people want to rework it it need some reorganizing and some
update. Contact me I'll allow edit to those who ask.

https://docs.google.com/document/pub?id=1yE70dWsRPmg723tfxouQHdK5Mlq3khU8gCRmiu1vxNI

The other document is the breakdown for an ALU for the shader that do
both float and integer. It's based mostly on the instruction set in
the specification. I still need to find back the original document I
only have found the PDF.
https://docs.google.com/open?id=0B0gdvUojV4mJUWhldEZIWWx0UEk

If you have question after looking at those document I'll try to
answer as best as I can.

Have fun

André
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)


_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to