2009/9/28 Kenneth Ostby <[email protected]>

> Timothy Normand Miller:
> >On Mon, Sep 21, 2009 at 9:41 PM, Hugh Fisher <[email protected]>
> wrote:
> >> Loads and stores are mostly of matrices (eg skinning), or materials
> >> and colors which are one or more 3/4-way RGB/RGBA vectors.
> >
> >Good argument for vector load instructions.  I can totally buy that.
>
> I see the point in vector load instructions, however, as a computer
> architect, I'm wondering if there is other ways to solve this. Mainly,
> cause I want to expose a simple ISA to the software developers (
> Software developers here being the compilers and OpenGL driver
> developers ).
>
> What strikes me, and this isn't well though through yet, is that we
> might be able to "emulate", in lack of a better term, the performance
> gain of a vector load instruction by instead, increasing the cache line
> size from the naive 32 bit, to Y*(number of ALUs). This will have the
> effect that every time we make a request for address X, we will fetch
> the entire line. Although, it might not be the entire vector for a
> single thread, we can, by storing our data in interleaved fashion, make
> sure that it's the data for the threads that are to be scheduled on the
> cores in round.
>
>
>From my point of view, if SW compatibilities between chip revision is not
mandatory and if instruction word size is not critical, it could be very
interresting to be very explicit about load/store request with instruction
that enable to mask the latency of the memory.

"Preload" is an explicit technics compare to "prefetching" of cache. Preload
is just a load very early in the code and put it in a register use later.
For example, you load data n, when you compute n+1 and store computed data
n+2. Usualy we could do it with normal register, but this increase register
allocation pressure.

It coulds looks like specific register 4 times wider than normal registers.
So preload is simply a load of 4 values in the same time (or only 2 if we
need only 2 datas). Then you need a register move between the larger
register set and the normal one.

The optimal size of this "load/store register" is the way to realy use all
the available bandwith of the DRAM interface (64 bits with burst size of 2 +
???). It's also interresting for the store part because you don't need
"write buffer" to really use all the bandwith.

If you look carefully of this architecture, is very like a cache but an
explicit cache dedicated for each instruction flow. I think this will much
more efficient than a "normal" cache between the core and the DRAM, because
so many threads in flight will break any memory access locality.

The good number of registers is based on the number of  "data stream" and
the number of needed duplication to mask the latency of the IO.

Regards,
Nicolas


> --
> Life on the earth might be expensive, but it
> includes an annual free trip around the sun.
>
> Kenneth Østby
> http://langly.org
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkrAixMACgkQpcFZhY+Vljw2WgCgyFXvdXSSBzNU17cfisnwPu7v
> PXcAn3ML9ZetpoK6gy7L4pV5SkjXI3/w
> =i/Qz
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to