> On 4/20/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
>> I don't think it's wise to use SIMD ALU here. All scalar code will use
>> the
>> SIMD FPU with 3 FMUL unit idle. Because everything is strongly
>> parrallel,
>> i think it's better to stay scalar.
>
> If there are enough independent scalars that can be scheduled, you can
> pack them and run them in parallel.
>

So you need the logic to detect that a pack is possible, and you need the
switch that permit to connect the different register bank and the FPU.

For what ?

The only advantage over 4 cores depend on the size of control logic, it
depend if it's negligeable in front of the size of a 32 bits FPU. In the
other side, 32 bits switch could be big.

The goal of the shader is to maximise the use of FPU or more precisely the
FMUL instruction.

So you could create un instruction word which look like this :
- OPeration code 1+ addr Read register 1 + addr Read register 2 + addr
write register 1
(this are for MOV, LOAD&STORE and maybe for logical op as "<" ">", so it
could do FADD FSUB)
- OPeration code 2 + addr Read register 4 + addr Read register 5 + addr
write register 2
OPCODE2 could be small (FMUL, integer MUL, what else ?)

There is 2 registers bank. So you could use 4 read and 1 write memory for
the register bank. Each read could access the 2 bank but write could only
access a dedicated bank. It's depend on the technology you could afford
(full custom or not... 4 read and 2 write memory are maybe common
nowadays)

Then you add :
- Precicat
That's a very easy way to make small "if" statement without breaking the
pipeline. (like CMOVE in x86). Predicat are access to a register that said
this register is null or not. If the register is null, the current
operations are cancelled.
- Predicat + Imm8
That's the way to handle loops, jump and the repeat instruction of some
DSP. If the register is not null, PC+IMM8 is performed with a delay slot
of 1, otherwise PC+1 is used.

The instruction world is big :) I have read somewhere that 32 registers of
vec4 are needed. So you must have at least 128 32 bits registers. If you
add some trick as R0 == 0, and some specific register, register address
will need 8 bits. Depending how you encode the opcode, you will reached 80
bits for an instruction word.

The "jump part" could manage "directly" the PC with a delay slot. The
predicat could enable or not the instruction (you could use a 128 bits
register to avoid the need to compare each register with zero).

Then you have the equivalent of 2 pipelines, with one dedicated to FMUL to
try to use it at every cycle. You could do out of order completion
depending on the lenght of the pipeline of each unit. That's not a problem
if you use a scoreboard.

You did not need a reorder buffer (ROB) as in current x86 to manage
correctly instruction fault (in case of fault, the faulty instruction
could be before some retired instruction, that's make it impossible to
debug general CPU but i don't think GPU need that).

Nicolas Boulay

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to