> On 4/20/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> I don't think it's wise to use SIMD ALU here. All scalar code will use >> the >> SIMD FPU with 3 FMUL unit idle. Because everything is strongly >> parrallel, >> i think it's better to stay scalar. > > If there are enough independent scalars that can be scheduled, you can > pack them and run them in parallel. >
So you need the logic to detect that a pack is possible, and you need the switch that permit to connect the different register bank and the FPU. For what ? The only advantage over 4 cores depend on the size of control logic, it depend if it's negligeable in front of the size of a 32 bits FPU. In the other side, 32 bits switch could be big. The goal of the shader is to maximise the use of FPU or more precisely the FMUL instruction. So you could create un instruction word which look like this : - OPeration code 1+ addr Read register 1 + addr Read register 2 + addr write register 1 (this are for MOV, LOAD&STORE and maybe for logical op as "<" ">", so it could do FADD FSUB) - OPeration code 2 + addr Read register 4 + addr Read register 5 + addr write register 2 OPCODE2 could be small (FMUL, integer MUL, what else ?) There is 2 registers bank. So you could use 4 read and 1 write memory for the register bank. Each read could access the 2 bank but write could only access a dedicated bank. It's depend on the technology you could afford (full custom or not... 4 read and 2 write memory are maybe common nowadays) Then you add : - Precicat That's a very easy way to make small "if" statement without breaking the pipeline. (like CMOVE in x86). Predicat are access to a register that said this register is null or not. If the register is null, the current operations are cancelled. - Predicat + Imm8 That's the way to handle loops, jump and the repeat instruction of some DSP. If the register is not null, PC+IMM8 is performed with a delay slot of 1, otherwise PC+1 is used. The instruction world is big :) I have read somewhere that 32 registers of vec4 are needed. So you must have at least 128 32 bits registers. If you add some trick as R0 == 0, and some specific register, register address will need 8 bits. Depending how you encode the opcode, you will reached 80 bits for an instruction word. The "jump part" could manage "directly" the PC with a delay slot. The predicat could enable or not the instruction (you could use a 128 bits register to avoid the need to compare each register with zero). Then you have the equivalent of 2 pipelines, with one dedicated to FMUL to try to use it at every cycle. You could do out of order completion depending on the lenght of the pipeline of each unit. That's not a problem if you use a scoreboard. You did not need a reorder buffer (ROB) as in current x86 to manage correctly instruction fault (in case of fault, the faulty instruction could be before some retired instruction, that's make it impossible to debug general CPU but i don't think GPU need that). Nicolas Boulay _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
