"They *don't* run in parallel (as much as they appear), instead when each thread stalls (say, to do a memory lookup), another thread will run. It's just pipelined."
This have the problem of removing access locality, the data should not be happy with such system. I don't know how fast could be a cpu with lot of register, lot of load/store addressing mode to avoid data realignement, a lot of basic type to decode easly pixel packing, the main goal will be to use at each cycle the floating point multiplication unit. Maybe the next step will be to have a cpu that have matrix (2*2,3*3,4*4), vector (2,3,4,n) and diagonal (2 for complex, 4 for quaternion) as basic types, as a way to fully use many ALU with a single program. Regards, Nicolas 2011/8/22 Luke Kenneth Casson Leighton <[email protected]>: > this from nick on the llvm list > > > ---------- Forwarded message ---------- > From: Nick Lewycky <[email protected]> > Date: Mon, Aug 22, 2011 at 4:42 AM > Subject: Re: [LLVMdev] Xilinx zynq-7000 (7030) as a Gallium3D LLVM FPGA target > To: Luke Kenneth Casson Leighton <[email protected]> > > _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
