James Richard Tyrer wrote: > André Pouliot wrote: >> If we do the same with a fixed pipeline and we suppose we do the same >> 100 operations but unrolled and we run at 100MHZ. We have the same >> requirement for the multiplier 20 stage of 4 multiplier per >> stage(RGBA) so that's 80 multiplier. The difference now is that will >> a pixel is doing 1 operation the other 99 stage also have a pixel in >> them so that correspond to 1 pixel/cycle that's being processed. At >> 100MHz that transform at 60 fps to approximatively 1.6 Millions >> pixel per frame. This number of pixel translate to a screen bigger >> than 1280*1024. So you can do at 60FPS 1280*1024. > >> I hope you see why the fixed function was preferred. > > No, I don't. You have asserted that with more hardware you will get > more throughput. I can not argue with that. But the questions is how > to get the best performance with a given amount of hardware (as well as > the related question of how much hardware we will have available). I > understand the issues involved, but I am not certain of the answer. I > know that systolic arrays are faster, but IIUC, they also take more > hardware/throughput. It is throwing hardware at the problem and it works. The evaluation made in the past was for using the old approach(fix pipeline) of the 3D video card and use some of the new stuff(float) for graphics. The first graphics card used fixed pipeline because it was smaller and did give sufficient performance for must job. Actually for a given level of performance processor require more hardware. > > >> The approach with the cpu do give a lot of flexibility. > > I am not talking about using a CPU at all. I am talking about > using 4-word vector processors optimized for graphics to be controlled > by microcode. I did use a wrong terminology there i did mean a processor not cpu. The problem rest the same even if you use microcode you can't go near the 1 operation per cycle for a processor in a fpga and do it fast. It's either fast but multicycle or 1 cycle but slow. > >> But gaining that flexibility you lose in performance and gain in >> complexity on the hardware side. The fixed function pipeline do seem >> bigger at first glance. > > It doesn't just seem bigger, it is bigger. To do more multiplies at the > same time, you will need more multipliers arrays. Yes it is bigger than a single processor who do the same thing but the performance level is not the same either. > >> But it's a much more straightforward to design than a cpu . > > The RTL for the Sun graphics processor is free for the download! But, > if designing it, I see no difference. You have 4 MAC units and a > sequencer to tell it what to do. For more versatility, the microcode > can be in RAM. The rtl for the sun sparc processor is free. But put the logic used in an FPGA for the niagara 1 chip and it is to big to fit in the 3s4000 even for only 1 core.
http://fpga.sunsource.net/index.html > >> Also you need a lot more processor to go near the performance of a >> dedicated pipeline so the hardware requirement for equal performance >> is at least 2 to 10 time in disadvantage of a processor. > > I am a hardware person. I fully understand that a systolic processor > system is faster. I also understand how much hardware is needed. With > sufficient hardware, you can do my sample problem at the rate of one > output per clock. HOWEVER, it will require 9 hardware multipliers and 6 > adders vs only 3 of each for the vector processor. To do 4 vector * > 4x4 Transform matrix (which is required for RGBA pixels), it will > require 16 hardware multipliers and 12 adders. > > Are we really going to have that kind of hardware available? No we don't have that kind of hardware to spare if we do it fully parallel and try to do it in one cycle. But we could easily do it by using something like a pipeline. For your sample problem we receive more data than we generate so if we receive 4 new data(RGBA) per clock(continuous) and we synchronizes them right we still output 4 result each 3 clock, we do have a quiet period of 2 clock that we are processing data. The hardware requirement would be 4 multiplier 4 adder and some small control logic and i t would take less resource than a processor for doing the same job, the big requirement would be synchronizing data and that part is rather easy. If you insist to do the processing one data per clock we just put 3 block in parallel and it would take 12 multiplier and 12 adder. Less than 4 simd processor > > Next question, if we have that type of hardware available, is this the > best way to deploy it. With the same 16 hardware multipliers and a few > more adders, we could have 4 4-word SIMD vector processors. IIUC, if we > could then do 4 operations in parallel, we would get the same > throughput. Either way, it will saturate those 16 multipliers to do > 4x4 transforms, so all of the rest of your other 99 stages is going to > have to be additional hardware. It would still be less efficient with vector processor, except if you suppose they don't need to fetch or store the data they process. If you insist to do the processing one data per clock we just put 3 block as i describe before in parallel and it would take 12 multiplier and 12 adder. Less than 4 simd processor > >> It's a problem the hardware always face, the balance between >> flexibility and performance. > > Yes, there is always a balance between flexibility and performance. > However, to get the performance you are talking about, it looks to me > like we are going to need a massive number of hardware multiplier arrays > and the exponent adder and control to go with them. We then need to > ask, if we have that much hardware available, is it better to have it in > a fixed purpose systolic processor array, or in long word FPUs that can > do MACs in parallel? ATI & nVidia have decided against fixed purpose > processor arrays. > ATI actually use stream processor for their architecture if I remember right and nVidia use small scalar processor. Both need to run a lot of core on an asic, at high frequency and there still is a diminishing performance gain for the total of the implemented hardware. Also the power consumption is going trough the roof. They didn't as much decided as were imposed the need for programmable shader by direct x. Maybe they would have decided eventually to go that way even if not forced by the requirement of directx but it's a moot point since it didn't happen. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
