On Mon, Sep 14, 2009 at 2:53 PM, Nicolas Boulay <[email protected]> wrote: > I have 2 mains comments to the document. > > The shader should not be SIMD, most shader do not use vector > operation. So on scalar code 3 out of 4 ALU will not be used. I think > that nvidia is going from 2 way SIMD to scalar shader.
It's not. One of the original criticisms we all had of our own document (well, André's and Kenneth's) is that the diagram is misleading. It's MIMD with the ability to keep track of when sibling threads are or are not running the same instruction sequence. A simple case is where a shader can run two sets of four threads. It starts out that way until some thread hits a conditional that causes its execution flow to diverge. This is trivial to detect, and we'd just split it off so that that shader was running one set of four, one set of three, and one set of one (and so forth) in round-robin fashion. > > >From a previous study on shader code poster here, the shader looks > like a cpu that should do an fmul per cycle, so LIW structure looks > the best. The load/store unit is a complex one to handle all the type > of data without any other cpu instructions (2D structure access, > texture incoded in different color schem as Yuv or RGBA with different > bits per channel). A NVIDIA paper spoke about 18 differents datatypes! > > The second comment is about the easyness about shader programming. > Some game developer said that gpu programming is for more complex than > there budget enable. They prefer to sacrifice performance against > easyness of programming. Larrabee looks to meet this goal. But using > 512 bytes vector is not so easy too ! We need to decide if we want to go the (current) traditional GPU route or something innovative. Until we come up with something innovative enough, we'll default to the traditional. This isn't a solution, but some people seem to suggest it helps: Using LLVM as an intermediate language seems to ease the translation. Something innovative would be to find a way to partition a shader program into requestor and consumer processes. An optimization available for fixed-function pipelines is that read requests can be issued into a queue (to the memory system) and another queue (to the rest of the pipeline). The requestor knows enough to know what address to request, and the receiver knows enough to process the data when it arrives. The advantage is that you get perfect streaming and high throughput. In shader programs, you have read instructions that always stall. The solution is to have thousands of threads. The problem is that memory accesses won't stream in ascending order or even necessarily stay on the same memory row before moving on to another memory row (and precharge+activate is expensive). If we could automatically divide the dataflow graph in a shader program between requestor and consumer, we could absorb the read latency and keep accesses nicely ordered. The problem is program flows that depend on data. Those aren't that common (in graphics, although they are in supercomputing). If it's a simple if/then, then we can use speculative execution. But if some piece of read data is used as an address (for instance), then that creates a cycle in the dataflow graph, which is hard to deal with in some cases. > On SOC system, it's common to use "accelerator" for different tasks > (offload jpeg compressor, cryto DMA,...). But most of the time this > unit are not flexible at all and it could be faster to do the job on > the cpu to avoid communication! > > I prefer the idea to include specific instruction.It's far easier to > use for the coder : communication are free, switching context are > included. And this specific instruction, if they take one cycle are > always faster than any other instruction combination. > Larrabee use a > specific instruction for the classical rasterisation. I read a paper on Larrabee rasterization where they discussed the challenges faced by the fact that the did NOT have special instructions just for rasterization. > In that cases, the software should be written in the same time of the > hardware to find the easiest way to acheive usable maximum > performance. Usual shader are not the easiest path ! That's the beauty of prototyping in an FPGA. :) > > Regards, > Nicolas > > 2009/9/14 Andre Pouliot <[email protected]>: >> Hi everyone, >> >> The current work on OGA1 and OGD1 is still progressing. Not as fast as we >> expected at first, but still going forward. >> >> With the experience we acquired working on OGA1, a small group of developer >> started a specification for the next architecture revision of the open >> graphics projects: The Open Graphics Architecture 2 (OGA2) >> >> That new revision will be with programmable shader that can also be use for >> GPGPU. The current work document can be seen here: >> http://docs.google.com/View?id=dfsp4qpd_41dtrrskfb >> >> As everyone can see, it's a work in progress. We welcome everyone to read >> the specification and discuss, criticize and bring suggestions on OGML. >> >> If you have any question or comment on OGA2 we'll be happy to answer as best >> as we can. >> >> André Pouliot >> >> _______________________________________________ >> Open-graphics mailing list >> [email protected] >> http://lists.duskglow.com/mailman/listinfo/open-graphics >> List service provided by Duskglow Consulting, LLC (www.duskglow.com) >> > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) > -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
