On Mon, Sep 14, 2009 at 2:53 PM, Nicolas Boulay
<[email protected]> wrote:
> I have 2 mains comments to the document.
>
> The shader should not be SIMD, most shader do not use vector
> operation. So on scalar code 3 out of 4 ALU will not be used. I think
> that nvidia is going from 2 way SIMD to scalar shader.

It's not.  One of the original criticisms we all had of our own
document (well, André's and Kenneth's) is that the diagram is
misleading.  It's MIMD with the ability to keep track of when sibling
threads are or are not running the same instruction sequence.  A
simple case is where a shader can run two sets of four threads.  It
starts out that way until some thread hits a conditional that causes
its execution flow to diverge.  This is trivial to detect, and we'd
just split it off so that that shader was running one set of four, one
set of three, and one set of one (and so forth) in round-robin
fashion.

>
> >From a previous study on shader code poster here, the shader looks
> like a cpu that should do an fmul per cycle, so LIW structure looks
> the best. The load/store unit is a complex one to handle all the type
> of data without any other cpu instructions (2D structure access,
> texture incoded in different color schem as Yuv or RGBA with different
> bits per channel). A NVIDIA paper spoke about 18 differents datatypes!
>
> The second comment is about the easyness about shader programming.
> Some game developer said that gpu programming is for more complex than
> there budget enable. They prefer to sacrifice performance against
> easyness of programming. Larrabee looks to meet this goal. But using
> 512 bytes vector is not so easy too !

We need to decide if we want to go the (current) traditional GPU route
or something innovative.  Until we come up with something innovative
enough, we'll default to the traditional.  This isn't a solution, but
some people seem to suggest it helps:  Using LLVM as an intermediate
language seems to ease the translation.

Something innovative would be to find a way to partition a shader
program into requestor and consumer processes.  An optimization
available for fixed-function pipelines is that read requests can be
issued into a queue (to the memory system) and another queue (to the
rest of the pipeline).  The requestor knows enough to know what
address to request, and the receiver knows enough to process the data
when it arrives.  The advantage is that you get perfect streaming and
high throughput.  In shader programs, you have read instructions that
always stall.  The solution is to have thousands of threads.  The
problem is that memory accesses won't stream in ascending order or
even necessarily stay on the same memory row before moving on to
another memory row (and precharge+activate is expensive).  If we could
automatically divide the dataflow graph in a shader program between
requestor and consumer, we could absorb the read latency and keep
accesses nicely ordered.  The problem is program flows that depend on
data.  Those aren't that common (in graphics, although they are in
supercomputing).  If it's a simple if/then, then we can use
speculative execution.  But if some piece of read data is used as an
address (for instance), then that creates a cycle in the dataflow
graph, which is hard to deal with in some cases.

> On SOC system, it's common to use "accelerator" for different tasks
> (offload jpeg compressor, cryto DMA,...). But most of the time this
> unit are not flexible at all and it could be faster to do the job on
> the cpu to avoid communication!
>
> I prefer the idea to include specific instruction.It's far easier to
> use for the coder : communication are free, switching context are
> included. And this specific instruction, if they take one cycle are
> always faster than any other instruction combination.

> Larrabee use a
> specific instruction for the classical rasterisation.

I read a paper on Larrabee rasterization where they discussed the
challenges faced by the fact that the did NOT have special
instructions just for rasterization.

> In that cases, the software should be written in the same time of the
> hardware to find the easiest way to acheive usable maximum
> performance. Usual shader are not the easiest path !

That's the beauty of prototyping in an FPGA.  :)

>
> Regards,
> Nicolas
>
> 2009/9/14 Andre Pouliot <[email protected]>:
>> Hi everyone,
>>
>> The current work on OGA1 and OGD1 is still progressing. Not as fast as we
>> expected at first, but still going forward.
>>
>> With the experience we acquired working on OGA1, a small group of developer
>> started a specification for the next architecture revision of the open
>> graphics projects: The Open Graphics Architecture 2 (OGA2)
>>
>> That new revision will be with programmable shader that can also be use for
>> GPGPU. The current work document can be seen here:
>> http://docs.google.com/View?id=dfsp4qpd_41dtrrskfb
>>
>> As everyone can see, it's a work in progress. We welcome everyone to read
>> the specification and discuss, criticize and bring suggestions on OGML.
>>
>> If you have any question or comment on OGA2 we'll be happy to answer as best
>> as we can.
>>
>> André Pouliot
>>
>> _______________________________________________
>> Open-graphics mailing list
>> [email protected]
>> http://lists.duskglow.com/mailman/listinfo/open-graphics
>> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>>
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>



-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to