Re: [Open-graphics] Looking towards the future: Graphics technology

nico Fri, 21 Apr 2006 02:07:48 -0700

> On 4/20/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>
>> > On 4/20/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> >
>> >> I don't think it's wise to use SIMD ALU here. All scalar code will
>> use
>> >> the
>> >> SIMD FPU with 3 FMUL unit idle. Because everything is strongly
>> >> parrallel,
>> >> i think it's better to stay scalar.
>> >
>> > If there are enough independent scalars that can be scheduled, you can
>> > pack them and run them in parallel.
>> >
>>
>> So you need the logic to detect that a pack is possible, and you need
>> the
>> switch that permit to connect the different register bank and the FPU.
>
>
> No.  The compiler can optimize this stuff.


So you want to pack and unpack SIMD register. That's a cuting edge
technology very few used in normal computing (using SSE2 or altivec,...).
Compiler are quite bad at it. But if you optimise the pack/unpack
instruction, this will represente switch that are big and slow.
FMUL is a one cycle operation. If you need 3 pack instruction before,
there is no use of it.

>
> Have a look at the instruction set you came up with; 19 out of 27 ops are
> vector ops.  It is going to be far more important to optimize vector ops
> than to ensure full utilisation of the silicon at all times.

See the real code posted here !

Most operation are purely scalar.

> If you can do
> a dot product in 10 instructions (8 load, 1 vmul, 1 store) then that is a
> big gain over 17 instructions (8 load, 4 smul, 3 add, 1 store).  If you
> have
> a wide memory bus and can fetch four floats in one op, so much the better;
> now we are down to 4 instructions instead of 17.  Vector operations are
> very
> likely to dominate a shader (any 3D processing for that matter) therefore
> the whole architecture should  have the goal of optimising vector
> operations.

If i have understand shader correctly, load are only for texture, every
thing else is transmited trough specific register. So this load are
implicit.

Basicaly DOT product take one cycle in vector arch, and 4 in a scalar LIW
arch (like I and André Pouliot explain) if you could interleave correctly
the instruction (7 instructions latency otherwise, with a 3 cycles latency
FPU). So for DOT it's completly the same. Because a scalar cpu will be
almost 4 times smaller than a vector shader, you could put 4 cores scalar
where you put 1 vector core.

When you look at the compiled code posted here, you see a lot of MOV,
scalar MUL, etc... In this precise case, the SIMD unit is completly
underused.

I don't have access here to the ASM published. But this is the kind of
code to optimise.

I know that if only vector code is used, a vector shader will be faster
(because there is less problem of read-after-write dependancies than in
scalar cpu), but shader use a lot of scalar code.

>
> Tom
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)


_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Looking towards the future: Graphics technology

Reply via email to