On Wed, Oct 28, 2009 at 4:31 PM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-27, Yann Guidon wrote:
>> Mojn,
>>
>> Petter Urkedal wrote:
>> > 4.  I like the idea of adding minimum and maximum functions in the
>> > instruction set; if we need them, that is.  They should not require much
>> > logic.  But in this case, note that there is a difference between signed
>> > and unsigned.  Do we want both?  On the other hand it only takes 2 to 3
>> > cycles to compute any of these if my idea of the instruction set is
>> > correct.
>>
>> Contrary to Timothy's opinion, I think that
>> UMIN/SMIN and UMAX/SMAX are worth it if you have enough bits in
>> the opcode. They spare conditional jumps and similar control stuffs,
>> and the logic is quite simple : take the carry out of the add/sub
>> unit, XOR with one bit of the opcode (which discriminates
>> MAX and MIN) and send the result to the "write enable" signal
>> of the register set's write port.
>
> Reusing the add/sub unit this way seems like a good idea.
>
> I made a sketch of how to implement the integer part of the ALU by
> combining unary modifiers on the inputs with a binary function on the
> following stages:
>
>    ======  ======  ======  ==================
>    bop     moda    modb    final instructions
>    ======  ======  ======  ==================
>    and     not_x   not_y   and, andn, nor
>    nand    not_x   not_y   nand, orn, or
>    xor     [?]     not_y   xorn
>    add     [?]     neg_x   sub [1]
>    shift   s/u     neg_x   lsl, lsr, asl, asr [1]
>    mul     [?]     [?]     mul
>    min     s/u     not_xy  smin, smax, umin, umax [2]

Xilinx has an "addsub" macro that seems to be more efficient than
passing an inverted operand in with a carry-in.  Of course, getting
more flexibility is usually better.  But keep in mind that doing
things the MIPS way simplifies the addsub by not requiring any
carry-in.

Actually, let me show you how I solved the problem of carry-in before.
 Let's say you have this:

wire carry_in;
wire [7:0] a, b;
wire [8:0] sum;

And I want to sum a and b with the carry in.  Xilinx produces a crap
result if I do this:

assign sum = a+b+carry_in;

It works, but the delay is WAY longer than what you'd get if you
didn't include the carry-in.  The way I solved it was to decide
between add and sub based on the carry and 1's compliment of b also
based on the carry.  That is, a+b+1 == a-(~b), so:

addsub9 as1 (.A({0'b0, a}), .B({0'b0, b} ^ {9{carry_in}}), .sum(c),
.subtract(carry_in ^ do_subtract));

Of course, this is only important if we have condition codes, and I
don't want to have condition codes.

>
> where
>
>    bop -- basic binary operation
>    moda, modb -- modifier bits
>    s/u -- selects between signed and unsigned
>    not_x -- bitwise negation of the x operand
>    not_y -- bitwise negation of the y operand
>    not_xy -- bitwise negation of both operands
>    neg_x -- inversion of the x operand
>
> I've left out div and the upper bits of multiply for now.  We can split
> up {shift, min} into {sshift, ushift, smin, umin} so that moda is only
> used for a single purpose.  I opted for compressing the instruction word
> over simplifying logic.
>
> [1] The "neg" modifiers are in fact implemented as "not" at this stage,
> with one flag passed down to the next stage.  For "add" the flag
> connects to the carry bit.  For "shift" I'm sure we can find another
> simple solution, esp since the effective width is only 5 bits plus sign.
>
> [2] The "not" modifiers only applies to the inputs to the add/sub unit.
> The actual write-back is taken from the original operands.
>
> [?] It's not obvious what are the most useful functions here, but we can
> get "not" for free if we don't have a strong preference.
>
> The s/u modifier is just passed down to the binary unit.  For shift,
> it's used for filling the upper bits of right shifts.  For min/max:
>
>> I don't remember exactly but you can perform signed/unsigned
>> integer MIN/MAX by XORing the MSB of the operands.
>
> Let's see.  Assume we have the unsigned order.  The signed order is the
> same if the operands are both non-negative or both negative.  If only
> one is negative then the negative operand is the larger, according to
> unsigned comparison.  So yes, the signed order is the 3-port XOR of the
> signs of the operands and the unsigned order.


FYI:

max(x,y) = (x + y + abs(x-y))*0.5
min(x,y) = (x + y - abs(x-y))*0.5

Mind you, for us, it would be WAY worse to do this than to use
branches, although this might be useful on something like an x86 doing
SSE, since it can be done entirely without branches.

Honestly, while I see the value in this, the fact is, it saves us
three instructions and 2 or 3 cycles (depending on which is smaller).
This is one of those things where it's OBVIOUS how to do it with other
instructions, which to me means we can table this until we get some
real statistics.  We might add min/max macros to the bytecode level
(assuming LLVM supports it).  For now, it would get expanded.  Later,
if we add min/max instructions, they would get magically converted to
the right opcodes.

>
>> This is pointless for FP. However if the critical datapath
>> must be balanced, I would put the XORs at the same place,
>> that is just before the ADD/SUB unit, instead of at the end
>> (XOR of both operands instead of one result).
>
> Maybe starting with the integer ordering for FP isn't that too far off.
> Looking at IEEE 754,
>
>  * +0, normalised and subnormal positive floats, and Inf are correctly
>    ordered by the integer ordering,
>
>  * -0, normalised and subnormal negative floats, and -Inf are ordered
>    opposite to their integer order, and
>
>  * we probably don't care about the ordering of NaN.
>
>> Furthermore, I have seen (in old/past studies) that MIN/MAX
>> are used a lot in signal processing and graphics, for
>> clipping to coordinates and stuffs like that, so since
>> it's so "simple" to implement, I usually include it
>> in my CPUs' instruction sets, along with a complete ROP2
>> (8 boolean operations : OR/ORN/NOR/AND/ANDN/NAND/XOR/XORN)
>
> That sounds right to me.  (And every cycle spend on trivial operations
> is waisting the floating point adder and multiplier, so it's important
> to do what we can to optimise the trivial parts within what the
> instruction width admits.)

We STILL don't know how much time these shaders will waste waiting on
memory reads.  Until we know that, we won't know what optimizations
are even worth doing.

Remember, there's a tradeoff that we don't know about.  When bundling
together lots of these shaders, will we run out of RAM blocks first,
or will we run out of random logic first?  (Assuming we know how much
space we need for things like the memory system, video, etc.)

If we run out of logic first, then we're better off having simpler
processors.  Then we can fit in more.  Each one individually will be
slower, but the aggregate will be better (probably).  If we run out of
BRAMs first, then adding specialized opcodes MAY speed things up,
depending on the memory throughput.

We all want to do some kind of clever, cool processor here.  But the
fact is, it's smarter to identify the TRUE bottlenecks and be clever
about adding opcodes for those later.  Right now, we don't have any
useful knowledge about some of these issues.  So KISS first.

Consider this:  How often are we going to do bit-wise logic operations
in a shader program?  Did you know that you can implement an OR by
combining AND and XOR?  AND is useful because we can use that to do a
float absolute value.  XOR is useful because we can use that to so a
float NOT.  But when do we care to use OR for much of anything?

Just to be a jerk, I'm going to suggest that we take bitwise OR off
the table and require that it be implemented using XOR and AND.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to