Re: [Open-graphics] OGA2 Basic Shader Instructions

Timothy Normand Miller Thu, 29 Oct 2009 08:00:01 -0700

On Wed, Oct 28, 2009 at 4:42 AM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-27, Timothy Normand Miller wrote:
>> On Mon, Oct 26, 2009 at 5:45 PM, Petter Urkedal <[email protected]> wrote:
>> >> > 4.  I like the idea of adding minimum and maximum functions in the
>> >> > instruction set; if we need them, that is.  They should not require much
>> >> > logic.  But in this case, note that there is a difference between signed
>> >> > and unsigned.  Do we want both?  On the other hand it only takes 2 to 3
>> >> > cycles to compute any of these if my idea of the instruction set is
>> >> > correct.
>> >>
>> > On 2009-10-26, Timothy Normand Miller wrote:
>> >> We CAN implement this with branches, which are a bit cheaper because
>> >> we don't have a delay slot.  We'll softof automatically have both
>> >> signed and unsigned variants of branch instructions in some cases.
>> >
>> > I'm hoping we can support a fairly complete set of branch conditions.
>> > If we have space in the instruction world, conditional write-back would
>> > reduce min and max to two instructions, that seems questionable at the
>> > moment.  Still, I'm not against min/max primitive if they're common.
>>
>> We can either have a write-back flag, or we can make r0 the bit bucket.
>
> I was really thinking of conditioned instructions, not just a fixed
> throw-away of the result.  E.g. computation of min as
>
>        ;; Computation of r1 := min {r1, r2}
>        sub r1, r2, r0
>        move r2, r1 if_neg r0
>
> Though without flags, the conditional modifier is only applicable to
> unary instructions, which severely limits its usability.


I just don't see where you're going to put the bits for that.

One option, actually, is to make the regfile banked or have a sliding
window.  Now, we can use fewer bits for register numbers, and we have
room for other stuff.  We could do this by splitting DECODE into
DECODE and REG, where DECODE does the offset.

I'm not particularly fond of sliding window register files.

>
>> >> Will we need special float branch instructions?  We may need special
>> >> ge/le instructions as usual since we can't recover those from the
>> >> difference, but for floats and ints, the sign bits are in the same
>> >> place and zero is always zero.
>> >
>> > You are not thinking of two-operand compare-and-branch instructions, but
>> > rather test instructions which store 0 or 1?  I outlined an option of
>> > using bits 32 and 33 of the BRAM below before I realised what you meant.
>>
>> Well, actually, that's an excellent idea.  The sign bit is only one
>> bit, so we don't need anything special for that.  When doing
>> write-back we can check to see if the value is zero and set a flag
>> bit.  We could also store whether or not the computation was the
>> result of an overflow.  This would be like processor flags, but one
>> set for each register.  (And since we get 4 bits free, why not use
>> them!)
>>
>> The main problem with the overflow bit is that it would get lost on a
>> context switch.  Right now, we can't do context switches, so it
>> doesn't matter.  But what if we change our minds in the future?
>
> By context switches, do you mean writing to main memory?  That would
> require logic to recode 36 bit words into 32 bit words, e.g. by shifting
> the top four bits into a 32 bit register and writing it between every 8
> normal writes.  Though, this may bring up other issues with our memory
> cache and dealing with memory alignment.

It's complication we should avoid.  The KISS principle should be
applied to every decision.

>
>> Having a zero flag is a good idea since it's easy to recompute every
>> time.  I'm not bothered by having a few extra compare instructions.
>> Most will just rely on subtract, but some signed ones will require
>> special compares, which are know about from the MIPS instruction set.
>>
>> Here are some ideas for "summary flags" that make use of the extra
>> bits in the register word:
>> - zero
>> - fp infinity
>> - fp NaN
>
> Contrary to carry and overflow, these are computable from the lower 32
> bits.  So, this would be purely an optimisation, but do we need it
> considering the depth of the pipeline?

Not strictly.  In fact, we have a good number of cycles between when a
branch is issued and when we need to know the address of the next
instruction.

>
> Here is another thought.  We don't need write-back for branches.  If 8
> bits suffice for short jumps, that leaves room for two operands.  Thus,
> we could make branch instructions which share logic with the arithmetic
> instructions, up to the point where the write-back or jump happens.
> However, but the implied functionality seems redundant, and this
> approach may leave us no bits for selecting the condition on which to
> branch.  So instead they could be based solely on subtraction:
>
>    ifeq ri, rj, target
>    ifneq ri, rj, target
>    ifule ri, rj, target  ; if (uint)ri ≤ (uint)rj then jump_to target
>    ifult ri, rj, target
>    ifsle ri, rj, target
>    ifslt ri, rj, target
>
> The latter four instructions will coincide with min and max logic.  The
> idea here is to test and act on the carry and overflow flags on the same
> cycle they are generated, so that we don't need to save them.  If 8 bit
> target is too narrow, the compiler can generate

One reason I like separating the compare instruction from the branch
instruction is that we now get to reuse an existing instruction (sub,
subf), and we get to choose the one correct for the datatype.

>
>    ifule ri, rj, l0
>    jump target
> l0:
>
> which is still the same number of instructions (and fewer cycles on the
> average) as
>
>    sub ri, rj, rk
>    ifnpos rk, taget
>
> If we don't have a register fixed to zero, we may also want
>
>    ifzero ri, target
>    ifnzero ri, target  ; nonzero
>    ifnpos ri, target   ; non-positive
>    ifneg ri, target
>    ifpos ri, target
>    ifnneg ri, target
>
> Moreover, these can have 16 bit relative targets.  We probably still
> want
>
>    jump far_target, wb  ; used for calls
>    jump far_target

The way I see the branch instruction is as follows:
- 8-bit opcode (with condition embedded in it)
- 8-bit register number (that has the result of sub or other compare)
- 16-bit PC offset for branch target.

The conditions are the same as MIPS, which can only consider zero and
negative (and we can add inf and nan if we like in a separate set of
opcodes).  This leaves out some possible comparisons, which requires a
few extra compare instructions.

We can also have an unconditional branch with a 24-bit address.

>> I'm not sure what else.  We have sign for free.  We can live without
>> overflow.  What else might we want to know about quickly?
>
> I think not having overflow basically means that we can't easily branch
> on a comparison where the difference of the operands exceed the signed
> range.  That can be a problem if the shader specification require that
> "if (x <= y)" does the right thing for the full 32 bit range of x and y.
> The solution may be as you have mentioned to have dedicated compare
> instructions.  After all one rarely use both the result and the
> overflow/carry flag of the same subtraction except for implementing
> multi-word integers.

In the case of multi-word integers, wouldn't we mostly be using carry,
not overflow?

>
>> > Also, these are constant shifts only.
>> > In HQ we did variable shifts and even treated negative and big exponents
>> > (RHSs) correctly.  I don't think it costed us that much hardware.
>>
>> Why do they have to be constant shifts?  We can use a decoder to turn
>> a binary value into a one-hot and make that the multiplier.  The
>> decoder plus the multiplier will be smaller than a barrel shifter,
>> especially considering that the multiplier has multiple uses.
>
> My mistake, I though the suggestion was to use the multiplication
> instruction as is.  Still, how does this compare to sharing the logic
> with rotate and the other shift?  I guess an n-cold is a easy to make as
> a 1-hot, so given the (sign-independent) rot-result, it remains to AND
> it with either an |y|-cold upper or lower mask, depending on the sign of
> y.

To perform a shift, we would compute the one-hot multiplier necessary
to get the right result to appear in the upper 32 bits of the product.
 To perform a rotate, we would compute the one-hot necessary to get
the right result when we bitwise OR the upper and lower 32 bits of the
product.

Where do rotate instructions get used?



-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] OGA2 Basic Shader Instructions

Reply via email to