Hi Maciej,

I'm glad to hear from you again!

>  I think that should be a GAS warning really (similarly to macros that 
> expand to multiple instructions in a delay slot) as people ought to be 
> allowed to do what they wish, and then `-Werror' can be used for code 
> quality enforcement (and possibly disabled on a case-by-case basis).

How can one reasonably prevent the bug when compiling a whole Linux
distribution with thousands of packages if GAS merely warns and proceeds
in many cases?

> > [ In theory, GAS could actually insert NOPs prior to the noreorder 
> > directive,
> > to make the loop longer that six instructions, but GAS does not have that
> > kind of capability. Another option that GCC prevents is to place a NOP in
> > the delay slot. ]
> 
>  Well, GAS does have that capability, although of course it is not enabled 
> for `noreorder' code.

Hmm? I'm unable to make sense of that, since such NOPs are unaffected by
noreorder. This is what I meant:

loop:   addiu   $5,$5,1
        addiu   $4,$4,1
        lb      $2,-1($5)
        nop                     /* These two NOPs would prevent the */
        nop                     /* bug while preserving noreorder. */
        .set    noreorder
        .set    nomacro
        bne     $2,$3,loop
        sb      $2,-1($4)
        .set    macro
        .set    reorder

> For generated code I think however that usually it 
> will be cheaper performance-wise if a non-trivial delay-slot instruction 
> is never produced in the affected cases (i.e. a dummy NOP is always used).

I suppose so too. One observation is that R5900 BogoMIPS went down from
584 to 195 when switching from the generic kernel variant

        .set    noreorder
1:      bnez    a0,1b
        addiu   a0,a0,-1
        .set    reorder

that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant

        beqz    a0,2f
1:      addiu   a0,a0,-1
        bnez    a0,1b
2:

for the R5900 where GAS will place a NOP in the branch delay slot. With
loop unrolling BogoMIPS could be inflated again, of course. In general,
unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC
could be informed about that, possibly via profile-guided optimisation.

> > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not
> > make explicit use of the branch delay slot, as suggested by the patch below?
> > Then GCC will emit
> > 
> > loop:       addiu   $5,$5,1
> >     addiu   $4,$4,1
> >     lb      $2,-1($5)
> >     sb      $2,-1($4)
> >     bne     $2,$3,loop
> > 
> > that GAS will adjust in the ELF object to
> > 
> >    4:       24a50001        addiu   a1,a1,1
> >    8:       24840001        addiu   a0,a0,1
> >    c:       80a2ffff        lb      v0,-1(a1)
> >   10:       a082ffff        sb      v0,-1(a0)
> >   14:       1443fffb        bne     v0,v1,4
> >   18:       00000000        nop
> > 
> > where a NOP is placed in the delay slot to avoid the bug. For longer loops,
> > this relies on GAS making appropriate use of the delay slot. I'm not sure if
> > GAS is good at that, though?
> 
>  I'm sort-of surprised that GCC has produced `reorder' code here, making 
> it rely on GAS for delay slot scheduling.  Have you used an unusual set of 
> options that prevents GCC from making `noreorder' code, which as I recall 
> is the usual default (not for the MIPS16 mode IIRC, as well as some 
> obscure corner cases)?

I used the suggested patch, and recompiled the kernel with it, and verified
with the machine code tool that there were no undefined loops. Wouldn't that
be enough?

> > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
> > index e17b1d522f0..acd31a8960c 100644
> > --- a/gcc/config/mips/mips.md
> > +++ b/gcc/config/mips/mips.md
> > @@ -1091,6 +1091,7 @@
> >  
> >  (define_delay (and (eq_attr "type" "branch")
> >                (not (match_test "TARGET_MIPS16"))
> > +              (not (match_test "TARGET_FIX_R5900"))
> >                (eq_attr "branch_likely" "yes"))
> >    [(eq_attr "can_delay" "yes")
> >     (nil)
> > @@ -1100,6 +1101,7 @@
> >  ;; not annul on false.
> >  (define_delay (and (eq_attr "type" "branch")
> >                (not (match_test "TARGET_MIPS16"))
> > +              (not (match_test "TARGET_FIX_R5900"))
> >                (ior (match_test "TARGET_CB_NEVER")
> >                     (and (eq_attr "compact_form" "maybe")
> >                          (not (match_test "TARGET_CB_ALWAYS")))
> > 
> 
>  I think you need to modify the default `can_delay' attribute definition 
> instead (in the same way).

I tried to limit the case to branch delays only, which is a rough
approximation. Call and jump delay slots are acceptable. Are you referring
to this piece?

;; Can the instruction be put into a delay slot?
(define_attr "can_delay" "no,yes"
  (if_then_else (and (eq_attr "type" "!branch,call,jump")
                     (eq_attr "hazard" "none")
                     (match_test "get_attr_insn_count (insn) == 1"))
                (const_string "yes")
                (const_string "no")))

> An improved future version might determine the 
> exact conditions as noted in your proposed commit description, however I'd 
> suggest making this simple change first.

Learning the exact conditions, in a hardware sense, would be interesting
indeed, as some short loops actually do appear to work despite being labeled
as undefined. A fairly deep understanding of the R5900 implementation is
essential for such an undertaking. :)

Fredrik

Reply via email to