Re: [PATCH 5/9] [SMS] Support new loop pattern

Ayal Zaks Tue, 11 Oct 2011 17:22:28 -0700

On Fri, Sep 30, 2011 at 5:22 PM, Roman Zhuykov <zhr...@ispras.ru> wrote:
> 2011/7/21  <zhr...@ispras.ru>:
>> This patch should be applied only after pending patches by Revital.
>
>
> Ping. New version is attached, it suits current trunk without
> additional patches.


Thanks for the ping.

> Also this related patch needs approval:
> http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01804.html
>
>> The loop should meet the following requirements.
>> First three are the same as for loop with doloop pattern:
>> ...
>> The next three describe the control part of new supported loops.
>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is

you'd probably want to bump to next instruction if falling through,
e.g., pc=(regF!=0)?label:pc+4

>>  flag register;
>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), 
>> where X
>>  is a constant, or maybe a register, which is not changed inside a loop;
>> - only one instruction modifies regC inside a loop (other can use regC, but 
>> not
>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>  step is a constant.
>
>> When doloop is succesfully scheduled by SMS, its number of
>> iterations of loop kernel should be decreased by the number of stages in a
>> schedule minus one, while other iterations expand to prologue and epilogue.
>> In new supported loops such approach can't be used, because some
>> instructions can use count register (regC).  Instead of this,
>> the final register value X in compare instruction regF=COMPARE(regC,X)
>> is changed to another value Y respective to the stage this instruction
>> is scheduled (Y = X - stage * step).

making sure this does not underflow; i.e., that the number of
iterations is no less than stage (you've addressed this towards the
end below).

>
> The main difference from doloop case is that regC can be used by some
> instructions in loop body.
> That's why we are unable to simply adjust regC initial value, but have
> to keep it's value correct on each particular iteration.
> So, we change comparison instruction accordingly.
>
> An example:
> int a[100];
> int main()
> {
>  int i;
>  for (i = 85; i > 12; i -= 5)
>      a[i] = i * i;
>  return a[15]-225;
> }
> ARM assembler with "-O2 -fno-auto-inc-dec":
>        ldr     r0, .L5
>        mov     r3, #85
>        mov     r2, r0
> .L2:
>        mul     r1, r3, r3
>        sub     r3, r3, #5
>        cmp     r3, #10
>        str     r1, [r2, #340]
>        sub     r2, r2, #20
>        bne     .L2
>        ldr     r0, [r0, #60]
>        sub     r0, r0, #225
>        bx      lr
> .L5:
>        .word   a
>
> Loop body is executed 15 times.
> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
> and following times:
> Stage  Time       Insn
> 0          5      mul     r1, r3, r3
> 1         10     sub     r3, r3, #5
> 1         11     cmp     r3, #10
> 1         11     str     r1, [r2, #340]
> 1         13     bne     .L2
> 2         16     sub     r2, r2, #20
>

branch is not scheduled last?

> To make new schedule correct the loop body
> should be executed 14 times and we change compare instruction:

the loop itself should execute 13 times.

> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
> In our example regC is r3, X is 10, step = -5, compare instruction
> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>

right. In general, if the compare is on stage s (starting from 0), it
will be executed s times in the epilog, so it should exit the loop
upon reaching Y = X - s * step.

> So, after SMS it looks like:
>        ldr     r0, .L5
>        mov     r3, #85
>        mov     r2, r0
> ;;prologue
>        mul     r1, r3, r3      ;;from stage 0 first iteration
>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>        cmp     r3, #10
>        str     r1, [r2, #340]
>        mul     r1, r3, r3      ;;from stage 0 second iteration
> ;;body
> .L2:
>        sub     r3, r3, #5
>        sub     r2, r2, #20
>        cmp     r3, #15         ;; new value to compare with is Y=15
>        str     r1, [r2, #340]
>        mul     r1, r3, r3
>        bne     .L2
> ;;epilogue
>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>        cmp     r3, #10
>        str     r1, [r2, #340]
>        sub     r2, r2, #20     ;;from stage 2 last iteration
>
>        ldr     r0, [r0, #60]
>        sub     r0, r0, #225
>        bx      lr
> .L5:
>        .word   a
>
> Real ARM assembler with SMS (after some optimizations and without dead code):
>        mov     r3, #85
>        ldr     r0, .L8
>        mul     r1, r3, r3
>        sub     r3, r3, #5
>        mov     r2, r0
>        str     r1, [r0, #340]
>        mul     r1, r3, r3
> .L2:
>        sub     r3, r3, #5
>        sub     r2, r2, #20
>        cmp     r3, #15
>        str     r1, [r2, #340]
>        mul     r1, r3, r3
>        bne     .L2
>        str     r1, [r2, #320]
>        ldr     r0, [r0, #60]
>        sub     r0, r0, #225
>        bx      lr
> .L8:
>        .word   a
>
>>
>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>> used only for doloop loops.  Both these bugs happen due to the nature of the
>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
>> The following situation happens when SMS is enabled without register renaming
>> (-fno-modulo-sched-allow-regmoves).  When data dependency graph is built, 
>> there
>> is a step when we generate anti-dependencies from last register use to first
>> write of this register at the next iteration.

is a step when we generate anti-dependencies from all last registers
uses (i.e., of last register def) to first write of this register at
the next iteration.

>> At this moment we should also
>> create such dependencies to all instructions which clobber the register to
>> prevent this clobbers being before last use is new schedule.
>>

well, we simply need to connect these last uses to either the first
write *or* the first clobber of this register at the next iteration,
according to whichever is first, no?

>> Here is an model of example:
>>
>> loop {
>> set1 regR
>> use1 regR
>> clobber regR
>> set2 regR
>> use2 regR
>> }
>>
>> If we create only use2->set1 anti-dependency (and no use2->cloober) the
>> following wrong schedule is possible:
>>
>> prologue {
>> set1 regR
>> use1 regR
>> clobber regR
>> }
>> kernel {
>> set2 regR
>> clobber regR (instruction from next iteration in terms of old loop kernel)
>> use2 regR
>> set1 regR (also from next iteration)
>> use1 regR (also from next iteration)
>> }
>> epilogue {
>> set2 regR
>> use2 regR
>> }
>>

strange; according to prolog (and epilog), clobber should appear after
use1 in the kernel, no? Aren't there (intra-iteration) dependencies
preventing the clobber to skip over use1 and/or set1?

>> This problem was succesfully fixed by creating a vector of all clobbering
>> instructions together with first write and adding all needed dependencies.
>>

seems like an overkill to me; we didn't draw an edge between every
last use and every write, because writes are kept in order by having
output dependences between them. So should be the case with clobbers.

This should ideally be solved by a dedicated (separate) patch.

>> The other bug happens only with -fmodulo-sched-allow-regmoves.  Here we
>> eliminate some anti-dependence edges in data dependency graph in order to
>> resolve them later by adding some register moves (renaming instructions).  
>> But
>> in situation as in example above compiler gives an ICE because it can't 
>> create
>> a register move, when regR is hardware flag register.  So we have to know 
>> which
>> register(s) cause anti-dependency in order to understand whether we can 
>> ignore
>> it.  I can't find any easy way to gather this information, so I create my own
>> structures to store this info and had implemented my own hooks for
>> sched_analyze function.  This leads to more complex interconnection between
>> ddg.c and modulo-sched.c.

Well, having ddg.c's/create_ddg_dep_from_intra_loop_link() consult
"Register sets from modulo scheduler structures" to determine if an
anti-dep can be eliminated, seems awkward. One should be able to build
a ddg independent of any modulo scheduler structures.
One simple solution is to keep all anti-dep edges of registers that
are clobbered anywhere in the loop. This might be overly conservative,
but perhaps not so if "On x86_64 it is clobbered by most of arithmetic
instructions".
If an anti-dep between a use and a dep of a clobber register is to be
removed, a dependence should be created between the use and a
clobbering instruction following the def. Hope this is clear.

This too should be solved by a dedicated (separate) patch, for easier digestion.

Presumably, the ddg already has all intra-iteration edges preventing
motion of clobbering instructions within an iteration, and we are only
missing inter-iteration deps or deps surfaced by eliminating
anti-deps, right?

You might consider introducing a new type of dependence for such
edges, "Clobber", if it would help.

>>
>> One more thing to point out is number of loop iterations. When number of
>> iterations of a loop is not known at compile time, SMS has to create two loop
>> versions (original and scheduled), and execute scheduled one only when real
>> number of iterations is bigger than number of stages.  In doloop case the
>> number of iterations simply equals to the count register value before the 
>> loop.
>> So SMS finds its constant initialization or makes two loop versions.  In new
>> supported loops number of iterations value is more complex.  It even can't be
>> calculated as (final_reg_value-start_reg_value)/step because of examples like
>> this:
>>
>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>
>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>> function which gives a structure with loop characteristics, some of them 
>> helps
>> to find iteration number:
>>
>> rtx niter_expr - The number of iterations of the loop;
>> bool const_iter - True if the loop iterates the constant number of times;
>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>
>> But we can use these expressions only after looking through some other fields
>> of returned structure:
>>
>> bool simple_p - True if we are able to say anything about number of 
>> iterations
>> of the loop;
>> rtx assumptions - Assumptions under that the rest of the information is 
>> valid;
>> rtx noloop_assumptions - Assumptions under which the loop ends before 
>> reaching
>> the latch;
>> rtx infinite - Condition under which the loop is infinite.
>>
>> I decide to allow SMS scheduling only when simple_p is true and other three
>> fields are NULL_RTX, or when simple_p is true and
>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>> condition, and the next separate patch is an attempt to process it.
>>

ok, still need to go over this rather lengthy and orthogonal (although
it exposes the bugs above) piece.

Ayal.


> --
> Roman Zhuykov
> zhr...@ispras.ru
>

Re: [PATCH 5/9] [SMS] Support new loop pattern

Reply via email to