Hi James,

> Would you mind taking a look at some of these techniques to see if you can
> reduce the size of the generated automata without causing too much
> trouble for code generation for Vulcan?

Thanks for the review. I will look into this.


with regards,
Virendra Pathak


On Mon, Jul 25, 2016 at 4:03 PM, James Greenhalgh
<james.greenha...@arm.com> wrote:
> On Wed, Jul 20, 2016 at 03:07:45PM +0530, Virendra Pathak wrote:
>> Hi gcc-patches group,
>>
>> Please find the patch for adding the basic scheduler for vulcan
>> in the aarch64 port.
>>
>> Tested the patch with compiling cross aarch64-linux-gcc,
>> bootstrapped native aarch64-unknown-linux-gnu and
>> run gcc regression.
>>
>> Kindly review and merge the patch to trunk, if the patch is okay.
>>
>> There are few TODO in this patch which we have planned to
>> submit in the next submission e.g. crc and crypto
>> instructions, further improving integer & fp load/store
>> based on addressing mode of the address.
>
> Hi Virendra,
>
> Thanks for the patch, I have some concerns about the size of the
> automata that this description generates. As you can see
> (use (automata_option "stats") in the description to enable statistics)
> this scheduler description generates a 10x larger automata for Vulcan than
> the second largest description we have for AArch64 (cortex_a53_advsimd):
>
>   Automaton `cortex_a53_advsimd'
>      9072 NDFA states,          49572 NDFA arcs
>      9072 DFA states,           49572 DFA arcs
>      4050 minimal DFA states,   23679 minimal DFA arcs
>       368 all insns         11 insn equivalence classes
>     0 locked states
>   28759 transition comb vector els, 44550 trans table els: use simple vect
>   44550 min delay table els, compression factor 2
>
>   Automaton `vulcan'
>     103223 NDFA states,          651918 NDFA arcs
>     103223 DFA states,           651918 DFA arcs
>     45857 minimal DFA states,   352255 minimal DFA arcs
>       368 all insns         28 insn equivalence classes
>     0 locked states
>   429671 transition comb vector els, 1283996 trans table els: use comb vect
>   1283996 min delay table els, compression factor 2
>
> Such a large automaton increases compiler build time and memory consumption,
> often for little scheduling benefit.
>
> Normally such a large automaton comes from using a large repeat
> expression (*). For example in your modeling of divisions:
>
>> +(define_insn_reservation "vulcan_div" 13
>> +  (and (eq_attr "tune" "vulcan")
>> +       (eq_attr "type" "sdiv,udiv"))
>> +  "vulcan_i1*13")
>> +
>> +(define_insn_reservation "vulcan_fp_divsqrt_s" 16
>> +  (and (eq_attr "tune" "vulcan")
>> +       (eq_attr "type" "fdivs,fsqrts"))
>> +  "vulcan_f0*8|vulcan_f1*8")
>> +
>> +(define_insn_reservation "vulcan_fp_divsqrt_d" 23
>> +  (and (eq_attr "tune" "vulcan")
>> +       (eq_attr "type" "fdivd,fsqrtd"))
>> +  "vulcan_f0*12|vulcan_f1*12")
>
> In other pipeline models, we try to keep these repeat numbers low to avoid
> the large state-space growth they cause. For example, the Cortex-A57
> pipeline model describes them as:
>
>   (define_insn_reservation "cortex_a57_fp_divd" 16
>     (and (eq_attr "tune" "cortexa57")
>          (eq_attr "type" "fdivd, fsqrtd, neon_fp_div_d, neon_fp_sqrt_d"))
>     "ca57_cx2_block*3")
>
> The lower accuracy is acceptable because of the nature of the scheduling
> model. For a machine with an issue rate of "4" like Vulcan, each cycle the
> compiler models it tries to find four instructions to schedule, before it
> progresses the state of the automaton. If an instruction is modelled as
> blocking the "vulcan_i1" unit for 13 cycles, that means up to 52
> instructions that the scheduler would have to find before issuing the next
> instruction which would use vulcan_i1. Because scheduling works within
> basic-blocks, the chance of finding so many independent instructions is
> extremely low, and so you'd never see the benefit of the 13-cycle block.
>
> I tried lowering the repeat expressions as so:
>
>> +(define_insn_reservation "vulcan_div" 13
>> +  (and (eq_attr "tune" "vulcan")
>> +       (eq_attr "type" "sdiv,udiv"))
>> +  "vulcan_i1*3")
>> +
>> +(define_insn_reservation "vulcan_fp_divsqrt_s" 16
>> +  (and (eq_attr "tune" "vulcan")
>> +       (eq_attr "type" "fdivs,fsqrts"))
>> +  "vulcan_f0*3|vulcan_f1*3")
>> +
>> +(define_insn_reservation "vulcan_fp_divsqrt_d" 23
>> +  (and (eq_attr "tune" "vulcan")
>> +       (eq_attr "type" "fdivd,fsqrtd"))
>> +  "vulcan_f0*5|vulcan_f1*5")
>
> Which more than halves the size of the generated automaton:
>
>   Automaton `vulcan'
>     45370 NDFA states,          319261 NDFA arcs
>     45370 DFA states,           319261 DFA arcs
>     20150 minimal DFA states,   170824 minimal DFA arcs
>       368 all insns         28 insn equivalence classes
>     0 locked states
>   215565 transition comb vector els, 564200 trans table els: use comb vect
>   564200 min delay table els, compression factor 2
>
> The other technique we use to reduce the size of the generated automaton is
> to split off the AdvSIMD/FP model from the main pipeline description
> (the thunderx _main, thunderx_mult, thunderx_divide, and thunderx_simd
>  models take this approach even further and acheieve very small automaton
>  as a result)
>
> A change like wiring the vulcan_f0 and vulcan_f1 reservations to be
> cpu_units of a new define_automaton "vulcan_advsimd" would cut the size
> of the automaton by half again:
>
>   Automaton `vulcan'
>      8520 NDFA states,          52754 NDFA arcs
>      8520 DFA states,           52754 DFA arcs
>      2414 minimal DFA states,   19882 minimal DFA arcs
>       368 all insns         19 insn equivalence classes
>     0 locked states
>   21062 transition comb vector els, 45866 trans table els: use simple vect
>   45866 min delay table els, compression factor 2
>
>   Automaton `vulcan_simd'
>     12231 NDFA states,          85833 NDFA arcs
>     12231 DFA states,           85833 DFA arcs
>      9246 minimal DFA states,   66554 minimal DFA arcs
>       368 all insns         11 insn equivalence classes
>     0 locked states
>   84074 transition comb vector els, 101706 trans table els: use simple vect
>   101706 min delay table els, compression factor 2
>
> Finally, simplifying some of the remaining large expressions
> (vulcan_asimd_load*_mult, vulcan_asimd_load*_elts) can bring the size down
> by half again, making it much more in line with the size of the other
> AArch64 automaton.
>
> Would you mind taking a look at some of these techniques to see if you can
> reduce the size of the generated automata without causing too much
> trouble for code generation for Vulcan?
>
> Ideally we want to keep the size of all models to a reasonable level to
> avoid bugs like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70473 .
>
> Thanks,
> James
>
>
>> Virendra Pathak  <virendra.pat...@broadcom.com>
>> Julian Brown  <jul...@codesourcery.com>
>>
>>         * config/aarch64/aarch64-cores.def: Change the scheduler
>>         to vulcan.
>>         * config/aarch64/aarch64.md: Include vulcan.md.
>>         * config/aarch64/vulcan.md: New file.
>>
>>
>

Reply via email to