Hi James, > Would you mind taking a look at some of these techniques to see if you can > reduce the size of the generated automata without causing too much > trouble for code generation for Vulcan?
Thanks for the review. I will look into this. with regards, Virendra Pathak On Mon, Jul 25, 2016 at 4:03 PM, James Greenhalgh <james.greenha...@arm.com> wrote: > On Wed, Jul 20, 2016 at 03:07:45PM +0530, Virendra Pathak wrote: >> Hi gcc-patches group, >> >> Please find the patch for adding the basic scheduler for vulcan >> in the aarch64 port. >> >> Tested the patch with compiling cross aarch64-linux-gcc, >> bootstrapped native aarch64-unknown-linux-gnu and >> run gcc regression. >> >> Kindly review and merge the patch to trunk, if the patch is okay. >> >> There are few TODO in this patch which we have planned to >> submit in the next submission e.g. crc and crypto >> instructions, further improving integer & fp load/store >> based on addressing mode of the address. > > Hi Virendra, > > Thanks for the patch, I have some concerns about the size of the > automata that this description generates. As you can see > (use (automata_option "stats") in the description to enable statistics) > this scheduler description generates a 10x larger automata for Vulcan than > the second largest description we have for AArch64 (cortex_a53_advsimd): > > Automaton `cortex_a53_advsimd' > 9072 NDFA states, 49572 NDFA arcs > 9072 DFA states, 49572 DFA arcs > 4050 minimal DFA states, 23679 minimal DFA arcs > 368 all insns 11 insn equivalence classes > 0 locked states > 28759 transition comb vector els, 44550 trans table els: use simple vect > 44550 min delay table els, compression factor 2 > > Automaton `vulcan' > 103223 NDFA states, 651918 NDFA arcs > 103223 DFA states, 651918 DFA arcs > 45857 minimal DFA states, 352255 minimal DFA arcs > 368 all insns 28 insn equivalence classes > 0 locked states > 429671 transition comb vector els, 1283996 trans table els: use comb vect > 1283996 min delay table els, compression factor 2 > > Such a large automaton increases compiler build time and memory consumption, > often for little scheduling benefit. > > Normally such a large automaton comes from using a large repeat > expression (*). For example in your modeling of divisions: > >> +(define_insn_reservation "vulcan_div" 13 >> + (and (eq_attr "tune" "vulcan") >> + (eq_attr "type" "sdiv,udiv")) >> + "vulcan_i1*13") >> + >> +(define_insn_reservation "vulcan_fp_divsqrt_s" 16 >> + (and (eq_attr "tune" "vulcan") >> + (eq_attr "type" "fdivs,fsqrts")) >> + "vulcan_f0*8|vulcan_f1*8") >> + >> +(define_insn_reservation "vulcan_fp_divsqrt_d" 23 >> + (and (eq_attr "tune" "vulcan") >> + (eq_attr "type" "fdivd,fsqrtd")) >> + "vulcan_f0*12|vulcan_f1*12") > > In other pipeline models, we try to keep these repeat numbers low to avoid > the large state-space growth they cause. For example, the Cortex-A57 > pipeline model describes them as: > > (define_insn_reservation "cortex_a57_fp_divd" 16 > (and (eq_attr "tune" "cortexa57") > (eq_attr "type" "fdivd, fsqrtd, neon_fp_div_d, neon_fp_sqrt_d")) > "ca57_cx2_block*3") > > The lower accuracy is acceptable because of the nature of the scheduling > model. For a machine with an issue rate of "4" like Vulcan, each cycle the > compiler models it tries to find four instructions to schedule, before it > progresses the state of the automaton. If an instruction is modelled as > blocking the "vulcan_i1" unit for 13 cycles, that means up to 52 > instructions that the scheduler would have to find before issuing the next > instruction which would use vulcan_i1. Because scheduling works within > basic-blocks, the chance of finding so many independent instructions is > extremely low, and so you'd never see the benefit of the 13-cycle block. > > I tried lowering the repeat expressions as so: > >> +(define_insn_reservation "vulcan_div" 13 >> + (and (eq_attr "tune" "vulcan") >> + (eq_attr "type" "sdiv,udiv")) >> + "vulcan_i1*3") >> + >> +(define_insn_reservation "vulcan_fp_divsqrt_s" 16 >> + (and (eq_attr "tune" "vulcan") >> + (eq_attr "type" "fdivs,fsqrts")) >> + "vulcan_f0*3|vulcan_f1*3") >> + >> +(define_insn_reservation "vulcan_fp_divsqrt_d" 23 >> + (and (eq_attr "tune" "vulcan") >> + (eq_attr "type" "fdivd,fsqrtd")) >> + "vulcan_f0*5|vulcan_f1*5") > > Which more than halves the size of the generated automaton: > > Automaton `vulcan' > 45370 NDFA states, 319261 NDFA arcs > 45370 DFA states, 319261 DFA arcs > 20150 minimal DFA states, 170824 minimal DFA arcs > 368 all insns 28 insn equivalence classes > 0 locked states > 215565 transition comb vector els, 564200 trans table els: use comb vect > 564200 min delay table els, compression factor 2 > > The other technique we use to reduce the size of the generated automaton is > to split off the AdvSIMD/FP model from the main pipeline description > (the thunderx _main, thunderx_mult, thunderx_divide, and thunderx_simd > models take this approach even further and acheieve very small automaton > as a result) > > A change like wiring the vulcan_f0 and vulcan_f1 reservations to be > cpu_units of a new define_automaton "vulcan_advsimd" would cut the size > of the automaton by half again: > > Automaton `vulcan' > 8520 NDFA states, 52754 NDFA arcs > 8520 DFA states, 52754 DFA arcs > 2414 minimal DFA states, 19882 minimal DFA arcs > 368 all insns 19 insn equivalence classes > 0 locked states > 21062 transition comb vector els, 45866 trans table els: use simple vect > 45866 min delay table els, compression factor 2 > > Automaton `vulcan_simd' > 12231 NDFA states, 85833 NDFA arcs > 12231 DFA states, 85833 DFA arcs > 9246 minimal DFA states, 66554 minimal DFA arcs > 368 all insns 11 insn equivalence classes > 0 locked states > 84074 transition comb vector els, 101706 trans table els: use simple vect > 101706 min delay table els, compression factor 2 > > Finally, simplifying some of the remaining large expressions > (vulcan_asimd_load*_mult, vulcan_asimd_load*_elts) can bring the size down > by half again, making it much more in line with the size of the other > AArch64 automaton. > > Would you mind taking a look at some of these techniques to see if you can > reduce the size of the generated automata without causing too much > trouble for code generation for Vulcan? > > Ideally we want to keep the size of all models to a reasonable level to > avoid bugs like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70473 . > > Thanks, > James > > >> Virendra Pathak <virendra.pat...@broadcom.com> >> Julian Brown <jul...@codesourcery.com> >> >> * config/aarch64/aarch64-cores.def: Change the scheduler >> to vulcan. >> * config/aarch64/aarch64.md: Include vulcan.md. >> * config/aarch64/vulcan.md: New file. >> >> >