> On Nov 25, 2025, at 22:29, Kyrylo Tkachov <[email protected]> wrote: > > Hi Maxim, > >> On 25 Nov 2025, at 09:31, Maxim Kuvyrkov <[email protected]> wrote: >> >> Hi Jennifer, > > Jennifer is no longer working on this, I’m shepherding this patch set on her > behalf.
Hi Kyrill, Thanks for the replies, it clears things up. It would be nice to copy-paste some of this history into olympus.md, but that's a taste preference. I've reviewed this and it looks good to me. I encourage AArch64 maintainers to approve. Thanks! -- Maxim Kuvyrkov https://www.linaro.org > > >> >> I assume I'm missing something because I don't know enough about Olympus >> microarchitecture -- but why use dispatch scheduling instead of DFA? >> Reading through the optimization manual I don't see similarities with BDVERx >> microarchitectures to explain the use of dispatch scheduling. Could you add >> a comment to olympus.md explaining why dispatch is preferred? >> > > We’ve been thinking for a few years about the future of instruction > scheduling for big out-of-order cores in AArch64 GCC. For the big Neoverse > cores we basically keep reusing the old Cortex-A57 model! > But since the OoO machinery in these CPUs is so sophisticated and aggressive > in hiding instruction latencies and resolving dependencies we haven’t managed > to do any better with the DFA approach. > So we’ve implemented the dispatch scheduling hooks for aarch64 to try a > different approach. This approach tries to optimize for the frontend dispatch > parts of the CPU described in various Software Optimization Guides e.g. > section “4.1 Dispatch Constraints" of the Neoverse V2 Software Optimization > Guide. It made sense to focus on these constraints as they are easier to > implement vs new DFA descriptions (there are much fewer constraints to > express than instruction latency and throughput groups). > > >> Also, can you share any benchmarking numbers comparing the new scheduling >> model with the likes of Neoverse-V2 and/or generic one (Cortex-A57?). >> > > Jennifer shared some results for Neoverse V2-based Grace at: > https://gcc.gnu.org/pipermail/gcc-patches/2025-July/691020.html > Basically the takeaway is that it’s largely neutral vs the current DFA > approach with some slight wins in SIMD-heavy code like GROMACS. > I can’t share Olympus numbers currently, but in this regard it behaves > similar to Grace. > Thanks, > Kyrill > > >> Thanks! >> >> -- >> Maxim Kuvyrkov >> https://www.linaro.org >> >> >>> On Oct 24, 2025, at 03:21, Jennifer Schmitz <[email protected]> wrote: >>> >>> This patch enables dispatch scheduling for the NVIDIA Olympus core. >>> The dispatch constraints are based on the Olympus CPU Core Software >>> Optimization Guide >>> (https://docs.nvidia.com/olympus-cpu-core-software-optimization-guide-dp12531-001v0-7.pdf). >>> >>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. >>> OK for trunk? >>> >>> Signed-off-by: Jennifer Schmitz <[email protected]> >>> >>> gcc/ >>> * config/aarch64/aarch64.md: Include olympus.md. >>> * config/aarch64/olympus.md: New file. >>> * config/aarch64/tuning_models/olympus.h: Add dispatch >>> constraints and enable dispatch scheduling. >>> --- >>> gcc/config/aarch64/aarch64.md | 1 + >>> gcc/config/aarch64/olympus.md | 199 +++++++++++ >>> gcc/config/aarch64/tuning_models/olympus.h | 363 ++++++++++++++++++++- >>> 3 files changed, 561 insertions(+), 2 deletions(-) >>> create mode 100644 gcc/config/aarch64/olympus.md >>> >>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md >>> index 98c65a74c8e..8aef3858a79 100644 >>> --- a/gcc/config/aarch64/aarch64.md >>> +++ b/gcc/config/aarch64/aarch64.md >>> @@ -686,6 +686,7 @@ >>> >>> ;; Dispatch scheduling >>> (include "neoversev2.md") >>> +(include "olympus.md") >>> >>> ;; ------------------------------------------------------------------- >>> ;; Jumps and other miscellaneous insns >>> diff --git a/gcc/config/aarch64/olympus.md b/gcc/config/aarch64/olympus.md >>> new file mode 100644 >>> index 00000000000..22b12016ffd >>> --- /dev/null >>> +++ b/gcc/config/aarch64/olympus.md >>> @@ -0,0 +1,199 @@ >>> +;; Instruction attribute for dispatch scheduling for NVIDIA Olympus. >>> +;; Copyright The GNU Toolchain Authors. >>> +;; >>> +;; This file is part of GCC. >>> +;; >>> +;; GCC is free software; you can redistribute it and/or modify it >>> +;; under the terms of the GNU General Public License as published by >>> +;; the Free Software Foundation; either version 3, or (at your option) >>> +;; any later version. >>> +;; >>> +;; GCC is distributed in the hope that it will be useful, but >>> +;; WITHOUT ANY WARRANTY; without even the implied warranty of >>> +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >>> +;; General Public License for more details. >>> +;; >>> +;; You should have received a copy of the GNU General Public License >>> +;; along with GCC; see the file COPYING3. If not see >>> +;; <http://www.gnu.org/licenses/>. >>> + >>> +;; Attribute that groups other instruction attributes into dispatch groups >>> +;; for the Olympus core. Dispatch groups are groups of pipelines for which >>> +;; the SWOG specifies a dispatch constraint. For example: Because the SWOG >>> +;; contains a dispatch constraint for the V12 pipelines, there is an >>> attribute >>> +;; value "v12" that groups instructions that are processed by the V1 and V2 >>> +;; pipelines. >>> +;; Values that contain a "_" represent combinations of dispatch groups. >>> +;; For example, there are dispatch constraints for the M and V pipelines. >>> The >>> +;; value "m_v" groups instructions that utilize the M as well as the >>> +;; V pipelines, such that both dispatch constraints apply. >>> + >>> +(define_attr "olympus_dispatch" >>> + "none,b,i,m,m0,l,v,v0,v03,v12,v45,v0123,m_v,l_v,m_l,m_v0123,v_v0123,\ >>> + l_v03,sa_d,sa_v0123,sa_v_v0123" >>> + (cond [(eq_attr "type" "branch,call") >>> + (const_string "b") >>> + (eq_attr "type" "adr,adc_reg,alu_ext,alu_imm,alu_sreg,alus_ext,\ >>> + alus_imm,alus_shift_imm,alus_sreg,clz,csel,extend,logic_imm,\ >>> + logic_reg,logics_imm,logics_reg,mov_imm,mov_reg,rbit,rev") >>> + (const_string "i") >>> + (ior >>> + (eq_attr "type" "bfm,bfx,crc,f_mrc,logic_shift_imm,\ >>> + logics_shift_imm,memtag,mul,neon_from_gp,neon_from_gp_q,\ >>> + rotate_imm,shift_reg,smull,sdiv,udiv,umull") >>> + (eq_attr "autodetect_type" "alu_shift_asr_op2,alu_shift_lsl_op2,\ >>> + alu_shift_lsr_op2") >>> + (eq_attr "sve_type" "sve_pred_cnt_ctrl,sve_pred_cnt_scalar,\ >>> + sve_pred_logical,sve_pred_misc")) >>> + (const_string "m") >>> + (eq_attr "sve_type" "sve_ffr") >>> + (const_string "m0") >>> + (ior >>> + (eq_attr "type" "f_loadd,f_loads,load_4,load_8,load_16,\ >>> + neon_ldp,neon_ldp_q,neon_load1_1reg,neon_load1_1reg_q,\ >>> + neon_load1_2reg,neon_load1_2reg_q,neon_load1_3reg,\ >>> + neon_load1_3reg_q,neon_load1_4reg,neon_load1_4reg_q,\ >>> + neon_load1_all_lanes") >>> + (eq_attr "sve_type" "sve_load_1reg")) >>> + (const_string "l") >>> + (ior >>> + (eq_attr "type" "crypto_aese,crypto_aesmc,crypto_pmull,faddd,fadds,\ >>> + fccmpd,fccmps,fcmpd,fcmps,fcsel,fconstd,fconsts,fmuld,fmuls,\ >>> + ffarithd,ffariths,fmacd,fmacs,f_mcr,f_minmaxd,f_minmaxs,fmov,\ >>> + f_rintd,f_rints,neon_abs,neon_abs_q,\ >>> + neon_add,neon_add_halve,neon_add_halve_narrow_q,neon_add_halve_q,\ >>> + neon_add_long,neon_add_q,neon_add_widen,neon_abd,neon_abd_long,\ >>> + neon_abd_q,neon_arith_acc,neon_arith_acc_q,neon_bsl,neon_bsl_q,\ >>> + neon_cls,neon_cls_q,neon_compare,neon_compare_q,\ >>> + neon_compare_zero,neon_compare_zero_q,neon_cnt,neon_cnt_q,\ >>> + neon_dup,neon_dup_q,neon_ext,neon_ext_q,neon_fcadd,neon_fcmla,\ >>> + neon_fp_abs_d,neon_fp_abs_d_q,neon_fp_abs_s,neon_fp_abs_s_q,\ >>> + neon_fp_abd_d,neon_fp_abd_d_q,neon_fp_abd_s,neon_fp_abd_s_q,\ >>> + neon_fp_addsub_d,neon_fp_addsub_d_q,neon_fp_addsub_s,\ >>> + neon_fp_addsub_s_q,neon_fp_compare_d,neon_fp_compare_d_q,\ >>> + neon_fp_compare_s,neon_fp_compare_s_q,neon_fp_mla_d,\ >>> + neon_fp_mla_d_q,neon_fp_mla_d_scalar_q,neon_fp_mla_s,\ >>> + neon_fp_mla_s_q,neon_fp_mla_s_scalar,neon_fp_mla_s_scalar_q,\ >>> + neon_fp_minmax_d,neon_fp_minmax_d_q,neon_fp_minmax_s,\ >>> + neon_fp_minmax_s_q,neon_fp_mul_d,neon_fp_mul_d_q,neon_fp_mul_s,\ >>> + neon_fp_mul_s_q,neon_fp_mul_s_scalar,neon_fp_mul_s_scalar_q,\ >>> + neon_fp_mul_d_scalar_q,neon_fp_neg_s,neon_fp_neg_s_q,\ >>> + neon_fp_neg_d,neon_fp_neg_d_q,neon_fp_recps_d,\ >>> + neon_fp_recps_d_q,neon_fp_recps_s,neon_fp_recps_s_q,\ >>> + neon_fp_reduc_add_d,neon_fp_reduc_add_d_q,neon_fp_reduc_add_s,\ >>> + neon_fp_reduc_add_s_q,neon_fp_reduc_minmax_d,\ >>> + neon_fp_reduc_minmax_d_q,neon_fp_reduc_minmax_s,\ >>> + neon_fp_reduc_minmax_s_q,neon_fp_rsqrts_d,neon_fp_rsqrts_d_q,\ >>> + neon_fp_rsqrts_s,neon_fp_rsqrts_s_q,neon_logic,neon_logic_q,\ >>> + neon_minmax,neon_minmax_q,neon_move,neon_move_narrow_q,\ >>> + neon_move_q,neon_neg,neon_neg_q,neon_permute,neon_permute_q,\ >>> + neon_qabs,neon_qabs_q,neon_qadd,neon_qadd_q,neon_qneg,neon_qneg_q,\ >>> + neon_qsub,neon_qsub_q,neon_rev,neon_rev_q,neon_rbit,neon_rbit_q,\ >>> + neon_sat_shift_imm,neon_sat_shift_imm_narrow_q,\ >>> + neon_sat_shift_imm_q,neon_sat_shift_reg,neon_sat_shift_reg_q,\ >>> + neon_shift_acc,neon_shift_acc_q,neon_shift_imm,\ >>> + neon_shift_imm_long,neon_shift_imm_narrow_q,neon_shift_imm_q,\ >>> + neon_shift_reg,neon_shift_reg_q,neon_sub,neon_sub_halve,\ >>> + neon_sub_halve_narrow_q,neon_sub_halve_q,neon_sub_long,\ >>> + neon_sub_q,neon_sub_widen,neon_tbl1,neon_tbl1_q,neon_tbl2,\ >>> + neon_tbl2_q,neon_tbl3,neon_tbl3_q,neon_tbl4,neon_tbl4_q,\ >>> + neon_tst,neon_tst_q,neon_zip,neon_zip_q") >>> + (eq_attr "sve_type" "sve_fp_arith,sve_fp_misc,\ >>> + sve_fp_mul,sve_fp_reduc,sve_int_accum,sve_int_dot,sve_int_extend,\ >>> + sve_int_general,sve_int_pmul,sve_int_shift")) >>> + (const_string "v") >>> + (ior >>> + (eq_attr "type" "crypto_sha1_fast,crypto_sha1_slow,crypto_sha1_xor,\ >>> + crypto_sha256_fast,crypto_sha256_slow,crypto_sha3,crypto_sha512,\ >>> + crypto_sm4") >>> + (eq_attr "sve_type" "sve_crypto_sha3")) >>> + (const_string "v0") >>> + (ior >>> + (eq_attr "type" "fccmpd,fccmps,fcmpd,fcmps,neon_fp_to_int_d,\ >>> + neon_fp_to_int_d_q,neon_fp_to_int_s,neon_fp_to_int_s_q,\ >>> + neon_to_gp,neon_to_gp_q") >>> + (eq_attr "sve_type" "sve_fp_assoc_add,sve_fp_cmp")) >>> + (const_string "v03") >>> + (ior >>> + (eq_attr "type" "fdivd,fdivs,fsqrtd,fsqrts,neon_fp_div_d,\ >>> + neon_fp_div_d_q,neon_fp_div_s,neon_fp_div_s_q,neon_fp_sqrt_d,\ >>> + neon_fp_sqrt_d_q,neon_fp_sqrt_s,neon_fp_sqrt_s_q") >>> + (eq_attr "sve_type" "sve_fp_div,sve_fp_exp,sve_fp_sqrt,\ >>> + sve_int_extract,sve_int_bit_perm")) >>> + (const_string "v12") >>> + (eq_attr "sve_type" "sve_int_div") >>> + (const_string "v45") >>> + (ior >>> + (eq_attr "type" "crypto_sm3,f_cvt,f_cvtf2i,f_cvti2f,f_rintd,\ >>> + f_rints,mla,neon_fp_cvt_narrow_d_q,neon_fp_cvt_narrow_s_q,\ >>> + neon_fp_cvt_widen_h,neon_fp_cvt_widen_s,\ >>> + neon_fp_recpe_d,neon_fp_recpe_d_q,neon_fp_recpe_s,\ >>> + neon_fp_recpe_s_q,neon_fp_recpx_d,neon_fp_recpx_d_q,\ >>> + neon_fp_recpx_s,neon_fp_recpx_s_q,neon_fp_round_d,\ >>> + neon_fp_round_d_q,neon_fp_round_s,neon_fp_round_s_q,\ >>> + neon_fp_rsqrte_d,neon_fp_rsqrte_d_q,neon_fp_rsqrte_s,\ >>> + neon_fp_rsqrte_s_q,neon_int_to_fp_s,\ >>> + neon_int_to_fp_s_q,neon_int_to_fp_d,neon_int_to_fp_d_q,\ >>> + neon_mla_b,neon_mla_b_long,neon_mla_b_q,neon_mla_h,\ >>> + neon_mla_h_long,neon_mla_h_q,neon_mla_h_scalar,\ >>> + neon_mla_h_scalar_q,neon_mla_h_scalar_long,neon_mla_s,\ >>> + neon_mla_s_long,neon_mla_s_q,neon_mla_s_scalar,\ >>> + neon_mla_s_scalar_q,neon_mla_s_scalar_long,neon_mul_b,\ >>> + neon_mul_b_long,neon_mul_b_q,neon_mul_d_long,neon_mul_h,\ >>> + neon_mul_h_q,neon_mul_h_long,neon_mul_h_scalar,\ >>> + neon_mul_h_scalar_long,neon_mul_h_scalar_q,neon_mul_s,\ >>> + neon_mul_s_scalar_q,neon_mul_s_q,neon_mul_s_long,\ >>> + neon_mul_s_scalar,neon_mul_s_scalar_long,neon_reduc_add,\ >>> + neon_reduc_add_long,neon_reduc_add_q,neon_reduc_minmax,\ >>> + neon_reduc_minmax_q,neon_sat_mla_b_long,neon_sat_mla_h_long,\ >>> + neon_sat_mla_h_scalar_long,neon_sat_mla_s_long,\ >>> + neon_sat_mla_s_scalar_long,neon_sat_mul_b,neon_sat_mul_b_q,\ >>> + neon_sat_mul_b_long,neon_sat_mul_h,neon_sat_mul_h_q,\ >>> + neon_sat_mul_h_long,neon_sat_mul_h_scalar,\ >>> + neon_sat_mul_h_scalar_q,neon_sat_mul_h_scalar_long,\ >>> + neon_sat_mul_s,neon_sat_mul_s_q,neon_sat_mul_s_long,\ >>> + neon_sat_mul_s_scalar,neon_sat_mul_s_scalar_q,\ >>> + neon_sat_mul_s_scalar_long,smlal,umlal") >>> + (eq_attr "sve_type" "sve_fp_cvt,sve_fp_log,sve_int_cvt,\ >>> + sve_int_mul,sve_int_recip_est")) >>> + (const_string "v0123") >>> + (eq_attr "type" "neon_ins,neon_ins_q") >>> + (const_string "m_v") >>> + (ior >>> + (eq_attr "type" "neon_load1_one_lane,neon_load1_one_lane_q,\ >>> + neon_load2_2reg,neon_load2_2reg_q,neon_load2_all_lanes,\ >>> + neon_load2_all_lanes_q,neon_load2_one_lane,neon_load3_3reg,\ >>> + neon_load3_3reg_q,neon_load3_all_lanes,neon_load3_all_lanes_q,\ >>> + neon_load3_one_lane,neon_load4_4reg,neon_load4_4reg_q,\ >>> + neon_load4_all_lanes,neon_load4_all_lanes_q,neon_load4_one_lane") >>> + (eq_attr "sve_type" "sve_load_2reg,sve_load_3reg,sve_load_4reg")) >>> + (const_string "l_v") >>> + (eq_attr "sve_type" "sve_load_pred,sve_pred_vec") >>> + (const_string "m_l") >>> + (eq_attr "sve_type" "sve_int_cmp_set,sve_int_index,sve_int_match") >>> + (const_string "m_v0123") >>> + (eq_attr "sve_type" "sve_int_reduc") >>> + (const_string "v_v0123") >>> + (eq_attr "sve_type" "sve_gatherload_32,sve_gatherload_64") >>> + (const_string "l_v03") >>> + (ior >>> + (eq_attr "type" "store_4,store_8,store_16") >>> + (eq_attr "sve_type" "sve_store_pred")) >>> + (const_string "sa_d") >>> + (ior >>> + (eq_attr "type" "f_stored,f_stores,neon_stp,neon_stp_q,\ >>> + neon_store1_1reg,neon_store1_1reg_q,neon_store1_2reg,\ >>> + neon_store1_2reg_q,neon_store1_3reg,neon_store1_3reg_q,\ >>> + neon_store1_4reg,neon_store1_4reg_q") >>> + (eq_attr "sve_type" "sve_store_1reg")) >>> + (const_string "sa_v0123") >>> + (ior >>> + (eq_attr "type" "neon_store1_one_lane,neon_store1_one_lane_q,\ >>> + neon_store2_2reg,neon_store2_2reg_q,neon_store2_one_lane,\ >>> + neon_store2_one_lane_q,neon_store3_3reg,neon_store3_3reg_q,\ >>> + neon_store3_one_lane,neon_store3_one_lane_q,neon_store4_4reg,\ >>> + neon_store4_4reg_q,neon_store4_one_lane,neon_store4_one_lane_q") >>> + (eq_attr "sve_type" "sve_store_2reg,sve_store_3reg,sve_store_4reg,\ >>> + sve_scatterstore_32,sve_scatterstore_64")) >>> + (const_string "sa_v_v0123")] >>> + (const_string "none"))) >>> \ No newline at end of file >>> diff --git a/gcc/config/aarch64/tuning_models/olympus.h >>> b/gcc/config/aarch64/tuning_models/olympus.h >>> index d19aca8c323..404d79307df 100644 >>> --- a/gcc/config/aarch64/tuning_models/olympus.h >>> +++ b/gcc/config/aarch64/tuning_models/olympus.h >>> @@ -21,6 +21,8 @@ >>> #define GCC_AARCH64_H_OLYMPUS >>> >>> #include "generic.h" >>> +#include "../aarch64-sched-dispatch.h" >>> +#include "vec.h" >>> >>> static struct cpu_regmove_cost olympus_regmove_cost = >>> { >>> @@ -169,6 +171,362 @@ static cpu_prefetch_tune olympus_prefetch_tune = >>> -1 /* default_opt_level */ >>> }; >>> >>> +/* Olympus dispatch constraint types. */ >>> +enum olympus_dispatch_constraint_type >>> +{ >>> + OLYMPUS_TOTAL_SLOTS, /* total slots */ >>> + OLYMPUS_M_PIPE, /* m pipelines */ >>> + OLYMPUS_M0_PIPE, /* m0 pipeline */ >>> + OLYMPUS_BRANCH_PIPE, /* branch pipelines */ >>> + OLYMPUS_L_SA_PIPE, /* l, sa pipelines */ >>> + OLYMPUS_SA_PIPE, /* sa pipelines */ >>> + OLYMPUS_V_PIPE, /* v pipelines */ >>> + OLYMPUS_V0123_PIPE, /* v0, v1, v2, v3 pipelines */ >>> + OLYMPUS_V03_PIPE, /* v0, v3 pipelines */ >>> + OLYMPUS_V12_PIPE, /* v1, v2 pipelines */ >>> + OLYMPUS_V45_PIPE, /* v4, v5 pipelines */ >>> + OLYMPUS_V0_PIPE /* v0 pipeline */ >>> +}; >>> + >>> +/* Olympus dispatch constraints for instruction scheduling. >>> + The numbers are based on the Olympus CPU Core SWOG, section 4.1. */ >>> +static const int olympus_dispatch_max_slots[] = { >>> + 10, /* total slots */ >>> + 6, /* m pipelines */ >>> + 3, /* m0 pipeline */ >>> + 3, /* branch pipelines */ >>> + 8, /* l, sa pipelines */ >>> + 4, /* sa pipelines */ >>> + 6, /* v pipelines */ >>> + 4, /* v0, v1, v2, v3 pipelines */ >>> + 4, /* v0, v3 pipelines */ >>> + 4, /* v1, v2 pipelines */ >>> + 2, /* v4, v5 pipelines */ >>> + 2 /* v0 pipeline */ >>> +}; >>> + >>> +/* Olympus dispatch constraint callback function. >>> + Determines which constraints apply to an instruction and how many slots >>> + it requires. Returns a vec of (constraint_index, slots_required) >>> pairs. */ >>> +static vec<std::pair<int, int>> >>> +olympus_dispatch_constraint_callback (rtx_insn *insn) >>> +{ >>> + auto dispatch_group = get_attr_olympus_dispatch (insn); >>> + vec<std::pair<int, int>> constraints = vNULL; >>> + >>> + /* In addition to deducting slots from the specific pipeline types >>> required >>> + by an instruction, we keep track of the total number of slots >>> required. >>> + There are different cases how total_slots is derived from the specific >>> + pipeline slots: >>> + Case 1: Single top-level pipeline type used >>> + Example groups: OLYMPUS_DISPATCH_B, OLYMPUS_DISPATCH_V_V0123 >>> + Total_slots is equal to the number of slots for the top-level >>> + pipeline type. >>> + Example: Assume an instruction in the OLYMPUS_DISPATCH_V_V0123 >>> + dispatch group is executed as 2 MOps: 1 utilizing any V pipeline and >>> + 1 utilizing a V0123 pipeline. It requires 1 slot in the >>> + OLYMPUS_V0123_PIPE constraint and a total of 2 slots in the >>> + OLYMPUS_V_PIPE constraint, because the V0123 pipelines are a subset >>> of >>> + the V pipelines. Total_slots is 2. >>> + Case 2: Multiple top-level pipeline types used >>> + Example groups: OLYMPUS_DISPATCH_M_V, OLYMPUS_DISPATCH_SA_V_V0123 >>> + Total_slots is equal to the sum of the slots for the top-level >>> + pipeline types. */ >>> + int total_slots = 1; >>> + >>> + switch (dispatch_group) >>> + { >>> + case OLYMPUS_DISPATCH_NONE: >>> + case OLYMPUS_DISPATCH_I: >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_B: >>> + constraints.safe_push ({OLYMPUS_BRANCH_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_M: >>> + constraints.safe_push ({OLYMPUS_M_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_M0: >>> + constraints.safe_push ({OLYMPUS_M_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_M0_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_L: >>> + { >>> + auto type_attr = get_attr_type (insn); >>> + int l_slots = 1; >>> + if (type_attr == TYPE_NEON_LDP_Q >>> + || type_attr == TYPE_NEON_LOAD1_2REG_Q >>> + || type_attr == TYPE_NEON_LOAD1_3REG >>> + || type_attr == TYPE_NEON_LOAD1_4REG) >>> + l_slots = 2; >>> + else if (type_attr == TYPE_NEON_LOAD1_3REG_Q) >>> + l_slots = 3; >>> + else if (type_attr == TYPE_NEON_LOAD1_4REG_Q) >>> + l_slots = 4; >>> + constraints.safe_push ({OLYMPUS_L_SA_PIPE, l_slots}); >>> + total_slots = l_slots; >>> + } >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V: >>> + { >>> + auto type_attr = get_attr_type (insn); >>> + int v_slots = 1; >>> + if (type_attr == TYPE_NEON_TBL3 >>> + || type_attr == TYPE_NEON_FP_REDUC_MINMAX_D >>> + || type_attr == TYPE_NEON_FP_REDUC_MINMAX_S >>> + || get_attr_sve_type (insn) == SVE_TYPE_SVE_FP_REDUC) >>> + v_slots = 2; >>> + else if (type_attr == TYPE_NEON_TBL4 >>> + || type_attr == TYPE_NEON_FP_REDUC_MINMAX_D_Q >>> + || type_attr == TYPE_NEON_FP_REDUC_MINMAX_S_Q) >>> + v_slots = 3; >>> + constraints.safe_push ({OLYMPUS_V_PIPE, v_slots}); >>> + total_slots = v_slots; >>> + } >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V0: >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V03_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V03: >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V03_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V12: >>> + { >>> + auto sve_type_attr = get_attr_sve_type (insn); >>> + int slots = (sve_type_attr == SVE_TYPE_SVE_INT_BIT_PERM) ? 2 : 1; >>> + constraints.safe_push ({OLYMPUS_V_PIPE, slots}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, slots}); >>> + constraints.safe_push ({OLYMPUS_V12_PIPE, slots}); >>> + total_slots = slots; >>> + } >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V45: >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V45_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V0123: >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_M_V: >>> + constraints.safe_push ({OLYMPUS_M_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + total_slots = 2; >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_L_V: >>> + { >>> + auto type_attr = get_attr_type (insn); >>> + auto sve_type_attr = get_attr_sve_type (insn); >>> + int l_sa_slots = 1; >>> + int v_slots = 1; >>> + if (type_attr == TYPE_NEON_LOAD2_2REG >>> + || type_attr == TYPE_NEON_LOAD2_ALL_LANES >>> + || type_attr == TYPE_NEON_LOAD2_ALL_LANES_Q >>> + || type_attr == TYPE_NEON_LOAD2_ONE_LANE) >>> + v_slots = 2; >>> + else if (type_attr == TYPE_NEON_LOAD2_2REG_Q >>> + || sve_type_attr == SVE_TYPE_SVE_LOAD_2REG) >>> + { >>> + l_sa_slots = 2; >>> + v_slots = 2; >>> + } >>> + else if (type_attr == TYPE_NEON_LOAD3_3REG >>> + || type_attr == TYPE_NEON_LOAD3_ALL_LANES >>> + || type_attr == TYPE_NEON_LOAD3_ONE_LANE >>> + || sve_type_attr == SVE_TYPE_SVE_LOAD_3REG) >>> + { >>> + l_sa_slots = 2; >>> + v_slots = 3; >>> + } >>> + else if (type_attr == TYPE_NEON_LOAD3_3REG_Q) >>> + { >>> + l_sa_slots = 3; >>> + v_slots = 3; >>> + } >>> + else if (type_attr == TYPE_NEON_LOAD4_4REG >>> + || type_attr == TYPE_NEON_LOAD4_ALL_LANES >>> + || type_attr == TYPE_NEON_LOAD4_ONE_LANE) >>> + { >>> + l_sa_slots = 2; >>> + v_slots = 4; >>> + } >>> + else if (type_attr == TYPE_NEON_LOAD4_4REG_Q >>> + || sve_type_attr == SVE_TYPE_SVE_LOAD_4REG) >>> + { >>> + l_sa_slots = 4; >>> + v_slots = 6; >>> + } >>> + constraints.safe_push ({OLYMPUS_L_SA_PIPE, l_sa_slots}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, v_slots}); >>> + total_slots = l_sa_slots + v_slots; >>> + } >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_M_L: >>> + constraints.safe_push ({OLYMPUS_M_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_L_SA_PIPE, 1}); >>> + total_slots = 2; >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_M_V0123: >>> + constraints.safe_push ({OLYMPUS_M_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + total_slots = 2; >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_V_V0123: >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 2}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + total_slots = 2; >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_L_V03: >>> + { >>> + auto sve_type_attr = get_attr_sve_type (insn); >>> + int l_slots = 1; >>> + if (sve_type_attr == SVE_TYPE_SVE_GATHERLOAD_32) >>> + l_slots = 4; >>> + else if (sve_type_attr == SVE_TYPE_SVE_GATHERLOAD_64) >>> + l_slots = 2; >>> + constraints.safe_push ({OLYMPUS_L_SA_PIPE, l_slots}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V03_PIPE, 1}); >>> + total_slots = l_slots + 1; >>> + } >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_SA_D: >>> + constraints.safe_push ({OLYMPUS_SA_PIPE, 1}); >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_SA_V0123: >>> + { >>> + /* According to the note in section 4.1 of the SWOG, MOps using the >>> + V0123 pipeline do not count towards the limits, when those MOps >>> + are in the same instruction as a MOp in the SA pipeline. That is >>> + why total_slots is set to the number of slots for the SA pipelines, >>> + disregarding the slots for the V0123 pipelines. */ >>> + auto type_attr = get_attr_type (insn); >>> + if (type_attr == TYPE_NEON_STORE1_3REG >>> + || type_attr == TYPE_NEON_STORE1_3REG_Q >>> + || type_attr == TYPE_NEON_STORE1_4REG >>> + || type_attr == TYPE_NEON_STORE1_4REG_Q) >>> + { >>> + constraints.safe_push ({OLYMPUS_SA_PIPE, 2}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 2}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 2}); >>> + total_slots = 2; >>> + } >>> + else >>> + { >>> + constraints.safe_push ({OLYMPUS_SA_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, 1}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, 1}); >>> + total_slots = 1; >>> + } >>> + } >>> + break; >>> + >>> + case OLYMPUS_DISPATCH_SA_V_V0123: >>> + { >>> + auto type_attr = get_attr_type (insn); >>> + auto sve_type_attr = get_attr_sve_type (insn); >>> + int sa_slots = 1; >>> + int v_slots = 2; >>> + int v0123_slots = 1; >>> + if (type_attr == TYPE_NEON_STORE2_2REG_Q >>> + || type_attr == TYPE_NEON_STORE4_ONE_LANE >>> + || type_attr == TYPE_NEON_STORE4_ONE_LANE_Q) >>> + v_slots = 3; >>> + else if (type_attr == TYPE_NEON_STORE3_3REG >>> + || type_attr == TYPE_NEON_STORE3_ONE_LANE >>> + || type_attr == TYPE_NEON_STORE3_ONE_LANE_Q >>> + || sve_type_attr == SVE_TYPE_SVE_STORE_2REG) >>> + { >>> + sa_slots = 2; >>> + v_slots = 4; >>> + v0123_slots = 2; >>> + } >>> + else if (type_attr == TYPE_NEON_STORE3_3REG_Q) >>> + { >>> + sa_slots = 2; >>> + v_slots = 5; >>> + v0123_slots = 2; >>> + } >>> + else if (type_attr == TYPE_NEON_STORE4_4REG) >>> + v_slots = 5; >>> + else if (type_attr == TYPE_NEON_STORE4_4REG_Q) >>> + { >>> + sa_slots = 2; >>> + v_slots = 6; >>> + v0123_slots = 2; >>> + } >>> + else if (sve_type_attr == SVE_TYPE_SVE_STORE_3REG) >>> + { >>> + sa_slots = 3; >>> + v_slots = 6; >>> + v0123_slots = 3; >>> + } >>> + else if (sve_type_attr == SVE_TYPE_SVE_STORE_4REG) >>> + { >>> + sa_slots = 4; >>> + v_slots = 6; >>> + v0123_slots = 4; >>> + } >>> + else if (sve_type_attr == SVE_TYPE_SVE_SCATTERSTORE_32) >>> + { >>> + sa_slots = 4; >>> + v_slots = 5; >>> + v0123_slots = 4; >>> + } >>> + else if (sve_type_attr == SVE_TYPE_SVE_SCATTERSTORE_64) >>> + { >>> + sa_slots = 2; >>> + v_slots = 4; >>> + v0123_slots = 3; >>> + } >>> + constraints.safe_push ({OLYMPUS_SA_PIPE, sa_slots}); >>> + constraints.safe_push ({OLYMPUS_V_PIPE, v_slots}); >>> + constraints.safe_push ({OLYMPUS_V0123_PIPE, v0123_slots}); >>> + /* We disregard slots from the V0123 pipelines in total_slots when >>> + the instruction also uses the SA pipelines, see comment in >>> + OLYMPUS_DISPATCH_SA_V0123. */ >>> + total_slots = sa_slots + (v_slots - v0123_slots); >>> + } >>> + break; >>> + } >>> + >>> + /* Add total slots constraint */ >>> + constraints.safe_push ({OLYMPUS_TOTAL_SLOTS, total_slots}); >>> + >>> + return constraints; >>> +} >>> + >>> +/* Olympus dispatch constraints configuration. */ >>> +static const struct dispatch_constraint_info >>> olympus_dispatch_constraint_info = { >>> + olympus_dispatch_max_slots, /* max_slots */ >>> + ARRAY_SIZE (olympus_dispatch_max_slots), /* num_constraints */ >>> + olympus_dispatch_constraint_callback /* callback */ >>> +}; >>> + >>> static struct tune_params olympus_tunings = >>> { >>> &cortexa76_extra_costs, >>> @@ -201,11 +559,12 @@ static struct tune_params olympus_tunings = >>> (AARCH64_EXTRA_TUNE_BASE >>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>> - | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>> + | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>> + | AARCH64_EXTRA_TUNE_DISPATCH_SCHED), /* tune_flags. */ >>> &olympus_prefetch_tune, >>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>> AARCH64_LDP_STP_POLICY_ALWAYS, /* stp_policy_model. */ >>> - nullptr /* dispatch_constraints. */ >>> + &olympus_dispatch_constraint_info /* dispatch_constraints. */ >>> }; >>> >>> #endif /* GCC_AARCH64_H_OLYMPUS. */ >>> -- >>> 2.34.1 >> >> >
