[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #13 from CVS Commits --- The master branch has been updated by Alexander Monakov : https://gcc.gnu.org/g:ec1db9017939bb8289c9bd63aace66c0f3957ecd commit r13-4956-gec1db9017939bb8289c9bd63aace66c0f3957ecd Author: Alexander Monakov Date: Fri Dec 9 20:47:55 2022 +0300 i386: correct division modeling in lujiazui.md Model the divider in Lujiazui processors as a separate automaton to significantly reduce the overall model size. This should also result in improved accuracy, as pipe 0 should be able to accept new instructions while the divider is occupied. It is unclear why integer divisions are modeled as if pipes 0-3 are all occupied. I've opted to keep a single-cycle reservation of all four pipes together, so GCC should continue trying to pack instructions around a division accordingly. Currently top three symbols in insn-automata.o are: 106102 r lujiazui_core_check 106102 r lujiazui_core_transitions 196123 r lujiazui_core_min_issue_delay This patch shrinks all lujiazui tables to: 3 r lujiazui_decoder_min_issue_delay 20 r lujiazui_decoder_transitions 32 r lujiazui_agu_min_issue_delay 126 r lujiazui_agu_transitions 304 r lujiazui_div_base 352 r lujiazui_div_check 352 r lujiazui_div_transitions 1152 r lujiazui_core_min_issue_delay 1592 r lujiazui_agu_translate 1592 r lujiazui_core_translate 1592 r lujiazui_decoder_translate 1592 r lujiazui_div_translate 3952 r lujiazui_div_min_issue_delay 9216 r lujiazui_core_transitions This continues the work on reducing i386 insn-automata.o size started with similar fixes for division and multiplication instructions in znver.md. gcc/ChangeLog: PR target/87832 * config/i386/lujiazui.md (lujiazui_div): New automaton. (lua_div): New unit. (lua_idiv_qi): Correct unit in the reservation. (lua_idiv_qi_load): Ditto. (lua_idiv_hi): Ditto. (lua_idiv_hi_load): Ditto. (lua_idiv_si): Ditto. (lua_idiv_si_load): Ditto. (lua_idiv_di): Ditto. (lua_idiv_di_load): Ditto. (lua_fdiv_SF): Ditto. (lua_fdiv_SF_load): Ditto. (lua_fdiv_DF): Ditto. (lua_fdiv_DF_load): Ditto. (lua_fdiv_XF): Ditto. (lua_fdiv_XF_load): Ditto. (lua_ssediv_SF): Ditto. (lua_ssediv_load_SF): Ditto. (lua_ssediv_V4SF): Ditto. (lua_ssediv_load_V4SF): Ditto. (lua_ssediv_V8SF): Ditto. (lua_ssediv_load_V8SF): Ditto. (lua_ssediv_SD): Ditto. (lua_ssediv_load_SD): Ditto. (lua_ssediv_V2DF): Ditto. (lua_ssediv_load_V2DF): Ditto. (lua_ssediv_V4DF): Ditto. (lua_ssediv_load_V4DF): Ditto.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #12 from Martin Liška --- Nice work Alexander!
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #11 from Alexander Monakov --- Factoring out Lujiazui divider shrinks its tables by almost 20x: 3 r lujiazui_decoder_min_issue_delay 20 r lujiazui_decoder_transitions 32 r lujiazui_agu_min_issue_delay 126 r lujiazui_agu_transitions 304 r lujiazui_div_base 352 r lujiazui_div_check 352 r lujiazui_div_transitions 1152 r lujiazui_core_min_issue_delay 1592 r lujiazui_agu_translate 1592 r lujiazui_core_translate 1592 r lujiazui_decoder_translate 1592 r lujiazui_div_translate 3952 r lujiazui_div_min_issue_delay 9216 r lujiazui_core_transitions
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #10 from Alexander Monakov --- (In reply to Jan Hubicka from comment #9) > Actually for older cores I think the manufacturers do not care much. I > still have a working Bulldozer machine and I can do some testing. > I think in Buldozer case I was basing the latency throughput on data in > Agner Fog's manuals. Ahhh, how could I forget that his manuals have data for those cores too. Thanks for the reminder! This solves the conundrum nicely: AMD Jaguar ('btver2' in GCC): int/fp division is not pipelined, separate int/fp dividers; AMD Bulldozer, Steamroller ('bdver1', 'bdver3'): int division is not pipelined (one divider), fp division is slightly pipelined (two independent dividers); Zhaoxin Lujiazui appears to use the same divider as VIA Nano 3000, which is not pipelined. So it's already enough to produce a decent patch. > How do you test it? For AMD Zen patches I was using measurements by Andreas Abel ( https://uops.info/table_overview.html ) and running a few experiments myself by coding loops in NASM and timing them with 'perf stat' on a Zen 2 CPU.
Re: [Bug target/87832] AMD pipeline models are very costly size-wise
> > Do you mean we should fix modeling of divisions there as well? I don't have > latency/throughput measurements for those CPUs, nor access so I can run > experiments myself, unfortunately. > > I guess you mean just making a patch to model division units separately, > leaving latency/throughput as in current incorrect models, and leave it to > manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we > might be able to crowd-source it eventually. Actually for older cores I think the manufacturers do not care much. I still have a working Bulldozer machine and I can do some testing. I think in Buldozer case I was basing the latency throughput on data in Agner Fog's manuals. How do you test it? Honza
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #9 from Jan Hubicka --- > > Do you mean we should fix modeling of divisions there as well? I don't have > latency/throughput measurements for those CPUs, nor access so I can run > experiments myself, unfortunately. > > I guess you mean just making a patch to model division units separately, > leaving latency/throughput as in current incorrect models, and leave it to > manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we > might be able to crowd-source it eventually. Actually for older cores I think the manufacturers do not care much. I still have a working Bulldozer machine and I can do some testing. I think in Buldozer case I was basing the latency throughput on data in Agner Fog's manuals. How do you test it? Honza
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #8 from Alexander Monakov --- (In reply to Jan Hubicka from comment #7) > > 53730 r btver2_fp_min_issue_delay > > 53760 r znver1_fp_transitions > > 93960 r bdver3_fp_transitions > > 106102 r lujiazui_core_check > > 106102 r lujiazui_core_transitions > > 196123 r lujiazui_core_min_issue_delay > > > > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models? > Yes, I think that makes sense... Do you mean we should fix modeling of divisions there as well? I don't have latency/throughput measurements for those CPUs, nor access so I can run experiments myself, unfortunately. I guess you mean just making a patch to model division units separately, leaving latency/throughput as in current incorrect models, and leave it to manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we might be able to crowd-source it eventually.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #7 from Jan Hubicka --- > 53730 r btver2_fp_min_issue_delay > 53760 r znver1_fp_transitions > 93960 r bdver3_fp_transitions > 106102 r lujiazui_core_check > 106102 r lujiazui_core_transitions > 196123 r lujiazui_core_min_issue_delay > > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models? Yes, I think that makes sense... Honza
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #6 from Alexander Monakov --- With these patches on trunk, current situation is: nm -CS -t d --defined-only gcc/insn-automata.o | sed 's/^[0-9]* 0*//' | sort -n | tail -40 2496 r slm_base 2527 r bdver3_load_min_issue_delay 2746 r glm_base 3892 r bdver1_fp_base r bdver1_ieu_min_issue_delay 4492 r geode_base 4608 r bdver3_ieu_transitions 6402 r bdver1_load_transitions 6720 r znver1_fp_min_issue_delay 7862 r athlon_fp_check 7862 r athlon_fp_transitions 9122 r lujiazui_core_base 9997 t internal_insn_latency(int, int, rtx_insn*, rtx_insn*) 10108 r bdver3_load_transitions 10498 r geode_check 10498 r geode_transitions 11632 r print_reservation(_IO_FILE*, rtx_insn*)::reservation_names 12575 r athlon_fp_min_issue_delay 12742 r btver2_fp_check 12742 r btver2_fp_transitions 13896 r slm_check 13896 r slm_transitions 17149 t internal_min_issue_delay(int, DFA_chip*) 17349 t internal_state_transition(int, DFA_chip*) 17776 r bdver1_ieu_transitions 20068 r bdver1_fp_check 20068 r bdver1_fp_transitions 26208 r slm_min_issue_delay 27244 r bdver1_fp_min_issue_delay 28518 r glm_check 28518 r glm_transitions 33690 r geode_min_issue_delay 46980 r bdver3_fp_min_issue_delay 49428 r glm_min_issue_delay 53730 r btver2_fp_min_issue_delay 53760 r znver1_fp_transitions 93960 r bdver3_fp_transitions 106102 r lujiazui_core_check 106102 r lujiazui_core_transitions 196123 r lujiazui_core_min_issue_delay What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #5 from CVS Commits --- The master branch has been updated by Alexander Monakov : https://gcc.gnu.org/g:d4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f commit r13-4093-gd4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f Author: Alexander Monakov Date: Tue Nov 1 17:53:13 2022 +0300 i386: correct x87 multiplication modeling in znver.md All multiplication instructions are fully pipelined, except AVX256 instructions on Zen 1, which issue over two cycles on a 128-bit unit. Correct the model accordingly to reduce combinatorial explosion in automaton tables. Top znver table sizes in insn-automata.o: Before: 30056 r znver1_fp_min_issue_delay 120224 r znver1_fp_transitions After: 6720 r znver1_fp_min_issue_delay 53760 r znver1_fp_transitions gcc/ChangeLog: PR target/87832 * config/i386/znver.md: (znver1_fp_op_mul): Correct cycles in the reservation. (znver1_fp_op_mul_load): Ditto. (znver1_mmx_mul): Ditto. (znver1_mmx_load): Ditto. (znver1_ssemul_ss_ps): Ditto. (znver1_ssemul_ss_ps_load): Ditto. (znver1_ssemul_avx256_ps): Ditto. (znver1_ssemul_avx256_ps_load): Ditto. (znver1_ssemul_sd_pd): Ditto. (znver1_ssemul_sd_pd_load): Ditto. (znver2_ssemul_sd_pd): Ditto. (znver2_ssemul_sd_pd_load): Ditto. (znver1_ssemul_avx256_pd): Ditto. (znver1_ssemul_avx256_pd_load): Ditto. (znver1_sseimul): Ditto. (znver1_sseimul_avx256): Ditto. (znver1_sseimul_load): Ditto. (znver1_sseimul_avx256_load): Ditto. (znver1_sseimul_di): Ditto. (znver1_sseimul_load_di): Ditto.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #4 from CVS Commits --- The master branch has been updated by Alexander Monakov : https://gcc.gnu.org/g:dd744f06c9952f92738b0860630085f0f0b99574 commit r13-4092-gdd744f06c9952f92738b0860630085f0f0b99574 Author: Alexander Monakov Date: Tue Nov 1 17:04:25 2022 +0300 i386: correct x87 division modeling in znver.md Correct modeling of division instructions in the SIMD/FP domain for AMD Zen architectures and avoid combinatorial explosion of automaton tables by modeling the separate floating-point division unit and correcting reservations to reflect reciprocal throughput of the corresponding instructions, similar to earlier commit 5cee5f94000 ("i386: correct integer division modeling in znver.md"). Division is partially pipelined and some instructions have fractional throughput (e.g. Zen 3 can issue divss and divsd each 3.5 and 4.5 cycles on average, respectively). Considering these CPUs implement out-of-order execution, the model doesn't need to be exact to the last cycle, so simplify it by using 4/5 cycles for SF/DF modes, and not modeling the fact that FP3 pipe is occupied for one cycle. Top znver table sizes in insn-automata.o: Before: 428108 r znver1_fp_min_issue_delay 856216 r znver1_fp_transitions After: 30056 r znver1_fp_min_issue_delay 120224 r znver1_fp_transitions gcc/ChangeLog: PR target/87832 * config/i386/znver.md (znver1_fdiv): New automaton. (znver1-fdiv): New unit. (znver1_fp_op_div): Correct unit and cycles in the reservation. (znver1_fp_op_div_load): Ditto. (znver1_fp_op_idiv_load): Ditto. (znver2_fp_op_idiv_load): Ditto. (znver1_ssediv_ss_ps): Ditto. (znver1_ssediv_ss_ps_load): Ditto. (znver1_ssediv_sd_pd): Ditto. (znver1_ssediv_sd_pd_load): Ditto. (znver1_ssediv_avx256_ps): Ditto. (znver1_ssediv_avx256_ps_load): Ditto. (znver1_ssediv_avx256_pd): Ditto. (znver1_ssediv_avx256_pd_load): Ditto.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #3 from Alexander Monakov --- Followup patches have been posted at https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #2 from CVS Commits --- The master branch has been updated by Alexander Monakov : https://gcc.gnu.org/g:5cee5f94000ee5eabce9b223c44c7923c1c69f61 commit r13-3589-g5cee5f94000ee5eabce9b223c44c7923c1c69f61 Author: Alexander Monakov Date: Mon Oct 31 17:35:57 2022 +0300 i386: correct integer division modeling in znver.md In znver.md, division instructions have descriptions like (define_insn_reservation "znver1_idiv_DI" 41 (and (eq_attr "cpu" "znver1,znver2") (and (eq_attr "type" "idiv") (and (eq_attr "mode" "DI") (eq_attr "memory" "none" "znver1-double,znver1-ieu2*41") which says that DImode idiv has latency 41 (which is correct) and that it occupies 2nd integer execution unit for 41 consecutive cycles, but that is not correct: 1) the division instruction is partially pipelined, and has throughput 1/14, not 1/41; 2) for the most part it occupies a separate division unit, not the general arithmetic unit. Evidently, interaction of such 41-cycle paths with the rest of reservations causes a combinatorial explosion in the automaton. Fix this by modeling the integer division unit properly, and correcting reservations to use the measured reciprocal throughput of those instructions (available from uops.info). A similar correction for floating-point divisions is left for a followup patch. Top 5 znver table sizes, before: 68692 r znver1_ieu_check 68692 r znver1_ieu_transitions 99792 r znver1_ieu_min_issue_delay 428108 r znver1_fp_min_issue_delay 856216 r znver1_fp_transitions After: 1454 r znver1_ieu_translate 1454 r znver1_translate 2304 r znver1_ieu_transitions 428108 r znver1_fp_min_issue_delay 856216 r znver1_fp_transitions gcc/ChangeLog: PR target/87832 * config/i386/znver.md (znver1_idiv): New automaton. (znver1-idiv): New unit. (znver1_idiv_DI): Correct unit and cycles in the reservation. (znver1_idiv_SI): Ditto. (znver1_idiv_HI): Ditto. (znver1_idiv_QI): Ditto. (znver1_idiv_mem_DI): Ditto. (znver1_idiv_mem_SI): Ditto. (znver1_idiv_mem_HI): Ditto. (znver1_idiv_mem_QI): Ditto. (znver3_idiv_DI): Ditto. (znver3_idiv_SI): Ditto. (znver3_idiv_HI): Ditto. (znver3_idiv_QI): Ditto. (znver3_idiv_mem_DI): Ditto. (znver3_idiv_mem_SI): Ditto. (znver3_idiv_mem_HI): Ditto. (znver3_idiv_mem_QI): Ditto.
[Bug target/87832] AMD pipeline models are very costly size-wise
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832 --- Comment #1 from Alexander Monakov --- Suggested partial fix for the integer-pipe side of the blowup: https://inbox.sourceware.org/gcc-patches/4549f27b-238a-7d77-f72b-cc77df8ae...@ispras.ru/