[Bug target/87832] AMD pipeline models are very costly size-wise

2023-01-02 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #13 from CVS Commits  ---
The master branch has been updated by Alexander Monakov :

https://gcc.gnu.org/g:ec1db9017939bb8289c9bd63aace66c0f3957ecd

commit r13-4956-gec1db9017939bb8289c9bd63aace66c0f3957ecd
Author: Alexander Monakov 
Date:   Fri Dec 9 20:47:55 2022 +0300

i386: correct division modeling in lujiazui.md

Model the divider in Lujiazui processors as a separate automaton to
significantly reduce the overall model size. This should also result
in improved accuracy, as pipe 0 should be able to accept new
instructions while the divider is occupied.

It is unclear why integer divisions are modeled as if pipes 0-3 are all
occupied. I've opted to keep a single-cycle reservation of all four
pipes together, so GCC should continue trying to pack instructions
around a division accordingly.

Currently top three symbols in insn-automata.o are:

106102 r lujiazui_core_check
106102 r lujiazui_core_transitions
196123 r lujiazui_core_min_issue_delay

This patch shrinks all lujiazui tables to:

3 r lujiazui_decoder_min_issue_delay
20 r lujiazui_decoder_transitions
32 r lujiazui_agu_min_issue_delay
126 r lujiazui_agu_transitions
304 r lujiazui_div_base
352 r lujiazui_div_check
352 r lujiazui_div_transitions
1152 r lujiazui_core_min_issue_delay
1592 r lujiazui_agu_translate
1592 r lujiazui_core_translate
1592 r lujiazui_decoder_translate
1592 r lujiazui_div_translate
3952 r lujiazui_div_min_issue_delay
9216 r lujiazui_core_transitions

This continues the work on reducing i386 insn-automata.o size started
with similar fixes for division and multiplication instructions in
znver.md.

gcc/ChangeLog:

PR target/87832
* config/i386/lujiazui.md (lujiazui_div): New automaton.
(lua_div): New unit.
(lua_idiv_qi): Correct unit in the reservation.
(lua_idiv_qi_load): Ditto.
(lua_idiv_hi): Ditto.
(lua_idiv_hi_load): Ditto.
(lua_idiv_si): Ditto.
(lua_idiv_si_load): Ditto.
(lua_idiv_di): Ditto.
(lua_idiv_di_load): Ditto.
(lua_fdiv_SF): Ditto.
(lua_fdiv_SF_load): Ditto.
(lua_fdiv_DF): Ditto.
(lua_fdiv_DF_load): Ditto.
(lua_fdiv_XF): Ditto.
(lua_fdiv_XF_load): Ditto.
(lua_ssediv_SF): Ditto.
(lua_ssediv_load_SF): Ditto.
(lua_ssediv_V4SF): Ditto.
(lua_ssediv_load_V4SF): Ditto.
(lua_ssediv_V8SF): Ditto.
(lua_ssediv_load_V8SF): Ditto.
(lua_ssediv_SD): Ditto.
(lua_ssediv_load_SD): Ditto.
(lua_ssediv_V2DF): Ditto.
(lua_ssediv_load_V2DF): Ditto.
(lua_ssediv_V4DF): Ditto.
(lua_ssediv_load_V4DF): Ditto.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-12-08 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #12 from Martin Liška  ---
Nice work Alexander!

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-12-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #11 from Alexander Monakov  ---
Factoring out Lujiazui divider shrinks its tables by almost 20x:

3 r lujiazui_decoder_min_issue_delay
20 r lujiazui_decoder_transitions
32 r lujiazui_agu_min_issue_delay
126 r lujiazui_agu_transitions
304 r lujiazui_div_base
352 r lujiazui_div_check
352 r lujiazui_div_transitions
1152 r lujiazui_core_min_issue_delay
1592 r lujiazui_agu_translate
1592 r lujiazui_core_translate
1592 r lujiazui_decoder_translate
1592 r lujiazui_div_translate
3952 r lujiazui_div_min_issue_delay
9216 r lujiazui_core_transitions

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #10 from Alexander Monakov  ---
(In reply to Jan Hubicka from comment #9)
> Actually for older cores I think the manufacturers do not care much.  I
> still have a working Bulldozer machine and I can do some testing.
> I think in Buldozer case I was basing the latency throughput on data in
> Agner Fog's manuals.

Ahhh, how could I forget that his manuals have data for those cores too. Thanks
for the reminder! This solves the conundrum nicely:

AMD Jaguar ('btver2' in GCC): int/fp division is not pipelined, separate int/fp
dividers;

AMD Bulldozer, Steamroller ('bdver1', 'bdver3'): int division is not pipelined
(one divider), fp division is slightly pipelined (two independent dividers);

Zhaoxin Lujiazui appears to use the same divider as VIA Nano 3000, which is not
pipelined.

So it's already enough to produce a decent patch.

> How do you test it?

For AMD Zen patches I was using measurements by Andreas Abel (
https://uops.info/table_overview.html ) and running a few experiments myself by
coding loops in NASM and timing them with 'perf stat' on a Zen 2 CPU.

Re: [Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread Jan Hubicka via Gcc-bugs
> 
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
> 
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput as in current incorrect models, and leave it to
> manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
> might be able to crowd-source it eventually.
Actually for older cores I think the manufacturers do not care much.  I
still have a working Bulldozer machine and I can do some testing.
I think in Buldozer case I was basing the latency throughput on data in
Agner Fog's manuals.  How do you test it?
Honza


[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #9 from Jan Hubicka  ---
> 
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
> 
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput as in current incorrect models, and leave it to
> manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
> might be able to crowd-source it eventually.
Actually for older cores I think the manufacturers do not care much.  I
still have a working Bulldozer machine and I can do some testing.
I think in Buldozer case I was basing the latency throughput on data in
Agner Fog's manuals.  How do you test it?
Honza

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #8 from Alexander Monakov  ---
(In reply to Jan Hubicka from comment #7)
> > 53730 r btver2_fp_min_issue_delay
> > 53760 r znver1_fp_transitions
> > 93960 r bdver3_fp_transitions
> > 106102 r lujiazui_core_check
> > 106102 r lujiazui_core_transitions
> > 196123 r lujiazui_core_min_issue_delay
> > 
> > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
> Yes, I think that makes sense...

Do you mean we should fix modeling of divisions there as well? I don't have
latency/throughput measurements for those CPUs, nor access so I can run
experiments myself, unfortunately.

I guess you mean just making a patch to model division units separately,
leaving latency/throughput as in current incorrect models, and leave it to
manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
might be able to crowd-source it eventually.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #7 from Jan Hubicka  ---
> 53730 r btver2_fp_min_issue_delay
> 53760 r znver1_fp_transitions
> 93960 r bdver3_fp_transitions
> 106102 r lujiazui_core_check
> 106102 r lujiazui_core_transitions
> 196123 r lujiazui_core_min_issue_delay
> 
> What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
Yes, I think that makes sense...

Honza

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #6 from Alexander Monakov  ---
With these patches on trunk, current situation is:

nm -CS -t d --defined-only gcc/insn-automata.o | sed 's/^[0-9]* 0*//' | sort -n
| tail -40
2496 r slm_base
2527 r bdver3_load_min_issue_delay
2746 r glm_base
3892 r bdver1_fp_base
 r bdver1_ieu_min_issue_delay
4492 r geode_base
4608 r bdver3_ieu_transitions
6402 r bdver1_load_transitions
6720 r znver1_fp_min_issue_delay
7862 r athlon_fp_check
7862 r athlon_fp_transitions
9122 r lujiazui_core_base
9997 t internal_insn_latency(int, int, rtx_insn*, rtx_insn*)
10108 r bdver3_load_transitions
10498 r geode_check
10498 r geode_transitions
11632 r print_reservation(_IO_FILE*, rtx_insn*)::reservation_names
12575 r athlon_fp_min_issue_delay
12742 r btver2_fp_check
12742 r btver2_fp_transitions
13896 r slm_check
13896 r slm_transitions
17149 t internal_min_issue_delay(int, DFA_chip*)
17349 t internal_state_transition(int, DFA_chip*)
17776 r bdver1_ieu_transitions
20068 r bdver1_fp_check
20068 r bdver1_fp_transitions
26208 r slm_min_issue_delay
27244 r bdver1_fp_min_issue_delay
28518 r glm_check
28518 r glm_transitions
33690 r geode_min_issue_delay
46980 r bdver3_fp_min_issue_delay
49428 r glm_min_issue_delay
53730 r btver2_fp_min_issue_delay
53760 r znver1_fp_transitions
93960 r bdver3_fp_transitions
106102 r lujiazui_core_check
106102 r lujiazui_core_transitions
196123 r lujiazui_core_min_issue_delay

What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #5 from CVS Commits  ---
The master branch has been updated by Alexander Monakov :

https://gcc.gnu.org/g:d4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f

commit r13-4093-gd4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f
Author: Alexander Monakov 
Date:   Tue Nov 1 17:53:13 2022 +0300

i386: correct x87 multiplication modeling in znver.md

All multiplication instructions are fully pipelined, except AVX256
instructions on Zen 1, which issue over two cycles on a 128-bit unit.
Correct the model accordingly to reduce combinatorial explosion in
automaton tables.

Top znver table sizes in insn-automata.o:

Before:

30056 r znver1_fp_min_issue_delay
120224 r znver1_fp_transitions

After:

6720 r znver1_fp_min_issue_delay
53760 r znver1_fp_transitions

gcc/ChangeLog:

PR target/87832
* config/i386/znver.md: (znver1_fp_op_mul): Correct cycles in
the reservation.
(znver1_fp_op_mul_load): Ditto.
(znver1_mmx_mul): Ditto.
(znver1_mmx_load): Ditto.
(znver1_ssemul_ss_ps): Ditto.
(znver1_ssemul_ss_ps_load): Ditto.
(znver1_ssemul_avx256_ps): Ditto.
(znver1_ssemul_avx256_ps_load): Ditto.
(znver1_ssemul_sd_pd): Ditto.
(znver1_ssemul_sd_pd_load): Ditto.
(znver2_ssemul_sd_pd): Ditto.
(znver2_ssemul_sd_pd_load): Ditto.
(znver1_ssemul_avx256_pd): Ditto.
(znver1_ssemul_avx256_pd_load): Ditto.
(znver1_sseimul): Ditto.
(znver1_sseimul_avx256): Ditto.
(znver1_sseimul_load): Ditto.
(znver1_sseimul_avx256_load): Ditto.
(znver1_sseimul_di): Ditto.
(znver1_sseimul_load_di): Ditto.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-16 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #4 from CVS Commits  ---
The master branch has been updated by Alexander Monakov :

https://gcc.gnu.org/g:dd744f06c9952f92738b0860630085f0f0b99574

commit r13-4092-gdd744f06c9952f92738b0860630085f0f0b99574
Author: Alexander Monakov 
Date:   Tue Nov 1 17:04:25 2022 +0300

i386: correct x87 division modeling in znver.md

Correct modeling of division instructions in the SIMD/FP domain for
AMD Zen architectures and avoid combinatorial explosion of automaton
tables by modeling the separate floating-point division unit and
correcting reservations to reflect reciprocal throughput of the
corresponding instructions, similar to earlier commit
5cee5f94000 ("i386: correct integer division modeling in znver.md").

Division is partially pipelined and some instructions have fractional
throughput (e.g. Zen 3 can issue divss and divsd each 3.5 and 4.5
cycles on average, respectively). Considering these CPUs implement
out-of-order execution, the model doesn't need to be exact to the last
cycle, so simplify it by using 4/5 cycles for SF/DF modes, and not
modeling the fact that FP3 pipe is occupied for one cycle.

Top znver table sizes in insn-automata.o:

Before:

428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

After:

30056 r znver1_fp_min_issue_delay
120224 r znver1_fp_transitions

gcc/ChangeLog:

PR target/87832
* config/i386/znver.md (znver1_fdiv): New automaton.
(znver1-fdiv): New unit.
(znver1_fp_op_div): Correct unit and cycles in the reservation.
(znver1_fp_op_div_load): Ditto.
(znver1_fp_op_idiv_load): Ditto.
(znver2_fp_op_idiv_load): Ditto.
(znver1_ssediv_ss_ps): Ditto.
(znver1_ssediv_ss_ps_load): Ditto.
(znver1_ssediv_sd_pd): Ditto.
(znver1_ssediv_sd_pd_load): Ditto.
(znver1_ssediv_avx256_ps): Ditto.
(znver1_ssediv_avx256_ps_load): Ditto.
(znver1_ssediv_avx256_pd): Ditto.
(znver1_ssediv_avx256_pd_load): Ditto.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-07 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #3 from Alexander Monakov  ---
Followup patches have been posted at
https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-11-01 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #2 from CVS Commits  ---
The master branch has been updated by Alexander Monakov :

https://gcc.gnu.org/g:5cee5f94000ee5eabce9b223c44c7923c1c69f61

commit r13-3589-g5cee5f94000ee5eabce9b223c44c7923c1c69f61
Author: Alexander Monakov 
Date:   Mon Oct 31 17:35:57 2022 +0300

i386: correct integer division modeling in znver.md

In znver.md, division instructions have descriptions like

(define_insn_reservation "znver1_idiv_DI" 41
(and (eq_attr "cpu" "znver1,znver2")
 (and (eq_attr "type" "idiv")
  (and (eq_attr "mode" "DI")
   (eq_attr "memory" "none"
"znver1-double,znver1-ieu2*41")

which says that DImode idiv has latency 41 (which is correct) and that
it occupies 2nd integer execution unit for 41 consecutive cycles, but
that is not correct:

1) the division instruction is partially pipelined, and has throughput
   1/14, not 1/41;

2) for the most part it occupies a separate division unit, not the
   general arithmetic unit.

Evidently, interaction of such 41-cycle paths with the rest of
reservations causes a combinatorial explosion in the automaton.

Fix this by modeling the integer division unit properly, and correcting
reservations to use the measured reciprocal throughput of those
instructions (available from uops.info). A similar correction for
floating-point divisions is left for a followup patch.

Top 5 znver table sizes, before:

68692 r znver1_ieu_check
68692 r znver1_ieu_transitions
99792 r znver1_ieu_min_issue_delay
428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

After:

1454 r znver1_ieu_translate
1454 r znver1_translate
2304 r znver1_ieu_transitions
428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

gcc/ChangeLog:

PR target/87832
* config/i386/znver.md (znver1_idiv): New automaton.
(znver1-idiv): New unit.
(znver1_idiv_DI): Correct unit and cycles in the reservation.
(znver1_idiv_SI): Ditto.
(znver1_idiv_HI): Ditto.
(znver1_idiv_QI): Ditto.
(znver1_idiv_mem_DI): Ditto.
(znver1_idiv_mem_SI): Ditto.
(znver1_idiv_mem_HI): Ditto.
(znver1_idiv_mem_QI): Ditto.
(znver3_idiv_DI): Ditto.
(znver3_idiv_SI): Ditto.
(znver3_idiv_HI): Ditto.
(znver3_idiv_QI): Ditto.
(znver3_idiv_mem_DI): Ditto.
(znver3_idiv_mem_SI): Ditto.
(znver3_idiv_mem_HI): Ditto.
(znver3_idiv_mem_QI): Ditto.

[Bug target/87832] AMD pipeline models are very costly size-wise

2022-10-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

--- Comment #1 from Alexander Monakov  ---
Suggested partial fix for the integer-pipe side of the blowup:
https://inbox.sourceware.org/gcc-patches/4549f27b-238a-7d77-f72b-cc77df8ae...@ispras.ru/