> On 28 Jun 2024, at 07:03, Uros Bizjak <ubiz...@gmail.com> wrote:
>
> On Fri, Jun 28, 2024 at 7:29 AM liuhongt <hongtao....@intel.com> wrote:
>>
>> Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
>> define target_insn_cost to prevent post_reload pass_late_combine to
>> revert the optimziation did in pass_rpad.
>>
>> Adjust testcases since pass_late_combine generates better code but
>> break scan assembly.
>>
>> .i.e
>> Under 32-bit target, gcc used to generate broadcast from stack and
>> then do the real operation.
>> After flate_combine, they're combined into embeded broadcast
>> operations.
>>
>> gcc/ChangeLog:
>>
>> * config/i386/i386-features.cc (ix86_rpad_gate): New function.
>> * config/i386/i386-options.cc (ix86_override_options_after_change):
>> Don't disable flate_combine.
>> * config/i386/i386-passes.def: Move pass_stv2 and pass_rpad
>> after pre_reload pas_late_combine.
>> * config/i386/i386-protos.h (ix86_rpad_gate): New declare.
>> * config/i386/i386.cc (ix86_insn_cost): New function.
>> (TARGET_INSN_COST): Define.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Adjus
>> testcase.
>> * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Ditto.
>> * gcc.target/i386/avx512f-fmadd-sf-zmm-7.c: Ditto.
>> * gcc.target/i386/avx512f-fmsub-sf-zmm-7.c: Ditto.
>> * gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c: Ditto.
>> * gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c: Ditto.
>> * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Ditto.
>> * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Ditto.
>> * gcc.target/i386/pr91333.c: Ditto.
>> * gcc.target/i386/vect-strided-4.c: Ditto.
>
> LGTM.
Unfortunately, this breaks bootstrap on every x86_64 Darwin with a 32b multilib
as well as on all x86 Darwin 32b hosts.
I have some analysis - and will raise a BZ (also going to see if it can be
reproduced
on a linux - darwin cross).
What’s happening is that the Darwin picbase load is being pushed from a single
instance in the prolog to multiple instances at the start of other blocks (I
cannot
tell yet if this is shrink wrapping or some other effect).
Anyway - there is supposed to be only one picbase load per function - and some
code-gen (e.g. non-local-gotos and similar) might well depend on that
assumption.
If the reason for the changed codegen is that the picbase load costing has been
over-estimated in the past, then that means we’d need to adjust both the
generation
of picbase (to cater for multiple instances) and anything that depends on
assuming
there’s only one.
If the reason for the codegen change is that the picbase load cost is now under-
estimated - that’s an easier fix.
Initial investigation only, I’ll try to raise a BZ tomorrow - it took a while
to bisect.
Iain