[Bug middle-end/28831] [11/12/13/14/15 Regression] Aggregate copy not elided when using a return value as a pass-by-value parameter

2024-05-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28831

--- Comment #44 from rguenther at suse dot de  ---
On Thu, 23 May 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28831
> 
> --- Comment #43 from Matthias Kretz (Vir)  ---
> I see this issue in SIMD programming. Example (on x86_64 with only '-O2', i.e.
> without AVX512) https://compiler-explorer.com/z/K64djP356:
> 
> typedef int V __attribute__((vector_size(64)));
> 
> V gen();
> 
> void g0(V const&, V const&);
> void g1(V, V);
> 
> void constref()
> {
>   g0(gen(), gen());
> }
> 
> void byvalue()
> {
>   g1(gen(), gen());
> }
> 
> Both the 'constref' and 'byvalue' cases copy every V argument before calling
> g0/g1.

The copy on GIMPLE is due to IL constraints:

  _10 = gen ();

   :
  D.2805 = _10;
  g0 (, );

when the call has a register type return value the LHS of the call
statement has to be a register (SSA name).  But the argument to
g0 has to be memory, so we get the extra copy.  Now, w/o AVX512
that "register" doesn't work out and we allocate it to memory
causing a memory-to-memory copy.

That's also because in vector lowering we do not lower those
register-to-memory stores (doing that would possibly help a bit,
as would more clever expansion of the copy or more clever
expanding of _10)

[Bug tree-optimization/114072] gcc.dg/vect/vect-pr111779.c FAILs

2024-05-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 22 May 2024, ro at CeBiTec dot Uni-Bielefeld.DE wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072
> 
> --- Comment #5 from ro at CeBiTec dot Uni-Bielefeld.DE  Uni-Bielefeld.DE> ---
> > --- Comment #4 from rguenther at suse dot de  ---
> [...]
> >> I think the best we can do then is
> >> 
> >> /* { dg-skip-if "PR tree-optimization/114072" { be && { ! vect_shift_char 
> >> } } }
> >> */
> >> 
> >> Lets the test become UNSUPPORTED on 32 and 64-bit SPARC, but still PASS
> >> as before on 32 and 64-bit x86.
> >
> > Can we instead guard the scan-tree-dump?  This way the correctness
> > execute part still is exercised?
> 
> Sure, even if the result is somewhat hard to read with all those levels
> of braces:
> 
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { vect_int &&
> { le || { be && vect_shift_char } } } } } } */
> 
> This way, all of compile, execute, and scan are run on x86, while on
> sparc it's only compile, execute.

Looks good, pre-approved.

[Bug tree-optimization/114072] gcc.dg/vect/vect-pr111779.c FAILs

2024-05-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072

--- Comment #4 from rguenther at suse dot de  ---
On Wed, 22 May 2024, ro at CeBiTec dot Uni-Bielefeld.DE wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072
> 
> --- Comment #3 from ro at CeBiTec dot Uni-Bielefeld.DE  Uni-Bielefeld.DE> ---
> > --- Comment #2 from Richard Biener  ---
> > Hmm, is solaris-sparc big-endian?  It seems so.  That makes the bitfield
> 
> It is indeed.
> 
> > access require a VnQImode logical right shift (but little-endian doesn't
> > require it - it's actually bitfield endianess that matters).
> >
> > There is vect_shift_char you could use and somehow conditionalize that
> > on big-endianess.
> 
> I think the best we can do then is
> 
> /* { dg-skip-if "PR tree-optimization/114072" { be && { ! vect_shift_char } } 
> }
> */
> 
> Lets the test become UNSUPPORTED on 32 and 64-bit SPARC, but still PASS
> as before on 32 and 64-bit x86.

Can we instead guard the scan-tree-dump?  This way the correctness
execute part still is exercised?

[Bug tree-optimization/115138] [15 Regression] Bootstrap compare-debug fail after r15-580-gf3e5f4c58591f5

2024-05-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115138

--- Comment #2 from rguenther at suse dot de  ---
> Am 17.05.2024 um 16:20 schrieb iains at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115138
> 
>Bug ID: 115138
>   Summary: [15 Regression] Bootstrap compare-debug fail after
>r15-580-gf3e5f4c58591f5
>   Product: gcc
>   Version: 14.0
>Status: UNCONFIRMED
>  Severity: normal
>  Priority: P3
> Component: tree-optimization
>  Assignee: unassigned at gcc dot gnu.org
>  Reporter: iains at gcc dot gnu.org
>CC: rguenth at gcc dot gnu.org
>  Target Milestone: ---
>Target: x86_64-darwin
> 
> The compare fail is a symbol name mismatch (AFAICT) the code is otherwise
> identical.
> 
> a single fail fails gcc/d/opover.o
> 
> There's:
> 
>.const
> _CSWTCH.155:
>.byte   38
>.byte   37
>.byte   40
>.byte   39
> 
> where the stage1 compiler (and x86_64 Linux) produces _CSWTCH.154
> 
> At present, still trying to figure out how to debug this further .. it's D so
> no preprocessed output - I guess will have to try tree dumps.

Did the followup fix maybe resolve this?

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug rtl-optimization/101523] Huge number of combine attempts

2024-05-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #64 from rguenther at suse dot de  ---
On Sat, 4 May 2024, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523
> 
> --- Comment #61 from Segher Boessenkool  ---
> We used to do the wrong thing in combine.  Now that my fix was reverted, we
> still do.  This should be undone soonish,

As promised I'm going to revert the revert after 14.1 is released 
(hopefully tomorrow).  As for distros I have decided to include my
hack posted in 
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/648725.html
for SUSE based distros in GCC 13 and 14 as that seems to improve
the problematical memory uses in our build farm.

[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-05-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #4 from rguenther at suse dot de  ---
On Mon, 6 May 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908
> 
> --- Comment #3 from Matthias Kretz (Vir)  ---
> The stdx::simd implementation in this area is old and mainly tuned to be
> correct. I can rewrite the split and concat implementation to use
> __builtin_shufflevector (which wasn't available in GCC at the time when I
> originally implemented it). Doing so I can resolve this issue.
> 
> How do you want to handle this? Because it would certainly be nice if the
> compiler can optimize this in the same way as Clang can. Should I try to come
> up with a testcase that doesn't need stdx::simd and then improve stdx::simd
> independently?

Yes, that would be nice (best the testcase w/o stdx::simd being a C
testcase even, no C++ abstraction).

I do think stdx::simd should be improved to use __builtin_shufflevector
though.

[Bug tree-optimization/114774] Missed DSE in simple code due to interleaving sotres

2024-04-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

--- Comment #4 from rguenther at suse dot de  ---
On Fri, 19 Apr 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774
> 
> --- Comment #3 from Jan Hubicka  ---
> > Yes, DSE walking doesn't "branch" but goes to some length handling some 
> > trivial
> > branches only.  Mainly to avoid compile-time issues.  It needs larger
> > re-structuring to fix that, but in principle it shouldn't be difficult. 
> 
> Looking into it, instead of having simple outer loop it needs to
> maintain worklist of defs to proceed each annotated with live bitmap,
> rigt?

Yeah, I have some patch on some branch somewhere ... IIRC it was broken
and miscompiled things and I got distracted ...

I will eventually get back to DSE for stage1 also because of some other
PRs.

> With Maritn we noticed it while looking into push_back codegen. In case
> aggregate is passed to a function call but does not escape, we now SRA
> it removing all uses and corresponding clobbers, but we do not remove the
> stores (Martin was expecting DSE to do it). As a result the store stays
> around which causes partial memory stall (storing in pieces, loading as
> a whole). It seems easy to remove this specific store during SRA, but it
> seems improving DSE here would be desirable.

Yes.

[Bug tree-optimization/114749] [13 Regression] RISC-V rv64gcv ICE: in vectorizable_load, at tree-vect-stmts.cc

2024-04-18 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114749

--- Comment #5 from rguenther at suse dot de  ---
On Wed, 17 Apr 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114749
> 
> --- Comment #4 from JuzheZhong  ---
> Hi, Patrick.
> 
> It seems that Richard didn't append the testcase in the patch.
> Could you send a patch to add the testcase for RISC-V port ?

I did not because of the extra riscv specific options needed, you
may have a better idea how to correctly put that into the riscv
target specific harness (which eventually iterates over extra options?)

[Bug target/111231] [12/13/14 regression] armhf: Miscompilation with -O2/-fno-exceptions level (-fno-tree-vectorize is working)

2024-04-16 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111231

--- Comment #35 from rguenther at suse dot de  ---
On Tue, 16 Apr 2024, rearnsha at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111231
> 
> --- Comment #34 from Richard Earnshaw  ---
> To be honest, I'm more concerned that we aren't eliminating a lot of these
> copies during the gimple optimization phase.  The memcpy is really a type
> punning step (that's strictly ISO C compliant, rather than using the GCC union
> extension), so ideally we'd recognize that and eliminate as many of the copies
> as possible (perhaps using some form of view_convert or whatever gimple is
> appropriate for changing the view without changing the contents).

Yeah, there's currently no way to represent a change just in the
effective type that wouldn't generate code in the end but still
serves as barrier for these type related optimizations.

When modifying the earlier store is an option then another possibility
would be to attach multiple effective types to it in some way.  Of course
that's pessimizing as well.

That said, the choice has been made to prune those "invalid" redundant
store removals but as we see here the implemented checks are not working
as intended.

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #17 from rguenther at suse dot de  ---
On Mon, 15 Apr 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
> 
> --- Comment #16 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #14)
> > I think
> > 
> >   if (safelen)
> > {
> >   poly_uint64 val;
> >   safelen = OMP_CLAUSE_SAFELEN_EXPR (safelen);
> >   if (!poly_int_tree_p (safelen, ))
> > safelen_int = 0;
> >   else
> > safelen_int = MIN (constant_lower_bound (val), INT_MAX);
> > 
> > should simply become
> > 
> > safelen_int = constant_upper_bound_with_limit (val, INT_MAX);
> > 
> > ?  Usually targets do have a limit on the actual length but I see
> > constant_upper_bound_with_limit doesn't query such.  But it would
> > be a more appropriate way to say there might be an actual target limit here?
> 
> OMP_CLAUSE_SAFELEN_EXPR is always an INTEGER_CST, the FEs verify that and 
> error
> if it is not.  So, I must say I don't really understand parts of the
> r8-5649-g9d2f08ab97be
> changes.  I can understand the intent to make max_vf a poly_int, but don't
> understand why safelen should be, what would it mean and when it would be set
> that way?

It would be only to "better" encode "infinity".  But I see loop->safelen
is 'int' but only [0, MAX_INT] is specified so we'd conveniently have
say -1 to encode "infinity".  We'd have to special case that value
anyway?

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #15 from rguenther at suse dot de  ---
On Mon, 15 Apr 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
> 
> --- Comment #13 from Jakub Jelinek  ---
> (In reply to kugan from comment #12)
> > > Why?
> > > Then it just is INT_MAX value, which is a magic value that says that it is
> > > infinity.
> > > No need to say it is a poly_int infinity.
> > 
> > For this test case, omp_max_vf gets [16, 16] from the backend. This then
> > becomes 16. If we keep it as poly_int, it would pass maybe_lt (max_vf,
> > min_vf)) after applying safelen?
> 
> No.  You should just special case loop->safelen == INT_MAX to mean infinity in
> the comparisons where it currently doesn't work as infinity.

But then an actual safelen(INT_MAX) would need to be adjusted.

Maybe using a poly-int safelen internally is cleaner.

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #7 from rguenther at suse dot de  ---
> Am 08.04.2024 um 16:55 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
> 
> --- Comment #6 from Tamar Christina  ---
> (In reply to Jakub Jelinek from comment #4)
>> Now, with SVE/RISCV vectors the actual vectorization factor is a poly_int
>> rather than constant.  One possibility would be to use VLA arrays in those
>> cases, but then it will be hard to undo that later, or allow both shrinking
>> and growing the arrays and even turning them into VLA-like ones.
> 
> I think they are already VLAs of some kind going into vect, unless I've
> misunderstood the declaration:
> 
>  float D.28295[0:POLY_INT_CST [15, 16]];
>  float D.28293[0:POLY_INT_CST [15, 16]];
>  float D.28290[0:POLY_INT_CST [15, 16]];
> 
> it looks like during vectorization they are lowered though.

Maybe it’s about upper vs lower bound when setting loop->safelen from
‚unlimited‘

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #26 from rguenther at suse dot de  ---
On Mon, 8 Apr 2024, douglas.boffey at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
> 
> --- Comment #25 from Douglas Boffey  ---
> (In reply to rguent...@suse.de from comment #24)
> > dumpbin /headers executable.exe
> 
> ...
>   C0 size of stack reserve

OK, so that's the expected 12MB we adjust the stack reserve to.  It's
odd that you run into stack exhaustion (or maybe you don't and instead
run out of other memory).

I've now tried GCC 11.4 as you reportedly used on x86_64-linux and
can compile the testcase successfully with that using 2GB memory
and 75s compile-time.  Stack usage itself isn't measured but 8MB
were enough.

GCC 13 runs out of (8MB) stack for me.

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #24 from rguenther at suse dot de  ---
On Mon, 8 Apr 2024, douglas.boffey at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
> 
> --- Comment #23 from Douglas Boffey  ---
> (In reply to Richard Biener from comment #22)
> > Note we're using -Wl,--stack,12582912 when linking the GCC executables, I
> > wonder
> > if the reporter can verify the segfaulting executables have the correct 
> > stack
> > size set?
> 
> How do I find that out?

googling tells me

dumpbin /headers executable.exe

which should show a "size of stack reserve".  Alternatively if you
have objdump from binutils that might also show this info.  Basically
it's encoded in the binary (you want to check cc1plus.exe here).

There's also

editbin /stack:N executable.exe

where you can alter the stack reserve of the binary to N

[Bug libfortran/114304] [13/14 Regression] libgfortran I/O – bogus "Semicolon not allowed as separator with DECIMAL='point'"

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304

--- Comment #29 from rguenther at suse dot de  ---
On Mon, 8 Apr 2024, burnus at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304
> 
> --- Comment #28 from Tobias Burnus  ---
> Created attachment 57896
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57896=edit
> Testcase
> 
> It seems as if 'tabs' cause problems, e.g. for:
> 
>  profile_single_file= .true.
> 
> where there are two tabs before '='.
> 
> * * *
> 
> The problem seems to be that the new code uses:
> 
> -  eat_spaces (dtp);
>dtp->u.p.comma_flag = 0;
> +  c = next_char (dtp);
> +  if (c == ' ')
> +{
> +  eat_spaces (dtp);
> 
> Thus, it explicitly checks for ' ' while eat_spaces handles:
> 
>   while (c != EOF && (c == ' ' || c == '\r' || c == '\t'));
> 
> Testcase attached.
> 
> I think we need at least an "|| c == '\t'"; I guess '\r' isn't really required
> here, or is it?

Might be for \r\n line endings?  I'd keep it for the sake of preserving
previous behavior.  isspace(3) tests for \f, \n, \r, \t, \v and space
(but of course all depends on the locale, not sure whether libgfortran
needs to care for locales)

[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e

2024-03-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683

--- Comment #25 from rguenther at suse dot de  ---
On Fri, 22 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683
> 
> --- Comment #24 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #23)
> > It looks like this could go to suitable_reference_p instead?
> 
> You mean return false for those making them not suitable at all?
> I thought without a write such references would act more like RS_ANY, but
> a reason I didn't just treat such references as RS_ANY rather than RS_NONZERO
> in suitable_reference_p was because of the assert that all refs in a component
> have the same ref_step_type but nothing actually comparing it before the
> assertion.

Hmm, true.

> But if you think I should just return false from suitable_reference_p if the
> step isn't a multiple of the sizes, I can surely try that.
> 
> > That said, I do wonder why with my patch split_data_refs_to_components
> > doesn't fixup.  I think it's supposed to handle the case where
> > dependences are unknown conservatively...
> 
> I'm afraid I'm not familiar with data refs enough to know what was going on.

I tried to follow what happens there and I'm also somewhat lost.

Anyway, I think fixing this in predcom in a convenient place even if
it might be not the true fix is OK.  You might want to put a comment
before any such fix indicating there might be more latent issues
in predcom or dependence analysis in general.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #17 from rguenther at suse dot de  ---
On Thu, 21 Mar 2024, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736
> 
> --- Comment #16 from Uro? Bizjak  ---
> (In reply to Richard Biener from comment #13)
> > The original testcase is fixed, appearantly slapping 'extern' on the int
> > makes it not effective.
> > 
> > Possibly better amend the
> > 
> >   if (VAR_P (inner) && DECL_HARD_REGISTER (inner))
> > return;
> > 
> > check though.  As indicated my fix fixed only VAR_DECL cases, there's
> > still pointer-based accesses (MEM_REF) to consider.  So possibly even
> > the following is necessary
> 
> I must admit that to create the patch from Comment #11 I just mindlessly
> searched for DECL_THREAD_LOCAL_P in asan.cc and amended the location with
> ADDR_SPACE_GENERIC_P check.

It might be that the DECL_THREAD_LOCAL_P handling is similarly broken
in that

int foo (int *p)
{
  return *p;
}

is instrumented even when p is a pointer to thread local storage?  But
maybe that works fine and it handled in the runtime.  But of course
the runtime can't handle non-generic address-spaces at all unless
we can convert all those addresses to ones in the generic address-space
(given suitable overlap of course).

> However, ASAN should back off from annotating *any* gs: prefixed address. 
> 
> I'll test your patch from Comment #13 ASAP.

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

--- Comment #8 from rguenther at suse dot de  ---
On Thu, 21 Mar 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
> 
> Hongtao Liu  changed:
> 
>What|Removed |Added
> 
>  CC||liuhongt at gcc dot gnu.org
> 
> --- Comment #7 from Hongtao Liu  ---
> Another simple case is 
> 
> typedef int v4si __attribute__((vector_size(16)));
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi a;
> v4si b;
> void
> foo ()
> {
>b = __extension__(v4si){0, 0, 0, 0};
>a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0};
> }
> 
> GCC generates 2 pxor
> 
> foo():
> vpxor   xmm0, xmm0, xmm0
> vmovdqa XMMWORD PTR b[rip], xmm0
> vpxor   xmm0, xmm0, xmm0
> vmovdqa XMMWORD PTR a[rip], xmm0
> ret

If we were to expose that vpxor before postreload we'd likely CSE but
we have

5: xmm0:V4SI=const_vector
  REG_EQUIV const_vector
6: [`b']=xmm0:V4SI
7: xmm0:V8HI=const_vector
  REG_EQUIV const_vector
8: [`a']=xmm0:V8HI

until the very end.  But since we have the same mode size on the xmm0
sets CSE could easily handle (integral) constants by hashing/comparing
on their byte representation rather than by using the RTX structure.
OTOH as we mostly have special constants allowed in the IL like this
treating all-zeros and all-ones specially might be good enough ...

[Bug tree-optimization/114345] FRE missing knowledge of semantics of IFN loads

2024-03-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345

--- Comment #6 from rguenther at suse dot de  ---
On Fri, 15 Mar 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345
> 
> --- Comment #5 from Tamar Christina  ---
> (In reply to Richard Biener from comment #4)
> > Well, the shuffling in .LOAD_LANES will be a bit awkward to do, but sure.  
> > We
> > basically lack "constant folding" of .LOAD_LANES and similarly of course
> > we can't see through .STORE_LANES of a constant when later folding a scalar
> > load from the same memory.
> 
> I guess it becomes harder with the 3 and 4 lane ones, but the 2 lanes one is
> just a single VEC_PERM_EXPR no?

It's all about constant folding and thus "shuffling" properly.  But if
you consider that the vector type might be punned a later "long" load
of a .STORE_LANES with "int" lanes it will get interesting to now
follow non-consecutive bits... (read: that's not implemented).  That said,
some careful set of testcases should accompany support for .LOAD_LANES
and .STORE_LANES handling in VN.

I suppose it should be possible to leverage the GIMPLE FE for this.

[Bug libfortran/114304] [13/14 Regression] libgfortran I/O – bogus "Semicolon not allowed as separator with DECIMAL='point'"

2024-03-11 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304

--- Comment #10 from rguenther at suse dot de  ---
> Am 11.03.2024 um 20:03 schrieb jvdelisle at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304
> 
> --- Comment #9 from Jerry DeLisle  ---
> Patch on comment #8 breaks all sorts of things. Not so obvious. I will try
> reverting the original hunk from pr105473 and then work from there.

Just to add, I think rejecting something we accepted before and when this
doesn’t fix a rejects-valid shouldn’t be done on branches and given it affects
the standard library, when it’s SONAME is not altered as it might affect
programs compiled with older libgfortran (maybe there’s the argument for
something like a LIBGFORTRAN_STRICT environment to control such if really
needed?)

> --
> You are receiving this mail because:
> You reported the bug.

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #16 from rguenther at suse dot de  ---
On Thu, 7 Mar 2024, gjl at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252
> 
> --- Comment #14 from Georg-Johann Lay  ---
> The code in the example is not a perfect bswap, it needs additional shuffling
> of bytes.  The tree passes must know that bswap is not a perfect fit.  There
> must be *some* criterion that depends on the permutation, and when a bswap is
> closer to the bswapped-permutation that a non-bswapped permutation is to the
> original one.

I think we only support bswap + rotate as replacement (but just use that
if it matches).

[Bug middle-end/105533] UBSAN: gcc/expmed.cc:3272:26: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int'

2024-03-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105533

--- Comment #10 from rguenther at suse dot de  ---
On Wed, 6 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105533
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||rguenth at gcc dot gnu.org
> 
> --- Comment #7 from Jakub Jelinek  ---
> The second UB is on
> #2  ao_ref_init_from_vn_reference (ref=, set=1, base_set=1,
> type=, ops=...) at ../../gcc/tree-ssa-sccvn.cc:1224
> 1224offset += op->off << LOG2_BITS_PER_UNIT;
> where op->off is negative.
> Isn't this just an unnecessary optimization?  I mean can't we just do
> offset += op->off * BITS_PER_UNIT;
> BITS_PER_UNIT is a constant 8 on all targets we support...

It's a habit from dealing with offset_int (but this is poly_int64)
where the multiply is possibly a lot more costly than a shift.

So yeah, a multiply is fine I guess.

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #6 from rguenther at suse dot de  ---
> Am 06.03.2024 um 17:12 schrieb gjl at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252
> 
> --- Comment #5 from Georg-Johann Lay  ---
> (In reply to Richard Biener from comment #4)
>> So bswap on a value is just register shuffling, right?
> 
> The point is that there is no need for bswap in the first place, just have a
> look at the code that v13 generates.  It's 4 QI loads and that's it, no
> shuffling required at all.
> 
> But v14 dropped that, and the bswapsi (presumably due to previous flawed tree
> optmizations) is introduced by some tree pass.
> 
> There's nothing the backend can do about it.  So would you explain why you
> think it's a "target" issue?
> 
> Maybe the PR title I used is confusing and does not hit the point?

Why does the target say it has bswapsi then? In which case is that profitable?

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #22 from rguenther at suse dot de  ---
On Tue, 5 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232
> 
> --- Comment #17 from Jakub Jelinek  ---
> Either change those too, or the splitter needs some variant what to do if 
> there
> is a mismatch.
> Though, optimize_size implies optimize_function_for_size_p (cfun), so if a
> named pattern uses && optimize_size and the insn it splits into uses
> optimize_function_for_size_p (cfun), it shouldn't fail.  The other direction 
> is
> not always true, optimize_function_for_size_p (cfun) can be true just because
> the function
> is cold, not just because it is -Os.

I think optimize_function_for_size_p (cfun) isn't always true if
optimize_size is since it looks at the function-specific setting
of that flag, so you'd have to use opt_for_fn (cfun, optimize_size).

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #14 from rguenther at suse dot de  ---
On Tue, 5 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org
> 
> --- Comment #12 from Jakub Jelinek  ---
> Still, it would be nice to understand what changed 
> optimize_function_for_size_p
> (cfun)
> after IPA.  Is something adjusting node->count or node->frequency?
> Otherwise it should just depend on the optimize_size flag which should not
> change...

Maybe we share the TREE optimization node (it gets re-materialized
during LTO streaming) between hot and cold functions and the "first"
one getting in set_cfun will overwrite TREE_OPTIMIZATION_OPTABS in
init_tree_optimization_optabs (though it seems to overwrite things).

In any case I think the tree node sharing only looks at options, not
at function frequency so having TREE_OPTIMIZATION_OPTABS part of
the optimization node looks a bit fragile in the light of sharing.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #43 from rguenther at suse dot de  ---
On Mon, 4 Mar 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #41 from Richard Sandiford  ---
> (In reply to Richard Biener from comment #40)
> > So I wonder if we can use "local costing" to decide a gather is always OK
> > compared to the alternative with peeling for gaps.  On x86 gather tends
> > to be slow compared to open-coding it.
> Yeah, on SVE gathers are generally ?enabling? instructions rather than
> something to use for their own sake.
> 
> I suppose one problem is that we currently only try to use gathers for
> single-element groups.  If we make a local decision to use gathers while
> keeping that restriction, we could end up using gathers ?unnecessarily? while
> still needing to peel for gaps for (say) a two-element group.
> 
> That is, it's only better to use gathers than contiguous loads if by doing 
> that
> we avoid all need to peel for gaps (and if the cost of peeling for gaps was
> high enough to justify the cost of using gathers over consecutive loads).

Yep.  I do want to experiment with a way to have vectorizable_* register
multiple variants of vectorization and have ways to stitch together and 
cost the overall vectorization as a cheaper (and more flexible) way to
"iteration".  It will to some extent blow up combinations to try but
there might be a way to use greedy relaxation techniques to converge to
a lowest cost variant.

> One of the things on the list to do (once everything is SLP!) is to support
> loads with gaps directly via predication, so that we never load elements that
> aren't needed.  E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a
> 32-bit load (LD1W .S) to load only even-indexed elements.  So a single-element
> group with a group size of 2 could be done cheaply with just consecutive 
> loads,
> without peeling for gaps.

Yep.  Gap handling leaves to be desired (also when no predication is
available), I also plan to address some shortcomings in that area early
stage1.

Note that generally the idea is that gap peeling is very cheap - unless
that is the only reason to have an epilogue at all.  The exeption might
be small round-trip loops but those are best handled with predication
where there's no good reason to do peeling for gaps at all.

[Bug other/114191] Flags "Warning" and "Target" don't mix well in target.opt files

2024-03-04 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114191

--- Comment #4 from rguenther at suse dot de  ---
On Mon, 4 Mar 2024, gjl at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114191
> 
> --- Comment #2 from Georg-Johann Lay  ---
> (In reply to Richard Biener from comment #1)
> > Wmisspelled-isr
> > Target C C++ Var(avr_warn_misspelled_isr) Init(1)
> > Warn if the ISR is misspelled, ...
> > 
> > should eventually work?
> 
> With that, the warnings appear as they should, but the option is not
> recognized:
> 
> $ avr-gcc signal.c -S -Wmisspelled-isr
> error: unrecognized command-line option '-Wmisspelled-isr'
> 
> $ avr-gcc signal.c -S -Wno-misspelled-isr
> error: unrecognized command-line option '-Wno-misspelled-isr'
> 
> $ avr-gcc signal.c -S -Werror=misspelled-isr
> error: '-Werror=misspelled-isr': '-Wmisspelled-isr' is not an option that
> controls warnings

Hmm.  This is how some other targets do it, grepping

grep ^W config/*/*.opt

(a bit noisy because of descriptions).

config/gcn/gcn.opt:Wopenacc-dims
config/gcn/gcn.opt:Warn about invalid OpenACC dimensions.
config/i386/mingw.opt:Wpedantic-ms-format
config/i386/mingw.opt:Warn about none ISO msvcrt scanf/printf width 
extensions.

would be not Target but

config/rs6000/darwin.opt:Waltivec-long-deprecated

uses Target (and thus is dysfunctional)

So it seems we're not set up for 'Target' warnings.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-01 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #34 from rguenther at suse dot de  ---
On Fri, 1 Mar 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #33 from Richard Sandiford  ---
> Can you give me a chance to look at it a bit when I back?  This doesn't feel
> like the way to go to me.

Sure.

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since g:a0b1798042d033fd2cc2c806afbb77875dd2909b

2024-02-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #4 from rguenther at suse dot de  ---
On Wed, 28 Feb 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151
> 
> --- Comment #3 from Tamar Christina  ---
> > 
> > This was a correctness fix btw, so I'm not sure we can easily recover - we
> > could try using niter information for CHREC_VARIABLE but then there's
> > variable niter here so I don't see a chance.
> > 
> 
> It looks like it's mostly SVE suffering here. Adv. SIMD an scalar codegen 
> seems
> to have vastly improved with it. we see ~10% improvements due to better
> addressing for scalar.
> 
> It also only happens at -O3 but -O2 is fine, which is weird as I was expected
> IVopts to be enabled at -O2 too.

Note the underlying issue, analyzing { a, +, b } * c differently and
thus eventually dependent CHRECs failing to analyze should be independent
on the architecture.

What was definitely missing is consideration of POLY_INT_CSTs (and
variable polys, as I think there's no range info for those).

We do eventually want to improve how ranger behaves here.  I'm not sure
why when we do not provide a context 'stmt' it can't see to compute
a range valid at the SSA names point of definition?  (so basically
compute the global range)

Maybe there's another API to do that.  But I thought it was convenient
to use range_of_expr as that also handles GENERIC expression trees
to some extent and those are common within SCEV.

[Bug tree-optimization/114041] wrong code with _BitInt() and -O -fgraphite-identity

2024-02-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041

--- Comment #14 from rguenther at suse dot de  ---
On Wed, 28 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041
> 
> --- Comment #12 from Jakub Jelinek  ---
> I can change the comparison into assert, or defer that for stage1?

Defer I think, if you want to bother ...

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 28 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #14 from Hongtao Liu  ---
> (In reply to rguent...@suse.de from comment #13)
> > On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> > > 
> > > --- Comment #11 from Hongtao Liu  ---
> > > 
> > > >Loop body is likely going to simplify further, this is difficult
> > > >to guess, we just decrease the result by 1/3.  */
> > > > 
> > > 
> > > This is introduced by r0-68074-g91a01f21abfe19
> > > 
> > > /* Estimate number of insns of completely unrolled loop.  We assume
> > > +   that the size of the unrolled loop is decreased in the
> > > +   following way (the numbers of insns are based on what
> > > +   estimate_num_insns returns for appropriate statements):
> > > +
> > > +   1) exit condition gets removed (2 insns)
> > > +   2) increment of the control variable gets removed (2 insns)
> > > +   3) All remaining statements are likely to get simplified
> > > +  due to constant propagation.  Hard to estimate; just
> > > +  as a heuristics we decrease the rest by 1/3.
> > > +
> > > +   NINSNS is the number of insns in the loop before unrolling.
> > > +   NUNROLL is the number of times the loop is unrolled.  */
> > > +
> > > +static unsigned HOST_WIDE_INT
> > > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +unsigned HOST_WIDE_INT nunroll)
> > > +{
> > > +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  if (unr_insns <= 0)
> > > +unr_insns = 1;
> > > +  unr_insns *= (nunroll + 1);
> > > +
> > > +  return unr_insns;
> > > +}
> > > 
> > > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > > likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> > > keeped.
> > > 
> > > +/* Estimate number of insns of completely unrolled loop.
> > > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > > +   the fact that in last copy everything after exit conditional
> > > +   is dead and that some instructions will be eliminated after
> > > +   peeling.
> > > 
> > > -   NINSNS is the number of insns in the loop before unrolling.
> > > -   NUNROLL is the number of times the loop is unrolled.  */
> > > +   Loop body is likely going to simplify futher, this is difficult
> > > +   to guess, we just decrease the result by 1/3.  */
> > > 
> > >  static unsigned HOST_WIDE_INT
> > > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +estimated_unrolled_size (struct loop_size *size,
> > >  unsigned HOST_WIDE_INT nunroll)
> > >  {
> > > -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  HOST_WIDE_INT unr_insns = ((nunroll)
> > > +* (HOST_WIDE_INT) (size->overall
> > > +   -
> > > size->eliminated_by_peeling));
> > > +  if (!nunroll)
> > > +unr_insns = 0;
> > > +  unr_insns += size->last_iteration -
> > > size->last_iteration_eliminated_by_peeling;
> > > +
> > > +  unr_insns = unr_insns * 2 / 3;
> > >if (unr_insns <= 0)
> > >  unr_insns = 1;
> > > -  unr_insns *= (nunroll + 1);
> > > 
> > > It looks to me 1 / 3 overestimates the instructions that can be optimised 
> > > away,
> > > especially if we've subtracted eliminated_by_peeling
> > 
> > Yes, that 1/3 reduction is a bit odd - you could have the same effect
> > by increasing the instruction limit by 1/3, but that means it doesn't
> > really matter, does it?  It would be interesting to see if increasing
> > the limit by 1/3 and removing the above is neutral on SPEC?
> 
> Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
> -march=native -O3, no big impact on other integer benchmark.

454.calculix was always the benchmark to cross check as that benefits
from much unrolling.

I'm all for removing the 1/3 for innermost loop handling (in cunroll
the unrolled loop is then innermost).  I'm more concerned about
unrolling more than one level which is exactly what's required for
454.calculix.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #13 from rguenther at suse dot de  ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #11 from Hongtao Liu  ---
> 
> >Loop body is likely going to simplify further, this is difficult
> >to guess, we just decrease the result by 1/3.  */
> > 
> 
> This is introduced by r0-68074-g91a01f21abfe19
> 
> /* Estimate number of insns of completely unrolled loop.  We assume
> +   that the size of the unrolled loop is decreased in the
> +   following way (the numbers of insns are based on what
> +   estimate_num_insns returns for appropriate statements):
> +
> +   1) exit condition gets removed (2 insns)
> +   2) increment of the control variable gets removed (2 insns)
> +   3) All remaining statements are likely to get simplified
> +  due to constant propagation.  Hard to estimate; just
> +  as a heuristics we decrease the rest by 1/3.
> +
> +   NINSNS is the number of insns in the loop before unrolling.
> +   NUNROLL is the number of times the loop is unrolled.  */
> +
> +static unsigned HOST_WIDE_INT
> +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +unsigned HOST_WIDE_INT nunroll)
> +{
> +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  if (unr_insns <= 0)
> +unr_insns = 1;
> +  unr_insns *= (nunroll + 1);
> +
> +  return unr_insns;
> +}
> 
> And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> keeped.
> 
> +/* Estimate number of insns of completely unrolled loop.
> +   It is (NUNROLL + 1) * size of loop body with taking into account
> +   the fact that in last copy everything after exit conditional
> +   is dead and that some instructions will be eliminated after
> +   peeling.
> 
> -   NINSNS is the number of insns in the loop before unrolling.
> -   NUNROLL is the number of times the loop is unrolled.  */
> +   Loop body is likely going to simplify futher, this is difficult
> +   to guess, we just decrease the result by 1/3.  */
> 
>  static unsigned HOST_WIDE_INT
> -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +estimated_unrolled_size (struct loop_size *size,
>  unsigned HOST_WIDE_INT nunroll)
>  {
> -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  HOST_WIDE_INT unr_insns = ((nunroll)
> +* (HOST_WIDE_INT) (size->overall
> +   -
> size->eliminated_by_peeling));
> +  if (!nunroll)
> +unr_insns = 0;
> +  unr_insns += size->last_iteration -
> size->last_iteration_eliminated_by_peeling;
> +
> +  unr_insns = unr_insns * 2 / 3;
>if (unr_insns <= 0)
>  unr_insns = 1;
> -  unr_insns *= (nunroll + 1);
> 
> It looks to me 1 / 3 overestimates the instructions that can be optimised 
> away,
> especially if we've subtracted eliminated_by_peeling

Yes, that 1/3 reduction is a bit odd - you could have the same effect
by increasing the instruction limit by 1/3, but that means it doesn't
really matter, does it?  It would be interesting to see if increasing
the limit by 1/3 and removing the above is neutral on SPEC?

Note this kind of "simplification guessing" is most important for
the 2nd stage unrolling an outer loop with an unrolled inner loop
as there are 2nd level recurrences to be optimized the "elmiminated by
peeling" heuristics do not get (but value-numbering would).  So another
thing to do would be not do the 1/3 reduction for innermost loops
but only for loops up from that.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #12 from rguenther at suse dot de  ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #10 from Hongtao Liu  ---
> (In reply to Hongtao Liu from comment #9)
> > The original case is a little different from the one in PR.
> But the issue is similar, after cunrolli, GCC failed to vectorize the outer
> loop.
> 
> The interesting thing is in estimated_unrolled_size, the original unr_insns is
> 288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
> is decreased by 1/3 due to
> 
>Loop body is likely going to simplify further, this is difficult
>to guess, we just decrease the result by 1/3.  */
> 
> In practice, this loop body is not simplied for 1/3 of the instructions.
> 
> Considering the unroll factor is 16, the unr_insn is large(192), I was
> wondering if we could add some heuristic algorithm to avoid complete loop
> unroll, because usually for such a big loop, both loop and BB vectorizer may
> not perform well.

There were several attempts at making the unroller guess less (that 1/3
reduction) but work out what actually will be simplified to be able to
shrink those numbers.

My favorite (but never implemented) idea was to code-generate 
optimistically but while running value-numbering on-the-fly on the
code and cost the so simplified unrolled code, stopping when we
reach a limit (and scrap the sofar accumulated code).  While
reasonably "easy" for unrolled code that ends up without branches
it gets complicated for branches.

My most recent attempt at improving was only for tracking what
unrolling estimates as ending up constant.

I think what might be the least controversical thing to do is to split
the instruction limit between the early cunrolli and the late cunroll
passes and lower the ones for cunrolli a lot.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-02-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #28 from rguenther at suse dot de  ---
On Mon, 26 Feb 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #27 from Tamar Christina  ---
> Created attachment 57538
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57538=edit
> proposed1.patch
> 
> proposed patch, this gets the gathers and scatters back. doing regression run.

I don't think this will fly.

[Bug tree-optimization/114041] wrong code with _BitInt() and -O -fgraphite-identity

2024-02-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041

--- Comment #4 from rguenther at suse dot de  ---
On Thu, 22 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041
> 
> --- Comment #3 from Jakub Jelinek  ---
> I thought graphite is done only after bitint lowering.
> At that point, there should be just <= MAX_FIXED_MODE_SIZE BITINT_TYPEs around
> in the IL, with the exception of loading SSA_NAMEs from memory in order to 
> pass
> them to function calls, or undefined SSA_NAMEs, or usually TREE_ADDRESSABLE
> PARM_DECLs being converted to something else.
> Would be probably wise to punt on graphite optimizations on loops which have
> those larger SSA_NAMEs anywhere in the loop bodies.
> That said, ub4_0_17(D) is unsigned _BitInt(4).

I think we want to understand what goes wrong, not just disable
graphite on any BitInt.

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #23 from rguenther at suse dot de  ---
On Thu, 22 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #22 from Jakub Jelinek  ---
> Yeah, I was worried about partial ints.  Or it could be punt if TYPE_MODEs are
> different and at least one of them is BLKmode.

Well, then you can also check whether one mode is BLKmode.

Btw, I think forwprop is lucky if it really changes

  _2 = VIEW_CONVERT_EXPR(x);

to

  _2 = (uint256_t) x_1(D);

because match on its own would create

  _2 = (uint256_t) x;

which would be invalid GIMPLE.  So are you sure it's not
update-address-taken first rewriting x to a register?

For 'foo' I see x becomes a register already during into-SSA so
another fix might be to re-consider and make x a non-register
because of that V_C_E?  (and hopefully the match pattern would
then not be applied, but I'm not 100% sure on that)

In principle there's nothing wrong about the transform it's
the use of uint256_t that makes this problematic.

So maybe we should bite the bullet and try the memcpy folding
fix even at this stage ...

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #21 from rguenther at suse dot de  ---
On Thu, 22 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #20 from Jakub Jelinek  ---
> (In reply to rguent...@suse.de from comment #19)
> > I think the usual BLKmode check would be better here?  Apart from
> > that this looks correct, we shouldn't use a regular convert on
> > a non-register type.  In fact, it looks like all bitint types are
> > register types because we want SSA names for them.  A bit of a
> > "bad" design ...
> > 
> > We've used BLKmode checks elsewhere so I think it would be appropriate
> > here, too.
> 
> But we don't want to disallow turning VIEW_CONVERT_EXPR from one _BitInt to
> another _BitInt of the same size, even if they both have BLKmode.
> So, if we'd use BLKmode check, it would need to be one of the types is
> INTEGER_TYPE and the other type is BLKmode BITINT_TYPE (or vice versa).
> Or shall the test be TYPE_MODE (type) == TYPE_MODE (TREE_TYPE (@0)) regardless
> of precision?
> Or it could be the mode check && at least one of the involved types is
> BITINT_TYPE,
> that would maintain existing behavior when _BitInt isn't involved.
> 
> The #c18 patch passed bootstrap/regtest, the following:
> 2024-02-22  Jakub Jelinek  
> 
> PR tree-optimization/113988
> * match.pd (view_convert -> convert): Punt for VCEs involving
> BITINT_TYPEs if there is mode mismatch.
> 
> * gcc.dg/bitint-91.c: New test.
> 
> --- gcc/match.pd.jj 2024-02-19 09:42:16.583617451 +0100
> +++ gcc/match.pd2024-02-22 09:00:34.302420089 +0100
> @@ -4679,7 +4679,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(view_convert @0)
>(if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
> && (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE
> (@0)))
> -   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
> +   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0))
> +   /* Punt for conversions with mode mismatches if BITINT_TYPEs are
> + involved.  */
> +   && (TYPE_MODE (type) == TYPE_MODE (TREE_TYPE (@0))

I like the mode check and I think we should avoid the transform
also for non-bit-int in case the mode differs, I can only think
of odd partial-int vs int mode cases here with pointer vs.
integer?

> +  || (TREE_CODE (type) != BITINT_TYPE
> +  && TREE_CODE (TREE_TYPE (@0)) != BITINT_TYPE)))
> (convert @0)))
> 
>  /* Strip inner integral conversions that do not change precision or size, or
> --- gcc/testsuite/gcc.dg/bitint-91.c.jj 2024-02-21 13:47:55.244885020 +0100
> +++ gcc/testsuite/gcc.dg/bitint-91.c2024-02-21 12:51:16.900026979 +0100
> @@ -0,0 +1,38 @@
> +/* PR tree-optimization/113988 */
> +/* { dg-do compile { target bitint } } */
> +/* { dg-options "-O2" } */
> +/* { dg-additional-options "-mavx512f" { target i?86-*-* x86_64-*-* } } */
> +
> +int i;
> +
> +#if __BITINT_MAXWIDTH__ >= 256
> +void
> +foo (void *p, _BitInt(256) x)
> +{
> +  __builtin_memcpy (p, , sizeof x);
> +}
> +
> +_BitInt(256)
> +bar (void *p, _BitInt(256) x)
> 
> passed
> make check-gcc check-g++ -j32 -k GCC_TEST_RUN_EXPENSIVE=1
> RUNTESTFLAGS="GCC_TEST_RUN_EXPENSIVE=1 dg.exp='*bitint* pr112673.c
> builtin-stdc-bit-*.c pr112566-2.c pr112511.c' dg-torture.exp=*bitint*
> dfp.exp=*bitint*"
> 
>

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 21 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #17 from Jakub Jelinek  ---
> So, either we could somehow handle that case during expansion (treat it
> basically as VCE), or tweak the
> /* For integral conversions with the same precision or pointer
>conversions use a NOP_EXPR instead.  */
> (simplify
>   (view_convert @0)
>   (if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
>&& (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE 
> (@0)))
>&& TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
>(convert @0)))
> match.pd rule not to do that for INTEGER_TYPEs with PRECISION >
> MAX_FIXED_TYPE_PRECISION (then we don't need the gimple-lower-bitint.cc 
> changes
> either).
> --- gcc/match.pd.jj 2024-02-19 09:42:16.583617451 +0100
> +++ gcc/match.pd2024-02-21 13:32:06.567816298 +0100
> @@ -4679,7 +4679,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(view_convert @0)
>(if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
> && (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE
> (@0)))
> -   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
> +   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0))
> +   /* Punt for conversions from or to barely supported huge
> + INTEGER_TYPEs.  Those can handle just loads/stores/moves but
> + nothing else.  */
> +   && (TYPE_PRECISION (type) <= MAX_FIXED_MODE_SIZE
> +  || (TREE_CODE (type) != INTEGER_TYPE
> +  && TREE_CODE (TREE_TYPE (@0)) != INTEGER_TYPE)))
> (convert @0)))
> 
>  /* Strip inner integral conversions that do not change precision or size, or

I think the usual BLKmode check would be better here?  Apart from
that this looks correct, we shouldn't use a regular convert on
a non-register type.  In fact, it looks like all bitint types are
register types because we want SSA names for them.  A bit of a
"bad" design ...

We've used BLKmode checks elsewhere so I think it would be appropriate
here, too.

[Bug ipa/113476] [14 Regression] irange::maybe_resize leaks memory via IPA VRP

2024-02-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113476

--- Comment #17 from rguenther at suse dot de  ---
On Wed, 21 Feb 2024, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113476
> 
> --- Comment #8 from Aldy Hernandez  ---
> (In reply to Richard Biener from comment #5)
> > (In reply to Martin Jambor from comment #4)
> > > The right place where to free stuff in lattices post-IPA would be in
> > > ipa_node_params::~ipa_node_params() where we should iterate over lattices
> > > and deinitialize them or perhaps destruct the array because since
> > > ipcp_vr_lattice directly contains Value_Range which AFAIU directly 
> > > contains
> > > int_range_max which has a virtual destructor... does not look like a POD
> > > anymore.  This has escaped me when I was looking at the IPA-VR changes but
> > > hopefully it should not be too difficult to deal with.
> > 
> > OK, that might work for the IPA side.
> > 
> > It's quite unusual to introduce a virtual DTOR in the middle of the class
> > hierarchy though.  Grepping I do see quite some direct uses of 'irange'
> > and also 'vrange' which do not have the DTOR visible but 'irange' already
> > exposes and uses 'maybe_resize'.  I think those should only be introduced
> > in the class exposing the virtual DTOR (but why virtual?!).
> > 
> > Would be nice to have a picture of the range class hierarchies with
> > pointers on which types to use in which circumstances ...
> 
> For reference, you should always use the most base class you can.  If
> you can get the work done with the vrange API, use that.  If you're
> dealing with an integer, use irange.  The int_range<> derived class
> should only be used for defining a variable, so:
> 
> int_range<123> foobar; // Defined with storage.
> 
> if (is_a )
> {
>   irange  = as_a  (foobar);
>   ...
> }
> 
> // Use an irange reference here, not an int_range reference.
> void somefunc (const irange )
> {
> }
> 
> Also, the reason irange does not have a destructor is because it has
> no storage.  Only int_range<> has storage associated with it, so it is
> the only one able to see if the range grew:

But when I do

 irange *r = new int_range<>;
 delete r;

then it will fail to release memory?  Are virtual DTORs not exactly
invented for this, when you delete on a reference/pointer to a base
class but the object is really a derived one?

That's what I was refering to with "introducing a virtual DTOR
in a not base-class".

I still think that's bad OO design, no?

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #12 from rguenther at suse dot de  ---
On Mon, 19 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #10 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Jakub Jelinek from comment #7)
> > > > I think I can handle it like the VIEW_CONVERT_EXPR case, bet with
> > > > _BitInt(511) it would actually be a VCE, but when it is same size
> > > > BITINT_TYPE to INTEGER_TYPE we choose NOP_EXPR.
> > > > That said, I think it would be better if the memcpy folding used say 
> > > > vector
> > > > types instead of these extra large integer types.
> > > 
> > > Hmm.  Maybe we want a target hook to specify the "move mode" for a given
> > > byte size and then we can use bitwise_type_for_mode to get a type?
> > > 
> > > Maybe we can even get rid of that large integer mode requirement that way 
> > > ...
> > 
> > Or we refuse to use integer types for > MAX_FIXED_MODE_SIZE sizes and
> > instead always try VNQImode?  For memcpy folding I mean.
> 
> Yeah, and punt if that mode isn't supported.

Or simply use mode_for_size with MODE_VECTOR_INT (though that doesn't
check targetm.vector_mode_supported_p for vector modes).  Just
in case a target doesn't have QImode component support.

[Bug middle-end/111156] [14 Regression] aarch64 aarch64/sve/mask_struct_store_4.c failures

2024-02-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56

--- Comment #22 from rguenther at suse dot de  ---
> Am 15.02.2024 um 19:53 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56
> 
> --- Comment #21 from Tamar Christina  ---
> (In reply to Richard Biener from comment #18)
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index 7cf9504398c..8deeecfd4aa 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -1280,8 +1280,11 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
>> *swap,
>>&& rhs_code.is_tree_code ()
>>&& (TREE_CODE_CLASS (tree_code (first_stmt_code))
>>== tcc_comparison)
>> -   && (swap_tree_comparison (tree_code (first_stmt_code))
>> -   == tree_code (rhs_code)))
>> +   && ((swap_tree_comparison (tree_code (first_stmt_code))
>> +== tree_code (rhs_code))
>> +   || ((TREE_CODE_CLASS (tree_code (alt_stmt_code))
>> +== tcc_comparison)
>> +   && rhs_code == alt_stmt_code)))
>>   && !(STMT_VINFO_GROUPED_ACCESS (stmt_info)
>>&& (first_stmt_code == ARRAY_REF
>>|| first_stmt_code == BIT_FIELD_REF
>> 
>> should get you SLP but:
>> 
>> t.c:8:26: note:   === vect_slp_analyze_operations ===
>> t.c:8:26: note:   ==> examining statement: pretmp_29 = *_28;
>> t.c:8:26: missed:   unsupported load permutation
>> t.c:10:30: missed:   not vectorized: relevant stmt not supported: pretmp_29
>> = *_28;
>> 
>> t.c:8:26: note:   op template: pretmp_29 = *_28;
>> t.c:8:26: note: stmt 0 pretmp_29 = *_28;
>> t.c:8:26: note: stmt 1 pretmp_29 = *_28;
>> t.c:8:26: note: load permutation { 0 0 }
> 
> hmm with that applied I get:
> 
> sve-mis.c:8:26: note:   ==> examining statement: pretmp_29 = *_28;
> sve-mis.c:8:26: note:   Vectorizing an unaligned access.
> sve-mis.c:8:26: note:   vect_model_load_cost: unaligned supported by hardware.
> sve-mis.c:8:26: note:   vect_model_load_cost: inside_cost = 1, prologue_cost =
> 0 .
> 
> but it bails out at:
> 
> sve-mis.c:8:26: missed:   Not using elementwise accesses due to variable
> vectorization factor.
> sve-mis.c:10:25: missed:   not vectorized: relevant stmt not supported:
> .MASK_STORE (_5, 8B, _27, pretmp_29);
> sve-mis.c:8:26: missed:  bad operation or unsupported loop bound.
> 
> for me

I’ve used -fno-cost-model and looked at the SVE variant only.

> --
> You are receiving this mail because:
> You are the assignee for the bug.
> You are on the CC list for the bug.

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #35 from rguenther at suse dot de  ---
On Thu, 15 Feb 2024, rguenther at suse dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> 
> --- Comment #34 from rguenther at suse dot de  ---
> On Thu, 15 Feb 2024, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> > 
> > --- Comment #32 from Jakub Jelinek  ---
> > (In reply to Jan Hubicka from comment #31)
> [...]
> > > Not streaming value ranges is an omission on my side (I mistakely assumed 
> > > we
> > > do stream them).  We ought to stream them, since otherwise we will lose
> > > propagated return value ranges in partitioned programs, which is pity.
> > 
> > Not sure if it is a good idea to start streaming them now in stage4, but 
> > sure,
> 
> Please don't do that now ;)
> 
> (we've not had ranges early originally, and even nonzero bits were not
> computed)
> 
> But yes, IPA CP jump functions with value-ranges might be a problem.
> I think it does instantiate them as local ranges, does it?  We
> have streaming support for ranges, we just don't stream the auxiliary
> data for all SSA names (like also not points-to info).

Also remember we like to have a fix that's easily backportable, and
that's probably going to be resetting the info.  We can do something
more fancy for GCC 15

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #34 from rguenther at suse dot de  ---
On Thu, 15 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> 
> --- Comment #32 from Jakub Jelinek  ---
> (In reply to Jan Hubicka from comment #31)
[...]
> > Not streaming value ranges is an omission on my side (I mistakely assumed we
> > do stream them).  We ought to stream them, since otherwise we will lose
> > propagated return value ranges in partitioned programs, which is pity.
> 
> Not sure if it is a good idea to start streaming them now in stage4, but sure,

Please don't do that now ;)

(we've not had ranges early originally, and even nonzero bits were not
computed)

But yes, IPA CP jump functions with value-ranges might be a problem.
I think it does instantiate them as local ranges, does it?  We
have streaming support for ranges, we just don't stream the auxiliary
data for all SSA names (like also not points-to info).

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #18 from rguenther at suse dot de  ---
On Wed, 14 Feb 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787
> 
> --- Comment #17 from Jan Hubicka  ---
> > > I guess PTA gets around by tracking points-to set also for non-pointer
> > > types and consequently it also gives up on any such addition.
> > 
> > It does.  But note it does _not_ for POINTER_PLUS where it treats
> > the offset operand as non-pointer.
> > 
> > > I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
> > > pointer_plus expression, but does not look through POINTER_PLUS.
> > > We can restrict it further, but tracking base pointer is quite useful,
> > > so it would be nice to not give up completely.
> > 
> > It looks like that function might treat that
> > 
> >  ADDR_EXPR >
> > 
> > as integer_zerop base.  It does
> > 
> >   if (TREE_CODE (op) == ADDR_EXPR) 
> > {
> >   poly_int64 extra_offset = 0; 
> >   tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
> >  );
> >   if (!base)
> > {
> >   base = get_base_address (TREE_OPERAND (op, 0));
> >   if (TREE_CODE (base) != MEM_REF)
> > break;
> >   offset_known = false;
> > }
> >   else
> > {
> >   if (TREE_CODE (base) != MEM_REF)
> > break;
> > 
> > with a variable offset we fall to the TREE_CODE (base) != MEM_REF
> > and will have offset_known == true.  Not sure what it does with
> > the result though (it's not the address of a decl).
> > 
> > This function seems to oddly special-case != MEM_REF ... (maybe
> > it wants to hande DECL_P () as finishing?
> 
> Hmm the function was definitely not written with TARGET_MEM_REF in mind,
> since it was originally used for IPA passes only.
> We basically want to handle stuff like
>  >foo
> or
>  &(ptr->foo)
> In the second case we want to continue the SSA walk to hopefully work
> out the origin of PTR.
> ipa-modref then looks if the base pointer is derived from function
> parameter or points to local or readonly memory to produce its summary.
> > 
> > Note get_addr_base_and_unit_offset will return NULL for
> > a TARGET_MEM_REF <, ..., offset> but TARGET_MEM_REF
> > itself if the base isn't an ADDR_EXPR, irrespective of whether
> > the offset within it is constant or not.
> 
> Hmm, interesting.  I would expect it to interpret the emantics of TMR
> and return base.
> > 
> > Not sure if the above is a problem, but it seems the only caller
> > will just call points_to_local_or_readonly_memory_p on the
> > ADDR_EXPR where refs_local_or_readonly_memory_p via
> > points_to_local_or_readonly_memory_p will eventually do
> > 
> >   /* See if memory location is clearly invalid.  */
> >   if (integer_zerop (t))
> > return flag_delete_null_pointer_checks;
> > 
> > and that might be a problem.  As said, we rely on
> > ADDR_EXPR  > to be an address computation
> > that's not subject to strict interpretation to allow IVOPTs
> > doing this kind of optimization w/o introducing some kind of
> > INTEGER_LEA <...>.  I know that's a bit awkward but we should
> > make sure this is honored by IPA as well.
> > 
> > I'd say
> > 
> > diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> > index 74c9b4e1d1e..45a770cf940 100644
> > --- a/gcc/ipa-fnsummary.cc
> > +++ b/gcc/ipa-fnsummary.cc
> > @@ -2642,7 +2642,8 @@ points_to_local_or_readonly_memory_p (tree t)
> > return true;
> >return !ptr_deref_may_alias_global_p (t, false);
> >  }
> > -  if (TREE_CODE (t) == ADDR_EXPR)
> > +  if (TREE_CODE (t) == ADDR_EXPR
> > +  && TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF)
> >  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
> >return false;
> >  }
> > 
> > might eventually work?  Alternatively a bit less aggressive like
> > the following.
> > 
> > diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> > index 74c9b4e1d1e..7c79adf6440 100644
> > --- a/gcc/ipa-fnsummary.cc
> > +++ b/gcc/ipa-fnsummary.cc
> > @@ -2642,7 +2642,9 @@ points_to_local_or_readonly_memory_p (tree t)
> > return true;
> >return !ptr_deref_may_alias_global_p (t, false);
> >  }
> > -  if (TREE_CODE (t) == ADDR_EXPR)
> > +  if (TREE_CODE (t) == ADDR_EXPR
> > +  && (TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF
> > + || TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) != 
> > INTEGER_CST))
> >  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
> >return false;
> >  }
> 
> Yes, those both looks reasonable to me, perhaps less agressive would be
> better. 

Note I didn't check if it helps the testcase ..

> > 
> > A "nicer" solution might be to add a informational operand
> > to TARGET_MEM_REF, representing the base pointer to be used for
> > alias/points-to purposes.  

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #28 from rguenther at suse dot de  ---
On Wed, 14 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> 
> --- Comment #27 from Jakub Jelinek  ---
> So:
> --- gcc/ipa-icf.cc.jj   2024-02-10 11:25:09.645478952 +0100
> +++ gcc/ipa-icf.cc  2024-02-14 10:44:27.906216458 +0100
> @@ -1244,6 +1244,29 @@ sem_function::merge (sem_item *alias_ite
>else
>  create_alias = true;
> 
> +  unsigned i;
> +  tree name;
> +  FOR_EACH_SSA_NAME (i, name, original->get_fun ())
> +{
> +  /* We need to either merge or reset SSA_NAME_*_INFO.
> +For merging we don't preserve the mapping between
> +original and alias SSA_NAMEs from successful equals
> +calls.  */
> +  if (POINTER_TYPE_P (TREE_TYPE (name)))
> +{
> + if (SSA_NAME_PTR_INFO (name))
> +   {
> + gcc_assert (!flag_wpa);
> + SSA_NAME_PTR_INFO (name) = NULL;
> +   }
> +}
> +  else if (SSA_NAME_RANGE_INFO (name))
> +   {
> + gcc_assert (!flag_wpa);
> + SSA_NAME_RANGE_INFO (name) = NULL;
> +   }
> +}
> +
>if (redirect_callers)
>  {
>int nredirected = redirect_all_callers (alias, local_original);
> then?

Yeah, though can't we do this in the caller and thus only once for the
target when we merge more than 2 nodes?

  For the merging, I guess we'd need to move one of the 2 vec vectors
> from m_checker to the sem_function instead of throwing it away in
> sem_function::equals
> if returning true.

It might require quite some amount of memory though when N is big,
I'm not sure it's worth it?

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #16 from rguenther at suse dot de  ---
On Tue, 13 Feb 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787
> 
> --- Comment #15 from Jan Hubicka  ---
> > 
> > IVOPTs does the above but it does it (or should) as
> > 
> >   offset = (uintptr) - (uintptr)
> >   val = *((T *)((uintptr)base1 + i + offset))
> > 
> > which is OK for points-to as no POINTER_PLUS_EXPR is involved so the
> > resulting pointer points to both base1 and base2 (which isn't optimal
> > but correct).
> > 
> > If we somehow get back a POINTER_PLUS that's where things go wrong.
> > 
> > Doing the above in C code would be valid input so we have to treat
> > it correctly (OK, the standard only allows back-and-forth
> > pointer-to-integer casts w/o any adjustment, but of course we relax
> > this).
> 
> OK. Modrefs tracks base pointer for accesses and tries to prove that
> they are function parameters.  This should immitate ivopts:
> void
> __attribute__ ((noinline))
> set(int *a, unsigned long off)
> {
>   *(int *)((unsigned long)a + off) = 1;
> }
> int
> test ()
> {
>   int a;
>   int b = 0;
>   set (, (unsigned long) - (unsigned long));
>   return b;
> }
> 
> Here set gets following gimple at modref2 time:
> __attribute__((noinline)) 
> void set (int * a, long unsigned int off)
> {
>   long unsigned int a.0_1;
>   long unsigned int _2;
>   int * _3; 
> 
>[local count: 1073741824]:
>   a.0_1 = (long unsigned int) a_4(D);
>   _2 = a.0_1 + off_5(D); 
>   _3 = (int *) _2; 
>   *_3 = 1; 
>   return;
> 
> }
> 
> This is not pattern matched so modref does not think the access has a as
> a base:
> 
>   stores:
>   Base 0: alias set 1
> Ref 0: alias set 1
>   Every access
> 
> While for:
> 
> void
> __attribute__ ((noinline))
> set(int *a, unsigned long off)
> {
>   *(a+off/sizeof(int))=1;
> }
> 
> we produce:
> 
> __attribute__((noinline))
> void set (int * a, long unsigned int off)
> {
>   sizetype _1;
>   int * _2;
> 
>[local count: 1073741824]:
>   _1 = off_3(D) & 18446744073709551612;
>   _2 = a_4(D) + _1;
>   *_2 = 1;
>   return;
> }
> 
> And this is understood:
> 
>   stores:
>   Base 0: alias set 1
> Ref 0: alias set 1
>   access: Parm 0
> 
> If we consider it correct to optimize out the conversion from and to
> pointer type, then I suppose any addition of pointer and integer which
> we do not see means that we need to give up on tracking base completely.
> 
> I guess PTA gets around by tracking points-to set also for non-pointer
> types and consequently it also gives up on any such addition.
> 
> But what we really get from relaxing this?
> > 
> > IVOPTs then in putting all of the stuff into 'offset' gets at
> > trying a TARGET_MEM_REF based on a NULL base but that's invalid.
> > We then resort to a LEA (ADDR_EXPR of TARGET_MEM_REF) to compute
> > the address which gets us into some phishy argument that it's
> > not valid to decompose ADDR_EXPR of TARGET_MEM_REF to
> > POINTER_PLUS of the TARGET_MEM_REF base and the offset.  But
> > that's how it is (points-to treats (address of) TARGET_MEM_REF
> > as pointing to anything ...).
> > 
> > > A quick fix would be to run IPA modref before ivopts, but I do not see 
> > > how such
> > > transformation can work with rest of alias analysis (PTA etc)
> > 
> > It does.  Somewhere IPA modref interprets things wrongly, I didn't figure
> > out here though.
> 
> 
> I guess PTA gets around by tracking points-to set also for non-pointer
> types and consequently it also gives up on any such addition.

It does.  But note it does _not_ for POINTER_PLUS where it treats
the offset operand as non-pointer.

> I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
> pointer_plus expression, but does not look through POINTER_PLUS.
> We can restrict it further, but tracking base pointer is quite useful,
> so it would be nice to not give up completely.

It looks like that function might treat that

 ADDR_EXPR >

as integer_zerop base.  It does

  if (TREE_CODE (op) == ADDR_EXPR) 
{
  poly_int64 extra_offset = 0; 
  tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
 );
  if (!base)
{
  base = get_base_address (TREE_OPERAND (op, 0));
  if (TREE_CODE (base) != MEM_REF)
break;
  offset_known = false;
}
  else
{
  if (TREE_CODE (base) != MEM_REF)
break;

with a variable offset we fall to the TREE_CODE (base) != MEM_REF
and will have offset_known == true.  Not sure what it does with
the result though (it's not the address of a decl).

This function seems to oddly special-case != MEM_REF ... (maybe
it wants to hande DECL_P () as finishing?

Note get_addr_base_and_unit_offset will return NULL for
a TARGET_MEM_REF <, ..., offset> but 

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #14 from rguenther at suse dot de  ---
On Tue, 13 Feb 2024, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787
> 
> --- Comment #13 from Jan Hubicka  ---
> So my understanding is that ivopts does something like
> 
>  offset =  - 
> 
> and then translate
>  val = base2[i]
> to
>  val = *((base1+i)+offset)
> 
> Where (base1+i) is then an iv variable.
> 
> I wonder if we consider doing memory reference with base changed via offset a
> valid transformation. Is there way to tell when this happens?

IVOPTs does the above but it does it (or should) as

  offset = (uintptr) - (uintptr)
  val = *((T *)((uintptr)base1 + i + offset))

which is OK for points-to as no POINTER_PLUS_EXPR is involved so the
resulting pointer points to both base1 and base2 (which isn't optimal
but correct).

If we somehow get back a POINTER_PLUS that's where things go wrong.

Doing the above in C code would be valid input so we have to treat
it correctly (OK, the standard only allows back-and-forth
pointer-to-integer casts w/o any adjustment, but of course we relax
this).

IVOPTs then in putting all of the stuff into 'offset' gets at
trying a TARGET_MEM_REF based on a NULL base but that's invalid.
We then resort to a LEA (ADDR_EXPR of TARGET_MEM_REF) to compute
the address which gets us into some phishy argument that it's
not valid to decompose ADDR_EXPR of TARGET_MEM_REF to
POINTER_PLUS of the TARGET_MEM_REF base and the offset.  But
that's how it is (points-to treats (address of) TARGET_MEM_REF
as pointing to anything ...).

> A quick fix would be to run IPA modref before ivopts, but I do not see how 
> such
> transformation can work with rest of alias analysis (PTA etc)

It does.  Somewhere IPA modref interprets things wrongly, I didn't figure
out here though.

[Bug c++/113852] -Wsign-compare doesn't warn on unsigned result types

2024-02-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113852

--- Comment #8 from rguenther at suse dot de  ---
On Mon, 12 Feb 2024, admin at computerquip dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113852
> 
> --- Comment #7 from Zachary L  ---
> (In reply to Richard Biener from comment #6)
> > Well, given athat a1 * a2 is carried out in 'int' you are invoking undefined
> > behavior if it overflows.  GCC assumes that doesn't happen so it's correct
> > to elide the diagnostic.  Unless you make overflow well-defined with 
> > -fwrapv.
> > 
> > I think that errors on the right side for the purpose of -Wsign-compare.
> 
> Removing a diagnostic because the program could be ill-formed seems backwards
> to me, especially since it seems like there's already logic in performing this
> diagnostic. Perhaps I've misunderstood the situation?

Well, we could, for each int * int diagnose a "may invoke undefined 
behavior if overflow happens at runtime" but that's hardly useful.
So we want to diagnose cases where it's _likely_ that there's an
issue.

For the case at hand it's not likely that ushort * ushort invokes
undefined behavior (because invoking undefined should be unlikely,
very much so).  So we choose to not diagnose it as "maybe".

Yes, with constexpr evaluation we actually _see_ we invoke undefined
behavior and diagnose that.  The subsequent diagnostic of
-Wsign-compare is then spurious though and IMO should be not given.
The error is the integer overflow invoking undefined behavior even
- if that didn't happen there would be no issue with comparing 
signed/unsigned values.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #17 from rguenther at suse dot de  ---
On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #16 from JuzheZhong  ---
> The FMA is generated in widening_mul PASS:
> 
> Before widening_mul (fab1):
> 
>   _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
>   _6 = _5 * 1.229982236431605997495353221893310546875e-1;
>   _8 = _4 + _6;

So this is x + (CST1 - x) * CST2 which we might fold/associate to
x * (1. - CST2) + CST1 * CST2

this looks like something for reassociation (it knows some rules,
like what it does in undistribute_ops_list, I'm not sure if that
comes into play here already, this would be doing the reverse
before).  A match.pd pattern also works, but it wouldn't be
general enough to handle more complicated but similar cases.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #14 from rguenther at suse dot de  ---
On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #13 from JuzheZhong  ---
> Ok. I found the optimized tree:
> 
> 
>   _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
>   _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, 
> _4);
> 
> Let CST0 = 3.33314829616256247390992939472198486328125e-1,
> CST1 = 1.229982236431605997495353221893310546875e-1
> 
> The expression is equivalent to the following:
> 
> _5 = CST0 - _4;
> _8 = _5 * CST1 + 4;
> 
> That is:
> 
> _8 = (CST0 - _4) * CST1 + 4;
> 
> So, We should be able to re-associate it like Clang:
> 
> _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);
> 
> Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
> time.
> 
> Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as 
> Clang:
> 
> _8 = FMA (_4, CST3, CST2).
> 
> Any suggestions for this re-association ?  Is match.pd the right place to do 
> it
> ?

You need to look at the IL before we do .FMA forming, specifically 
before/after the late reassoc pass.  There pass applying match.pd
patterns everywhere is forwprop.

I also wonder which compilation flags you are using (note clang
has different defaults for example for -ftrapping-math)

[Bug ipa/113359] [13 Regression] LTO miscompilation of ceph on aarch64

2024-02-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359

--- Comment #13 from rguenther at suse dot de  ---
On Tue, 6 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359
> 
> --- Comment #12 from Jakub Jelinek  ---
> Just going from the demangled name of
> std::pair std::chrono::duration > > const,
> Context*>
> it would surprise me if it was ODR violation in the testcase, because class
> Context
> is only defined in Timer.ii, not in the other preprocessed source,
> ceph::mono_clock is defined in both but looks the same (and it is empty class
> anyway), and the pair uses Context* as second type anyway, so it is unclear 
> how
> it could become something smaller than pointer.
> But, admittedly I haven't looked up at this under the debugger and not even
> sure where to look at that.

Might be also an interaction with IPA ICF in case there's a pointer to
the pair involved?

[Bug tree-optimization/113736] ICE: verify_gimple failed: incompatible types in 'PHI' argument 0 with _BitInt() struct copy to __seg_fs/gs

2024-02-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113736

--- Comment #3 from rguenther at suse dot de  ---
On Sat, 3 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113736
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>Last reconfirmed||2024-02-03
>  CC||rguenth at gcc dot gnu.org
>  Ever confirmed|0   |1
>  Status|UNCONFIRMED |NEW
> 
> --- Comment #1 from Jakub Jelinek  ---
> struct S { _BitInt(710) a; };
> struct T { struct S b[2]; };
> 
> void
> foo (__seg_gs struct T *p)
> {
>   struct S s;
>   p->b[0] = s;
> }
> 
> Bitint lowering changes here
>   MEM < _BitInt(768)> [( struct T *)p_2(D)] 
> =
> s_4(D);
> to
>   VIEW_CONVERT_EXPR(MEM < _BitInt(768)>
> [( struct T *)p_2(D)])[_5] = s_7(D);
> accesses in a loop.  Is that invalid and should have  also in
> the VCE type?  Or is this just a vectorizer bug?

I think that's OK, I will have a look.

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #18 from JuzheZhong  ---
> (In reply to rguent...@suse.de from comment #17)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > 
> > > --- Comment #16 from JuzheZhong  ---
> > > (In reply to rguent...@suse.de from comment #15)
> > > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > > > 
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > > > 
> > > > > --- Comment #14 from JuzheZhong  ---
> > > > > Thanks Richard.
> > > > > 
> > > > > It seems that we can't fix this issue for now. Is that right ?
> > > > > 
> > > > > If I understand correctly, do you mean we should wait after SLP 
> > > > > representations
> > > > > are finished and then revisit this PR?
> > > > 
> > > > Yes.
> > > 
> > > It seems to be a big refactor work.
> > 
> > It's not too bad if people wouldn't continue to add features not 
> > implementing SLP ...
> > 
> > > I wonder I can do anything to help with SLP representations ?
> > 
> > I hope to get back to this before stage1 re-opens and will post
> > another request for testing.  It's really mostly going to be making
> > sure all paths have coverage which means testing all the various
> > architectures - I can only easily test x86.  There's a branch
> > I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> > which I use to hunt down cases not supporting SLP (it's a bit
> > overeager to trigger, and it has known holes so it's not really
> > a good starting point yet for folks to try other archs).
> 
> Ok. It seems that you almost done with that but needs more testing in
> various targets.
> 
> So, if I want to work on optimizing vectorization (start with TSVC),
> I should avoid touching the failed vectorized due to data reference/dependence
> analysis (e.g. this PR case, s116).

It depends on the actual case - the one in this bug at least looks like
half of it might be dealt with with the refactoring.

> and avoid adding new features into loop vectorizer, e.g. min/max reduction 
> with
> index (s315).

It's fine to add features if they works with SLP as well ;)  Note that
in the future SLP will also do the "single lane" case but it doesn't
do that on trunk.  Some features are difficult with multi-lane SLP
and probably not important in practice for that case, still handling
single-lane SLP will be important as otherwise the feature is lost.

> To not to make your SLP refactoring work heavier.
> 
> Am I right ?

Yes.  I've got early break vectorization to chase now, I was "finished"
with the parts I could exercise on x86_64 in autumn ...

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #17 from rguenther at suse dot de  ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #16 from JuzheZhong  ---
> (In reply to rguent...@suse.de from comment #15)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > 
> > > --- Comment #14 from JuzheZhong  ---
> > > Thanks Richard.
> > > 
> > > It seems that we can't fix this issue for now. Is that right ?
> > > 
> > > If I understand correctly, do you mean we should wait after SLP 
> > > representations
> > > are finished and then revisit this PR?
> > 
> > Yes.
> 
> It seems to be a big refactor work.

It's not too bad if people wouldn't continue to add features not 
implementing SLP ...

> I wonder I can do anything to help with SLP representations ?

I hope to get back to this before stage1 re-opens and will post
another request for testing.  It's really mostly going to be making
sure all paths have coverage which means testing all the various
architectures - I can only easily test x86.  There's a branch
I worked on last year, refs/users/rguenth/heads/vect-force-slp,
which I use to hunt down cases not supporting SLP (it's a bit
overeager to trigger, and it has known holes so it's not really
a good starting point yet for folks to try other archs).

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #14 from JuzheZhong  ---
> Thanks Richard.
> 
> It seems that we can't fix this issue for now. Is that right ?
> 
> If I understand correctly, do you mean we should wait after SLP 
> representations
> are finished and then revisit this PR?

Yes.

[Bug tree-optimization/113622] [11/12/13 Regression] ICE with vectors in named registers

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113622

--- Comment #23 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, xry111 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113622
> 
> --- Comment #22 from Xi Ruoyao  ---
> On x86_64:
> 
> $ cat t.c
> typedef float __attribute__ ((vector_size (16))) vec;
> typedef int __attribute__ ((vector_size (16))) ivec;
> register vec a asm("xmm0"), b asm("xmm1");
> register ivec c asm("xmm2");
> 
> void
> test (void)
> {
>   for (int i = 0; i < 4; i++)
> c[i] = a[i] < b[i] ? -1 : 1;
> }
> $ gcc/cc1 -msse2 t.c -O2 -fno-vect-cost-model -nostdinc -ffixed-xmm{0,1,2}
> t.c: In function 'test':
> t.c:7:1: internal compiler error: in expand_expr_addr_expr_1, at expr.cc:9139
> 7 | test (void)
>   | ^~~~
> 0x10e6d6e expand_expr_addr_expr_1
> ../../gcc/gcc/expr.cc:9139
> 0x10e76e2 expand_expr_addr_expr
> ../../gcc/gcc/expr.cc:9252
> 0x10f73a7 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
> ../../gcc/gcc/expr.cc:12585
> 0x10e7dc8 expand_expr_real(tree_node*, rtx_def*, machine_mode, 
> expand_modifier,
> rtx_def**, bool)
> ../../gcc/gcc/expr.cc:9440
> 0xef7346 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> ../../gcc/gcc/expr.h:316
> 0x10e91fa expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
> ../../gcc/gcc/expr.cc:9762
> 0x10ef77d expand_expr_real_gassign(gassign*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
> ../../gcc/gcc/expr.cc:11096
> 0xf2db31 expand_gimple_stmt_1
> ../../gcc/gcc/cfgexpand.cc:4010
> 0xf2ddd4 expand_gimple_stmt
> ../../gcc/gcc/cfgexpand.cc:4071
> 0xf36844 expand_gimple_basic_block
> ../../gcc/gcc/cfgexpand.cc:6127
> 0xf38ff8 execute
> ../../gcc/gcc/cfgexpand.cc:6866
> 
> Should I open a new ticket or add back 14 Regression to the subject?

Please open a new ticked - this seems to be another vectorizer issue.

We end up with the invalid

_28 = (sizetype) 

[Bug target/113059] [14 regression] fftw fails tests for -O3 -m32 -march=znver2 since r14-6210-ge44ed92dbbe9d4

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113059

--- Comment #20 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113059
> 
> --- Comment #18 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #17)
> > (In reply to Jakub Jelinek from comment #16)
> > > The question is revert what exactly?
> > > If we revert r14-6210, we get back the other P1.  Or do you mean revert
> > > r14-5355?
> > > I guess another option is move the vzeroupper pass one pass later, i.e.
> > > after pass_gcse.
> > 
> > I think moving mdreorg passes as late as possible esp. when they don't play
> > well with DF/notes is a good thing.  Maybe even after pass_rtl_dse2 and
> > thus after shrink-wrapping?
> 
> The thing is that the vzeroupper pass actually plays well with DF notes, the
> problem is that it now (in GCC 14) asks for them to be computed.
> The first issue was that vzeroupper before postreload_cse computed the notes,
> then
> postreload_cse CSEd something and made the REG_UNUSED invalid without killing
> them and then later passes went wrong because of the incorrect notes.
> This issue is that vzeroupper now after postreload_cse but before gcse2
> computes notes, then gcse2 CSEs something and makes REG_UNUSED invalid, rest 
> is
> the same.
> But, I believe gcse2 is the last CSE-ish pass.
> I wouldn't move it too much further, because I don't remember the interactions
> between vzeroupper, splitting and peepholes.

OK, so the "real" revert would then simply kill the notes actively
again after vzeroupper?  Btw, DSE also uses cselib, but I'm not sure
whether it uses REG_UNUSED but IIRC it does "redundant store" removal
and it does replace reads in some cases ... (not sure if after reload
though).

So for maximum safety if we'd have a way to kill off REG_UNUSED maybe
we should do that instead?  OTOH any "stray" valid REG_UNUSED
notes not causing issues with gcse or postreload_cse might not be
preserved and cause missed optimizations later ...

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #31 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #30 from Richard Sandiford  ---
> (In reply to Richard Biener from comment #29)
> > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> > VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> > masking for VECTOR_CST expansion as well, instead of trying to fixup
> > in do_compare_and_jump?
> But then how would ~ be implemented for things like 4-bit masks?
> If we use notqi2 then I assume the upper bits could be 1 rather than 0.

Yeah, I guess it's similar to expand_expr_real_1 'reduce_bit_field'
handling - we'd need to insert fixup code in strathegic places
(or for ~ use xor with the proper mask).

The difficulty is that we can't make the backend do this unless
there are insn operands that allows it to infer the real precision
of the mode.  And for most insns the excess bits are irrelevant
anyway.

Still the CTOR case showed wrong-code issues with GCN, which possibly
means it has the same issue with VECTOR_CSTs as well.  IIRC that
was that all vectors are 1024bits, and its "fake" V4SImode insns
rely on accurate masked out upper bits.  That might hint that
compares are not enough here (but for non-compares the backend
might have a chance to fixup by infering the max. number of
active elements).

If we think that compares (but that would also be compares without
jump, aka a == b | c == d) are the only problematical case we can
also fixup at the uses rather than at the defs as 'reduce_bit_field'
tries to do.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #10 from rguenther at suse dot de  ---
On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #9 from Robin Dapp  ---
> (In reply to rguent...@suse.de from comment #6)
> 
> > t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> > or not equal to 3
> > t.c:47:21: missed:   not falling back to elementwise accesses
> > t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> > *_3;
> > t.c:47:21: missed:  bad operation or unsupported loop bound.
> > 
> > where we don't consider using gather because we have a known constant
> > stride (20).  Since the stores are really scatters we don't attempt
> > to SLP either.
> > 
> > Disabling the above heuristic we get this vectorized as well, avoiding
> > gather/scatter by manually implementing them and using a quite high
> > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> > faster code in the end).
> 
> I suppose you're referring to this?
> 
>   /* FIXME: At the moment the cost model seems to underestimate the
>  cost of using elementwise accesses.  This check preserves the
>  traditional behavior until that can be fixed.  */
>   stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
>   if (!first_stmt_info)
> first_stmt_info = stmt_info;
>   if (*memory_access_type == VMAT_ELEMENTWISE
>   && !STMT_VINFO_STRIDED_P (first_stmt_info)
>   && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
>&& !DR_GROUP_NEXT_ELEMENT (stmt_info)
>&& !pow2p_hwi (DR_GROUP_SIZE (stmt_info
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "not falling back to elementwise accesses\n");
>   return false;
> }
> 
> 
> I did some more tests on my laptop.  As said above the whole loop in lbm is
> larger and contains two ifs.  The first one prevents clang and GCC from
> vectorizing the loop, the second one
> 
> if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
> ux = 0.005;
> uy = 0.002;
> uz = 0.000;
> }
> 
> seems to be if-converted? by clang or at least doesn't inhibit vectorization.
> 
> Now if I comment out the first, larger if clang does vectorize the loop.  With
> the return false commented out in the above GCC snippet GCC also vectorizes,
> but only when both ifs are commented out.
> 
> Results (with both ifs commented out), -march=native (resulting in avx2), best
> of 3 as lbm is notoriously fickle:
> 
> gcc trunk vanilla: 156.04s
> gcc trunk with elementwise: 132.10s
> clang 17: 143.06s
> 
> Of course even the comment already said that costing is difficult and the
> change will surely cause regressions elsewhere.  However the 15% improvement
> with vectorization (or the 9% improvement of clang) IMHO show that it's surely
> useful to look into this further.  On top, the riscv clang seems to not care
> about the first if either and still vectorize.  I haven't looked closer what
> happens there, though.

Yes.  I think this shows we should remove the above hack and instead
try to fix the costing next stage1.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #19 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #18 from Richard Sandiford  ---
> (In reply to Tamar Christina from comment #17)
> > Well the mid-end has generated the right precision. The type it generates is
> > vector(4)  vexit_reduc_67;
> > so it does say it's a single bit boolean.
> > 
> > Isn't this just an expand problem?
> That's what I meant.  expand is using a QImode comparison to compare things
> with 4-bit precision, so I think the masking should happen at that point.
> 
> How about doing the masking in do_compare_and_jump?

That sounds sensible.

Note that I wonder how to eliminate redundant maskings?  I suppose
eventually combine tracking nonzero bits where obvious would do
that?  For example for cmp:V4SI we know the bits will be zero but
I wonder if the RTL IL is obvious enough to derive this (or whether
there's a target hook for extra nonzero bit discovery, say for
unspecs).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #12 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #7 from Hongtao Liu  ---
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1fd957288d4..33a8d539b4d 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned 
> char
> *ptr, int len,
> 
>unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
>unsigned int first_elt = off * elts_per_byte;
> -  unsigned int extract_elts = extract_bytes * elts_per_byte;
> +  unsigned int extract_elts = count;
>for (unsigned int i = 0; i < extract_elts; ++i)
> {
>   tree elt = VECTOR_CST_ELT (expr, first_elt + i);
> 
> Shouldn't we use count here?(it also fixed the hanged issue).

extract_bytes is capped by the buffer 'len':

  int total_bytes = CEIL (elt_bits * count, BITS_PER_UNIT);
..
  int extract_bytes = MIN (len, total_bytes - off);

we'd still need to effectively do that.  But yeah, using CEIL
makes extract_elts off.  Maybe we should simply calculate
extract_bits instead (but then use uint64 for that)

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #6 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #5 from JuzheZhong  ---
> Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> 
> https://godbolt.org/z/MjbTbYf1G
> 
> But I am not sure X86 ICC is better or X86 Clang is better.

gather/scatter are possibly slow (and gather now has that Intel
security issue).  The reason is a "cost" one:

t.c:47:21: note:   ==> examining statement: _4 = *_3;
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
or not equal to 3
t.c:47:21: missed:   not falling back to elementwise accesses
t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
*_3;
t.c:47:21: missed:  bad operation or unsupported loop bound.

where we don't consider using gather because we have a known constant
stride (20).  Since the stores are really scatters we don't attempt
to SLP either.

Disabling the above heuristic we get this vectorized as well, avoiding
gather/scatter by manually implementing them and using a quite high
VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
faster code in the end).  But yes, I doubt that any of ICC or clang
vectorized codes are faster anywhere (but without specifying an
uarch you get some generic cost modelling applied).  Maybe SPR doesn't
have the gather bug and it does have reasonable gather and scatter
(zen4 scatter sucks).

.L3:
vmovsd  952(%rax), %xmm0
vmovsd  -8(%rax), %xmm2
addq$1280, %rsi
addq$1280, %rax
vmovhpd -168(%rax), %xmm0, %xmm1
vmovhpd -1128(%rax), %xmm2, %xmm2
vmovsd  -648(%rax), %xmm0
vmovhpd -488(%rax), %xmm0, %xmm0
vinsertf32x4$0x1, %xmm1, %ymm0, %ymm0
vmovsd  -968(%rax), %xmm1
vmovhpd -808(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm1, %ymm2, %ymm2
vinsertf64x4$0x1, %ymm0, %zmm2, %zmm2
vmovsd  -320(%rax), %xmm0
vmovhpd -160(%rax), %xmm0, %xmm1
vmovsd  -640(%rax), %xmm0
vmovhpd -480(%rax), %xmm0, %xmm0
vinsertf32x4$0x1, %xmm1, %ymm0, %ymm1
vmovsd  -960(%rax), %xmm0
vmovhpd -800(%rax), %xmm0, %xmm8
vmovsd  -1280(%rax), %xmm0
vmovhpd -1120(%rax), %xmm0, %xmm0
vinsertf32x4$0x1, %xmm8, %ymm0, %ymm0
vinsertf64x4$0x1, %ymm1, %zmm0, %zmm0
vmovsd  -312(%rax), %xmm1
vmovhpd -152(%rax), %xmm1, %xmm8
vmovsd  -632(%rax), %xmm1
vmovhpd -472(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm8, %ymm1, %ymm8
vmovsd  -952(%rax), %xmm1
vmovhpd -792(%rax), %xmm1, %xmm9
vmovsd  -1272(%rax), %xmm1
vmovhpd -1112(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm9, %ymm1, %ymm1
vinsertf64x4$0x1, %ymm8, %zmm1, %zmm1
vaddpd  %zmm1, %zmm0, %zmm0
vaddpd  %zmm7, %zmm2, %zmm1
vfnmadd132pd%zmm3, %zmm2, %zmm1
vfmadd132pd %zmm6, %zmm5, %zmm0
valignq $3, %ymm1, %ymm1, %ymm2
vmovlpd %xmm1, -1280(%rsi)
vextractf64x2   $1, %ymm1, %xmm8
vmovhpd %xmm1, -1120(%rsi)
vextractf64x4   $0x1, %zmm1, %ymm1
vmovlpd %xmm1, -640(%rsi)
vmovhpd %xmm1, -480(%rsi)
vmovsd  %xmm2, -800(%rsi)
vextractf64x2   $1, %ymm1, %xmm2
vmovsd  %xmm8, -960(%rsi)
valignq $3, %ymm1, %ymm1, %ymm1
vmovsd  %xmm2, -320(%rsi)
vmovsd  %xmm1, -160(%rsi)
vmovsd  -320(%rax), %xmm1
vmovhpd -160(%rax), %xmm1, %xmm2
vmovsd  -640(%rax), %xmm1
vmovhpd -480(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm2, %ymm1, %ymm2
vmovsd  -960(%rax), %xmm1
vmovhpd -800(%rax), %xmm1, %xmm8
vmovsd  -1280(%rax), %xmm1
vmovhpd -1120(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm8, %ymm1, %ymm1
vinsertf64x4$0x1, %ymm2, %zmm1, %zmm1
vfnmadd132pd%zmm3, %zmm1, %zmm0
vaddpd  %zmm4, %zmm0, %zmm0
valignq $3, %ymm0, %ymm0, %ymm1
vmovlpd %xmm0, 14728(%rsi)
vextractf64x2   $1, %ymm0, %xmm2
vmovhpd %xmm0, 14888(%rsi)
vextractf64x4   $0x1, %zmm0, %ymm0
vmovlpd %xmm0, 15368(%rsi)
vmovhpd %xmm0, 15528(%rsi)
vmovsd  %xmm1, 15208(%rsi)
vextractf64x2   $1, %ymm0, %xmm1
vmovsd  %xmm2, 15048(%rsi)
valignq $3, %ymm0, %ymm0, %ymm0
vmovsd  %xmm1, 15688(%rsi)
vmovsd  %xmm0, 15848(%rsi)
cmpq%rdx, %rsi
jne .L3

[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-24 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> 
> --- Comment #17 from Andrew Pinski  ---
> (In reply to rguent...@suse.de from comment #16)
> > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> > > 
> > > --- Comment #15 from Andrew Pinski  ---
> > > (In reply to Richard Biener from comment #14)
> > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > > > avoid requiring masking).
> > > 
> > > Note maybe instead of MIN here we use `a & 0xf` since that will more 
> > > likely be
> > > cheaper than a MIN.
> > 
> > But it's incorrect (that's what I did originally).
> 
> But `(a>= 16)? 0: (32872>> (a&0xf))` is correct.
> 
> So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` .
> 
> Or is it you want to avoid the conditional here.

Yeah - at some point not trying to optimize the widening/shortening
is going to be cheaper, no?  Esp. if it is "just" because of
GIMPLE being limited with no way to express the desired semantics
for "out-of-bound" shifts (and no way to query target behavior for
them of course).

The widening/shortening might or might not have an effect on the
VF (we don't know) and we don't (yet) compute if the target
supports it.  We could force a smaller vector size (if the target
supports it) to avoid increasing the VF.

[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-24 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #16 from rguenther at suse dot de  ---
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> 
> --- Comment #15 from Andrew Pinski  ---
> (In reply to Richard Biener from comment #14)
> > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > avoid requiring masking).
> 
> Note maybe instead of MIN here we use `a & 0xf` since that will more likely be
> cheaper than a MIN.

But it's incorrect (that's what I did originally).

[Bug tree-optimization/113551] [13 Regression] Miscompilation with -O1 -funswitch-loops -fno-strict-overflow

2024-01-24 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113551

--- Comment #19 from rguenther at suse dot de  ---
On Tue, 23 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113551
> 
> --- Comment #13 from Andrew Pinski  ---
> (In reply to Yuxuan Shui from comment #12)
> > I think this is the MRE:
> > 
> > 
> > void bug(struct obj *dso) {
> > if (>i) {
> > if (dso == (void *)0)
> > return;
> > 
> > assert_not_null(dso);
> > }
> > }
> 
> Except that is undefined ...
> Manually unswitching introduces the undefined behavior in the code.
> So even though the code was unswitched before GCC 13 incorrectly, GCC didn't
> take into that account before hand.
> 
> I am 99% sure what is happening is GCC is see `if (a+1)` and then assuming a
> can't be a nullptr. Which is due to undefinedness there of adding 1 to an null
> ptr ... (for C that is).
> 
> Basically the unswitch is the issue ... Or we maybe we should turn `if (a+1)`
> into just `if (a)` ...  Likewise for `if (>i)` into `if (a)`

It's not undefined on GIMPLE.  As said I believe this was fixed for GCC 14
but appearantly not backported.  bisection should tell

[Bug tree-optimization/113467] [14 regression] libgcrypt-1.10.3 is miscompiled

2024-01-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113467

--- Comment #19 from rguenther at suse dot de  ---
> Am 23.01.2024 um 18:06 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113467
> 
> --- Comment #18 from Tamar Christina  ---
> (In reply to Richard Biener from comment #7)
>> I do wonder whether LOOP_VINFO_EARLY_BREAKS_VECT_PEELED actually works (since
>> without early exits we cannot handle a non-empty latch because of correctness
>> issues).  I'd very much have preferred to deal with these by loop rotation
>> (there's the loop_ch pass).  We're still doing this, even when
>> LOOP_VINFO_EARLY_BREAKS_VECT_PEELED:
>> 
>>  /* We assume that the loop exit condition is at the end of the loop. i.e,
>> that the loop is represented as a do-while (with a proper if-guard
>> before the loop if needed), where the loop header contains all the
>> executable statements, and the latch is empty.  */
>>  if (!empty_block_p (loop->latch)
>>  || !gimple_seq_empty_p (phi_nodes (loop->latch)))
>>return opt_result::failure_at (vect_location,
>>   "not vectorized: latch block not
>> empty.\n");
>> 
>> so that's a bit odd (but loop_ch tries to ensure the latch is empty anyway).
>> 
>> Does the following fix the issue?
> 
> Not really sure I understand what the latch being empty has to do with
> LOOP_VINFO_EARLY_BREAKS_VECT_PEELED as the latch is still empty even with it.

The latch is everything after the IV exit.

> I guess if it's just going to disabled it then wouldn't it better to just
> always pick the latch exit rather than trying to do the whole analysis thing
> and maybe pick another exit while the main exit would have worked.

The point was to quickly see whether a peeled early exit vectorization is the
issue here.

[Bug ipa/107931] [12/13/14 Regression] -Og causes always_inline to fail since r12-6677-gc952126870c92cf2

2024-01-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107931

--- Comment #27 from rguenther at suse dot de  ---
On Tue, 23 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107931
> 
> --- Comment #26 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #25)
> > Btw, I'd rather go the opposite and make the testcase at hand always invalid
> > and diagnosed which means diagnose taking the address of always-inline
> > declared functions and never emit an out-of-line body for them.
> 
> That would need an exception at least for gnu extern inline always_inline
> functions,
> because the way they are used in glibc requires  etc. to be valid (and 
> use
> then as fallback the out of line open).

Sure, already the C frontend should resolve to the out-of-line open call
there, we shouldn't do this in the middle-end.  Yes, indirect 'open' will
then not be fortified, but so what.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop

2024-01-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #15 from rguenther at suse dot de  ---
On Tue, 23 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #14 from JuzheZhong  ---
> I just tried again both GCC-13.2 and GCC-14 with -fno-vect-cost-model.
> 
> https://godbolt.org/z/enEG3qf5K
> 
> GCC-14 requires scalar epilogue loop, whereas GCC-13.2 doesn't.
> 
> I believe it's not cost model issue.

As said, please try to bisect to the point where we started to require
the epilogue.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop

2024-01-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #13 from rguenther at suse dot de  ---
On Tue, 23 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #12 from JuzheZhong  ---
> (In reply to Richard Biener from comment #11)
> > (In reply to Tamar Christina from comment #9)
> > > There is a weird costing going on in the PHI nodes though:
> > > 
> > > m_108 = PHI  1 times vector_stmt costs 0 in body 
> > > m_108 = PHI  2 times scalar_to_vec costs 0 in prologue
> > > 
> > > they have collapsed to 0. which can't be right..
> > 
> > Note this is likely because of the backend going wrong.
> > 
> > bool
> > vectorizable_phi (vec_info *,
> >   stmt_vec_info stmt_info, gimple **vec_stmt,
> >   slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> > {
> > ..
> > 
> >   /* For single-argument PHIs assume coalescing which means zero cost
> >  for the scalar and the vector PHIs.  This avoids artificially
> >  favoring the vector path (but may pessimize it in some cases).  */
> >   if (gimple_phi_num_args (as_a  (stmt_info->stmt)) > 1)
> > record_stmt_cost (cost_vec, SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node),
> >   vector_stmt, stmt_info, vectype, 0, vect_body);
> > 
> > You could check if we call this with sane values.
> 
> Do you mean it's RISC-V backend cost model issue ?

I responded to Tamar which means a aarch64 cost model issue - the
specific issue that the PHIs appear to have no cost.  I didn't look
at any of the rest.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #26 from rguenther at suse dot de  ---
On Fri, 19 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495
> 
> --- Comment #22 from JuzheZhong  ---
> (In reply to Richard Biener from comment #21)
> > I once tried to avoid df_reorganize_refs and/or optimize this with the
> > blocks involved but failed.
> 
> I am considering whether we should disable LICM for RISC-V by default if 
> vector
> is enabled ?
> Since the compile time explode 10 times is really horrible.

I think that's a bad idea.  It only explodes for some degenerate cases.
The best would be to fix invariant motion to keep DF up-to-date so
it can stop using df_analyze_loop and instead analyze the whole function.
Or maybe change it to use the rtl-ssa framework instead.

There's already param_loop_invariant_max_bbs_in_loop:

  /* Process the loops, innermost first.  */
  for (auto loop : loops_list (cfun, LI_FROM_INNERMOST))
{
  curr_loop = loop;
  /* move_single_loop_invariants for very large loops is time 
consuming
 and might need a lot of memory.  For -O1 only do loop invariant
 motion for very small loops.  */
  unsigned max_bbs = param_loop_invariant_max_bbs_in_loop;
  if (optimize < 2)
max_bbs /= 10;
  if (loop->num_nodes <= max_bbs)
move_single_loop_invariants (loop);
}

it might be possible to restrict invariant motion to innermost loops
when the overall number of loops is too large (with a new param
for that).  And when the number of innermost loops also exceeds
the limit avoid even that?  The above also misses a
optimize_loop_for_speed_p (loop) check (probably doesn't make
a difference, but you could try).

[Bug tree-optimization/113459] ICE: in as_a, at machmode.h:381 with memset() on a _BitInt() at -O1 and above

2024-01-18 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113459

--- Comment #4 from rguenther at suse dot de  ---
On Thu, 18 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113459
> 
> --- Comment #3 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #2)
> >   unsigned buflen = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (vr->type))
> > + 1;
> >   if (INTEGRAL_TYPE_P (vr->type))
> > buflen = GET_MODE_SIZE (SCALAR_INT_TYPE_MODE (vr->type)) + 
> > 1;
> > 
> > there's other spots using the pattern
> > 
> >   if (INTEGRAL_TYPE_P (type))
> > sz = GET_MODE_SIZE (SCALAR_INT_TYPE_MODE (type));
> > 
> > I wonder when GET_MODE_SIZE differs from TYPE_SIZE_UNIT?  PSImode?
> 
> I'm afraid I don't know either.
> 
> > Packed bitfields?  r10-6885-g5f9cd512c42786 added the INTEGRAL_TYPE_P
> > special-casing we now run into with the TYPE_SIZE_UNIT code being there
> > before.
> > 
> > Jakub, do you remember?
> 
> I bet the above comes from what the native_{encode,interpret}_int has been
> doing (and which has been tweaked for BITINT_TYPE).
> If we don't want to throw it away, we could just change it to
> if (INTEGRAL_TYPE_P (...) && TYPE_MODE (...) != BLKmode)

Yeah, that might work.  There are a few occurances in tree-ssa-sccvn.cc
that need such adjustment.

[Bug tree-optimization/113372] wrong code with _BitInt() arithmetics at -O1

2024-01-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372

--- Comment #18 from rguenther at suse dot de  ---
On Mon, 15 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372
> 
> --- Comment #17 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #15)
> > (In reply to Jakub Jelinek from comment #14)
> > > Created attachment 57085 [details]
> > > gcc14-pr113372.patch
> > > 
> > > The non-propagation workaround which seems to fix^H^H^Hworkaround all 
> > > those
> > > 4 issues (PR90348 testcase actually doesn't FAIL anymore, but I've 
> > > verified
> > > the patch results in in and buf no longer being shared) can look like 
> > > this.
> > 
> > + || (INTEGRAL_TYPE_P (TREE_TYPE (use))
> > + && TYPE_PRECISION (TREE_TYPE (use)) == POINTER_SIZE)))
> > 
> > ptrofftype_p (TREE_TYPE (use))
> 
> Aren't there targets where pointers are larger than sizetype?
> I thought msp430, but that one uses __int20.

There are also address-spaces with pointer sizes different from
POINTER_SIZE.  I suppose tracking all INTEGRAL_TYPE_P uses and
instead relying on the def to identify a pointer source would work
as well.

> > I think it should be enough to look at gimple_assing_rhs1, that works
> > for single non-invariant [i], for conversions and for offsetting of
> > an invariant address (pointer-plus).
> 
> Is the invariant operand guaranteed to go first?  If it is pointer, guess
> POINTER_PLUS_EXPR enforces that, and for sizetype addition guess an operand
> can't be ADDR_EXPR, there would need to be a cast in a separate stmt.  So
> perhaps ok.

Yes, I think that works.

> As for Micha's fears, I can certainly try to dump statistics during
> bootstrap/regtest on how many variables were shared and/or their cumulative
> size without/with the patch and see if it has significant effects on 
> real-world
> code.

Might be interesting to dump the stack size saved due to sharing instead?

Do we really need to handle the PHI nodes btw?  With doing propagation
we could avoid marking certain use sites as mentions, like compares
of the address value.  But of course we'd have to give up for uses
in calls or other things we can't analyze.

[Bug tree-optimization/113372] wrong code with _BitInt() arithmetics at -O1

2024-01-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372

--- Comment #13 from rguenther at suse dot de  ---
On Mon, 15 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372
> 
> --- Comment #12 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #11)
> > So the key would be to make the object live again after a CLOBBER when such
> > address SSA name is used (but before any other explicit mention appears)?
> > 
> > The current algorithm relies on explitic mentions appearing, one original
> > idea was to turn those explicit mentions (or BLOCK starts) into
> > start-of-live CLOBBERs, but even that was shown to be not enough.
> > 
> > But yes, if we want to try to mitigate the problems somewhat without
> > doing a full solution then possibly even just looking at SSA defs
> > when POINTER_TYPE and the def is an ADDR_EXPR might work (in the
> > visit_* callbacks of the walk_stmt_load_store_addr_ops), no propagation
> > needed at all (basically just undo CSE virtually here for the simple
> > cases).
> 
> It couldn't be limited to just POINTER_TYPE, because it mostly is actually
> pointer-sized unsigned integer from IVOPTs, but yes, even without full
> propagation I think all the testcases I've mentioned could be solved by just
> doing the short walks in PHI arguments or other SSA_NAME uses (only PR111422
> needs the latter).
> For this PR, we'd need to visit for ivtmp.40_3 uses in PHI args the def-stmt:
>   ivtmp.40_3 = (unsigned long)   [(void *) +
> 8B];
> for PR90348 ivtmp.32_28's def-stmt:
>   ivtmp.32_28 = (unsigned long) 
> for PR110115 ivtmp.27_2's def-stmt:
>   ivtmp.27_2 = (unsigned long) 
> fpr PR111422 _44's def-stmt:
>   _44 =  + _43;
> So, if you think it is ok to use just a small hack rather than full 
> propagation
> (which could even handle const/pure calls perhaps if they return pointer or
> pointer-sized integer), I can implement that too.  Even the full propagation
> wouldn't handle everything of course, but could handle more than the small 
> hack
> (which would need to be limited to just say 4 def-stmts and wouldn't try to
> look through PHIs).

I think we don't expect arbitrary complex obfuscation.  We do have
RPO order computed so propagation cost would be mostly a bunch of
bitmaps (possibly one per SSA name), but with PHIs we'd need to
iterate.  I wonder if this could be done along the propagation
add_scope_conflicts does itself - can we do this within
add_scope_conflicts_1 when for_conflict == false?  I'd think so
(though whether a change occured is more difficult to track?).

What would you then consider a mention that's not explicit?
Any that we couldn't further propagate to its uses?  Thus also

  mem = _1;

with _1 having the address of some local?

[Bug tree-optimization/113372] wrong code with _BitInt() arithmetics at -O1

2024-01-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372

--- Comment #5 from rguenther at suse dot de  ---
On Mon, 15 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113372
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||rguenth at gcc dot gnu.org
> 
> --- Comment #2 from Jakub Jelinek  ---
> Seems this is invalid variable sharing during RTL, I'm afraid we have tons of
> open bugs against that, and I'm afraid I don't know anything what to do on the
> bitint lowering side.
> Bitint lowering creates 3 large variables:
>   unsigned long bitint.8[75];
>   unsigned long bitint.7[75];
>   unsigned long bitint.6[100];
> where bitint.6 corresponds to
>   _3 = _2 % y_14(D);
> and later when _3 isn't needed anymore to
>   _8 = -y_14(D);
> bitint.7 corresponds to
>   _4 = -_3;
>   _5 = (unsigned _BitInt(4745)) _4;
> and bitint.8 corresponds to
>   _7 = _5 * _6;
> Reusing the same variable bitint.6 for 2 different _BitInt(6384) SSA_NAMEs
> ought to be fine, they aren't live at the same time.
> After lowering, we end up with:
>   .DIVMODBITINT (0B, 0, , 6384, , -8, , -6384);
>   // Above fills in bitint.6 array
> ...
>   loop which uses bitint.6 to compute bitint.7
>   bitint.6 ={v} {CLOBBER(eos)};
> ...
>   .MULBITINT (, 4745, , 4745, , -8);
>   // Above fills in bitint.8 array
> ...
>   loop to compute bitint.6
> ...
>   loop which uses bitint.6
> ...
>   _21 = MEM[(unsigned long *)];
>   _22 = (_BitInt(8)) _21;
>   _18 = _22;
>   return _18;
>   // I.e. use bitint.8 for the return value
> 
> But unfortunately RTL expansion decides to use the same underlying memory for
> bitint.6 and bitint.8 arrays, even when the second uses of bitint.6 happens
> when bitint.8 is live.
> 
> Or am I using incorrect CLOBBER kind when an underlying variable is used for
> more than one SSA_NAME?  Like should that be just eob rather than eos?

I think it's OK, but likely the address-taken of bitint.8 somehow breaks
things.  Does RTL expansion properly handle internal-functions here?

[Bug target/113059] [14 regression] fftw fails tests for -O3 -m32 -march=znver2 since r14-6210-ge44ed92dbbe9d4

2024-01-10 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113059

--- Comment #13 from rguenther at suse dot de  ---
On Wed, 10 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113059
> 
> --- Comment #12 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #11)
> > IMHO REG_UNUSED notes should stay correct or be removed.  Instead of a full
> > solution can we just wipe them in postreload-cse as we know that pass will
> > break them?  Any user then should recompute the notes.  Originally pass
> > properties were thought of a vehicle indicating these kind of IL features,
> > so we
> > could add PROP_rtl_unused_notes and if not set treat them as possibly
> > invalid when present (instead of actively wiping them), and when a pass
> > requires them re-compute them.
> > 
> > The problem with properties is that they are always set/unset irrespective
> > of whether a pass did something destroying it.  That could be
> > circumvented by postreload-cse only clearing the flag when it does
> > something.
> > 
> > Not (re-)using pass properties for this but a flag in struct function
> > works as well of course (or treat it as part of the DF state).
> 
> I was hoping we could have some df checking which would discover invalid
> REG_UNUSED notes and so we would know which passes need tweaking, but I'm
> afraid my df knowledge is insufficient for that.
> Dunno how often postreload-cse actually extends life time of some register,
> which should determine whether we want to drop REG_UNUSED notes 
> unconditionally
> at the end of postreload-cse, or whether we e.g. want just some flag whether
> we've extended lifetime of something during it and only remove REG_UNUSED 
> notes
> if that flag is set.

In theory the insn analysis could look at REG_UNUSED notes and mark
them as "to be extended" (eventually even pointing to the note).  When
eliminating to a such marked REG we could either remove the note or
scrap all notes.

That said - not sure how exactly postreload-cse keeps its state.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #3 from rguenther at suse dot de  ---
On Thu, 21 Dec 2023, fxue at os dot amperecomputing.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
> 
> --- Comment #2 from Feng Xue  ---
> (In reply to Richard Biener from comment #1)
> > See my proposal on the mailing list to lift the restriction of sticking to a
> > single vector size, I think this is another example showing this.  If you
> > use BB level vectorization by disabling loop vectorization but not SLP
> > vectorization the code should improve?
> 
> Yes, the loop is fully unrolled, and BB SLP would.

I suspect even when the loop isn't unrolled (just increase iteration
count) the code would improve

> I could not find the proposal, would you share me a link? Thanks

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640476.html

[Bug middle-end/113082] builtin transforms do not honor errno

2023-12-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113082

--- Comment #3 from rguenther at suse dot de  ---
On Tue, 19 Dec 2023, fw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113082
> 
> --- Comment #2 from Florian Weimer  ---
> (In reply to Richard Biener from comment #1)
> > Joseph - I wonder if the standard folks can be convinced to amend most
> > library function documentation as to not altering 'errno' (like memcpy,
> > strlen, etc.)?
> > 
> > Should we simply document our constraints on supported library
> > implementations?
> 
> We can add attributes to the glibc headers, similar to the throw and leaf
> annotation we have today. It would act as a reminder that if we clobber errno
> in these functions due to some implementation detail, we need to save/restore
> errno.

I guess a new 'noerrno' attribute would make sense.  What I've always
wanted is also a reliable way to distinguish accesses to 'errno'
from other accesses (unfortunately 'errno' is an lvalue so it's
address can be taken).  glibc uses __errno_location (), so an
additional attribute indicating the function returns the address
of 'errno' would be nice to have as well ('errno', just like we
have 'malloc' for malloc results?).  At the moment we can just
use TBAA ('errno' is an lvalue of type 'int') for disambiguation
but 'int' is a bit generic for that to be of much help.

[Bug tree-optimization/113026] Bogus -Wstringop-overflow warning on simple memcpy type loop

2023-12-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113026

--- Comment #5 from rguenther at suse dot de  ---
On Fri, 15 Dec 2023, avieira at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113026
> 
> --- Comment #4 from avieira at gcc dot gnu.org ---
> Drive by comments as it's been a while since I looked at this. I'm also
> surprised we didn't adjust the bounds. But why do we only subtract VF? Like 
> you
> say, if there's no loop around edge, can't we guarantee the epilogue will only
> need to iterate at most VF-1?  This is assuming we didn't take an early exit,
> if we do then we can't assume anything as the iterations 'reset'.

Subtracting can bring down epilogue iterations max to 1 while yes,
we should also apply a max of VF-1 but in addition to the subtraction.

[Bug libstdc++/105562] [12 Regression] std::function::_M_invoker may be used uninitialized in std::regex move with -fno-strict-aliasing

2023-12-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105562

--- Comment #26 from rguenther at suse dot de  ---
On Wed, 6 Dec 2023, romain.geissler at amadeus dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105562
> 
> --- Comment #25 from Romain Geissler  ---
> So it means we should rather go for "silencing" workaround from comment #19 ?

No, I think people shouldn't care for warnings with -fsanitize=...
(or not use -Wall).

[Bug tree-optimization/112303] [14 Regression] ICE on valid code at -O3 on x86_64-linux-gnu: verify_flow_info failed since r14-3459-g0c78240fd7d519

2023-12-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112303

--- Comment #8 from rguenther at suse dot de  ---
On Tue, 5 Dec 2023, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112303
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org,
>||rguenth at gcc dot gnu.org
>Keywords|needs-bisection |
> 
> --- Comment #7 from Jakub Jelinek  ---
> Doesn't ICE since r14-6010-g2dde9f326ded84814a78c3044294b535c1f97b41
> No idea whether that was the fix for this or just something that made it
> latent.

I'm quite sure it just made it latent.

[Bug target/112773] [14 Regression] RISC-V ICE: in force_align_down_and_div, at poly-int.h:1828 on rv32gcv_zvl256b

2023-12-01 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112773

--- Comment #12 from rguenther at suse dot de  ---
On Fri, 1 Dec 2023, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112773
> 
> --- Comment #11 from Robin Dapp  ---
> When I define a vec_extract...bi pattern we don't enter the if (vec_extract) 
> in
> expmed because e.g.
> 
> bitsize = {1, 0}
> bitnum = {3, 4}
> 
> and GET_MODE_BITSIZE (innermode) = {1, 0} with innermode = BImode.
> 
> This fails multiple_p (bitnum, GET_MODE_BITSIZE (innermode), ).
> 
> It is a multiple of course, but still dependent on the actual vector length.
> (So we would also need to extract a [3 4] from the vector).
> 
> That would be the same as an extract_last with a CONST_M1 mask.  Maybe that's
> an option?  So if we have an extract_last and no loop len or mask fall back to
> an extract_last with a full mask?  That would delegate the length calculation
> to the backend.

Hm, yeah, but is that an issue?  I guess all vec_extract can end up
at a variable position with VLA?

[Bug target/112411] ICE: SIGSEGV with --param=min-nondebug-insn-uid=2147483647 on powerpc64le-unknown-linux-gnu

2023-11-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112411

--- Comment #6 from rguenther at suse dot de  ---
On Thu, 30 Nov 2023, zsojka at seznam dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112411
> 
> --- Comment #5 from Zdenek Sojka  ---
> Thank you for the evaluation.
>
> - params affecting inlining or unrolling often cause ICEs due to
> function::last_clique wraparound (it's 16bit uint)

Ah, do you have a testcase for this?  I see we have an assert
in IL verification but not really any handling of the overflow
itself (there is a way to gracefully handle it though!).  A testcase
would really be useful (in a new bug).

[Bug middle-end/32667] block copy with exact overlap is expanded as memcpy

2023-11-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32667

--- Comment #53 from rguenther at suse dot de  ---
On Tue, 28 Nov 2023, post+gcc at ralfj dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32667
> 
> --- Comment #51 from post+gcc at ralfj dot de ---
> Oh great, I love it when one part of the C standard just adds exceptions to
> statements made elsewhere. It's almost as if the authors want this to be as
> hard to understand as possible...
> 
> That then raises the question which version of the signature is actually used
> for building (and optimizing) the function: the one in the declaration or the
> one in the definition. Does the standard have an answer to that?

For avoidance of doubt the frontends should drop non-semantic qualifiers
from declarations then just in case the middle-end tries to apply
semantics there.  Like it does for const qualified reference arguments
(OK, that's not C but C++).  The middle-end also uses the qualifiers
for diagnostic purposes.

[Bug tree-optimization/52252] An opportunity for x86 gcc vectorizer (gain up to 3 times)

2023-11-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252

--- Comment #11 from rguenther at suse dot de  ---
On Tue, 28 Nov 2023, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252
> 
> Andrew Pinski  changed:
> 
>What|Removed |Added
> 
>  CC||pinskia at gcc dot gnu.org
> 
> --- Comment #10 from Andrew Pinski  ---
> Note there is also a missing scalar optimization here also (which will improve
> the vectorized version in the end too).
> 
> Right now we have the following match pattern:
> /* MIN (~X, ~Y) -> ~MAX (X, Y)
>MAX (~X, ~Y) -> ~MIN (X, Y)  */
> (for minmax (min max)
>  maxmin (max min)
>  (simplify
>   (minmax (bit_not:s@2 @0) (bit_not:s@3 @1))
>   (bit_not (maxmin @0 @1)))
> 
> 
> But that does not match here due to the :s. I am not 100% sure but trading 2
> possible bit_not for adding another might end up improving things ...

We're lacking a way to say one of the bit_not should be single-used,
one multi-use would be OK and a fair trade-off - not sure if that
would be enough here, of course.  That would mena changing to
a condition with single_use ().

[Bug lto/112716] LTO optimization with struct with variable size

2023-11-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112716

--- Comment #6 from rguenther at suse dot de  ---
On Mon, 27 Nov 2023, muecker at gwdg dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112716
> 
> --- Comment #5 from Martin Uecker  ---
> It works (and is required to work) for other types, e.g.
> 
> [[gnu::noinline,gnu::noipa]]
> int foo(void *p, void *q)
> {
> int n = 5;
> int (*p2)[n] = p;
> (*p2)[0] = 1;
> bar(q);
> return (*p2)[0];
> }
> 
> void bar(void* q)
> {   
> int n = 5;
> int (*q2)[n] = q;
> (*q2)[0] = 2;
> }
> 
> One could argue that there is a weaker requirement for having an object of 
> type
> int[n] present than for struct { int x[n]; } because we do not access the 
> array
> directly but it decays to a pointer. (but then, other languages have array
> assignment, so why does the middle-end care about this C peculiarity?) 

So in theory we could disregard the VLA-sized components for TBAA
which would make the access behaved as if it were a int * indirect access.
I think if you write it as array as above that's already what happens.

Note that even without LTO when you enable inlining you'd expose two
different structures with two different alias-sets, possibly leading
to wrong-code (look at the RTL expansion dump and check alias-sets).

As said, for arrays it probably works as-is since these gets the alias
set of the component.

> There is also no indication in documentation that structs with variable size
> follow different rules than conventional structs.   So my preference would be
> to fix this somehow.  Otherwise we should document this as a limitation.

Local declared structs in principle follow the same logic (but they
get promoted "global" due to implementation details I think).

[Bug bootstrap/111601] [14 Regression] profilebootstrap fails in stagestrain in libcody on x86_64-linux-gnu and powerpc64le-linux-gnu

2023-11-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111601

--- Comment #12 from rguenther at suse dot de  ---
On Mon, 27 Nov 2023, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111601
> 
> --- Comment #11 from Jakub Jelinek  ---
> r14-4662 still builds ok and r14-4668 fails, wonder which of those commits it
> is.

My bet is r14-4664-g04c9cf5c786b94

[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function

2023-11-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

--- Comment #9 from rguenther at suse dot de  ---
On Fri, 24 Nov 2023, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653
> 
> --- Comment #8 from Jan Hubicka  ---
> On ARM32 and other targets methods returns this pointer.  Togher with making
> return value escape this probably completely disables any chance for IPA
> tracking of C++ data types...

Note that all call and global var escapes would also go to
"return escape", so the only extra bit you get is the reverse,
what escapes the explicit return doesn't escape through calls.
We'd have

RETURN_ESCAPE = ESCAPED

And handle return by adding to RETURN_ESCAPED but not ESCAPED.
Then for example DSE needs to properly query whether sth escaped
through function return (ptr_deref_may_alias_global_p & friends).

The cost is an extra bitmap in struct function.

I will have a look if it's reasonably easy to implement.

[Bug middle-end/111655] [11/12/13/14 Regression] wrong code generated for __builtin_signbit and 0./0. on x86-64 -O2

2023-11-24 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655

--- Comment #14 from rguenther at suse dot de  ---
On Fri, 24 Nov 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655
> 
> --- Comment #13 from Alexander Monakov  ---
> > Then there is the MULT_EXPR x * x case
> 
> This is PR 111701.
> 
> It would be nice to clarify what "nonnegative" means in the contracts of this
> family of functions, because it's ambiguous for NaNs and negative zeros (x < 0
> is false while signbit is set, and x >= 0 is also false for positive NaNs).

Agreed, I think we're using it in both ways which is the problem in
the end.  Maybe having _compares_nonnegative and _sign_positive
would clarify this.

[Bug middle-end/32667] block copy with exact overlap is expanded as memcpy

2023-11-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32667

--- Comment #33 from rguenther at suse dot de  ---
On Thu, 23 Nov 2023, fw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32667
> 
> --- Comment #32 from Florian Weimer  ---
> There's this in standards.texi:
> 
> Most of the compiler support routines used by GCC are present in
> @file{libgcc}, but there are a few exceptions.  GCC requires the
> freestanding environment provide @code{memcpy}, @code{memmove},
> @code{memset} and @code{memcmp}.
> Finally, if @code{__builtin_trap} is used, and the target does
> not implement the @code{trap} pattern, then GCC emits a call
> to @code{abort}.
> 
> Maybe that would be a place to mention this issue, too?

Will add.

[Bug middle-end/32667] block copy with exact overlap is expanded as memcpy

2023-11-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32667

--- Comment #25 from rguenther at suse dot de  ---
On Tue, 21 Nov 2023, bugdal at aerifal dot cx wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32667
> 
> Rich Felker  changed:
> 
>What|Removed |Added
> 
>  CC||bugdal at aerifal dot cx
> 
> --- Comment #24 from Rich Felker  ---
> If the copy is such that gcc is happy to emit an external call to memcpy for
> it, there is no significant size or performance cost to emitting a branch
> checking for equality before making the call, and performing this branch would
> greatly optimize the (maybe rare in the caller, maybe not) case of
> self-assignment!
> 
> On the other hand, expecting the libc memcpy to make this check greatly
> pessimizes every reasonable small use of memcpy with a gratuitous branch for
> what is undefined behavior and should never appear in any valid program.
> 
> Fix it on the compiler side please.

The only reasonable fix on the compiler side is to never emit memcpy
but always use memmove.  The pessimization of that compared to the
non-existing "real" issue with calling memcpy with exact overlap
is the reason for the non-action.

[Bug tree-optimization/111970] [14 regression] SLP for non-IFN gathers result in RISC-V test failure on gather since r14-4745-gbeab5b95c58145

2023-11-20 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111970

--- Comment #15 from rguenther at suse dot de  ---
On Mon, 20 Nov 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111970
> 
> --- Comment #11 from JuzheZhong  ---
> Hi, Richard.
> 
> I come back to revisit this bug.
> 
> I found if I do this:
> 
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 4a09b3c2aca..2fd128672b9 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -1434,7 +1434,6 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
> *swap,
>   && rhs_code != CFN_GATHER_LOAD
>   && rhs_code != CFN_MASK_GATHER_LOAD
>   && rhs_code != CFN_MASK_LEN_GATHER_LOAD
> - && !STMT_VINFO_GATHER_SCATTER_P (stmt_info)
>   /* Not grouped loads are handled as externals for BB
>  vectorization.  For loop vectorization we can handle
>  splats the same we handle single element interleaving.  */
> 
> 
> The bug is fixed. But I am not sure whether it is the correct fix.

That will simply disable SLP recognition for the case in question.

[Bug tree-optimization/109088] GCC does not always vectorize conditional reduction

2023-11-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088

--- Comment #22 from rguenther at suse dot de  ---
On Thu, 16 Nov 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
> 
> --- Comment #21 from JuzheZhong  ---
> Thanks Richi.
> 
> Does re-associating (with eliminating exceptions) in if-convert is a 
> reasonable
> approach ?

Yeah, I don't think we have a much better place at the moment.

[Bug tree-optimization/112374] [14 Regression] Failed bootstrap with `--with-arch=skylake-avx512 --with-cpu=skylake-avx512`, causes a comparison failure since r14-5076-g01c18f58d37865d5f3bbe93e666183b

2023-11-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374

--- Comment #33 from rguenther at suse dot de  ---
On Wed, 15 Nov 2023, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374
> 
> --- Comment #29 from Jakub Jelinek  ---
> --- gcc/tree-vect-loop.cc.jj2023-11-14 10:35:52.0 +0100
> +++ gcc/tree-vect-loop.cc   2023-11-15 22:42:32.782007408 +0100
> @@ -4105,9 +4105,9 @@ pop:
> /* In case of a COND_OP (mask, op1, op2, op1) reduction we might have
>op1 twice (once as definition, once as else) in the same operation.
>Allow this.  */
> - if (cond_fn_p)
> + if (cond_fn_p && op_use_stmt == use_stmt)
> {
> - gcall *call = dyn_cast (use_stmt);
> + gcall *call = as_a (use_stmt);
>   unsigned else_pos
> = internal_fn_else_index (internal_fn (op.code));
> 
> doesn't cure the -fcompare-debug failure though.

Looks like a relevant fix tho.

[Bug tree-optimization/109088] GCC does not always vectorize conditional reduction

2023-11-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088

--- Comment #20 from rguenther at suse dot de  ---
On Wed, 15 Nov 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
> 
> --- Comment #19 from JuzheZhong  ---
> I have added:
> 
> +  if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))

For TYPE_OVERFLOW_UNDEFINED you have to convert the ops to unsigned
to avoid spurious undefined overflow.

> + && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
> +  && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
> +  && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT 
> (phi)))
> +  && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)

You should check flag_associative_math which covers one half
and !flag_trapping_math which covers spurious FP exceptions.

> +   return false;
> 
> for floating-point. I failed to see which situation will cause FP exceptions ?

Ops with NaN cause INVALID, but there's also INEXACT which can be
set differently after re-association.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

--- Comment #9 from rguenther at suse dot de  ---
On Tue, 14 Nov 2023, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481
> 
> --- Comment #7 from Andrew Stubbs  ---
> Simply changing to OPTAB_WIDEN solves the ICE, but I don't know if it does so
> in a sensible way, for RISC V.
> 
> @@ -7489,7 +7489,7 @@ store_constructor (tree exp, rtx target, int cleared,
> poly_int64 size,
> if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
>   tmp = expand_binop (mode, and_optab, tmp,
>   GEN_INT ((1 << nunits) - 1), target,
> - true, OPTAB_DIRECT);
> + true, OPTAB_WIDEN);
> if (tmp != target)
>   emit_move_insn (target, tmp);
> break;
> 
> Here are the instructions it generates:
> 
> (set (reg:DI 165)
> (and:DI (subreg:DI (reg:SI 164) 0)
> (const_int 1 [0x1])))
> (set (reg:SI 154)
> (subreg:SI (reg:DI 165) 0))
> 
> Should I use that patch? I think it's harmless on targets where OPTAB_DIRECT
> would work.

I think that's sensible.

[Bug c/111811] [14 Regression] ICE with vector float bitfield after error

2023-11-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111811

--- Comment #5 from rguenther at suse dot de  ---
On Mon, 13 Nov 2023, joseph at codesourcery dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111811
> 
> --- Comment #4 from joseph at codesourcery dot com  dot com> ---
> The checks are in check_bitfield_type_and_width.  I expect the attribute - 
> in this position a declaration attribute - gets applied after that (and 
> while applying it results in a change to the type, and thus in the 
> declaration being laid out again, this check doesn't get repeated).
> 
> In this case, the existing check is correct but not sufficient.  In 
> another case the check is arguably too early:
> 
> struct s { int __attribute__ ((__mode__ (DI))) x : 50; };
> 
> Considering int __attribute__ ((__mode__ (DI))) as a DImode integer type, 
> that bit-field width is valid - but it's rejected because the check is 
> carried out on int, before the attribute gets applied.  Getting that case 
> to work might require extracting early those declaration attributes that 
> actually affect the type, so they can be applied to the type before the 
> declaration gets constructed and such checks are carried out.

Ah, and when applying the vector_size attribute the FIELD_DECL hasn't been
updated to indicate we identified it as bitfield (so we could reject
the attribute) ...

I'll leave this to frontend folks to sort out.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

--- Comment #6 from rguenther at suse dot de  ---
On Mon, 13 Nov 2023, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481
> 
> Andrew Stubbs  changed:
> 
>What|Removed |Added
> 
>  Status|UNCONFIRMED |ASSIGNED
>Last reconfirmed||2023-11-13
>  Ever confirmed|0   |1
>Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org
> 
> --- Comment #4 from Andrew Stubbs  ---
> It fails because optab_handler fails to find an instruction for "and_optab" in
> SImode.  I didn't consider handling that case; seems so unlikely.

It sounds odd at least - I don't remember us having fallback for this.
Ah, well, maybe OPTAB_WIDEN - guess that might be a hint how to fix it?

[Bug c/112420] Unexpected vectorization for RISC-V

2023-11-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112420

--- Comment #3 from rguenther at suse dot de  ---
On Tue, 7 Nov 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112420
> 
> --- Comment #2 from JuzheZhong  ---
> (In reply to Richard Biener from comment #1)
> > We end up using gathers for the strided load which would be OK and avoid the
> > issue noted in the comment.  Thus we get true from
> > vect_use_strided_gather_scatters_p ().  That has a comment guard:
> > 
> >   /* As a last resort, trying using a gather load or scatter store.
> > 
> >  ??? Although the code can handle all group sizes correctly,
> >  it probably isn't a win to use separate strided accesses based
> >  on nearby locations.  Or, even if it's a win over scalar code,
> >  it might not be a win over vectorizing at a lower VF, if that
> >  allows us to use contiguous accesses.  */
> >   if (*memory_access_type == VMAT_ELEMENTWISE
> >   && single_element_p
> >   && loop_vinfo
> >   && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
> >  masked_p, gs_info))
> 
> Thanks Richi.
> 
> So, do you mean I need to fix test to fix the FAIL?
> 
> Is this following reasonable ?

Yes.  Adding a scan for the "using gather/scatter for strided/grouped 
accesses" for riscv might be good as well, explaining things a bit.

> /* { dg-final { scan-tree-dump-times "vectorized 0 loops in function" 2 "vect"
> target {! riscv*-*-* } } } */

[Bug middle-end/112359] [14 Regression] ICE: in expand_fn_using_insn, at internal-fn.cc:215 with -O -ftree-loop-if-convert -mavx512fp16

2023-11-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112359

--- Comment #3 from rguenther at suse dot de  ---
On Mon, 6 Nov 2023, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112359
> 
> --- Comment #2 from Robin Dapp  ---
> Would something like
> 
> +  bool allow_cond_op = flag_tree_loop_vectorize
> +&& !gimple_bb (phi)->loop_father->dont_vectorize;
> 
> in convert_scalar_cond_reduction be sufficient or are the more conditions to
> check still?  Testing with the above and the reduced testcase worked FWIW.

You want, in tree_if_conversion (), remember whether we versioned the
loop as part of

  if (need_to_lower_bitfields
  || need_to_predicate
  || any_complicated_phi
  || flag_tree_loop_if_convert != 1)
{
...

and pass that as flag to combine_blocks and down to
predicate_all_scalar_phis (or add a new global flag).  Ideally we'd
do the convert_scalar_cond_reduction analysis beforehand and force
versioning for the case we'd generate a .COND_* instead of deciding
at code-generation only (when it's too late to do the required
versioning).

[Bug tree-optimization/112361] [14 Regression] avx512f-reduce-op-1.c miscompiled since r14-5076

2023-11-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112361

--- Comment #7 from rguenther at suse dot de  ---
On Mon, 6 Nov 2023, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112361
> 
> --- Comment #6 from Robin Dapp  ---
> So "before" we created
> 
>   vect__3.12_55 = MEM  [(float *)vectp_a.10_53];
>   vect__ifc__43.13_57 = VEC_COND_EXPR  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 }>;
> //  _ifc__43 = _24 ? _3 : 0.0;
>   stmp__44.14_58 = BIT_FIELD_REF ;
>   stmp__44.14_59 = r3_29 + stmp__44.14_58;
>   ...
> 
> in vect_expand_fold_left.

Note that this wasn't correct in all cases (wrt signed zeros and
sign-dependent rounding).

> Now, as intended, there is no VEC_COND anymore and we just create the 
> bit-field
> reduction over the unmasked vector.

That's invalid for a COND_OP.  We either have to emulate that COND_OP
by materializing a VEC_COND_EXPR as before when that's semantically
valid, or refrain from vectorizing (I don't think we want to emit
N compare & jump to scalarize the mask effect).

> We could refrain from creating the COND_OP in the first place as Jakub
> mentioned (I guess we know already in if-conv that we shouldn't), re-insert a
> VEC_COND or create a COND_OP chain (instead of an OP chain) in
> vect_expand_fold_left by passing it the mask (and is_cond_op).
> Having several COND_OPs here might make analysis of subsequent passes more
> difficult?

pass in the mask and is_cond_op and create the VEC_COND_EXPR in
vect_expand_fold_left.  But make sure to disallow vectorizing the
invalid cases.

  1   2   3   4   5   6   7   >