[Bug middle-end/113921] Output register of an "asm volatile goto" is incorrectly clobbered/discarded

2024-02-14 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921

--- Comment #5 from Linus Torvalds  ---
(In reply to Linus Torvalds from comment #2)
>
> So we could make our workaround option be something like
> 
>config GCC_ASM_GOTO_WORKAROUND
>  def_bool y
>  depends on CC_IS_GCC && GCC_VERSION < 120100

Replying to myself some more, since that kernel side workaround is actually in
two parts: it does the extra empty inline asm as an extra barrier, but it
*also* adds the 'volatile' to the asm with outputs to work around the other gcc
bug.

And that other fix ("Mark asm goto with outputs as volatile") is *not* in
gcc-12.1.0. It has only made it into gcc-13.2.0 (and it's pending in the gcc-12
release branch if I read things right).

I suspect that other bug doesn't affect Sean's kvm case, because the outputs
are always used, but it could affect other kernel code.

So we'd want to have at least gcc-13.2 to be safe.

Hmm.

We could make the "add volatile manually" be the default workaround, though,
since it shouldn't matter.

[Bug middle-end/113921] Output register of an "asm volatile goto" is incorrectly clobbered/discarded

2024-02-14 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921

--- Comment #3 from Linus Torvalds  ---
(In reply to Linus Torvalds from comment #2)
>
> So we could make our workaround option be something like
> 
>config GCC_ASM_GOTO_WORKAROUND
>  def_bool y
>  depends on CC_IS_GCC && GCC_VERSION < 120100

Actually, rather than add a new kernel config option, I'll just make the
workaround conditional in our  header file.

But I'll wait for confirmation from gcc people that Jakub's bisection makes
sense, and is the real fix, rather than just a change that just happens to hide
the issue.

[Bug middle-end/113921] Output register of an "asm volatile goto" is incorrectly clobbered/discarded

2024-02-14 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921

--- Comment #2 from Linus Torvalds  ---
(In reply to Jakub Jelinek from comment #1)
> Bisection points to r12-5301-g045206450386bcd774db3bde0c696828402361c6
> making the problem go away,

Well, that certainly explains why I can't see the problem with my gcc 13.2.1.

It looks like that commit is in gcc-12.1.o and later:

   git log --oneline --all --grep="tree-optimization/102880 - improve CD-DCE"

only returns that one commit, and 

   git name-rev --refs '*releases*' 045206450386b

says "gcc-12.1.0~3038".

So we could make our workaround option be something like

   config GCC_ASM_GOTO_WORKAROUND
 def_bool y
 depends on CC_IS_GCC && GCC_VERSION < 120100

but maybe there is some backporting policy with gcc that my quick git check
missed?

[Bug rtl-optimization/111901] Apparently bogus CSE of inline asm with memory clobber

2023-10-20 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111901

--- Comment #3 from Linus Torvalds  ---
(In reply to Andrew Pinski from comment #1)
> I suspect without an input, the cse will happen as there is no other writes
> in the loop.

Yes, it looks to me like the CSE simply didn't think of the memory clobber as a
possible side effect.

Other clobbers (ie registers) presumably don't affect whether something can be
CSE'd, so I do think that a memory clobber is special.

Of course, memory clobbers are special in other ways too (ie they force spills
and reloads of any pseudos around them), but my reaction to this is that it
does look like gcc has simply missed one other implication of a memory clobber.

Do I think it *matters* in practice? Probably not. Hopefully these kinds of asm
uses don't exist.

But I do find the resulting code generation surprising to the point of "that
looks buggy".

[Bug c/111901] New: Apparently bogus CSE of inline asm with memory clobber

2023-10-20 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111901

Bug ID: 111901
   Summary: Apparently bogus CSE of inline asm with memory clobber
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torva...@linux-foundation.org
  Target Milestone: ---

This came up during an unrelated discussion about 'volatile' in the kernel with
Uros Bizjak, both wrt inline asms and just 'normal' volatile memory accesses.
As part of that discussion I wrote a test-program, and it shows unexpected
(arguably very buggy) gcc optimization behavior.

The test is really simple:

int test(void)
{
unsigned int sum=0;
for (int i=0 ; i < 4 ; i++){
unsigned int val;
#if ONE
asm("magic1 %0":"=r"(val): :"memory");
#else
asm volatile("magic2 %0":"=r"(val));
#endif
sum+=val;
}
return sum;
}

and if you compile this with

gcc -O2 -S -DONE -funroll-all-loops t.c

the end result seems nonsensical.

The over-eager CSE combines the non-volatile asm despite the fact that it has a
memory clobber, which gcc documentation states means:

 The "memory" clobber tells the compiler that the assembly code
 performs memory reads or writes to items other than those listed in
 the input and output operands (for example, accessing the memory
 pointed to by one of the input parameters).

so combining the four identical asm statements into one seems to be actively
buggy. The inline asm may not be marked volatile, but it does clearly tell the
compiler that it does memory reads OR WRITES to operands other than those
listed. Which would on the face of it make the CSE invalid.

Imagine, for example, that there's some added statistics gathering or similar
going on inside the asm, maybe using the new Intel remote atomics for example. 

The other oddity is how it does not actually multiply the result by four in any
sane way. The above generates:

magic1 %eax
movl%eax, %edx
addl%eax, %eax
addl%edx, %eax
addl%edx, %eax
ret

but *without* the memory barrier it generates the much more obvious

magic1 %eax
sall$2, %eax
ret

so the memory barrier does end up affecting the end result, just in a
nonsensical way.

And no, I don't think we have cases of this kind of code in the kernel. Marking
the asm volatile will obviously disable the over-eager CSE, and is certainly
the natural thing to do for anything that actually modifies memory.

But the memory clobber really does seem to make the CSE that gcc does invalid
as per the documentation, and Uros suggested I'd do a gcc bugzilla entry for
this.

FYI: clang seems to generate the expected code for all cases (ie both
"volatile" and the memory clobber will disable CSE, and when it does CSE it, it
generates the expected single shift, rather than doing individual additions).
But clang verifies the assembler mnemonic (which I think is a misfeature - one
of the *reasons* to use inline asm is to do things the compiler doesn't
understand), so instead of "magicX", you need to use "strl" or something like
that.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #47 from Linus Torvalds  ---
(In reply to Richard Biener from comment #45)
> For user code
> 
> volatile long long x;
> void foo () { x++; }
> 
> emitting inc + adc with memory operands is only "incorrect" in re-ordering
> the subword reads with the subword writes, the reads and writes still happen
> architecturally ...

But the thing is, the ordering *is* very much defined for volatile accesses.
"volatile" is not a "the access happens architecturally", it's very much
defined "the access is _visible_ architecturally, and ordering matters".

So with the "volatile long long x" code, I think any language lawyer will say
that generating it as

add $1,mem
adc $0,mem+4

is unquestionably a compiler bug.

It may be what the user *wants* (and it's obviously what the gcov code would
like), but it's simply not a valid volatile access to 'x'.

So the gcov code would really want something slightly weaker than 'volatile'.
Something that just does 'guaranteed access' and disallows combining stores or
doing re-loads, without the ordering constraints.

Side note: we would use such a "weaker volatile" in the kernel too. We already
have that concept in the form of READ_ONCE() and WRITE_ONCE(), and it uses
"volatile" internally, and it works fine for us. But if we had another way to
just describe "guaranteed access", that could be useful.

I suspect the memory ordering primitives would be a better model than
'volatile' for this.  What are the rules for doing it as load/store with
'memory_order_relaxed'? That should at least guarantee that the load is never
re-done (getting two different values for anybody who does a load), but maybe
the stores can be combined?

And gcc should already have all that infrastructure in place. Hmm?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #43 from Linus Torvalds  ---
(In reply to Richard Biener from comment #42)
> 
> I think if we want to avoid doing optimizations on gcov counters we should
> make them volatile. 

Honestly, that sounds like the cleanest and safest option to me.

That said, with the gcov counters apparently also being 64-bit, I suspect it
will create some truly horrid code generation.

Presumably you'd end up getting a lot of load-load-add-adc-store-store
instruction patterns, which is not just six instructions when just two should
do - it also uses up two registers.

So while it sounds like the simplest and safest model, maybe it just makes code
generation too unbearably bad?

Maybe nobody who uses gcov would care. But I suspect it might be quite the big
performance regression, to the point where even people who thought they don't
care will go "that's a bit much".

I wonder if there is some half-way solution that would allow at least a
load-add-store-load-adc-store instruction sequence, which would then mean (a)
one less register wasted and (b) potentially allow some peephole optimization
turning it into just a addmem-adcmem instruction pair.

Turning just the one of the memops into a volatile access might be enough (eg
just the load, but not the store?)

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #32 from Linus Torvalds  ---
Brw, where does the -fprofile-update=single/atomic come from?

The kernel just uses 

  CFLAGS_GCOV:= -fprofile-arcs -ftest-coverage

for this case. So I guess 'single' is just the default value?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #31 from Linus Torvalds  ---
(In reply to Richard Biener from comment #26)
> 
> Now, in principle we should have applied store-motion and not only PRE which
> would have avoided the issue, not tricking the RA into reloading the value
> from where we store it in the loop, but the kernel uses -fno-tree-loop-im,
> preventing that.  If you enable that you'd get

Note that we use -fno-tree-loop-im only for the GCOV case, and because of
another problem with code generation with gcov. See

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702

and the fix for the excessive stack use was to disable that compiler option.
See

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c87bf431448b404a6ef5fbabd74c0e3e42157a7f

for the kernel commit message.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #30 from Linus Torvalds  ---
(In reply to Richard Biener from comment #26)
> And yes, to IV optimization the gcov counter for the loop body is just
> another IV candidate that can be used, and in this case it allows to elide
> the otherwise
> unused original IV.

Ouch.

So we really don't mind the data race - the gcov data is obviously not primary
- but I don't think anybody expected the data race on the gcov data that isn't
"semantically visible" to then affect actual semantics.

And yeah, atomic updates would be too expensive even on 64-bit architectures,
so we pretty much *depend* on the data race being there. And on 32-bit
architectures (at least i386), atomic 64-bit ones go from "expensive" to
"ludicrously complicated" (ie to get a 64-bit atomic update you'd need to start
doing cmpxchg8b loops or something).

So I think the data race is not just what we expected, it's fundamental. Just
the "mix it with semantics" ends up being less than optimal. 

Having the gcov data be treated as 'volatile' would be one option, but probably
cause horrendous code generation issues as Jakub says.

Although I have several times hit that "I want to just update a volatile in
memory, I wish gcc would just be happy to combine a 'read-modify-update' to a
single instruction". So in a perfect world, that would be fixed too.

I guess from a kernel perspective, we might need to really document that GCOV
has these issues, and you can't use it for any real work. We have just been
lucky this hasn't hit us (admittedly because it's fairly odd that an expected
end gcov value would end up being used in that secondary way as a loop
variable).

[Bug target/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-26 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #12 from Linus Torvalds  ---
So it might be worth pointing explicitly to Vlastimil's email at

  https://lore.kernel.org/all/2b857e20-5e3a-13ec-a0b0-1f69d2d04...@suse.cz/

which has annotated objdump output and seems to point to the actual bug (or at
least part of it), which seems to show how the page counting (in register %ebx)
is corrupted by the coverage counts (Vlastimil calls the coverage counts "crap"
- it's real data, but from an algorithmic standpoint it obviously has no
bearing on the output).

That would mesh with "on 32-bit x86, the 64-bit coverage counts require a lot
more effort, and we have few registers, and something gets confused and uses
register %rax for two things".

The bug apparently only happens with -O2, and I think has only been reported
with gcc-11, which is what the intel test robots happened to use

[Bug target/106471] Strange code generation for __builtin_ctzl()

2022-07-28 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471

--- Comment #6 from Linus Torvalds  ---
Ahh, crossed comments.

(In reply to Andrew Pinski from comment #3)
> The xor is due to X86_TUNE_AVOID_FALSE_DEP_FOR_BMI setting:
> 
> /* X86_TUNE_AVOID_FALSE_DEP_FOR_BMI: Avoid false dependency
>for bit-manipulation instructions.  */

Ahh, ok, so it is indeed for that false dependency for old cpus.

Intel claims that's only for POPCNT on more recent CPU's.

But yes, the same thing definitely happened for BSF for much older cores (ie
pre-TZCNT).

[Bug target/106471] Strange code generation for __builtin_ctzl()

2022-07-28 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471

--- Comment #5 from Linus Torvalds  ---
(In reply to Andrew Pinski from comment #2)
> The xor is needed because of an errata in some Intel cores.

The only errata I'm aware of is that tzcnt can act as tzcnt even when cpuid
doesn't enumerate it (so it would be expected to act as just bsf). Errata 010
for Gen 8/9 cores.

And yes, that's an errata, but the xor doesn't really help.

Sure, the xor means that on old machines, where 'rep' is ignored, and tzcnt
will always act as bsf, the result register is now going to be zero if the
input is zero.

But that's

 (a) not what tzcnt does (it sets the result to 64 when the input is zero)

 (b) not what __builtin_ctzl() is documented to do anyway

In particular, wrt (b), the documentation already states

 "If x is 0, the result is undefined"

which is exactly the old legacy 'bsf' behavior.

And the errata I'm aware of is that 'rep bsf' acts as tzcnt (ie "write 64 to
destination instead of leave unmodified"), so even with the xor you actually
get undefined behavior (0 or 64 depending on CPU).

So both (a) and (b) argue for that xor being wrong.

Now, of course, there may be some other errata that I'm not aware of. Can
somebody point to it?

(And yes, on old CPUs that don't have tzcnt at all the added xor will break a
false dependency and maybe help performance, but should gcc really care about
old CPUs like that? Particularly when it eats a register and makes it
impossible to have the same source and destination register?)

> There is a different bug already recording the issue with cltq (and should
> be fixed soon or was already committed, there is a patch).

Ok, thanks.

[Bug c/106471] Strange code generation for __builtin_ctzl()

2022-07-28 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471

--- Comment #1 from Linus Torvalds  ---
Created attachment 53379
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53379=edit
Silly test-case as an attachment too

I expected just

rep bsfq %rdi, %rax
ret

from this, but that's not what gcc generates.

[Bug c/106471] New: Strange code generation for __builtin_ctzl()

2022-07-28 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471

Bug ID: 106471
   Summary: Strange code generation for __builtin_ctzl()
   Product: gcc
   Version: 12.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torva...@linux-foundation.org
  Target Milestone: ---

So this actually started out as a clang issue report about bad inline asm input
argument behavior at

   https://github.com/llvm/llvm-project/issues/56789

but as part of that I was looking at __builtin_ctzl() and while gcc DTRT for
the inline asm version we use in the kernel, the builtin acts very oddly
indeed.

IOW, this code:

unsigned long test(unsigned long arg)
{
return __builtin_ctzl(arg);
}

generates this odd result with 'gcc -O2 -S':

xorl%eax, %eax
rep bsfq%rdi, %rax
cltq
ret

where the xor and the cltq both just confuse me.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-07-10 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #28 from Linus Torvalds  ---
(In reply to Roger Sayle from comment #27)
> This should now be fixed on both mainline and the GCC 12 release branch.

Thanks everybody.

Looks like the xchg optimization isn't in the gcc-12 release branch, but the
stack use looks reasonable from my (very limited) testing.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-24 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #24 from Linus Torvalds  ---
(In reply to Linus Torvalds from comment #23)
> 
> And this now brings back my memory of the earlier similar discussion - it
> wasn't about DImode code generation, it was about bitfield code generation
> being horrendous,

Searching around, it's this one from 2011:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

not really related to this issue, apart from the superficially similar issue
with oddly bad code generation.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-24 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #23 from Linus Torvalds  ---
(In reply to Jakub Jelinek from comment #22)
> 
> If the wider registers are narrowed before register allocation, it is just
> a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.

That was more what I was thinking - why is the DImode information being kept so
long?

I realize that you want to do a lot of the early CSE etc operations at that
higher level, but by the time you are actually allocating registers and
thinking about spilling them, why is it still a DImode thing?

And this now brings back my memory of the earlier similar discussion - it
wasn't about DImode code generation, it was about bitfield code generation
being horrendous, where gcc was keeping the whole "this is a bitfield"
information around for a long time and as a result generating truly horrendous
code. When it looked like it instead should just have turned it into a load and
shift early, and then doing all the sane optimizations at that level (ie
rewriting simple bitfield code to just do loads and shifts generated *much*
better code than using bitfields).

But this is just my personal curiosity at this point - it looks like Roger
Sayle's patch has fixed the immediate problem, so the big issue is solved. And
maybe the fact that clang is doing so much better is due to something else
entirely - it just _looks_ like it might be this artificial constraint by gcc
that makes it do bad register and spill choices.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-24 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #21 from Linus Torvalds  ---
(In reply to CVS Commits from comment #20)
>
> One might think
> that splitting early gives the register allocator more freedom to
> use available registers, but in practice the constraint that double
> word values occupy consecutive registers (when ultimately used as a
> DImode value) is the greater constraint. 

Whee.

Why does gcc have that constraint, btw? I tried to look at the clang code
generation once more, and I don't *think* clang has the same constraint, and
maybe that is why it does so much better?

Yes, x86 itself inherently has a couple of forced register pairings (notably
%edx:%eax for 64-bit multiplication and division), and obviously the whole
calling convention requires well-defined pairings, but in the general case it
seems to be a mistake to keep DImode values as DImode values and force them to
be consecutive registers when used.

Maybe I misunderstand. But now that this comes up I have this dim memory of
actually having had a discussion like this before on bugzilla, where gcc
generated horrible DImode code.

> GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
> sub esp, 24
> mov ecx, DWORD PTR [esp+40]
> 
> GCC 12 [use %eax to address memory, require a 44-byte stack frame]
> sub esp, 44
> mov eax, DWORD PTR [esp+64]

I just checked the current git -tip, and this does seem to fix the original
case too, with the old horrid 2620 bytes of stack frame now being a *much*
improved 404 bytes!

So your patch - or other changes - does fix it for me, unless I did something
wrong in my testing (which is possible).

Thanks. I'm not sure what the gcc policy on closing the bug is (and I don't
even know if I am allowed), so I'm not marking this closed, but it seems to be
fixed as far as I am concerned, and I hope it gets released as a dot-release
for the gcc-12 series.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-13 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #14 from Linus Torvalds  ---
(In reply to Samuel Neves from comment #13)
> Something simple like this -- https://godbolt.org/z/61orYdjK7 -- already
> exhibits the effect. 

Yup.

That's a much better test-case. I think you should attach it to the gcc
bugzilla, I don't know how long godbolt remembers those links.

But godbolt here is good for another thing: select clang-14 as the compiler,
and you realize how even gcc-11 is actually doing a really really bad job.

And I'm not saying that to shame anybody: I think it's more a sign that maybe
the issue is actually much deeper, and that the problem already exists in
gcc-11, but some small perturbation then just makes an already existing problem
much worse in gcc-12.

So that commit I bisected to is probably almost entirely unrelated, and it is
just the thing that turns up the bad behavior to 11.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-13 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #12 from Linus Torvalds  ---
(In reply to Jakub Jelinek from comment #11)
> Anyway, I think we need to understand what makes it spill that much more,
> and unfortunately the testcase is too large to find that out easily, I think
> we should investigate commenting out some rounds.

I think you can comment out all the rounds except for one, and still see the
effects of this.

At least when I try it on godbolt with just ROUND(0) in place and rounds 1-12
deleted, for gcc-11.3 I get a stack frame of 388.

And with gcc-12.1 I get a stack frame of 516. Obviously that's not nearly as
huge, but it's still a fairly significant expansion and is presumably due to
the same basic issue.

I used godbolt just because I no longer have 11.3 installed to compare against,
so it was the easiest way to verify that the stack expansion is still there.

Just to clarify: that's still with

   -O2 -m32 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx

I'm sure it depends on the flags.

The code is still not exactly trivial with just one round in place. It's a
cryptographic hash, after all. But when compiling for 64-bit mode it all looks
fairly straightforward, it really is that DImode stuff that makes it really
ungainly when using -m32.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-12 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #10 from Linus Torvalds  ---
(In reply to Roger Sayle from comment #7)
> Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and
> on godbolt the number of assembler lines reduces from 6952 to 6203).

So now that I actually tried that, '-mno-stv' does nothing for me. I still see
a frame size of 2620.

I wonder what the difference in our setups is. I tested with plain gcc-12.1.1
from Fedora 36, maybe you tested with current trunk or something?

Anyway, it seems -mno-stv isn't the solution at least for the released version
;(

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-12 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #9 from Linus Torvalds  ---
Looks like STV is "scalar to vector" and it should have been disabled
automatically by the -mno-avx flag anyway.

And the excessive stack usage was perhaps due to GCC preparing all those stack
slots for integer -> vector movement that then never happens exactly because
the kernel uses -mno-avx?

So if I understand this right, the kernel can indeed just add -mno-stv to work
around this.

?

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-12 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #8 from Linus Torvalds  ---
(In reply to Roger Sayle from comment #7)
> Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and
> on godbolt the number of assembler lines reduces from 6952 to 6203).

Thanks. Using -mno-stv may well be a good workaround for the kernel for now,
except I don't have a clue what it means. I will google it to figure out
whether it's appropriate for our use.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-11 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #5 from Linus Torvalds  ---
(In reply to Linus Torvalds from comment #4)
> 
> I'm  not proud of that hacky thing, but since gcc documentation is written
> in sanskrit, and mere mortals can't figure it out, it's the best I could do.

And bu 'sanskrit' I mean 'texi'. 

I understand that it's how GNU projects are supposed to work, but markdown (or
rst) is just *so* much more legible and you really can read it like text.

Anyway, that's my excuse for not knowing how to "just generate cc1" for a saner
git bisect run. What I did worked, but was just incredibly ugly. There must be
some better way gcc developers have when they want to bisect cc1 behavior.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-11 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #4 from Linus Torvalds  ---
So hey, since you guys use git now, I thought I might as well just bisect this.

Now, I have no idea what the best and most efficient way is to generate only
"cc1", so my bisection run was this unholy mess of "just run configure, and
then run 'make -j128' for long enough that 'host-x86_64-pc-linux-gnu/gcc/cc1'
gets generated, and test that".

I'm  not proud of that hacky thing, but since gcc documentation is written in
sanskrit, and mere mortals can't figure it out, it's the best I could do.

And the problem bisects down to

  8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48 is the first bad commit
  commit 8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48
  Author: Roger Sayle 
  Date:   Sat Mar 5 08:50:45 2022 +

  PR 104732: Simplify/fix DI mode logic expansion/splitting on -m32.

so yes, this seems to be very much specific to the i386 target.

And yes, I also verified that reverting that commit on the current master
branch solves it for me.

Again: this was just a completely mindless bisection, with a "revert to verify"
on top of the current trunk, which for me happened to be commit cd02f15f1ae
("xtensa: Improve constant synthesis for both integer and floating-point").

I'm attaching the revert patch I used just so that you can see exactly what I
did. I probably shouldn't have actually removed the testsuite entry, but again:
ENTIRELY MINDLESS BISECTION RESULT.

[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86

2022-06-11 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #3 from Linus Torvalds  ---
Created attachment 53123
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53123=edit
Mindless revert that fixes things for me

[Bug target/105930] Excessive stack spill generation on 32-bit x86

2022-06-11 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

--- Comment #1 from Linus Torvalds  ---
Side note: it might be best to clarify that this is a regression specific to
gcc-12.

Gcc 11.3 doesn't have the problem, and generates code for this same test-case
with a stack frame of only 428 bytes. That's still bigger than clang, but not
"crazy bigger".

[Bug c/105930] New: Excessive stack spill generation on 32-bit x86

2022-06-11 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

Bug ID: 105930
   Summary: Excessive stack spill generation on 32-bit x86
   Product: gcc
   Version: 12.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: torva...@linux-foundation.org
  Target Milestone: ---

Created attachment 53121
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53121=edit
Test-case extracted from the generic blake2b kernel code

Gcc-12 seems to generate a huge number of stack spills on this blake2b
test-case, to the point where it overflows the allowable kernel stack on 32-bit
x86.

This crypto thing has two 128-byte buffers, so a stack frame a bit larger than
256 is expected when the dataset doesn't fit in the register set.

Just as an example, on this code, clang-.14.0.0 generates a stack frame that is
296 bytes. 

In contrast, gcc-12.1.1 generates a stack frame that is almost an order of
magnitude(!) larger, at 2620 bytes.

The trivial Makefile I used for this test-case is

   # The kernel cannot just randomly use FP/MMX/AVX
CFLAGS := -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx
CFLAGS += -m32
CFLAGS += -O2

test:
gcc $(CFLAGS) -Wall -S blake2b.c
grep "sub.*%[er]sp" blake2b.s

to easily test different flags and the end result, but as can be seen from
above, it really doesn't need any special flags except the ones that disable
MMX/AVX code generation.

And the generated code looks perfectly regular, except for the fact that it
uses almost 3kB of stack space.

Note that "-m32" is required to trigger this - the 64-bit case does much
better, presumably because it has more registers and this needs fewer spills.
It gets worse with some added debug flags we use in the kernel, but not that
kind of "order of magnitude" worse.

Using -O1 or -Os makes no real difference.

This is presumably due to some newly triggered optimization in gcc-12, but I
can't even begin to guess at what we'd need to disable (or enable) to avoid
this horrendous stack growth. Some very aggressive instruction scheduling thing
that spreads out all the calculations and always wants to spill-and-reload the
subepxressions that it CSE'd? I dunno. 

Pls advice. The excessive stack literally causes build failures due to us using
-Werror-frame-larger-than= to make sure stack use remains sanely bounded. The
kernel stack is a rather limited resource.

[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3

2021-05-03 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #14 from Linus Torvalds  ---
(In reply to Vineet Gupta from comment #13)
> Sorry the workaround proposed by Alexander doesn't seem to cure it (patch
> attached), outcome is the same

Vineet - it's not the ldd/std that is necessarily buggy, it's the earlier tests
of the address that guard that vectorized path. 

So your quoted parts of the code generation aren't necessarily the problematic
ones.

Did you actually test the code and check whether it has the same issue? Maybe
it changed the address limit guards before that ldd/std?

I also sent you a separate patch to test if just upgrading to a newer version
of the zlib code helps. Although that may be buggy for other reasons, it's not
like I actually tested the end result.. But it would be interesting to hear if
that one works for you (again, ldd/std might be a valid end result of trying to
vectorize that code assuming the aliasing tests are done correctly in the
vectorized loop headers).

[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3

2021-05-03 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #11 from Linus Torvalds  ---
(In reply to Linus Torvalds from comment #10)
> 
>   This particular code comes
> from some old version of zlib, and I can't test because I don't have the ARC
> background to make any sense of the generated code.

Heh. We upgraded to a "recent version" of zlib back in 2006: 

   "Upgrade the zlib_inflate implementation in the kernel from a patched
version 1.1.3/4 to a patched 1.2.3"

but it turns out that the "do things a 16-bit word at a time" was a
kernel-local optimization for some very slow old PowerPC microcontroller.

The code in upstream zlib actually looks rather better (which is not saying
much, admittedly), doesn't have any 16-bit accesses, and we probably should
just try to synchronize with that instead.

[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3

2021-05-03 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #10 from Linus Torvalds  ---
(In reply to Richard Biener from comment #9)
> 
> Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing
> you mean btw).

I obviously meant -fno-strict-aliasing, yes.

But I think it's actually essentially the same issue, just in a different
guise:

> One thing we do is (I'm not 50% sure this explains the observed issue) assume
> that if you have two accesses with type 'short' and they are aligned
> according to this type then they will not partly overlap.  Note this has
> nothing to do with C strict aliasing rules but is basic pointer math when
> you know lower zero bits.

Well, the thing is, you have two situations:

 (a) you can statically see that the two do not alias, because the offset
arithmetic is either constant or you have some range logic that can tell that
they are sufficiently far apart.

 (b) you can't.

Now, everybody is ok with the static aliasing situation in (a). If you can tell
that two addresses don't alias, your'e done, they are independent, there's no
question  about it.

But that's not the situation here. So we're in (b). And what I find personally
so annoying is that gcc has actually *done* that distance check, but apparently
intentionally done it badly based on type information.

And the reason I think this is similar to -fno-strict-aliasing is that it's
that same (b) case, and it looks like a very similar "do a bad job of doing
actual run-time alias analysis based on type information".

It seems to be literally an off-by-one error, not because it generates better
code, but because the compiler has decided to pointlessly make a bad range
comparison based on type.

But I've never worked with the gcc IR dumps, so Andrew Pinski's debug output in
#c5 doesn't actually make me go "ahh, there". Maybe it's that 8 vs 6 that he
pointed out. Did somebody notice that "offset > 8" was off-by-one, and should
have been "offset >= 8"? And then changed it to "offset > 6" which is
off-by-one in the other direction instead?

> I suggest to try the fix suggested in comment#7 and report back if that
> fixes the observed issue.

Vineet?

I still think gcc is doing the wrong thing, exactly because of that
"pointlessly using the wrong range check" issue. This particular code comes
from some old version of zlib, and I can't test because I don't have the ARC
background to make any sense of the generated code.

[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3

2021-05-01 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

--- Comment #8 from Linus Torvalds  ---
(In reply to Alexander Monakov from comment #7)
> 
> Most likely the issue is that sout/sfrom are misaligned at runtime, while
> the vectorized code somewhere relies on them being sufficiently aligned for
> a 'short'.

They absolutely are.

And we build the kernel with -Wno-strict-aliasing exactly to make sure the
compiler doesn't think that "oh, I can make aliasing decisions based on type
information".

Because we have those kinds of issues all over, and we know which architectures
support unaligned loads etc, and all the tricks with "memcpy()" and unions make
for entirely unreadable code.

So please fix the aliasing logic to not be type-based when people explicitly
tell you not to do that.

Linus

[Bug middle-end/100363] gcc generating wider load/store than warranted at -O3

2021-04-30 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Linus Torvalds  changed:

   What|Removed |Added

 CC||torvalds@linux-foundation.o
   ||rg

--- Comment #4 from Linus Torvalds  ---
(In reply to Andrew Pinski from comment #1)
> The loop gets vectorized, I don't see the problem really.


See

   
https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/issues/372

and in particular the comment

   "In the first 8-byte copy, src and dst overlap"

so apparently gcc has decided that they can't overlap, despite the two pointers
being literally generated from the same base pointer.

But I don't real arc assembly, so I'll have to take Vineet's word for it.

Vineet, have you been able to generate a smaller test-case?