[Bug middle-end/113921] Output register of an "asm volatile goto" is incorrectly clobbered/discarded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 --- Comment #5 from Linus Torvalds --- (In reply to Linus Torvalds from comment #2) > > So we could make our workaround option be something like > >config GCC_ASM_GOTO_WORKAROUND > def_bool y > depends on CC_IS_GCC && GCC_VERSION < 120100 Replying to myself some more, since that kernel side workaround is actually in two parts: it does the extra empty inline asm as an extra barrier, but it *also* adds the 'volatile' to the asm with outputs to work around the other gcc bug. And that other fix ("Mark asm goto with outputs as volatile") is *not* in gcc-12.1.0. It has only made it into gcc-13.2.0 (and it's pending in the gcc-12 release branch if I read things right). I suspect that other bug doesn't affect Sean's kvm case, because the outputs are always used, but it could affect other kernel code. So we'd want to have at least gcc-13.2 to be safe. Hmm. We could make the "add volatile manually" be the default workaround, though, since it shouldn't matter.
[Bug middle-end/113921] Output register of an "asm volatile goto" is incorrectly clobbered/discarded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 --- Comment #3 from Linus Torvalds --- (In reply to Linus Torvalds from comment #2) > > So we could make our workaround option be something like > >config GCC_ASM_GOTO_WORKAROUND > def_bool y > depends on CC_IS_GCC && GCC_VERSION < 120100 Actually, rather than add a new kernel config option, I'll just make the workaround conditional in our header file. But I'll wait for confirmation from gcc people that Jakub's bisection makes sense, and is the real fix, rather than just a change that just happens to hide the issue.
[Bug middle-end/113921] Output register of an "asm volatile goto" is incorrectly clobbered/discarded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 --- Comment #2 from Linus Torvalds --- (In reply to Jakub Jelinek from comment #1) > Bisection points to r12-5301-g045206450386bcd774db3bde0c696828402361c6 > making the problem go away, Well, that certainly explains why I can't see the problem with my gcc 13.2.1. It looks like that commit is in gcc-12.1.o and later: git log --oneline --all --grep="tree-optimization/102880 - improve CD-DCE" only returns that one commit, and git name-rev --refs '*releases*' 045206450386b says "gcc-12.1.0~3038". So we could make our workaround option be something like config GCC_ASM_GOTO_WORKAROUND def_bool y depends on CC_IS_GCC && GCC_VERSION < 120100 but maybe there is some backporting policy with gcc that my quick git check missed?
[Bug rtl-optimization/111901] Apparently bogus CSE of inline asm with memory clobber
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111901 --- Comment #3 from Linus Torvalds --- (In reply to Andrew Pinski from comment #1) > I suspect without an input, the cse will happen as there is no other writes > in the loop. Yes, it looks to me like the CSE simply didn't think of the memory clobber as a possible side effect. Other clobbers (ie registers) presumably don't affect whether something can be CSE'd, so I do think that a memory clobber is special. Of course, memory clobbers are special in other ways too (ie they force spills and reloads of any pseudos around them), but my reaction to this is that it does look like gcc has simply missed one other implication of a memory clobber. Do I think it *matters* in practice? Probably not. Hopefully these kinds of asm uses don't exist. But I do find the resulting code generation surprising to the point of "that looks buggy".
[Bug c/111901] New: Apparently bogus CSE of inline asm with memory clobber
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111901 Bug ID: 111901 Summary: Apparently bogus CSE of inline asm with memory clobber Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: torva...@linux-foundation.org Target Milestone: --- This came up during an unrelated discussion about 'volatile' in the kernel with Uros Bizjak, both wrt inline asms and just 'normal' volatile memory accesses. As part of that discussion I wrote a test-program, and it shows unexpected (arguably very buggy) gcc optimization behavior. The test is really simple: int test(void) { unsigned int sum=0; for (int i=0 ; i < 4 ; i++){ unsigned int val; #if ONE asm("magic1 %0":"=r"(val): :"memory"); #else asm volatile("magic2 %0":"=r"(val)); #endif sum+=val; } return sum; } and if you compile this with gcc -O2 -S -DONE -funroll-all-loops t.c the end result seems nonsensical. The over-eager CSE combines the non-volatile asm despite the fact that it has a memory clobber, which gcc documentation states means: The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). so combining the four identical asm statements into one seems to be actively buggy. The inline asm may not be marked volatile, but it does clearly tell the compiler that it does memory reads OR WRITES to operands other than those listed. Which would on the face of it make the CSE invalid. Imagine, for example, that there's some added statistics gathering or similar going on inside the asm, maybe using the new Intel remote atomics for example. The other oddity is how it does not actually multiply the result by four in any sane way. The above generates: magic1 %eax movl%eax, %edx addl%eax, %eax addl%edx, %eax addl%edx, %eax ret but *without* the memory barrier it generates the much more obvious magic1 %eax sall$2, %eax ret so the memory barrier does end up affecting the end result, just in a nonsensical way. And no, I don't think we have cases of this kind of code in the kernel. Marking the asm volatile will obviously disable the over-eager CSE, and is certainly the natural thing to do for anything that actually modifies memory. But the memory clobber really does seem to make the CSE that gcc does invalid as per the documentation, and Uros suggested I'd do a gcc bugzilla entry for this. FYI: clang seems to generate the expected code for all cases (ie both "volatile" and the memory clobber will disable CSE, and when it does CSE it, it generates the expected single shift, rather than doing individual additions). But clang verifies the assembler mnemonic (which I think is a misfeature - one of the *reasons* to use inline asm is to do things the compiler doesn't understand), so instead of "magicX", you need to use "strl" or something like that.
[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552 --- Comment #47 from Linus Torvalds --- (In reply to Richard Biener from comment #45) > For user code > > volatile long long x; > void foo () { x++; } > > emitting inc + adc with memory operands is only "incorrect" in re-ordering > the subword reads with the subword writes, the reads and writes still happen > architecturally ... But the thing is, the ordering *is* very much defined for volatile accesses. "volatile" is not a "the access happens architecturally", it's very much defined "the access is _visible_ architecturally, and ordering matters". So with the "volatile long long x" code, I think any language lawyer will say that generating it as add $1,mem adc $0,mem+4 is unquestionably a compiler bug. It may be what the user *wants* (and it's obviously what the gcov code would like), but it's simply not a valid volatile access to 'x'. So the gcov code would really want something slightly weaker than 'volatile'. Something that just does 'guaranteed access' and disallows combining stores or doing re-loads, without the ordering constraints. Side note: we would use such a "weaker volatile" in the kernel too. We already have that concept in the form of READ_ONCE() and WRITE_ONCE(), and it uses "volatile" internally, and it works fine for us. But if we had another way to just describe "guaranteed access", that could be useful. I suspect the memory ordering primitives would be a better model than 'volatile' for this. What are the rules for doing it as load/store with 'memory_order_relaxed'? That should at least guarantee that the load is never re-done (getting two different values for anybody who does a load), but maybe the stores can be combined? And gcc should already have all that infrastructure in place. Hmm?
[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552 --- Comment #43 from Linus Torvalds --- (In reply to Richard Biener from comment #42) > > I think if we want to avoid doing optimizations on gcov counters we should > make them volatile. Honestly, that sounds like the cleanest and safest option to me. That said, with the gcov counters apparently also being 64-bit, I suspect it will create some truly horrid code generation. Presumably you'd end up getting a lot of load-load-add-adc-store-store instruction patterns, which is not just six instructions when just two should do - it also uses up two registers. So while it sounds like the simplest and safest model, maybe it just makes code generation too unbearably bad? Maybe nobody who uses gcov would care. But I suspect it might be quite the big performance regression, to the point where even people who thought they don't care will go "that's a bit much". I wonder if there is some half-way solution that would allow at least a load-add-store-load-adc-store instruction sequence, which would then mean (a) one less register wasted and (b) potentially allow some peephole optimization turning it into just a addmem-adcmem instruction pair. Turning just the one of the memops into a volatile access might be enough (eg just the load, but not the store?)
[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552 --- Comment #32 from Linus Torvalds --- Brw, where does the -fprofile-update=single/atomic come from? The kernel just uses CFLAGS_GCOV:= -fprofile-arcs -ftest-coverage for this case. So I guess 'single' is just the default value?
[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552 --- Comment #31 from Linus Torvalds --- (In reply to Richard Biener from comment #26) > > Now, in principle we should have applied store-motion and not only PRE which > would have avoided the issue, not tricking the RA into reloading the value > from where we store it in the loop, but the kernel uses -fno-tree-loop-im, > preventing that. If you enable that you'd get Note that we use -fno-tree-loop-im only for the GCOV case, and because of another problem with code generation with gcov. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702 and the fix for the excessive stack use was to disable that compiler option. See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c87bf431448b404a6ef5fbabd74c0e3e42157a7f for the kernel commit message.
[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552 --- Comment #30 from Linus Torvalds --- (In reply to Richard Biener from comment #26) > And yes, to IV optimization the gcov counter for the loop body is just > another IV candidate that can be used, and in this case it allows to elide > the otherwise > unused original IV. Ouch. So we really don't mind the data race - the gcov data is obviously not primary - but I don't think anybody expected the data race on the gcov data that isn't "semantically visible" to then affect actual semantics. And yeah, atomic updates would be too expensive even on 64-bit architectures, so we pretty much *depend* on the data race being there. And on 32-bit architectures (at least i386), atomic 64-bit ones go from "expensive" to "ludicrously complicated" (ie to get a 64-bit atomic update you'd need to start doing cmpxchg8b loops or something). So I think the data race is not just what we expected, it's fundamental. Just the "mix it with semantics" ends up being less than optimal. Having the gcov data be treated as 'volatile' would be one option, but probably cause horrendous code generation issues as Jakub says. Although I have several times hit that "I want to just update a volatile in memory, I wish gcc would just be happy to combine a 'read-modify-update' to a single instruction". So in a perfect world, that would be fixed too. I guess from a kernel perspective, we might need to really document that GCOV has these issues, and you can't use it for any real work. We have just been lucky this hasn't hit us (admittedly because it's fairly odd that an expected end gcov value would end up being used in that secondary way as a loop variable).
[Bug target/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552 --- Comment #12 from Linus Torvalds --- So it might be worth pointing explicitly to Vlastimil's email at https://lore.kernel.org/all/2b857e20-5e3a-13ec-a0b0-1f69d2d04...@suse.cz/ which has annotated objdump output and seems to point to the actual bug (or at least part of it), which seems to show how the page counting (in register %ebx) is corrupted by the coverage counts (Vlastimil calls the coverage counts "crap" - it's real data, but from an algorithmic standpoint it obviously has no bearing on the output). That would mesh with "on 32-bit x86, the 64-bit coverage counts require a lot more effort, and we have few registers, and something gets confused and uses register %rax for two things". The bug apparently only happens with -O2, and I think has only been reported with gcc-11, which is what the intel test robots happened to use
[Bug target/106471] Strange code generation for __builtin_ctzl()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471 --- Comment #6 from Linus Torvalds --- Ahh, crossed comments. (In reply to Andrew Pinski from comment #3) > The xor is due to X86_TUNE_AVOID_FALSE_DEP_FOR_BMI setting: > > /* X86_TUNE_AVOID_FALSE_DEP_FOR_BMI: Avoid false dependency >for bit-manipulation instructions. */ Ahh, ok, so it is indeed for that false dependency for old cpus. Intel claims that's only for POPCNT on more recent CPU's. But yes, the same thing definitely happened for BSF for much older cores (ie pre-TZCNT).
[Bug target/106471] Strange code generation for __builtin_ctzl()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471 --- Comment #5 from Linus Torvalds --- (In reply to Andrew Pinski from comment #2) > The xor is needed because of an errata in some Intel cores. The only errata I'm aware of is that tzcnt can act as tzcnt even when cpuid doesn't enumerate it (so it would be expected to act as just bsf). Errata 010 for Gen 8/9 cores. And yes, that's an errata, but the xor doesn't really help. Sure, the xor means that on old machines, where 'rep' is ignored, and tzcnt will always act as bsf, the result register is now going to be zero if the input is zero. But that's (a) not what tzcnt does (it sets the result to 64 when the input is zero) (b) not what __builtin_ctzl() is documented to do anyway In particular, wrt (b), the documentation already states "If x is 0, the result is undefined" which is exactly the old legacy 'bsf' behavior. And the errata I'm aware of is that 'rep bsf' acts as tzcnt (ie "write 64 to destination instead of leave unmodified"), so even with the xor you actually get undefined behavior (0 or 64 depending on CPU). So both (a) and (b) argue for that xor being wrong. Now, of course, there may be some other errata that I'm not aware of. Can somebody point to it? (And yes, on old CPUs that don't have tzcnt at all the added xor will break a false dependency and maybe help performance, but should gcc really care about old CPUs like that? Particularly when it eats a register and makes it impossible to have the same source and destination register?) > There is a different bug already recording the issue with cltq (and should > be fixed soon or was already committed, there is a patch). Ok, thanks.
[Bug c/106471] Strange code generation for __builtin_ctzl()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471 --- Comment #1 from Linus Torvalds --- Created attachment 53379 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53379=edit Silly test-case as an attachment too I expected just rep bsfq %rdi, %rax ret from this, but that's not what gcc generates.
[Bug c/106471] New: Strange code generation for __builtin_ctzl()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471 Bug ID: 106471 Summary: Strange code generation for __builtin_ctzl() Product: gcc Version: 12.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: torva...@linux-foundation.org Target Milestone: --- So this actually started out as a clang issue report about bad inline asm input argument behavior at https://github.com/llvm/llvm-project/issues/56789 but as part of that I was looking at __builtin_ctzl() and while gcc DTRT for the inline asm version we use in the kernel, the builtin acts very oddly indeed. IOW, this code: unsigned long test(unsigned long arg) { return __builtin_ctzl(arg); } generates this odd result with 'gcc -O2 -S': xorl%eax, %eax rep bsfq%rdi, %rax cltq ret where the xor and the cltq both just confuse me.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #28 from Linus Torvalds --- (In reply to Roger Sayle from comment #27) > This should now be fixed on both mainline and the GCC 12 release branch. Thanks everybody. Looks like the xchg optimization isn't in the gcc-12 release branch, but the stack use looks reasonable from my (very limited) testing.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #24 from Linus Torvalds --- (In reply to Linus Torvalds from comment #23) > > And this now brings back my memory of the earlier similar discussion - it > wasn't about DImode code generation, it was about bitfield code generation > being horrendous, Searching around, it's this one from 2011: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 not really related to this issue, apart from the superficially similar issue with oddly bad code generation.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #23 from Linus Torvalds --- (In reply to Jakub Jelinek from comment #22) > > If the wider registers are narrowed before register allocation, it is just > a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere. That was more what I was thinking - why is the DImode information being kept so long? I realize that you want to do a lot of the early CSE etc operations at that higher level, but by the time you are actually allocating registers and thinking about spilling them, why is it still a DImode thing? And this now brings back my memory of the earlier similar discussion - it wasn't about DImode code generation, it was about bitfield code generation being horrendous, where gcc was keeping the whole "this is a bitfield" information around for a long time and as a result generating truly horrendous code. When it looked like it instead should just have turned it into a load and shift early, and then doing all the sane optimizations at that level (ie rewriting simple bitfield code to just do loads and shifts generated *much* better code than using bitfields). But this is just my personal curiosity at this point - it looks like Roger Sayle's patch has fixed the immediate problem, so the big issue is solved. And maybe the fact that clang is doing so much better is due to something else entirely - it just _looks_ like it might be this artificial constraint by gcc that makes it do bad register and spill choices.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #21 from Linus Torvalds --- (In reply to CVS Commits from comment #20) > > One might think > that splitting early gives the register allocator more freedom to > use available registers, but in practice the constraint that double > word values occupy consecutive registers (when ultimately used as a > DImode value) is the greater constraint. Whee. Why does gcc have that constraint, btw? I tried to look at the clang code generation once more, and I don't *think* clang has the same constraint, and maybe that is why it does so much better? Yes, x86 itself inherently has a couple of forced register pairings (notably %edx:%eax for 64-bit multiplication and division), and obviously the whole calling convention requires well-defined pairings, but in the general case it seems to be a mistake to keep DImode values as DImode values and force them to be consecutive registers when used. Maybe I misunderstand. But now that this comes up I have this dim memory of actually having had a discussion like this before on bugzilla, where gcc generated horrible DImode code. > GCC 11 [use %ecx to address memory, require a 24-byte stack frame] > sub esp, 24 > mov ecx, DWORD PTR [esp+40] > > GCC 12 [use %eax to address memory, require a 44-byte stack frame] > sub esp, 44 > mov eax, DWORD PTR [esp+64] I just checked the current git -tip, and this does seem to fix the original case too, with the old horrid 2620 bytes of stack frame now being a *much* improved 404 bytes! So your patch - or other changes - does fix it for me, unless I did something wrong in my testing (which is possible). Thanks. I'm not sure what the gcc policy on closing the bug is (and I don't even know if I am allowed), so I'm not marking this closed, but it seems to be fixed as far as I am concerned, and I hope it gets released as a dot-release for the gcc-12 series.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #14 from Linus Torvalds --- (In reply to Samuel Neves from comment #13) > Something simple like this -- https://godbolt.org/z/61orYdjK7 -- already > exhibits the effect. Yup. That's a much better test-case. I think you should attach it to the gcc bugzilla, I don't know how long godbolt remembers those links. But godbolt here is good for another thing: select clang-14 as the compiler, and you realize how even gcc-11 is actually doing a really really bad job. And I'm not saying that to shame anybody: I think it's more a sign that maybe the issue is actually much deeper, and that the problem already exists in gcc-11, but some small perturbation then just makes an already existing problem much worse in gcc-12. So that commit I bisected to is probably almost entirely unrelated, and it is just the thing that turns up the bad behavior to 11.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #12 from Linus Torvalds --- (In reply to Jakub Jelinek from comment #11) > Anyway, I think we need to understand what makes it spill that much more, > and unfortunately the testcase is too large to find that out easily, I think > we should investigate commenting out some rounds. I think you can comment out all the rounds except for one, and still see the effects of this. At least when I try it on godbolt with just ROUND(0) in place and rounds 1-12 deleted, for gcc-11.3 I get a stack frame of 388. And with gcc-12.1 I get a stack frame of 516. Obviously that's not nearly as huge, but it's still a fairly significant expansion and is presumably due to the same basic issue. I used godbolt just because I no longer have 11.3 installed to compare against, so it was the easiest way to verify that the stack expansion is still there. Just to clarify: that's still with -O2 -m32 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx I'm sure it depends on the flags. The code is still not exactly trivial with just one round in place. It's a cryptographic hash, after all. But when compiling for 64-bit mode it all looks fairly straightforward, it really is that DImode stuff that makes it really ungainly when using -m32.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #10 from Linus Torvalds --- (In reply to Roger Sayle from comment #7) > Investigating. Adding -mno-stv the stack size reduces from 2612 to 428 (and > on godbolt the number of assembler lines reduces from 6952 to 6203). So now that I actually tried that, '-mno-stv' does nothing for me. I still see a frame size of 2620. I wonder what the difference in our setups is. I tested with plain gcc-12.1.1 from Fedora 36, maybe you tested with current trunk or something? Anyway, it seems -mno-stv isn't the solution at least for the released version ;(
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #9 from Linus Torvalds --- Looks like STV is "scalar to vector" and it should have been disabled automatically by the -mno-avx flag anyway. And the excessive stack usage was perhaps due to GCC preparing all those stack slots for integer -> vector movement that then never happens exactly because the kernel uses -mno-avx? So if I understand this right, the kernel can indeed just add -mno-stv to work around this. ?
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #8 from Linus Torvalds --- (In reply to Roger Sayle from comment #7) > Investigating. Adding -mno-stv the stack size reduces from 2612 to 428 (and > on godbolt the number of assembler lines reduces from 6952 to 6203). Thanks. Using -mno-stv may well be a good workaround for the kernel for now, except I don't have a clue what it means. I will google it to figure out whether it's appropriate for our use.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #5 from Linus Torvalds --- (In reply to Linus Torvalds from comment #4) > > I'm not proud of that hacky thing, but since gcc documentation is written > in sanskrit, and mere mortals can't figure it out, it's the best I could do. And bu 'sanskrit' I mean 'texi'. I understand that it's how GNU projects are supposed to work, but markdown (or rst) is just *so* much more legible and you really can read it like text. Anyway, that's my excuse for not knowing how to "just generate cc1" for a saner git bisect run. What I did worked, but was just incredibly ugly. There must be some better way gcc developers have when they want to bisect cc1 behavior.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #4 from Linus Torvalds --- So hey, since you guys use git now, I thought I might as well just bisect this. Now, I have no idea what the best and most efficient way is to generate only "cc1", so my bisection run was this unholy mess of "just run configure, and then run 'make -j128' for long enough that 'host-x86_64-pc-linux-gnu/gcc/cc1' gets generated, and test that". I'm not proud of that hacky thing, but since gcc documentation is written in sanskrit, and mere mortals can't figure it out, it's the best I could do. And the problem bisects down to 8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48 is the first bad commit commit 8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48 Author: Roger Sayle Date: Sat Mar 5 08:50:45 2022 + PR 104732: Simplify/fix DI mode logic expansion/splitting on -m32. so yes, this seems to be very much specific to the i386 target. And yes, I also verified that reverting that commit on the current master branch solves it for me. Again: this was just a completely mindless bisection, with a "revert to verify" on top of the current trunk, which for me happened to be commit cd02f15f1ae ("xtensa: Improve constant synthesis for both integer and floating-point"). I'm attaching the revert patch I used just so that you can see exactly what I did. I probably shouldn't have actually removed the testsuite entry, but again: ENTIRELY MINDLESS BISECTION RESULT.
[Bug target/105930] [12/13 Regression] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #3 from Linus Torvalds --- Created attachment 53123 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53123=edit Mindless revert that fixes things for me
[Bug target/105930] Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 --- Comment #1 from Linus Torvalds --- Side note: it might be best to clarify that this is a regression specific to gcc-12. Gcc 11.3 doesn't have the problem, and generates code for this same test-case with a stack frame of only 428 bytes. That's still bigger than clang, but not "crazy bigger".
[Bug c/105930] New: Excessive stack spill generation on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 Bug ID: 105930 Summary: Excessive stack spill generation on 32-bit x86 Product: gcc Version: 12.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: torva...@linux-foundation.org Target Milestone: --- Created attachment 53121 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53121=edit Test-case extracted from the generic blake2b kernel code Gcc-12 seems to generate a huge number of stack spills on this blake2b test-case, to the point where it overflows the allowable kernel stack on 32-bit x86. This crypto thing has two 128-byte buffers, so a stack frame a bit larger than 256 is expected when the dataset doesn't fit in the register set. Just as an example, on this code, clang-.14.0.0 generates a stack frame that is 296 bytes. In contrast, gcc-12.1.1 generates a stack frame that is almost an order of magnitude(!) larger, at 2620 bytes. The trivial Makefile I used for this test-case is # The kernel cannot just randomly use FP/MMX/AVX CFLAGS := -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx CFLAGS += -m32 CFLAGS += -O2 test: gcc $(CFLAGS) -Wall -S blake2b.c grep "sub.*%[er]sp" blake2b.s to easily test different flags and the end result, but as can be seen from above, it really doesn't need any special flags except the ones that disable MMX/AVX code generation. And the generated code looks perfectly regular, except for the fact that it uses almost 3kB of stack space. Note that "-m32" is required to trigger this - the 64-bit case does much better, presumably because it has more registers and this needs fewer spills. It gets worse with some added debug flags we use in the kernel, but not that kind of "order of magnitude" worse. Using -O1 or -Os makes no real difference. This is presumably due to some newly triggered optimization in gcc-12, but I can't even begin to guess at what we'd need to disable (or enable) to avoid this horrendous stack growth. Some very aggressive instruction scheduling thing that spreads out all the calculations and always wants to spill-and-reload the subepxressions that it CSE'd? I dunno. Pls advice. The excessive stack literally causes build failures due to us using -Werror-frame-larger-than= to make sure stack use remains sanely bounded. The kernel stack is a rather limited resource.
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #14 from Linus Torvalds --- (In reply to Vineet Gupta from comment #13) > Sorry the workaround proposed by Alexander doesn't seem to cure it (patch > attached), outcome is the same Vineet - it's not the ldd/std that is necessarily buggy, it's the earlier tests of the address that guard that vectorized path. So your quoted parts of the code generation aren't necessarily the problematic ones. Did you actually test the code and check whether it has the same issue? Maybe it changed the address limit guards before that ldd/std? I also sent you a separate patch to test if just upgrading to a newer version of the zlib code helps. Although that may be buggy for other reasons, it's not like I actually tested the end result.. But it would be interesting to hear if that one works for you (again, ldd/std might be a valid end result of trying to vectorize that code assuming the aliasing tests are done correctly in the vectorized loop headers).
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #11 from Linus Torvalds --- (In reply to Linus Torvalds from comment #10) > > This particular code comes > from some old version of zlib, and I can't test because I don't have the ARC > background to make any sense of the generated code. Heh. We upgraded to a "recent version" of zlib back in 2006: "Upgrade the zlib_inflate implementation in the kernel from a patched version 1.1.3/4 to a patched 1.2.3" but it turns out that the "do things a 16-bit word at a time" was a kernel-local optimization for some very slow old PowerPC microcontroller. The code in upstream zlib actually looks rather better (which is not saying much, admittedly), doesn't have any 16-bit accesses, and we probably should just try to synchronize with that instead.
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #10 from Linus Torvalds --- (In reply to Richard Biener from comment #9) > > Note alignment has nothing to do with strict-aliasing (-fno-strict-aliasing > you mean btw). I obviously meant -fno-strict-aliasing, yes. But I think it's actually essentially the same issue, just in a different guise: > One thing we do is (I'm not 50% sure this explains the observed issue) assume > that if you have two accesses with type 'short' and they are aligned > according to this type then they will not partly overlap. Note this has > nothing to do with C strict aliasing rules but is basic pointer math when > you know lower zero bits. Well, the thing is, you have two situations: (a) you can statically see that the two do not alias, because the offset arithmetic is either constant or you have some range logic that can tell that they are sufficiently far apart. (b) you can't. Now, everybody is ok with the static aliasing situation in (a). If you can tell that two addresses don't alias, your'e done, they are independent, there's no question about it. But that's not the situation here. So we're in (b). And what I find personally so annoying is that gcc has actually *done* that distance check, but apparently intentionally done it badly based on type information. And the reason I think this is similar to -fno-strict-aliasing is that it's that same (b) case, and it looks like a very similar "do a bad job of doing actual run-time alias analysis based on type information". It seems to be literally an off-by-one error, not because it generates better code, but because the compiler has decided to pointlessly make a bad range comparison based on type. But I've never worked with the gcc IR dumps, so Andrew Pinski's debug output in #c5 doesn't actually make me go "ahh, there". Maybe it's that 8 vs 6 that he pointed out. Did somebody notice that "offset > 8" was off-by-one, and should have been "offset >= 8"? And then changed it to "offset > 6" which is off-by-one in the other direction instead? > I suggest to try the fix suggested in comment#7 and report back if that > fixes the observed issue. Vineet? I still think gcc is doing the wrong thing, exactly because of that "pointlessly using the wrong range check" issue. This particular code comes from some old version of zlib, and I can't test because I don't have the ARC background to make any sense of the generated code.
[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 --- Comment #8 from Linus Torvalds --- (In reply to Alexander Monakov from comment #7) > > Most likely the issue is that sout/sfrom are misaligned at runtime, while > the vectorized code somewhere relies on them being sufficiently aligned for > a 'short'. They absolutely are. And we build the kernel with -Wno-strict-aliasing exactly to make sure the compiler doesn't think that "oh, I can make aliasing decisions based on type information". Because we have those kinds of issues all over, and we know which architectures support unaligned loads etc, and all the tricks with "memcpy()" and unions make for entirely unreadable code. So please fix the aliasing logic to not be type-based when people explicitly tell you not to do that. Linus
[Bug middle-end/100363] gcc generating wider load/store than warranted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 Linus Torvalds changed: What|Removed |Added CC||torvalds@linux-foundation.o ||rg --- Comment #4 from Linus Torvalds --- (In reply to Andrew Pinski from comment #1) > The loop gets vectorized, I don't see the problem really. See https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/issues/372 and in particular the comment "In the first 8-byte copy, src and dst overlap" so apparently gcc has decided that they can't overlap, despite the two pointers being literally generated from the same base pointer. But I don't real arc assembly, so I'll have to take Vineet's word for it. Vineet, have you been able to generate a smaller test-case?