https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106161
Bug ID: 106161 Summary: Dubious choice of optimization strategy Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vluchits at gmail dot com Target Milestone: --- Hello, here's a piece of C code: ... #define AC_NEWCEILING 16 #define AC_NEWFLOOR 32 ... if (newclipbounds) { int newfloorclipx = floorclipx; int newceilingclipx = ceilingclipx; uint16_t newclip; // rewrite clipbounds if (actionbits & AC_NEWFLOOR) newfloorclipx = low; if (actionbits & AC_NEWCEILING) newceilingclipx = high; newclip = (newceilingclipx << 8) + newfloorclipx; clipbounds[x] = newclip; newclipbounds[x] = newclip; } ... which is compiled with -Os and results in the following set of SH-2 assembler instructions: if (newclipbounds) 190: 54 fb mov.l @(44,r15),r4 192: 24 48 tst r4,r4 194: 8d 11 bt.s 1ba <_R_SegLoop+0x1ba> 196: e0 58 mov #88,r0 if (actionbits & AC_NEWFLOOR) 198: 05 fe mov.l @(r0,r15),r5 19a: 25 58 tst r5,r5 19c: 8f 01 bf.s 1a2 <_R_SegLoop+0x1a2> 19e: e0 5c mov #92,r0 floorclipx = ceilingclipx & 0x00ff; 1a0: 67 93 mov r9,r7 if (actionbits & AC_NEWCEILING) 1a2: 00 fe mov.l @(r0,r15),r0 1a4: 20 08 tst r0,r0 1a6: 8f 01 bf.s 1ac <_R_SegLoop+0x1ac> 1a8: e0 40 mov #64,r0 int newceilingclipx = ceilingclipx; 1aa: 66 83 mov r8,r6 clipbounds[x] = newclip; 1ac: 00 fe mov.l @(r0,r15),r0 newclip = (newceilingclipx << 8) + newfloorclipx; 1ae: 46 18 shll8 r6 1b0: 37 6c add r6,r7 1b2: 67 7d extu.w r7,r7 clipbounds[x] = newclip; 1b4: 0c 75 mov.w r7,@(r0,r12) newclipbounds[x] = newclip; 1b6: 50 fb mov.l @(44,r15),r0 1b8: 0c 75 mov.w r7,@(r0,r12) What I find really odd is that gcc opts to cache results of bitwise AND on the stack and reload them individually instead of simply doing tst #imm1,r0 and tst #imm,r0. There are more instances of the this behavior further down the same function. Now memory reads are really expensive on the target architecture and I would like to avoid them if possible. I'm not sure whether this behavior is triggered by some optimization setting or is inherent to the architecture, but I'd appreciate any help here.