https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
--- Comment #20 from wilco at gcc dot gnu.org --- (In reply to Bernd Edlinger from comment #19) > I think the problem with anddi iordi and xordi instructions is that > they obscure the data flow between low and high half words. > When they are not enabled, we have the low and high parts > expanded independently, but in the case of the di mode instructions > it is not clear which of the half words propagate from input to output. > > With my new patch, we have 2328 bytes stack for hard float point, > and only 272 bytes for arm-none-eabi which is a target I care about. > > > This is still not perfect, but certainly a big improvement. > > Wilco, where have you seen the additional registers used with my > previous patch, maybe we can try to fix that somehow? What happens is that the move of zero causes us to use extra registers in shifts as both source and destination are now always live at the same time. We generate worse code for simple examples like x | (y << 3): -mfpu=vfp: push {r4, r5} lsls r5, r1, #3 orr r5, r5, r0, lsr #29 lsls r4, r0, #3 orr r0, r4, r2 orr r1, r5, r3 pop {r4, r5} bx lr -mfpu=neon: lsls r1, r1, #3 orr r1, r1, r0, lsr #29 lsls r0, r0, #3 orrs r0, r0, r2 orrs r1, r1, r3 bx lr So that means this is not a solution. Note init_regs already does insert moves of zero before expanded shifts (I get the same code with -mfpu=vfp with or without your previous patch), so it shouldn't be necessary. Why does it still make a difference? Presumably init_regs doesn't find all cases or inserts the moves at the right place, so we should fix that rather than do it in the shift expansion. However the underlying issue is that DI mode operations are not all split at exactly the same time, and that is what needs to be fixed.