https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-01-22
CC| |rguenth at gcc dot gnu.org
Summary|poor code when using |vector lowering should
|vector_size((32)) for sse2 |split loads and stores
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ok, the issue is "simple" - veclower doesn't split the loads/stores itself but
the
registers:
<bb 3>:
# ivtmp.11_24 = PHI <ivtmp.11_23(3), 0(2)>
_8 = MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B];
_11 = MEM[base: b_9(D), index: ivtmp.11_24, offset: 0B];
_17 = BIT_FIELD_REF <_8, 128, 0>;
_4 = BIT_FIELD_REF <_11, 128, 0>;
_5 = _4 + _17;
_29 = BIT_FIELD_REF <_8, 128, 128>;
_28 = BIT_FIELD_REF <_11, 128, 128>;
_14 = _28 + _29;
_12 = {_5, _14};
MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B] = _12;
ivtmp.11_23 = ivtmp.11_24 + 32;
if (ivtmp.11_23 != 8192)
goto <bb 3>;
else
goto <bb 4>;
in this case it would also have a moderately hard time to split the loads/store
as it is faced with TARGET_MEM_REFs already.
Nothing combines this back into a sane form. I've recently added code that
handles exactly the same situation but only for complex arithmetic
(in tree-ssa-forwprop.c for PR64568).
I wonder why with only -msse2 IVOPTs produces TARGET_MEM_REFs for the loads.
For sure x86_64 cannot load V4DF in one instruction...