https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122308
Bug ID: 122308
Summary: Inefficient vectorization on inner loop
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
Given a simple two-level loop-nest, a simple way to vectorization is just to do
the straightforward transform in the context of the inner loop, and no need to
consider the outer one at all.
short a[1024];
short b[2048];
int c[2048];
void foo(int n)
{
for (int i = 0; i < n; i++)
{
int index = c[i];
for (int j = 0; j < 1024; ++j)
a[j] += b[index + j];
}
}
However, current vectorizer chooses the outer one as vectorization loop, and
apply a complicated and very inefficent transform on the inner loop. The index
vector is spliced via induction vectorization technique, and then this vector
is decomposed back to scalar elements.
<bb 5> [local count: 956703966]:
# vect_vec_iv_.30_94 = PHI <{ 0, 1, 2, 3 }(4), _95(5)>
# ivtmp.58_23 = PHI <0(4), ivtmp.58_27(5)>
vect__53.31_99 = vect_vec_iv_.30_94 + vect_cst__98;
vect__1.20_80 = MEM <vector(4) short int> [(short int *)&a + ivtmp.58_23 *
1];
vect__2.21_81 = VIEW_CONVERT_EXPR<vector(4) unsigned short>(vect__1.20_80);
vect__4.24_87 = MEM <vector(4) short int> [(short int *)vectp_b.23_82 +
ivtmp.58_23 * 1];
vect__5.25_88 = VIEW_CONVERT_EXPR<vector(4) unsigned short>(vect__4.24_87);
vect__6.26_89 = vect__2.21_81 + vect__5.25_88;
vect__7.27_90 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__6.26_89);
MEM <vector(4) short int> [(short int *)&a + ivtmp.58_23 * 1] =
vect__7.27_90;
_101 = BIT_FIELD_REF <vect__53.31_99, 32, 0>;
_103 = _101 w* 2;
_104 = _100 + _103;
_105 = (void *) _104;
_106 = MEM[(short int *)_105];
_107 = BIT_FIELD_REF <vect__53.31_99, 32, 32>;
_109 = _107 w* 2;
_110 = _100 + _109;
_111 = (void *) _110;
_112 = MEM[(short int *)_111];
_113 = BIT_FIELD_REF <vect__53.31_99, 32, 64>;
_115 = _113 w* 2;
_116 = _100 + _115;
_117 = (void *) _116;
_118 = MEM[(short int *)_117];
_119 = BIT_FIELD_REF <vect__53.31_99, 32, 96>;
_121 = _119 w* 2;
_122 = _100 + _121;
_123 = (void *) _122;
_124 = MEM[(short int *)_123];
vect__54.32_125 = {_106, _112, _118, _124};
vect__55.33_126 = VIEW_CONVERT_EXPR<vector(4) unsigned
short>(vect__54.32_125);
vect__56.34_127 = vect__6.26_89 + vect__55.33_126;
vect__57.35_128 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__56.34_127);
MEM <vector(4) short int> [(short int *)&a + ivtmp.58_23 * 1] =
vect__57.35_128;
_95 = vect_vec_iv_.30_94 + { 4, 4, 4, 4 };
ivtmp.58_27 = ivtmp.58_23 + 8;
if (ivtmp.58_27 != 2048)
goto <bb 5>; [98.99%]
else
goto <bb 6>; [1.01%]