[Bug tree-optimization/122308] New: Inefficient vectorization on inner loop

fxue at os dot amperecomputing.com via Gcc-bugs Sat, 18 Oct 2025 08:23:23 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122308


            Bug ID: 122308
           Summary: Inefficient vectorization on inner loop
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Given a simple two-level loop-nest, a simple way to vectorization is just to do
the straightforward transform in the context of the inner loop, and no need to
consider the outer one at all.

short a[1024];
short b[2048];
int c[2048];

void foo(int n)
{
  for (int i = 0; i < n; i++)
    {
      int index = c[i];

      for (int j = 0; j < 1024; ++j)
        a[j] += b[index + j];
    }
}

However, current vectorizer chooses the outer one as vectorization loop, and
apply a complicated and very inefficent transform on the inner loop. The index
vector is spliced via induction vectorization technique, and then this vector
is decomposed back to scalar elements.

  <bb 5> [local count: 956703966]:
  # vect_vec_iv_.30_94 = PHI <{ 0, 1, 2, 3 }(4), _95(5)>
  # ivtmp.58_23 = PHI <0(4), ivtmp.58_27(5)>
  vect__53.31_99 = vect_vec_iv_.30_94 + vect_cst__98;
  vect__1.20_80 = MEM <vector(4) short int> [(short int *)&a + ivtmp.58_23 *
1];
  vect__2.21_81 = VIEW_CONVERT_EXPR<vector(4) unsigned short>(vect__1.20_80);
  vect__4.24_87 = MEM <vector(4) short int> [(short int *)vectp_b.23_82 +
ivtmp.58_23 * 1];
  vect__5.25_88 = VIEW_CONVERT_EXPR<vector(4) unsigned short>(vect__4.24_87);
  vect__6.26_89 = vect__2.21_81 + vect__5.25_88;
  vect__7.27_90 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__6.26_89);
  MEM <vector(4) short int> [(short int *)&a + ivtmp.58_23 * 1] =
vect__7.27_90;
  _101 = BIT_FIELD_REF <vect__53.31_99, 32, 0>;
  _103 = _101 w* 2;
  _104 = _100 + _103;
  _105 = (void *) _104;
  _106 = MEM[(short int *)_105];
  _107 = BIT_FIELD_REF <vect__53.31_99, 32, 32>;
  _109 = _107 w* 2;
  _110 = _100 + _109;
  _111 = (void *) _110;
  _112 = MEM[(short int *)_111];
  _113 = BIT_FIELD_REF <vect__53.31_99, 32, 64>;
  _115 = _113 w* 2;
  _116 = _100 + _115;
  _117 = (void *) _116;
  _118 = MEM[(short int *)_117];
  _119 = BIT_FIELD_REF <vect__53.31_99, 32, 96>;
  _121 = _119 w* 2;
  _122 = _100 + _121;
  _123 = (void *) _122;
  _124 = MEM[(short int *)_123];
  vect__54.32_125 = {_106, _112, _118, _124};
  vect__55.33_126 = VIEW_CONVERT_EXPR<vector(4) unsigned
short>(vect__54.32_125);
  vect__56.34_127 = vect__6.26_89 + vect__55.33_126;
  vect__57.35_128 = VIEW_CONVERT_EXPR<vector(4) short int>(vect__56.34_127);
  MEM <vector(4) short int> [(short int *)&a + ivtmp.58_23 * 1] =
vect__57.35_128;
  _95 = vect_vec_iv_.30_94 + { 4, 4, 4, 4 };
  ivtmp.58_27 = ivtmp.58_23 + 8;
  if (ivtmp.58_27 != 2048)
    goto <bb 5>; [98.99%]
  else
    goto <bb 6>; [1.01%]

[Bug tree-optimization/122308] New: Inefficient vectorization on inner loop

Reply via email to