Hi Mei,

See the attached updated and re-tested patch.

For comment 1: Removed redundant WN_MAP_Set.
For comment 2: Removed redundant initialization as the original code took 
advantage of these being initialized to NULL or FALSE according to C standard.
For comment 3: 'vec_unroll_preg_store' corresponds to 'II' in the conceptual 
transformation shown for the fourth test case and its size corresponds to the 
vector length of the vectorized operation. For the example shown, since d3 is 
an array of doubles, we need II = [i, i+1], but if d3 was an array of floats, 
we would need II = [i, i+1, i+2, i+3] and so on. For a vector operation packing 
16 scalar operations into one, we would need a size 16 array (though I don't 
think currently the vectorizer can generate such operations). Also, depending 
on the unrolling factor, the index variable may need to be incremented by 
2,4,6,8,10,12,16. The unrolling is done in batch after vectorization. For any 
statement unrolled several times, each time, an increment is added to the 
"add_to_base" to compute how much increment is needed for the unrolling. 
Therefore, for a single statement (e.g. s1 in the last example) whose unroll 
factor is 2, each increment to the index should be a multiple of 2.

Thanks.
Pallavi

From: Ye, Mei
Sent: Wednesday, May 25, 2011 5:25 PM
To: Mathew, Pallavi; open64-devel@lists.sourceforge.net
Subject: RE: Code review request for vectorizer bugfix patch

-L117, "WN_MAP_Set" seems to be redundant here.
-L31/32, you changed the size of "vec_unroll_preg_created" and 
"vec_unroll_preg_store" but did not change the trip count in the loop that 
initializes these arrays.
-L35, If the index variable can only be updated by factors: 1, 2, 4, 8, 16, 
then we only need to create 5 types of pregs instead of 16.

-Mei

From: Mathew, Pallavi [mailto:pallavi.mat...@amd.com]
Sent: Wednesday, May 25, 2011 1:34 PM
To: open64-devel@lists.sourceforge.net
Subject: [Open64-devel] Code review request for vectorizer bugfix patch

Hi,

Can a gatekeeper please review the attached vectorizer patch?

Sample testcases that fail at -O3:
#include <stdio.h>
int a[1000], b[1000], c[1000];
void init (void) {
    int i;
    for (i = 0; i < 1000; i++) {
        a[i] = 1;
        b[i] = 2;
        c[i] = 3;
    }
}

int
main (int argc, char** argv) {
    int i;
    init ();
    for (i = 2; i < 1000; i++) {
        b[i] = a[i-2];
        a[i] = c[i];
    }
    fprintf (stderr, "%d\n", b[6]);
    return 0;
}
---
#include <stdio.h>

#define N 1000
double d1[N], d2[N], d3[N];
float f1[N], f2[N];

void foo (void) {
    int i;
    for (i = 4; i < N; i++) {
        d1[i] = d2[i];         /* s1 */
        f1[i] = f2[i];         /* s2 */
        d3[i] = d1[i+3];       /* s3 */
    }
}

void
init (void) {
    int i;
    for (i = 0; i < N; i++) {
        d1[i] = 1.0;
        d2[i] = 2.0;
        d3[i] = 3.0;
    }
}

int
main (int argc, char** argv) {
    init ();
    foo ();
    fprintf (stderr, "%f\n", (float)d3[4]);
    return 0;
}

---
#include <stdio.h>

#define N 1000
double d1[N], d2[N], d3[N];
char f1[N], f2[N];

void foo (void) {
    int i;
    for (i = 0; i < N - 4; i++) {
        d1[i] = d2[i] + d3[i+1];         /* s1 */
        f1[i] = f2[i];         /* s2 */
        d3[i] = d1[i+3];       /* s3 */
    }
}

void init (void) {
    int i;
    for (i = 0; i < N; i++) {
        d1[i] = 1.0 * i;
        d2[i] = 2.0 * i;
        d3[i] = 3.0 * i;
    }
}

int main (int argc, char** argv) {
    int i;
    init ();
    foo ();
    for (i=0; i<10; i++){
        fprintf (stderr, "%f ", (float)d3[i]);
    }
    fprintf(stderr, "\n");
    return 0;
}
---
#include <stdio.h>
#define N 1000
double d1[N], d2[N], d3[N];
char f1[N], f2[N];

int main (int argc, char** argv) {
  int i;
  for (i = 0; i < N; i++) {
    d3[i] = 3.0 * i;       /* s1 */
    f1[i] = f2[i];         /* s2 */
  }
  for (i=0; i<10; i++){
    fprintf (stderr, "%f ", (float)d3[i]);
  }
  fprintf(stderr, "\n");
  return 0;
}

---
Problem/Fix-Description:
 The central issue is incorrect ordering of vectorized statements.

 In the first example, the vector length is 4. Therefore, we need to
 re-order the two statements in order to perform vectorization correctly.

 b[i] = a[i-2];
 a[i] = c[i];

 To fix the problem, we topologically reorder the statements right
 before performing vectorization.

 In the other examples the problem is caused by incorrect order of certain
 unrolled statements during vectorization, coupled sometimes (last two examples)
 with incorrect handling of induction variables in loop unrolling during SIMD.
 Before this fix, copies of unrolled statement are inserted right after current 
statement,
 without checking dependencies. This causes problems to the following example 
code:

 d1[i] = d2[i] + d3[i+1]; /* s1 */
 f1[i] = f2[i]; /* s2 */
 d3[i] = d1[i+3]; /* s3 */

 Due to the presence of s2, s1 and s3 need to be unrolled by 2. So,
 the code will become the following:

 d1[i:i+1] = d2[i:i+1] + d3[i+1:i+2];/* s1 */
 d1[i+2:i+3] = d2[i+2:i+3] + d3[i+3:i+4];/* s1' */
 f1[i:i+3] = f2[i:i+3]; /* s2 */
 d3[i:i+1] = d1[i+3:i+4]; /* s3 */
 d3[i+2:i+3] = d1[i+5:i+6]; /* s3' */

 Due to s1', d1[i+3] is written before read which is not so in original
 code. The fix is to insert unrolled copies of statements in the order
 of their appearence in the original loop.

 In the last example, when vectorizing `3.0 * I' (use of index variable in non 
array
 subscript operation), the first loop is changed to (conceptually):

 II = [start:start+4]
 INC= [16, 16, 16, 16]
 for (i = start; i < end; i+=16) {
    d3[i:i+1] = 3.0 * II;      /* vectorized version of s1 */
    f1[i:i+15]= f2[i:i+15];    /* vectorized version of s2 */
    II += INC;                 /* s3, created by SIMD */
 }

 Then, unrolling is performed. Since the loop is unrolled 16 times, we
 need to fill remaining copies of s1. SIMD does this by creating the
 following unrolled copy and then insert it to the loop.

 KK = [2, 2, 2, 2]
 d3[i+2:i+3] = 3.0 * (II + KK);

 In the original code, unrolled copy was inserted after the last
 statement of the loop. However, in this case, it should be s2, not
 s3. The other problem is that, vectorized version of s1 needs to be
 unrolled 8 times, so the increment to base should be 2, 4, 6, 8, 10,
 12, 14. However, the original code only deals with increment being 1,
 2, 4, 8 case (hence the assertion failure).

Thanks.
Pallavi

Attachment: vectorizer2.p
Description: vectorizer2.p

------------------------------------------------------------------------------
Simplify data backup and recovery for your virtual environment with vRanger. 
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Data protection magic?
Nope - It's vRanger. Get your free trial download today. 
http://p.sf.net/sfu/quest-sfdev2dev
_______________________________________________
Open64-devel mailing list
Open64-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/open64-devel

Reply via email to