On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote:
From: Ju-Zhe Zhong <juzhe.zh...@rivai.ai>

This patch is fixing V3 patch:
https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/

Fix issues according to Richard Sandiford && Richard Biener.

1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
2. Support multiple-rgroup for non-SLP auto-vectorization.

    For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total 
length:

      _36 = MIN_EXPR <ivtmp_34, POLY_INT_CST [8, 8]>;

      First length (MIN (X, VF/N)):
        loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;

      Second length (X - MIN (X, 1 * VF/N)):
        loop_len_16 = _36 - loop_len_15;

      Third length (X - MIN (X, 2 * VF/N)):
        _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
        loop_len_17 = _36 - _38;

      Forth length (X - MIN (X, 3 * VF/N)):
        _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
        loop_len_18 = _36 - _39;

The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
since using SELECT_VL
to adapt induction IV consumes more instructions than just using MIN_EXPR. 
Also, during testing,
I found it's hard to adjust length correctly according to SELECT_VL.

So, this patch we only use SELECT_VL for single-rgroup with single length 
control.

3. Fix document of select_vl for Richard Biener (remove mode N).
4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
Biener.
5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
at the caller site.

More comments from Richard Biener:
So it's not actually saturating.  The saturating operation is done by 
.WHILE_LEN?
I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
vf)) will make
the loop control counter never underflow zero.

I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
the above scheme should also work for single rgroups, no?
As said, it _looks_ like you can progress without .WHILE_LEN and using
.WHILE_LEN is a pure optimization?
Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
target adjust any length = INRANGE (0, min (n, vf)) each iteration.

Let me known if I missed something for the V3 patch.
So at a high level this is pretty good. I think there's some improvements we should make in the documentation and comments, but I'm comfortable with most of the implementation details.




diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index cc4a93a8763..99cf0cdbdca 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
    operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
  @end smallexample
+@cindex @code{select_vl@var{m}} instruction pattern
+@item @code{select_vl@var{m}}
+Set operand 0 to the number of active elements in vector will be updated value.
This reads rather poorly.  Is this still accurate?

Set operand 0 to the number of active elements in a vector to be updated in a loop iteration based on the total number of elements to be updated, the vectorization factor and vector properties of the target.


+operand 1 is the total elements need to be updated value.
operand 1 is the total elements in the vector to be updated.


+
+The output of this pattern is not only used as IV of loop control counter, but 
also
+is used as the IV of address calculation with multiply/shift operation. This 
allow
+us dynamic adjust the number of elements is processed in each iteration of the 
loop.
This allows dynamic adjustment of the number of elements processed each loop iteration. -- is that still accurate and does it read better?


@@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
     so that we can free them all at once.  */
  static bitmap_obstack loop_renamer_obstack;
-/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
+/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
LOOP.
+   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
+   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
     It is expected that neither BASE nor STEP are shared with other expressions
     (unless the sharing rules allow this).  Use VAR as a base var_decl for it
     (if NULL, a new temporary will be created).  The increment will occur at
It's been pretty standard to stick with just PLUS_EXPR for this stuff and instead negate the constant to produce the same effect as MINUS_EXPR. Is there a reason we're not continuing that practice? Sorry if you've answered this already -- if you have, you can just point me at the prior discussion and I'll read it.



diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 44bd5f2c805..d63ded5d4f0 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -385,6 +385,48 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, 
rgroup_controls *dest_rgm,
    return false;
  }
+/* Try to use permutes to define the lens in DEST_RGM using the lens
+   in SRC_RGM, given that the former has twice as many lens as the
+   latter.  Return true on success, adding any new statements to SEQ.  */
I would suggest not using "permute" in this description. When I read permute in the context of vectorization, I think of a vector permute to scramble elements within a vector.

This looks like you're just adjusting how many vector elements you're operating on.

+       {
+         /* For SLP, we can't allow non-VF number of elements to be processed
+            in non-final iteration. We force the number of elements to be
+          processed in each non-final iteration is VF elements. If we allow
+          non-VF elements processing in non-final iteration will make SLP too
+          complicated and produce inferior codegen.
Looks like you may have mixed up spaces and tabs in the above comment. Just a nit, but let's go ahead and get it fixed.

@@ -703,6 +1040,10 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
    tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+  bool use_vl_p = !use_masks_p
+                 && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
+                                                    OPTIMIZE_FOR_SPEED);
When you break a line with a logical like this, go ahead and add parenthesis and make sure the logical aligns just after the paren. ie

bool use_vl_p = (!use_masks_p
                 && direct....


Alternately, compute the direct_itnernal_fn_supported_p into its own boolean and then you don't need as much line wrapping.

In general, don't be afraid to use extra temporaries if doing so improves readability.





+         else if (loop_lens && loop_lens->length () == 1
+                  && direct_internal_fn_supported_p (
+                    IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo),
+                    OPTIMIZE_FOR_SPEED)
+                  && memory_access_type != VMAT_INVARIANT)
This looks like a good example of code that would be easier to read if the call to direct_internal-fn_supported_p was saved into a temporary. Similarly for the instance you added in vectorizable_load.


I'd like to get this patch wrapped up soon. But I also want to give both Richards a chance to chime in with their concerns.

Thanks,

Jeff

Reply via email to