Hi Richard,This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC auto-vectorization." We, at Imaginations Technologies, have further worked on finalizing the algorithms for transformations to achieve efficient target-aware reordering instruction selection. We have implemented the prototype in python to demonstrate the capabilities of the algorithm and verify the claims made in the presentation.
I am attaching the prototype along with the documented algorithm for review purposes. We would be starting with design and implementation of the same in GCC and would be glad to receive comments, feedback and suggestions.
- Thanks and regards, Sameera Deshpande
Introduction
============
Current methods of auto-vectorization take the vectorization decision through a
generic analysis which is almost target independent
(except for target features such as vector sizes). However, the instruction
selection mechanism is largely adhoc in the sense that the
ideas used for instruction selection for a target may not carry over seamlessly
to some other target. This has led to proliferation of
target specific code in the generic part of compiler frameworks such as GCC.
We propose a generic approach for reordering-instruction selection which is
based on the insight that effective data reordering for
vectorization can be defined in terms of a set of well defined primitive
operations. Interestingly, the same set of operations can be
used to define target instructions almost for all targets that we know of. Thus
these operations form the basic building blocks of
reorderings for effective vectorization.
Primitive operations
====================
Let V_N denote a vector of N scalar elements. A collection of k V_N vectors is
denoted by V_N , ...k , V_N
We propose the following operations:
- Concatenation_k^N : V_N x ... x k V_N --> V_kN : CONCAT_k^N (V_N , ...k ,
V_N) concatenates k vectors of size N to produce a
single vector of size kN.
- Split_k,i^kN : V_kN --> V_N . Split operation is the inverse of
concatenation. SPLT_k,i^kN ( V_kN ) splits a vector of size kN into
k vectors of size N each and returns the i th vector.
- Inerleave_k^N : V_N x ...k x V_N --> V_kN . ILV_k^N ( V_N, ...k, V_N )
creates a vector of size kN such that ith element of output
vector comes from i/kth element from i%kth vector.
- Extract_k,i^kN : V_kN --> V_N. Extract operation is inverse of interleave
operation. EXTR_k,i^kN (V_kN) divides the input vector
into N parts and creates k vectors of size N by selecting jth element from
each part, and returning ith vector of size N .
- permute^N : V_N x I_M --> V_N. Permute operation takes vector of size N and
integer order of size M. It divides vector N into N/M
vectors of size M, and permutes each subvector with permute order I_M.
Phase structure of Algorithm
============================
The algorithm for target-aware reordering-instruction selection for efficient
autovectorization is divided into following phases:
Phase 1 : Create tree for each loop per destination - Implemented by Sameera
Deshpande
Phase 2 : k-arity reduction/promotion - Implemented by Sameera Deshpande
Phase 3 : Top-down distribution of p-node subtree - Implemented by Sameera
Deshpande
Phase 4 : Optimizations - Implemented by Sameera Deshpande
Phase 5 : Bottom-up commoning of p-node subtree - Implemented by Sameera
Deshpande
- Cost computation of p-tree - Implemented by Sameera Deshpande
Phase 6 : Vector-size reduction - Implemented by Sameera Deshpande
Phase 4(again): Optimization - Implemented by Sameera Deshpande
Phase 7: Cost computation - Store min_cost_tree. - Implemented by Prachi
Godbole
Phase 8: Target specific code generation for min_cost_tree - Implemented by
Prachi Godbole
Input loop
==========
To enable auto-vectorization using this algorithm, the loop structure should be
as follows:
Loop with k statements reading from S_1 , S_2 , ... S_m and writing to
destination D, where stmt j is of form
D[k * i + d_j ] = op_j (S_1 [k_1 * i + c_j1 ], S_2 [k_2 * i + c_j2 ], ... S_m
[k_m * i + c_jm ])
such that
– All statements stmt_1 , ... stmt_k in the loop body are independent and
vectorizable for target with vector size VS.
– Iteration count N > VS.
– Induction variable i for the loop is not altered by stmt_1 , stmt_2 ...
stmt_k .
– All S_1 , ... S_m and D are of same type.
– Indices for all S and D have same multiplicative index. If different,
normalize by taking LCM, and duplicate the statements
appropriately by adjusting the offset.
– c_jp < k_p
FOR(i = 0 ... N − 1)
{
stmt 1 : D [k * i + d_1 ] = op_1 (S_1 [k_1 * i + c_11 ], S_2 [k_2 * i + c_12
], . . . S_m [k_m * i + c_1m ])
stmt 2 : D [k * i + d_2 ] = op_2 (S_1 [k_1 * i + c_21 ], S_2 [k_2 * i + c_22
], . . . S_m [k_m * i + c_2m ])
...
stmt k : D [k * i + d_k ] = op_k (S_1 [k_1 * i + c_k1 ], S_2 [k_2 * i + c_k2
], . . . S_m [k_m * i + c_km ])
}
For ease of implementation, the loop information is taken as input in language
defined as follows:
Input language
--------------
loop_begin -> LOOP '(' NUMBER ')' dest_stmt_list
dest_stmt_list -> dest_stmt_list dest_stmt |
dest_stmt
dest_stmt -> VARIABLE '[' NUMBER ',' NUMBER ']' '=' cop_tree
cop_tree -> VARIABLE '[' NUMBER ',' NUMBER ']' | c-op
'(' cop_tree_list ')'
cop_tree_list -> cop_tree_list ',' cop_tree |
cop_tree
c-op -> Any order-preserving compute operation.
So, the above given loop can be represented as follows in the input language:
LOOP (N)
D[k,d_1] = op1 (S_1[k_1, c_11], S_2[k_2, c_12], ... S_m[k_m, c_1m])
D[k,d_2] = op1 (S_1[k_1, c_21], S_2[k_2, c_22], ... S_m[k_m, c_2m])
...
D[k,d_k] = op1 (S_1[k_1, c_k1], S_2[k_2, c_k2], ... S_m[k_m, c_km])
Algorithm
=========
Phase 1: Create AST for each loop per destination D.
---------------------------------------------------
Input:
- - -
Program defined using input language.
Procedure:
- - - - -
- Identify loop count N.
- Identify respective k for destination array and source arrays.
- Create ILV_k^N node which has arity k, which writes to destination array D.
The vectorsize = loopcount.
- Create vectorized AST for each statement within loop. Each operation op_j in
statement j is vectorized to vop_j depending upon
target architecture. If statement j accesses memory S_p[ k_p * i + c_jp ],
create EXTR_k_p,c_jp on source S_p. If statement j
writes at location D[k * i + d_j ] in the iteration, attach the AST as d_jth
operand to ILV_k^N
- For each destination array D, goto step 3
Output:
- - - -
Output of phase 1 is an AST which can be defined using following grammar:
tree -> ptree | ctree
| ITER (tree)
ptree -> p-op_k (tree_1, tree_2, ... , tree_k)
ctree -> c-op_k (tree_1, tree_2, ... , tree_k) | MEM (src-name)
p-op_k -> Permute operation of arity k.
c-op_k -> Compute operation of arity k.
This phase is implemented in
- scanner.lex - Lexical analyzer to scan the input
- parser.y - Syntax analyzer and AST generator
- header.h - Header file
Commands to Execute:
To build phase 1:
flex scanner.lex
bison parser.y -d -v
gcc parser.tab.c lex.yy.c -g -o parser.out
To run phase 1:
./parser.out < <inputFile> > dmp.txt
Phase 2: k-arity reduction/promotion.
------------------------------------
Input:
- - -
AST constructed from dmp.txt rooted at root_node
Procedure:
- - - - -
IF root_node is p-op:
- Let m be machine arity
- Let k be arity of root_node
- IF m < k: Arity reduction
- k % m should be 0, otherwise autovectorization not possible.
- IF p-op of root_node is EXTR:
EXTR_k,i^N EXTR_m,(i/(k/m))%m^Nm/k
| => |
| |
T EXTR_k/m,i%(k/m)^N <=== Recursion on newly
created EXTR node
|
|
T
- ELSE IF p-op of root_node is ILV:
ILV_k^N ILV_m^Nk/m
/ | \ => / | \
/ | \ / | \
T1 T2 ... Tk / | ... \
/ | \
ILV_k/m^N ILV_k/m^N ILV_k/m^N <===
Recursion on all new ILV nodes
/ | \ / | \ / | \
/ |...\ / |...\ / |...\
T1 Tm+1 Tk-m+1 T2 Tm+2 Tk-m+2 Tm T2m Tk
- ELSE IF m == k:
- Already in desired format. Hence, invoke phase 2 recursively on
child(root_node)
- ELSE IF m > k: Arity promotion
- m %k should be 0, otherwise autovectorization not possible.
- Algorithm for ILV-promotion introduces new EXTR nodes whereas algorithm for
EXTR-promotion introduce ILV node. To avoid the
infinite loop, only ILV-promotion or EXTR-promotion can be implemented. As
phase 2 is designed as top-down algorithm, ILV-promotion
is implemented, and expect that the propagation of newly created EXTR nodes
creates opportunity for arity reduction. If this cannot
be done, auto-vectorization cannot be possible. So,
- IF p-op of root-node is EXTR:
- Do not perform auto-vectorization
- Else if p-op of root_node is ILV:
ILV_k^N ILV_m^Nk/m
/ | \ => / / / \ \ \
/ | \
T1 T2 ... Tk / / / \ \ \
:
:
/ / / \ \ \
/ / / \ \ \
/ / / \ \ \
N N N N N N
EXTR EXTR EXTR EXTR EXTR
EXTR
m/k,0 m/k,0 m/k,1 m/k,1 m/k,m/k-1
m/k,m/k-1
| | | | |
|
T1 ... Tk T1 ... Tk T1 ...
Tk
ELSE #root_node is c-op or ITER or MEM
- Invoke phase 2 recursively on child(root_node)
Output:
- - - -
Output of phase 2 is AST with all intermediate nodes having arity of p-nodes
adjusted to arity of target instruction.
This phase is implemented in Transformations.py and can be invoked by invoking
driver function k_arity_reduction_promotion
Phase 3: Top-down distribution of p-node subtree
------------------------------------------------
Input:
- - -
Input of phase 3 is AST rooted at root_node with p-node arity adjusted to arity
of reordering-instructions in target.
Procedure:
- - - - -
This phase helps accumulate all reordering operations on source operands,
because of which effective instruction selection is possible.
- Select a subtree T_N of input expression tree, such that all nodes within the
subtree are p-ops and all n children of the subtree
have same c-op node. Each c-op node has arity of k.
^
/ \
/ \
/ T_N \
+========+
==============
/ | \
/ | \
COP_k ... COP_k
/|\ / | \
/ | \ / | \
C0...Ck-1 C_kN-k...C_kN-1
- If such subtree T can be found, then construct a tree rooted at COP_k, and
propagate T on each child of cop_k.
COP_k
/ | \
/ | \
/ | \
/ | \
T_N ... T_N
/|\ / | \
/ | \ / | \
C0...CNk-k C_k-1...C_kN-1
Output:
- - - -
AST rooted at newly created cop-node or original tree.
This phase is implemented in Transformations.py and can be invoked using
top_down_distribution.
Phase 4: Optimization
---------------------
Input:
- - -
AST rooted at root_node with accumulated reordering operations whenever
possible.
Procedure:
- - - - -
The distribution of reordering operations (p-ops) can create opportunities to
perform redundancy elimination. Few identities that can
be eliminated are:
1. ILV_m (EXTR_0(S), EXTR_1(S),...EXTR_m-1(S)) => S
2. EXTR_m,x (ILV_M(S1, S2, ... Sm)) => Sx
3. CONCAT_m (SPLT_0(S), SPLT_1(S),...SPLT_m-1(S)) => S
4. SPLT_m,x (CONCAT_M(S1, S2, ... Sm)) => Sx
This phase process tree nodes in post order, so that maximum redundancy
elimination can be acheived in single pass over tree.
Output:
- - - -
AST rooted at root_node with identity elimination performed.
This phase is implemented in Transformations.py and can be invoked using
optimizations.
Phase 5: Bottom-up commoning of p-node subtrees
-----------------------------------------------
Input:
- - -
Optimized tree rooted at root_node.
Procedure:
- - - - -
Each p-tree represents set of permute instructions to be generated on source of
the p-tree. If single reordering instruction can be
generated at any tree rooted at c-node, then commoning subpart of child p-tree
is always expensive. Hence, if single instruction cannot
be generated for each child of c-node, then only commoning out the sub p-tree
is efficient.
This commoning can give better performance, if for k-ary c-tree,
- Each child p-tree has cost > 1.
- Each child p-tree has common sub p-tree with cost c - so, k * c instructions
can be reduced if sub p-tree can be commoned across all
children.
- Commoned sub p-tree now operate on result, hence k * c instructions are
replaced by c instructions.
- Select common subtree T_N from all children of root COP_k node, such that all
nodes within the subtree are p-ops
and T_N is not subtree of single instruction matching.
COP_k
==============
/ | \
/ | \
T_N ... T_N
/|\ / | \
/ | \ / | \
C0...CNk-k Ck-1 ... CkN-1
- If such subtree T can be found, common out subtree T_N from all children of
COP_K, and construct a tree rooted
at T_N, with each child as tree rooted at COP_k, performing c-op on
appropriate children on original tree.
^
/ \
/ \
/ T_N \
+========+
==============
/ | \
/ | \
COP_k ... COP_k
/|\ / | \
/ | \ / | \
C0...Ck-1 CkN-k ...CkN-1
Output:
- - - -
p-tree commoned tree rooted at new-p-node, or original tree.
This phase is implemented in Transformations.py and can be invoked using
bottom_up_commoning.
Phase 6: Vector-size reduction
------------------------------
Input:
- - -
AST rooted at root_node
Procedure:
- - - - -
- Take GCD of vec_size fields of all nodes in tree.
- IF GCD is divisible by target's vec_size,
- create ITER node of GCD/target_vec_size.
- ELSE IF GCD is not divisible by target's vec_size,
- do not vectorize.
Output:
- - - -
AST rooted at ITER node with adjusted vector size in each node.
This phase is implemented in Transformations.py and can be invoked using
vs_reduction.
Phase 7: Cost Computation
-------------------------
This phase covers entire expression tree with tiles of target instructions. It
records selected instructions along with their costs for
each node of the tree in tuple <instruction, cost for input VS, desired input
VS>. For each target architecture, reordering
instructions in target are represented using primitive operations. Then finite
automata is constructed by creating transitions per p-op
node in instruction. In final state for each reordering instruction, the tuple
is populated. Along with state transitions for each
target instruction, we also generate states in the finite automata to match
single ILV and EXTR node, where final state gives tuple for
combination of instructions or combination of input respectively.
Input:
- - -
AST for cost computation rooted at root_node.
Procedure for instruction selection:
- - - - - - - - - - - - - - - - - -
- Traverse tree top-down and for each subtree,
- Transition over the generated automaton.
- If final state is reached, add populated cost tuple in root of subtree
matched.
Output:
- - - -
Tree with cost anotated nodes.
This phase is implemented in CostComputation.py and can be invoked using
compute_cost
Phase 8: Code generation
------------------------
Input:
- - -
AST rooted at root_node with cost-tuple annotations at each node.
Procedure:
- - - - -
- Create DAG from AST.
- Scan the DAG in Bottom-up fashion so that code for all children will be
generated before the parent node.
- Depending upon vector size of tree-node and input vector size expected by
instruction, tree_node(vec_size)/<desired input VS>
instructions are generated.
Output:
- - - -
Assembly code or intermediate code.
This phase is implemented in CostComputation.py and can be invoked using
generate_code.
TODOs
=====
- Implementation of ILV <-> CONCAT and EXTR <-> SPLT conversions.
- Implementation of PERM for target instructions of type vec_perm.
Currently, this algorithm can optimize the loops which adhere to the input loop
structure as described earlier. However some
optimizations and loop transformations can help to transform unacceptable loops
into acceptable structure:
- Loop interchange - for conteguous element access in multi-dimentional arrays
- Loop peeling - For getting loop count % target VS = 0
- Loop fusion - To avoid voids on destinations as well as reuse sources better.
- Loop fission - To seperate different destinations.
final_prototype.tgz
Description: application/compressed-tar
