Hi Bin,

Thanks for posting the patch. I haven't looked in detail yet but have
couple of quick questions.

1. Shouldn’t the run time alias check for versioning happen only when
vectorisation is enabled? You seems to be using the
IFN_LOOP_DIST_ALIAS when vectoring but seems to be versioning
otherwise too.

2. As I understand, loop distribution will be mostly beneficial when
enabling transformations succeeds.  Like resulting loop being
vectorized.  I am wondering if we should only version when the loop’s
control flow is simple that it can be vectorizable. I know that you
are adding the internal function for this to some extend but for some
cases we should be able to say while in loop distribution itself that
the control flow will not result in loop being vectorized.

Btw, did you run Spec2006 with this? Any notable changes ?

Thanks,
Kugan




On 2 June 2017 at 21:51, Bin Cheng <bin.ch...@arm.com> wrote:
> Hi,
> This is the main patch of the change.  It improves loop distribution by 
> versioning loop under
> runtime alias check conditions, as well as better partition fusion.  As 
> described in comments,
> the patch basically implements distribution in the following steps:
>
>      1) Seed partitions with specific type statements.  For now we support
>         two types seed statements: statement defining variable used outside
>         of loop; statement storing to memory.
>      2) Build reduced dependence graph (RDG) for loop to be distributed.
>         The vertices (RDG:V) model all statements in the loop and the edges
>         (RDG:E) model flow and control dependences between statements.
>      3) Apart from RDG, compute data dependences between memory references.
>      4) Starting from seed statement, build up partition by adding depended
>         statements according to RDG's dependence information.  Partition is
>         classified as parallel type if it can be executed parallelly; or as
>         sequential type if it can't.  Parallel type partition is further
>         classified as different builtin kinds if it can be implemented as
>         builtin function calls.
>      5) Build partition dependence graph (PG) based on data dependences.
>         The vertices (PG:V) model all partitions and the edges (PG:E) model
>         all data dependences between every partitions pair.  In general,
>         data dependence is either compilation time known or unknown.  In C
>         family languages, there exists quite amount compilation time unknown
>         dependences because of possible alias relation of data references.
>         We categorize PG's edge to two types: "true" edge that represents
>         compilation time known data dependences; "alias" edge for all other
>         data dependences.
>      6) Traverse subgraph of PG as if all "alias" edges don't exist.  Merge
>         partitions in each strong connected commponent (SCC) correspondingly.
>         Build new PG for merged partitions.
>      7) Traverse PG again and this time with both "true" and "alias" edges
>         included.  We try to break SCCs by removing some edges.  Because
>         SCCs by "true" edges are all fused in step 6), we can break SCCs
>         by removing some "alias" edges.  It's NP-hard to choose optimal
>         edge set, fortunately simple approximation is good enough for us
>         given the small problem scale.
>      8) Collect all data dependences of the removed "alias" edges.  Create
>         runtime alias checks for collected data dependences.
>      9) Version loop under the condition of runtime alias checks.  Given
>         loop distribution generally introduces additional overhead, it is
>         only useful if vectorization is achieved in distributed loop.  We
>         version loop with internal function call IFN_LOOP_DIST_ALIAS.  If
>         no distributed loop can be vectorized, we simply remove distributed
>         loops and recover to the original one.
>
> Also, there are some more to improve in the future (which shouldn't be 
> difficult):
>    TODO:
>      1) We only distribute innermost loops now.  This pass should handle loop
>         nests in the future.
>      2) We only fuse partitions in SCC now.  A better fusion algorithm is
>         desired to minimize loop overhead, maximize parallelism and maximize
>
> This patch also fixes couple of latent bugs in the original implementation.
>
> After this change, kernel loop in hmmer can be distributed and vectorized as 
> a result.
> This gives obvious performance improvement.  There is still inefficient code 
> generation
> issue which I will try to fix in loop split.  Apart from this, the next 
> opportunity in hmmer
> is to eliminate number of dead stores under proper alias information.
> Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?
>
> Thanks,
> bin
> 2017-05-31  Bin Cheng  <bin.ch...@arm.com>
>
>         * cfgloop.h (struct loop): New field ldist_alias_id.
>         * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for
>         new internal function.
>         * internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
>         * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function.
>         * tree-loop-distribution.c: Add general explanantion on the pass.
>         Include header file.
>         (struct ddr_entry, struct ddr_entry_hasher): New structs.
>         (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions.
>         (bb_top_order_index, bb_top_order_index_size): New static vars.
>         (bb_top_order_cmp): New function.
>         (stmts_from_loop): Get basic blocks in topological order.  Don't
>         free data references.
>         (build_rdg): New parameter pointing to vector of data references.
>         Store data references in it.
>         (enum partition_type): New enum.
>         (enum partition_kind, struct partition): Add comments.  New fields.
>         (partition_alloc, partition_free): Handle new fields of partition.
>         (enum fuse_type, fuse_message): New enum and varaible.
>         (partition_merge_into): New parameter.  Update implementation wrto
>         new fields of partition.
>         (generate_loops_for_partition): Set ldist_alias_id information.
>         (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions.
>         (build_rdg_partition_for_vertex): New parameter.  Compute type info
>         for partition.
>         (classify_partition): New parameter.  Only classify partition as
>         reduction if the reduction is not included by all partitions.
>         Retrieve cached ddr, rather than compute it on the fly.
>         (ref_base_address): Delete.
>         (similar_memory_accesses): Rename to ...
>         (share_memory_accesses): ... this.
>         (rdg_build_partitions): New parameter and update uses.
>         (pg_add_dependence_edges): New parameter.  Store ddr in parameter
>         vector if it could be resolved by runtime alias check.
>         (rdg_compute_data_dependence): New function.
>         (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>         (init_partition_graph_vertices, add_partition_graph_edge): New.
>         (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>         (free_partition_graph_vdata, build_partition_graph): New.
>         (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>         (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>         (data_ref_segment_size, latch_dominated_by_data_ref): New.
>         (compute_alias_check_pairs, version_loop_by_alias_check): New.
>         (version_for_distribution_p, finalize_partitions): New.
>         (distribute_loop): Update uses.  Break SCC and version loop
>         with runtime alias checks.
>         (pass_loop_distribution::execute): Compute topological order for
>         basic blocks.  Update uses.
>         * tree-vectorizer.c (vect_loop_dist_alias_call): New.
>         (fold_loop_dist_alias_call): New.
>         (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on
>         successful vectorization or not.
>
> gcc/testsuite/ChangeLog
> 2017-05-31  Bin Cheng  <bin.ch...@arm.com>
>
>         * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>         * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>         * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>         * gcc.dg/tree-ssa/ldist-14.c: Ditto.

Reply via email to