The following patch adds a heuristic to prefer store/load-lanes over SLP when vectorizing. Compared to the variant attached to the PR I made the STMT_VINFO_STRIDED_P behavior explicit (matching what you've tested).
It's a heuristic that may end up vectorizing less loops or loops in a less optimal way. Thus I wait for your ok (it's essentially ARM specific). Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok? It will require a bunch of vectorizer tests to be adjusted for ARM I think. Thanks, Richard. 2015-12-16 Richard Biener <rguent...@suse.de> PR tree-optimization/68707 PR tree-optimization/67323 * tree-vect-slp.c (vect_analyze_slp_instance): Drop SLP instances if they can be vectorized using load/store-lane instructions. Index: gcc/tree-vect-slp.c =================================================================== *** gcc/tree-vect-slp.c (revision 231673) --- gcc/tree-vect-slp.c (working copy) *************** vect_analyze_slp_instance (vec_info *vin *** 1808,1813 **** --- 1802,1836 ---- } } + /* If the loads and stores can be handled with load/store-lane + instructions do not generate this SLP instance. */ + if (is_a <loop_vec_info> (vinfo) + && loads_permuted + && dr && vect_store_lanes_supported (vectype, group_size)) + { + slp_tree load_node; + FOR_EACH_VEC_ELT (loads, i, load_node) + { + gimple *first_stmt = GROUP_FIRST_ELEMENT + (vinfo_for_stmt (SLP_TREE_SCALAR_STMTS (load_node)[0])); + stmt_vec_info stmt_vinfo = vinfo_for_stmt (first_stmt); + if (! STMT_VINFO_STRIDED_P (stmt_vinfo) + && ! vect_load_lanes_supported + (STMT_VINFO_VECTYPE (stmt_vinfo), + GROUP_SIZE (stmt_vinfo))) + break; + } + if (i == loads.length ()) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "Built SLP cancelled: can use " + "load/store-lanes\n"); + vect_free_slp_instance (new_instance); + return false; + } + } + vinfo->slp_instances.safe_push (new_instance); if (dump_enabled_p ())