On 6 July 2016 at 22:25, Andrew Pinski <pins...@gmail.com> wrote: > On Wed, Jul 6, 2016 at 5:00 AM, Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: >> On 4 July 2016 at 13:51, Andrew Pinski <pins...@gmail.com> wrote: >>> On Mon, Jul 4, 2016 at 12:58 AM, Prathamesh Kulkarni >>> <prathamesh.kulka...@linaro.org> wrote: >>>> Hi, >>>> I have attached a "quick and dirty" prototype patch (var-partition-1.diff), >>>> that attempts to partition variables to reduce number of >>>> external references and to increase usage of section-anchors >>>> to CSE address computation of global variables. >>>> >>>> We could put a variable in a partition that has max references for it, >>>> however it doesn't lend itself directly to section anchor optimization. >>>> For instance if a partition has max references for variables 'a' and 'b', >>>> but no function in that partition references both 'a', and 'b' then AFAIU >>>> it doesn't make any difference from section anchors perspective to have >>>> them >>>> in same partition. >>>> >>>> The patch tries to assign a set of variables (>= 2) >>>> to a partition whose functions have maximum references for that set. >>>> Functions within the partition that reference the variables >>>> in the set can take advantage of section-anchors. Functions >>>> referencing the variables in the set outside the partition >>>> would need to load them as external references (using movw/movt), >>>> however since we are placing the set in partition that has maximal >>>> references for it, number of external references should be overall >>>> reduced. >>>> >>>> Partitioning is gated by -flto-var-partition and enabled >>>> only for arm and aarch64. >>> >>> Why only for arm and aarch64? Shouldn't it be enabled for all section >>> anchor targets? >> AFAIK the only targets supporting section anchors are arm, aarch64 and >> powerpc. >> I didn't enable it for ppc64 because I am not sure how much profitable >> it is for that target. >> Honza mentioned to me some time back that effect of partitioning on >> powerpc was nearly zero. > > > No MIPS has section anchors enabled too. Plus MIPS will benefit the > same way as AARCH64 and ARM. PowerPC32 would too. > > I don't think it is correct to enable it only for arm and aarch64. Thanks, I updated the patch to remove -flto-var-partition and gated the partition currently on target_supports_section_anchors_p() (although it doesn't test if -fsection-anchors is passed). Um I am not able to see where mips has section anchors enabled ? mips.c does not seem to override min_anchor_offset and max_anchor_offset hooks. Both these hooks have values 0 for default, and target_supports_section_anchors_p() returns false if both these hooks have value 0.
Thanks, Prathamesh > > Thanks, > Andrew Pinski > >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Andrew >>> >>>> As per previous discussion [1], I haven't >>>> touched function partitioning. Does this approach look ok >>>> especially regarding correctness ? >>>> So far, I have cross-tested patch on arm*-*-*, aarch64*-*-*. >>>> >>>> I haven't yet managed to benchmark the patch. >>>> As a cheap measurement, I tried to measure number of external >>>> references with and without patch by writing a small ipa pass >>>> which is run during ltrans and simply walks over varpool nodes >>>> and counts number of varpool_nodes for which DECL_EXTERNAL (vnode->decl) >>>> is true >>>> and vnode->definition is 0. Is that sufficient condition to determine >>>> if variable is externally defined ? I have attached the pass >>>> (count-external-refs.diff) >>>> and the comparison done with it for for SPEC2000 [2]. The entries >>>> in "before" and "after" column contain summation of number of >>>> external refs (total_count) across all partitions before and after applying >>>> the patch. Does the comparison hold any merit ? >>>> I was wondering if we could we use a better way for >>>> measuring statically the effects of variable partitioning ? >>>> I hope also to get done with benchmarking soon. >>>> >>>> I have not yet figured out how to integrate it with existing cost metrics >>>> for >>>> balanced partitioning, I am looking into that. >>>> I would be grateful for suggestions on the patch. >>>> >>>> [1] https://gcc.gnu.org/ml/gcc/2016-04/msg00090.html >>>> >>>> [2] SPEC2000 comparison: >>>> https://docs.google.com/spreadsheets/d/1xnszyw04ksoyBspmCVYesq6KARLw-PA2n3T4aoaKdYw/edit?usp=sharing
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c index 453343a..09b525e 100644 --- a/gcc/lto/lto-partition.c +++ b/gcc/lto/lto-partition.c @@ -34,6 +34,12 @@ along with GCC; see the file COPYING3. If not see #include "ipa-prop.h" #include "ipa-inline.h" #include "lto-partition.h" +#include "toplev.h" /* for target_supports_section_anchors_p() */ +#include <map> +#include <set> +#include <algorithm> +#include <vector> +#include <algorithm> vec<ltrans_partition> ltrans_partitions; @@ -407,6 +413,274 @@ add_sorted_nodes (vec<symtab_node *> &next_nodes, ltrans_partition partition) add_symbol_to_partition (partition, node); } +/* FIXME: Currently I don't care to compute power set if set has more than 4 vars, which + I chose as an arbitrary upper limit and should be analyzed with experimenting. + Or maybe fall back to computing power-set for n in general way, but that can be + very expensive if n is large. */ + +static void +compute_powerset (std::vector<varpool_node *>& set, std::vector< std::vector< varpool_node * > >& pset) +{ + if (set.size () > 4) + { + pset.push_back (set); + return; + } + + std::vector<varpool_node *> v; + +#define add_2_elems(x, y) \ + v.clear (); \ + v.push_back (set[x]); \ + v.push_back (set[y]); \ + +#define add_2_elems_to_pset(x, y) \ + add_2_elems(x, y) \ + pset.push_back (v); + +#define add_3_elems(x, y, z) \ + add_2_elems(x, y) \ + v.push_back (set[z]); \ + +#define add_3_elems_to_pset(x, y, z) \ + add_3_elems(x, y, z) \ + pset.push_back (v); + +#define add_4_elems(x, y, z, w) \ + add_3_elems(x, y, z) \ + v.push_back (set[z]); + +#define add_4_elems_to_pset(x, y, z, w) \ + add_4_elems(x, y, z, w) \ + pset.push_back (v); + + switch (set.size ()) + { + case 4: + add_2_elems_to_pset(0, 3) + add_2_elems_to_pset(1, 3) + add_2_elems_to_pset(2, 3) + add_3_elems_to_pset(0, 1, 3) + add_3_elems_to_pset(1, 2, 3) + add_4_elems_to_pset(0, 1, 2, 3) + /* fallthru. */ + + case 3: + add_2_elems_to_pset(0, 2) + add_2_elems_to_pset(1, 2) + add_3_elems_to_pset(0, 1, 2) + /* fallthru. */ + + case 2: + add_2_elems_to_pset(0, 1) + } + +} + +struct global_map_value { + unsigned partition_index; + unsigned count; +}; + +typedef std::pair< std::vector< varpool_node * >, global_map_value > global_map_pair; +typedef std::map< std::vector< varpool_node * >, global_map_value > global_map_t; + +static void +dump_global_pair (global_map_t::iterator it) +{ + const std::vector< varpool_node * >& v = it->first; + global_map_value gv = it->second; + + fprintf (stderr, "\n{"); + for (unsigned i = 0; i < v.size (); i++) + fprintf (stderr, "%s, ", v[i]->name ()); + fprintf (stderr, "} occuring %u times is assigned to %u partition\n", gv.count, gv.partition_index); +} + +static bool +global_pair_cmp (const global_map_t::iterator& it_1, const global_map_t::iterator& it_2) +{ + // compare for descending order + const std::vector< varpool_node *>& v1 = it_1->first; + global_map_value gv1 = it_1->second; + + const std::vector< varpool_node *>& v2 = it_2->first; + global_map_value gv2 = it_2->second; + + if (gv1.count > gv2.count) + return true; + + else if (gv1.count == gv2.count) + return v1.size () > v2.size (); + + else + return false; +} + +/* The function tries to assign a set of variables to partition that references it most. + The intent is that variables in this set would then be anchored and their address + computations can be CSE'd by using the anchor. + + The function performs 3 steps: + a) Assign set of variables used in a function to partition if that partition has maximum + references for the set. We do not consider set of variables used across functions, + for eg, if a is used in f1() and b in f2() then do not consider {a, b}. Only consider + sets of variables which are referenced in a single function. + So if f() references {a, b, c}, then consider + each combination involving at-least 2 vars: {a, b}, {a, c}, {b, c} and {a, b, c} and count + references for each of these sets in the partition. + global_map[set] == partition that has max references for the set. + If partition 0 hax max references (5) for {a, b} then global_map[{a, b}] = <0, 5> + where 0 is the partition_index and 5 is the number of references. global_map_value + is used to represent the tuple <partition_index, number_of_references>. + + b) Sort global_map in descending order according to count. + This is done so that if sets have overlapping variables then + we choose first to allocate the one which occurs more number of times. + For eg, if set {a, b} occurs 5 times in partition 0 and {a, b, c} occurs + 2 times in partition 1, then we first want to consider allocating + {a, b} to partition 0. + + c) Step 3: Try allocating variables to set in the partition. + This step might not add all variables from the set in the partition since + they could have been allocated in another partition. + + TODO: Integrate with cost-metrics for balanced partitioning. + + Possible improvements: + a) If a set has same number occurence in multiple partitions, choose a partition + whose functions have more repeated number of occurence of that set. + for eg: + partition 0: f1() uses {a, b} 3 times. + partition 1: f2(), f3(), f4() use {a, b} 1 time each. + In this case alocating {a, b} to partition 0 instead of partiton 1 would be more beneficial. + However this might not always be a win. + + b) Currently we choose a set with higher priority, whose occurence is strictly greater. + This might not always be optimal. + For instance if in a partition {a, b} occurs 'x' times and {a, b, c} occurs 'y' times, + and y < x but very close to x, it probably makes more sense to first consider the set {a, b, c}. + Perhaps define a range [x - c, x] where c is constant and choose allocation of set with higher + number of elements, whose number of occurences is y if y falls within [x -c, x]. + + c) hot/cold functions: We might want to ignore occurence of set in cold and noreturn functions and give + a higher priority to set that occurs in a hot function. +*/ + +static void +assign_vars_to_partitions (void) +{ +#define ASSIGN_VARS_TO_PARTITIONS_DEBUG + + global_map_t global_map; + + /* Step 1: Assign set to partition that has maxmium references for the set. */ + + for (unsigned i = 0; i < ltrans_partitions.length (); i++) + { + /* partition_map: map of set -> number of times it's referenced in the partition. */ + typedef std::map< std::vector< varpool_node * >, unsigned > partition_map_t; + partition_map_t partition_map; + + const ltrans_partition& part = ltrans_partitions[i]; + for (unsigned j = 0; j < lto_symtab_encoder_size (part->encoder); j++) + { + cgraph_node *cnode = dyn_cast<cgraph_node *> (lto_symtab_encoder_deref (part->encoder, j)); + if (!cnode) + continue; + + /* var_map: map of variable -> number of times it's referenced in the function. */ + std::map<varpool_node *, unsigned> var_map; + ipa_ref *ref; + + for (unsigned k = 0; cnode->iterate_reference (k, ref); k++) + if (varpool_node *vnode = dyn_cast<varpool_node *> (ref->referred)) + var_map[vnode]++; + + /* vars: set of variables referenced by function. */ + std::vector< varpool_node * > vars; + for (std::map<varpool_node *, unsigned>::const_iterator it = var_map.begin (); it != var_map.end (); ++it) + vars.push_back (it->first); + + /* If a function refers to only one variable, there's no point anchoring it. */ + if (vars.size () == 1) + continue; + + /* Compute the power set for vars and for each set within the power_set, count + occurences of the set in function. + The occurence of set == occurence of variable, which has least occurence + for instance occurence{a, b} == min (nrefs(a), nrefs(b)). + number of references for individual variables are already computed in var_map. */ + + std::vector< std::vector< varpool_node * > > vars_pset; + compute_powerset (vars, vars_pset); + + for (unsigned i = 0; i < vars_pset.size (); ++i) + { + unsigned min = UINT_MAX; + const std::vector< varpool_node * >& v = vars_pset[i]; + + for (unsigned i = 0; i < v.size (); i++) + if (var_map[v[i]] < min) + min = var_map[v[i]]; + partition_map[v] += min; + } + } + + /* Walk each set that we computed occurences for in this partition. If a set S has max occurences + in current partition, then update global_map[S] .count = partition_map[S], .partition_index = i; */ + + for (partition_map_t::iterator it = partition_map.begin (); it != partition_map.end (); ++it) + { + const std::vector<varpool_node *>& v = it->first; + unsigned count = it->second; + + global_map_value& gv = global_map[v]; + if (count > gv.count) + { + gv.partition_index = i; + gv.count = count; + } + } + } + + /* Step 2: We are not really sorting global_map, rather initially store all + global_map_t::iterators in keys vector, and then sort keys in + descending order. global_pair_cmp is used for comparison. */ + + std::vector< global_map_t::iterator > keys; + + for (global_map_t::iterator it = global_map.begin (); it != global_map.end (); ++it) + keys.push_back (it); + + std::sort (keys.begin (), keys.end (), global_pair_cmp); + + /* Step 3: Walk each iterator in keys, and attempt to put variables from set + in the respective partition. */ + + for (unsigned i = 0; i < keys.size (); i++) + { + global_map_t::iterator it = keys[i]; + const std::vector< varpool_node * >& v = it->first; + global_map_value gv = it->second; + + for (unsigned j = 0; j < v.size (); j++) + { + varpool_node *vnode = v[j]; + + if (!vnode->definition) + continue; + + if (!symbol_partitioned_p (vnode) && flag_toplevel_reorder + && !vnode->no_reorder + && vnode->get_partitioning_class () == SYMBOL_PARTITION) + { + // fprintf (stderr, "%s goes to partition %u\n", vnode->name (), gv.partition_index); + add_symbol_to_partition (ltrans_partitions[gv.partition_index], vnode); + } + } + } +} /* Group cgraph nodes into equally-sized partitions. @@ -467,6 +741,7 @@ lto_balanced_map (int n_lto_partitions, int max_partition_size) int npartitions; int current_order = -1; int noreorder_pos = 0; + bool section_anchor_supported_p = target_supports_section_anchors_p (); FOR_EACH_VARIABLE (vnode) gcc_assert (!vnode->aux); @@ -628,6 +903,9 @@ lto_balanced_map (int n_lto_partitions, int max_partition_size) for (j = 0; refs_node->iterate_reference (j, ref); j++) if (is_a <varpool_node *> (ref->referred)) { + if (section_anchor_supported_p) + continue; + int index; vnode = dyn_cast <varpool_node *> (ref->referred); @@ -663,6 +941,9 @@ lto_balanced_map (int n_lto_partitions, int max_partition_size) for (j = 0; refs_node->iterate_referring (j, ref); j++) if (is_a <varpool_node *> (ref->referring)) { + if (section_anchor_supported_p) + continue; + int index; vnode = dyn_cast <varpool_node *> (ref->referring); @@ -763,6 +1044,8 @@ lto_balanced_map (int n_lto_partitions, int max_partition_size) } next_nodes.truncate (0); + if (section_anchor_supported_p) + assign_vars_to_partitions (); /* Varables that are not reachable from the code go into last partition. */ if (flag_toplevel_reorder) diff --git a/gcc/toplev.c b/gcc/toplev.c index 543b8a3..0fd0d54 100644 --- a/gcc/toplev.c +++ b/gcc/toplev.c @@ -1165,7 +1165,7 @@ general_init (const char *argv0, bool init_signals) /* Return true if the current target supports -fsection-anchors. */ -static bool +bool target_supports_section_anchors_p (void) { if (targetm.min_anchor_offset == 0 && targetm.max_anchor_offset == 0) diff --git a/gcc/toplev.h b/gcc/toplev.h index 06923cf..d8b9f77 100644 --- a/gcc/toplev.h +++ b/gcc/toplev.h @@ -100,4 +100,5 @@ extern const char *set_random_seed (const char *); extern void initialize_rtl (void); +extern bool target_supports_section_anchors_p (void); #endif /* ! GCC_TOPLEV_H */