On 6 July 2016 at 22:25, Andrew Pinski <[email protected]> wrote:
> On Wed, Jul 6, 2016 at 5:00 AM, Prathamesh Kulkarni
> <[email protected]> wrote:
>> On 4 July 2016 at 13:51, Andrew Pinski <[email protected]> wrote:
>>> On Mon, Jul 4, 2016 at 12:58 AM, Prathamesh Kulkarni
>>> <[email protected]> wrote:
>>>> Hi,
>>>> I have attached a "quick and dirty" prototype patch (var-partition-1.diff),
>>>> that attempts to partition variables to reduce number of
>>>> external references and to increase usage of section-anchors
>>>> to CSE address computation of global variables.
>>>>
>>>> We could put a variable in a partition that has max references for it,
>>>> however it doesn't lend itself directly to section anchor optimization.
>>>> For instance if a partition has max references for variables 'a' and 'b',
>>>> but no function in that partition references both 'a', and 'b' then AFAIU
>>>> it doesn't make any difference from section anchors perspective to have
>>>> them
>>>> in same partition.
>>>>
>>>> The patch tries to assign a set of variables (>= 2)
>>>> to a partition whose functions have maximum references for that set.
>>>> Functions within the partition that reference the variables
>>>> in the set can take advantage of section-anchors. Functions
>>>> referencing the variables in the set outside the partition
>>>> would need to load them as external references (using movw/movt),
>>>> however since we are placing the set in partition that has maximal
>>>> references for it, number of external references should be overall
>>>> reduced.
>>>>
>>>> Partitioning is gated by -flto-var-partition and enabled
>>>> only for arm and aarch64.
>>>
>>> Why only for arm and aarch64? Shouldn't it be enabled for all section
>>> anchor targets?
>> AFAIK the only targets supporting section anchors are arm, aarch64 and
>> powerpc.
>> I didn't enable it for ppc64 because I am not sure how much profitable
>> it is for that target.
>> Honza mentioned to me some time back that effect of partitioning on
>> powerpc was nearly zero.
>
>
> No MIPS has section anchors enabled too. Plus MIPS will benefit the
> same way as AARCH64 and ARM. PowerPC32 would too.
>
> I don't think it is correct to enable it only for arm and aarch64.
Thanks, I updated the patch to remove -flto-var-partition
and gated the partition currently on target_supports_section_anchors_p()
(although it doesn't test if -fsection-anchors is passed).
Um I am not able to see where mips has section anchors enabled ?
mips.c does not seem to override min_anchor_offset and max_anchor_offset
hooks. Both these hooks have values 0 for default, and
target_supports_section_anchors_p()
returns false if both these hooks have value 0.
Thanks,
Prathamesh
>
> Thanks,
> Andrew Pinski
>
>>
>> Thanks,
>> Prathamesh
>>>
>>> Thanks,
>>> Andrew
>>>
>>>> As per previous discussion [1], I haven't
>>>> touched function partitioning. Does this approach look ok
>>>> especially regarding correctness ?
>>>> So far, I have cross-tested patch on arm*-*-*, aarch64*-*-*.
>>>>
>>>> I haven't yet managed to benchmark the patch.
>>>> As a cheap measurement, I tried to measure number of external
>>>> references with and without patch by writing a small ipa pass
>>>> which is run during ltrans and simply walks over varpool nodes
>>>> and counts number of varpool_nodes for which DECL_EXTERNAL (vnode->decl)
>>>> is true
>>>> and vnode->definition is 0. Is that sufficient condition to determine
>>>> if variable is externally defined ? I have attached the pass
>>>> (count-external-refs.diff)
>>>> and the comparison done with it for for SPEC2000 [2]. The entries
>>>> in "before" and "after" column contain summation of number of
>>>> external refs (total_count) across all partitions before and after applying
>>>> the patch. Does the comparison hold any merit ?
>>>> I was wondering if we could we use a better way for
>>>> measuring statically the effects of variable partitioning ?
>>>> I hope also to get done with benchmarking soon.
>>>>
>>>> I have not yet figured out how to integrate it with existing cost metrics
>>>> for
>>>> balanced partitioning, I am looking into that.
>>>> I would be grateful for suggestions on the patch.
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc/2016-04/msg00090.html
>>>>
>>>> [2] SPEC2000 comparison:
>>>> https://docs.google.com/spreadsheets/d/1xnszyw04ksoyBspmCVYesq6KARLw-PA2n3T4aoaKdYw/edit?usp=sharing
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index 453343a..09b525e 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -34,6 +34,12 @@ along with GCC; see the file COPYING3. If not see
#include "ipa-prop.h"
#include "ipa-inline.h"
#include "lto-partition.h"
+#include "toplev.h" /* for target_supports_section_anchors_p() */
+#include <map>
+#include <set>
+#include <algorithm>
+#include <vector>
+#include <algorithm>
vec<ltrans_partition> ltrans_partitions;
@@ -407,6 +413,274 @@ add_sorted_nodes (vec<symtab_node *> &next_nodes,
ltrans_partition partition)
add_symbol_to_partition (partition, node);
}
+/* FIXME: Currently I don't care to compute power set if set has more than 4
vars, which
+ I chose as an arbitrary upper limit and should be analyzed with
experimenting.
+ Or maybe fall back to computing power-set for n in general way, but that
can be
+ very expensive if n is large. */
+
+static void
+compute_powerset (std::vector<varpool_node *>& set, std::vector< std::vector<
varpool_node * > >& pset)
+{
+ if (set.size () > 4)
+ {
+ pset.push_back (set);
+ return;
+ }
+
+ std::vector<varpool_node *> v;
+
+#define add_2_elems(x, y) \
+ v.clear (); \
+ v.push_back (set[x]); \
+ v.push_back (set[y]); \
+
+#define add_2_elems_to_pset(x, y) \
+ add_2_elems(x, y) \
+ pset.push_back (v);
+
+#define add_3_elems(x, y, z) \
+ add_2_elems(x, y) \
+ v.push_back (set[z]); \
+
+#define add_3_elems_to_pset(x, y, z) \
+ add_3_elems(x, y, z) \
+ pset.push_back (v);
+
+#define add_4_elems(x, y, z, w) \
+ add_3_elems(x, y, z) \
+ v.push_back (set[z]);
+
+#define add_4_elems_to_pset(x, y, z, w) \
+ add_4_elems(x, y, z, w) \
+ pset.push_back (v);
+
+ switch (set.size ())
+ {
+ case 4:
+ add_2_elems_to_pset(0, 3)
+ add_2_elems_to_pset(1, 3)
+ add_2_elems_to_pset(2, 3)
+ add_3_elems_to_pset(0, 1, 3)
+ add_3_elems_to_pset(1, 2, 3)
+ add_4_elems_to_pset(0, 1, 2, 3)
+ /* fallthru. */
+
+ case 3:
+ add_2_elems_to_pset(0, 2)
+ add_2_elems_to_pset(1, 2)
+ add_3_elems_to_pset(0, 1, 2)
+ /* fallthru. */
+
+ case 2:
+ add_2_elems_to_pset(0, 1)
+ }
+
+}
+
+struct global_map_value {
+ unsigned partition_index;
+ unsigned count;
+};
+
+typedef std::pair< std::vector< varpool_node * >, global_map_value >
global_map_pair;
+typedef std::map< std::vector< varpool_node * >, global_map_value >
global_map_t;
+
+static void
+dump_global_pair (global_map_t::iterator it)
+{
+ const std::vector< varpool_node * >& v = it->first;
+ global_map_value gv = it->second;
+
+ fprintf (stderr, "\n{");
+ for (unsigned i = 0; i < v.size (); i++)
+ fprintf (stderr, "%s, ", v[i]->name ());
+ fprintf (stderr, "} occuring %u times is assigned to %u partition\n",
gv.count, gv.partition_index);
+}
+
+static bool
+global_pair_cmp (const global_map_t::iterator& it_1, const
global_map_t::iterator& it_2)
+{
+ // compare for descending order
+ const std::vector< varpool_node *>& v1 = it_1->first;
+ global_map_value gv1 = it_1->second;
+
+ const std::vector< varpool_node *>& v2 = it_2->first;
+ global_map_value gv2 = it_2->second;
+
+ if (gv1.count > gv2.count)
+ return true;
+
+ else if (gv1.count == gv2.count)
+ return v1.size () > v2.size ();
+
+ else
+ return false;
+}
+
+/* The function tries to assign a set of variables to partition that
references it most.
+ The intent is that variables in this set would then be anchored and their
address
+ computations can be CSE'd by using the anchor.
+
+ The function performs 3 steps:
+ a) Assign set of variables used in a function to partition if that
partition has maximum
+ references for the set. We do not consider set of variables used across
functions,
+ for eg, if a is used in f1() and b in f2() then do not consider {a, b}.
Only consider
+ sets of variables which are referenced in a single function.
+ So if f() references {a, b, c}, then consider
+ each combination involving at-least 2 vars: {a, b}, {a, c}, {b, c} and
{a, b, c} and count
+ references for each of these sets in the partition.
+ global_map[set] == partition that has max references for the set.
+ If partition 0 hax max references (5) for {a, b} then global_map[{a,
b}] = <0, 5>
+ where 0 is the partition_index and 5 is the number of references.
global_map_value
+ is used to represent the tuple <partition_index, number_of_references>.
+
+ b) Sort global_map in descending order according to count.
+ This is done so that if sets have overlapping variables then
+ we choose first to allocate the one which occurs more number of times.
+ For eg, if set {a, b} occurs 5 times in partition 0 and {a, b, c} occurs
+ 2 times in partition 1, then we first want to consider allocating
+ {a, b} to partition 0.
+
+ c) Step 3: Try allocating variables to set in the partition.
+ This step might not add all variables from the set in the partition
since
+ they could have been allocated in another partition.
+
+ TODO: Integrate with cost-metrics for balanced partitioning.
+
+ Possible improvements:
+ a) If a set has same number occurence in multiple partitions, choose a
partition
+ whose functions have more repeated number of occurence of that set.
+ for eg:
+ partition 0: f1() uses {a, b} 3 times.
+ partition 1: f2(), f3(), f4() use {a, b} 1 time each.
+ In this case alocating {a, b} to partition 0 instead of partiton 1 would
be more beneficial.
+ However this might not always be a win.
+
+ b) Currently we choose a set with higher priority, whose occurence is
strictly greater.
+ This might not always be optimal.
+ For instance if in a partition {a, b} occurs 'x' times and {a, b, c}
occurs 'y' times,
+ and y < x but very close to x, it probably makes more sense to first
consider the set {a, b, c}.
+ Perhaps define a range [x - c, x] where c is constant and choose
allocation of set with higher
+ number of elements, whose number of occurences is y if y falls within [x
-c, x].
+
+ c) hot/cold functions: We might want to ignore occurence of set in cold and
noreturn functions and give
+ a higher priority to set that occurs in a hot function.
+*/
+
+static void
+assign_vars_to_partitions (void)
+{
+#define ASSIGN_VARS_TO_PARTITIONS_DEBUG
+
+ global_map_t global_map;
+
+ /* Step 1: Assign set to partition that has maxmium references for the set.
*/
+
+ for (unsigned i = 0; i < ltrans_partitions.length (); i++)
+ {
+ /* partition_map: map of set -> number of times it's referenced in the
partition. */
+ typedef std::map< std::vector< varpool_node * >, unsigned >
partition_map_t;
+ partition_map_t partition_map;
+
+ const ltrans_partition& part = ltrans_partitions[i];
+ for (unsigned j = 0; j < lto_symtab_encoder_size (part->encoder); j++)
+ {
+ cgraph_node *cnode = dyn_cast<cgraph_node *>
(lto_symtab_encoder_deref (part->encoder, j));
+ if (!cnode)
+ continue;
+
+ /* var_map: map of variable -> number of times it's referenced in the
function. */
+ std::map<varpool_node *, unsigned> var_map;
+ ipa_ref *ref;
+
+ for (unsigned k = 0; cnode->iterate_reference (k, ref); k++)
+ if (varpool_node *vnode = dyn_cast<varpool_node *> (ref->referred))
+ var_map[vnode]++;
+
+ /* vars: set of variables referenced by function. */
+ std::vector< varpool_node * > vars;
+ for (std::map<varpool_node *, unsigned>::const_iterator it =
var_map.begin (); it != var_map.end (); ++it)
+ vars.push_back (it->first);
+
+ /* If a function refers to only one variable, there's no point
anchoring it. */
+ if (vars.size () == 1)
+ continue;
+
+ /* Compute the power set for vars and for each set within the
power_set, count
+ occurences of the set in function.
+ The occurence of set == occurence of variable, which has least
occurence
+ for instance occurence{a, b} == min (nrefs(a), nrefs(b)).
+ number of references for individual variables are already computed
in var_map. */
+
+ std::vector< std::vector< varpool_node * > > vars_pset;
+ compute_powerset (vars, vars_pset);
+
+ for (unsigned i = 0; i < vars_pset.size (); ++i)
+ {
+ unsigned min = UINT_MAX;
+ const std::vector< varpool_node * >& v = vars_pset[i];
+
+ for (unsigned i = 0; i < v.size (); i++)
+ if (var_map[v[i]] < min)
+ min = var_map[v[i]];
+ partition_map[v] += min;
+ }
+ }
+
+ /* Walk each set that we computed occurences for in this partition. If a
set S has max occurences
+ in current partition, then update global_map[S] .count =
partition_map[S], .partition_index = i; */
+
+ for (partition_map_t::iterator it = partition_map.begin (); it !=
partition_map.end (); ++it)
+ {
+ const std::vector<varpool_node *>& v = it->first;
+ unsigned count = it->second;
+
+ global_map_value& gv = global_map[v];
+ if (count > gv.count)
+ {
+ gv.partition_index = i;
+ gv.count = count;
+ }
+ }
+ }
+
+ /* Step 2: We are not really sorting global_map, rather initially store all
+ global_map_t::iterators in keys vector, and then sort keys in
+ descending order. global_pair_cmp is used for comparison. */
+
+ std::vector< global_map_t::iterator > keys;
+
+ for (global_map_t::iterator it = global_map.begin (); it != global_map.end
(); ++it)
+ keys.push_back (it);
+
+ std::sort (keys.begin (), keys.end (), global_pair_cmp);
+
+ /* Step 3: Walk each iterator in keys, and attempt to put variables from set
+ in the respective partition. */
+
+ for (unsigned i = 0; i < keys.size (); i++)
+ {
+ global_map_t::iterator it = keys[i];
+ const std::vector< varpool_node * >& v = it->first;
+ global_map_value gv = it->second;
+
+ for (unsigned j = 0; j < v.size (); j++)
+ {
+ varpool_node *vnode = v[j];
+
+ if (!vnode->definition)
+ continue;
+
+ if (!symbol_partitioned_p (vnode) && flag_toplevel_reorder
+ && !vnode->no_reorder
+ && vnode->get_partitioning_class () == SYMBOL_PARTITION)
+ {
+ // fprintf (stderr, "%s goes to partition %u\n", vnode->name (),
gv.partition_index);
+ add_symbol_to_partition (ltrans_partitions[gv.partition_index],
vnode);
+ }
+ }
+ }
+}
/* Group cgraph nodes into equally-sized partitions.
@@ -467,6 +741,7 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
int npartitions;
int current_order = -1;
int noreorder_pos = 0;
+ bool section_anchor_supported_p = target_supports_section_anchors_p ();
FOR_EACH_VARIABLE (vnode)
gcc_assert (!vnode->aux);
@@ -628,6 +903,9 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
for (j = 0; refs_node->iterate_reference (j, ref); j++)
if (is_a <varpool_node *> (ref->referred))
{
+ if (section_anchor_supported_p)
+ continue;
+
int index;
vnode = dyn_cast <varpool_node *> (ref->referred);
@@ -663,6 +941,9 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
for (j = 0; refs_node->iterate_referring (j, ref); j++)
if (is_a <varpool_node *> (ref->referring))
{
+ if (section_anchor_supported_p)
+ continue;
+
int index;
vnode = dyn_cast <varpool_node *> (ref->referring);
@@ -763,6 +1044,8 @@ lto_balanced_map (int n_lto_partitions, int
max_partition_size)
}
next_nodes.truncate (0);
+ if (section_anchor_supported_p)
+ assign_vars_to_partitions ();
/* Varables that are not reachable from the code go into last partition. */
if (flag_toplevel_reorder)
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 543b8a3..0fd0d54 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1165,7 +1165,7 @@ general_init (const char *argv0, bool init_signals)
/* Return true if the current target supports -fsection-anchors. */
-static bool
+bool
target_supports_section_anchors_p (void)
{
if (targetm.min_anchor_offset == 0 && targetm.max_anchor_offset == 0)
diff --git a/gcc/toplev.h b/gcc/toplev.h
index 06923cf..d8b9f77 100644
--- a/gcc/toplev.h
+++ b/gcc/toplev.h
@@ -100,4 +100,5 @@ extern const char *set_random_seed (const char *);
extern void initialize_rtl (void);
+extern bool target_supports_section_anchors_p (void);
#endif /* ! GCC_TOPLEV_H */