On 6 July 2016 at 22:25, Andrew Pinski <pins...@gmail.com> wrote:
> On Wed, Jul 6, 2016 at 5:00 AM, Prathamesh Kulkarni
> <prathamesh.kulka...@linaro.org> wrote:
>> On 4 July 2016 at 13:51, Andrew Pinski <pins...@gmail.com> wrote:
>>> On Mon, Jul 4, 2016 at 12:58 AM, Prathamesh Kulkarni
>>> <prathamesh.kulka...@linaro.org> wrote:
>>>> Hi,
>>>> I have attached a "quick and dirty" prototype patch (var-partition-1.diff),
>>>> that attempts to partition variables to reduce number of
>>>> external references and to increase usage of section-anchors
>>>> to CSE address computation of global variables.
>>>>
>>>> We could put a variable in a partition that has max references for it,
>>>> however it doesn't lend itself directly to section anchor optimization.
>>>> For instance if a partition has max references for variables 'a' and 'b',
>>>> but no function in that partition references both 'a', and 'b' then AFAIU
>>>> it doesn't make any difference from section anchors perspective to have 
>>>> them
>>>> in same partition.
>>>>
>>>> The patch tries to assign a set of variables (>= 2)
>>>> to a partition whose functions have maximum references for that set.
>>>> Functions within the partition that reference the variables
>>>> in the set can take advantage of section-anchors. Functions
>>>> referencing the variables in the set outside the partition
>>>> would need to load them as external references (using movw/movt),
>>>> however since we are placing the set in partition that has maximal
>>>> references for it, number of external references should be overall
>>>> reduced.
>>>>
>>>> Partitioning is gated by -flto-var-partition and enabled
>>>> only for arm and aarch64.
>>>
>>> Why only for arm and aarch64?  Shouldn't it be enabled for all section
>>> anchor targets?
>> AFAIK the only targets supporting section anchors are arm, aarch64 and 
>> powerpc.
>> I didn't enable it for ppc64 because I am not sure how much profitable
>> it is for that target.
>> Honza mentioned to me some time back that effect of partitioning on
>> powerpc was nearly zero.
>
>
> No MIPS has section anchors enabled too.  Plus MIPS will benefit the
> same way as AARCH64 and ARM.  PowerPC32 would too.
>
> I don't think it is correct to enable it only for arm and aarch64.
Thanks, I updated the patch to remove -flto-var-partition
and gated the partition currently on target_supports_section_anchors_p()
(although it doesn't test if -fsection-anchors is passed).
Um I am not able to see where mips has section anchors enabled ?
mips.c does not seem to override min_anchor_offset and max_anchor_offset
hooks. Both these hooks have values 0 for default, and
target_supports_section_anchors_p()
returns false if both these hooks have value 0.

Thanks,
Prathamesh
>
> Thanks,
> Andrew Pinski
>
>>
>> Thanks,
>> Prathamesh
>>>
>>> Thanks,
>>> Andrew
>>>
>>>> As per previous discussion [1], I haven't
>>>> touched function partitioning. Does this approach look ok
>>>> especially regarding correctness ?
>>>> So far, I have cross-tested patch on arm*-*-*, aarch64*-*-*.
>>>>
>>>> I haven't yet managed to benchmark the patch.
>>>> As a cheap measurement, I tried to measure number of external
>>>> references with and without patch by writing a small ipa pass
>>>> which is run during ltrans and simply walks over varpool nodes
>>>> and counts number of varpool_nodes for which DECL_EXTERNAL (vnode->decl) 
>>>> is true
>>>> and vnode->definition is 0. Is that sufficient condition to determine
>>>> if variable is externally defined ? I have attached the pass
>>>> (count-external-refs.diff)
>>>> and the comparison done with it for for SPEC2000 [2]. The entries
>>>> in "before" and "after" column contain summation of number of
>>>> external refs (total_count) across all partitions before and after applying
>>>> the patch. Does the comparison hold any merit ?
>>>> I was wondering if we could we use a better way for
>>>> measuring statically the effects of variable partitioning ?
>>>> I hope also to get done with benchmarking soon.
>>>>
>>>> I have not yet figured out how to integrate it with existing cost metrics 
>>>> for
>>>> balanced partitioning, I am looking into that.
>>>> I would be grateful for suggestions on the patch.
>>>>
>>>> [1] https://gcc.gnu.org/ml/gcc/2016-04/msg00090.html
>>>>
>>>> [2] SPEC2000 comparison:
>>>> https://docs.google.com/spreadsheets/d/1xnszyw04ksoyBspmCVYesq6KARLw-PA2n3T4aoaKdYw/edit?usp=sharing
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index 453343a..09b525e 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -34,6 +34,12 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-prop.h"
 #include "ipa-inline.h"
 #include "lto-partition.h"
+#include "toplev.h" /* for target_supports_section_anchors_p()  */
+#include <map>
+#include <set>
+#include <algorithm>
+#include <vector>
+#include <algorithm>
 
 vec<ltrans_partition> ltrans_partitions;
 
@@ -407,6 +413,274 @@ add_sorted_nodes (vec<symtab_node *> &next_nodes, 
ltrans_partition partition)
       add_symbol_to_partition (partition, node);
 }
 
+/* FIXME: Currently I don't care to compute power set if set has more than 4 
vars, which
+   I chose as an arbitrary upper limit and should be analyzed with 
experimenting.
+   Or maybe fall back to computing power-set for n in general way, but that 
can be
+   very expensive if n is large.  */
+
+static void
+compute_powerset (std::vector<varpool_node *>& set, std::vector< std::vector< 
varpool_node * > >& pset)
+{
+  if (set.size () > 4)
+    {
+      pset.push_back (set);
+      return;
+    }
+  
+  std::vector<varpool_node *> v;
+
+#define add_2_elems(x, y)  \
+  v.clear ();  \
+  v.push_back (set[x]);    \
+  v.push_back (set[y]);    \
+
+#define add_2_elems_to_pset(x, y) \
+  add_2_elems(x, y) \
+  pset.push_back (v);
+
+#define add_3_elems(x, y, z)  \
+  add_2_elems(x, y)  \
+  v.push_back (set[z]); \
+
+#define add_3_elems_to_pset(x, y, z) \
+  add_3_elems(x, y, z) \
+  pset.push_back (v);
+
+#define add_4_elems(x, y, z, w)  \
+  add_3_elems(x, y, z) \
+  v.push_back (set[z]); 
+
+#define add_4_elems_to_pset(x, y, z, w) \
+  add_4_elems(x, y, z, w) \
+  pset.push_back (v);
+
+  switch (set.size ()) 
+    {
+      case 4:
+       add_2_elems_to_pset(0, 3)
+       add_2_elems_to_pset(1, 3)
+       add_2_elems_to_pset(2, 3)
+       add_3_elems_to_pset(0, 1, 3)
+       add_3_elems_to_pset(1, 2, 3)
+       add_4_elems_to_pset(0, 1, 2, 3)
+       /* fallthru.  */
+      
+      case 3:
+       add_2_elems_to_pset(0, 2)
+       add_2_elems_to_pset(1, 2)
+       add_3_elems_to_pset(0, 1, 2)
+       /* fallthru.  */
+
+      case 2:
+        add_2_elems_to_pset(0, 1)      
+    }
+
+}
+
+struct global_map_value {
+  unsigned partition_index;
+  unsigned count;
+};
+
+typedef std::pair< std::vector< varpool_node * >, global_map_value > 
global_map_pair;
+typedef std::map< std::vector< varpool_node * >, global_map_value > 
global_map_t;
+
+static void
+dump_global_pair (global_map_t::iterator it)
+{
+  const std::vector< varpool_node * >& v = it->first;
+  global_map_value gv = it->second;
+
+  fprintf (stderr, "\n{");
+  for (unsigned i = 0; i < v.size (); i++)
+    fprintf (stderr, "%s, ", v[i]->name ());
+  fprintf (stderr, "} occuring %u times is assigned to %u partition\n", 
gv.count, gv.partition_index);
+}
+
+static bool
+global_pair_cmp (const global_map_t::iterator& it_1, const 
global_map_t::iterator& it_2)
+{
+  // compare for descending order
+  const std::vector< varpool_node *>& v1 = it_1->first;
+  global_map_value gv1 = it_1->second;
+
+  const std::vector< varpool_node *>& v2 = it_2->first;
+  global_map_value gv2 = it_2->second;
+
+  if (gv1.count > gv2.count)
+    return true;
+
+  else if (gv1.count == gv2.count)
+    return v1.size () > v2.size (); 
+
+  else
+    return false;
+}
+
+/*  The function tries to assign a set of variables to partition that 
references it most.
+    The intent is that variables in this set would then be anchored and their 
address
+    computations can be CSE'd by using the anchor.
+
+    The function performs 3 steps:
+    a) Assign set of variables used in a function to partition if that 
partition has maximum
+       references for the set. We do not consider set of variables used across 
functions,
+       for eg, if a is used in f1() and b in f2() then do not consider {a, b}. 
Only consider
+       sets of variables which are referenced in a single function.
+       So if f() references {a, b, c}, then consider
+       each combination involving at-least 2 vars: {a, b}, {a, c}, {b, c} and 
{a, b, c} and count
+       references for each of these sets in the partition.
+       global_map[set] == partition that has max references for the set.
+       If partition 0 hax max references (5) for {a, b} then global_map[{a, 
b}] = <0, 5>
+       where 0 is the partition_index and 5 is the number of references. 
global_map_value
+       is used to represent the tuple <partition_index, number_of_references>.
+
+    b) Sort global_map in descending order according to  count.
+       This is done so that if sets have overlapping variables then
+       we choose first to allocate the one which occurs more number of times.
+       For eg, if set {a, b} occurs 5 times in partition 0 and {a, b, c} occurs
+       2 times in partition 1, then we first want to consider allocating
+       {a, b} to partition 0.
+
+    c) Step 3: Try allocating variables to set in the partition.
+       This step might not add all variables from the set in the partition 
since
+       they could have been allocated in another partition. 
+
+   TODO: Integrate with cost-metrics for balanced partitioning.
+
+   Possible improvements:
+   a) If a set has same number occurence in multiple partitions, choose a 
partition
+      whose functions have more repeated number of occurence of that set. 
+      for eg:
+      partition 0: f1() uses {a, b} 3 times.
+      partition 1: f2(), f3(), f4() use {a, b} 1 time each.
+      In this case alocating {a, b} to partition 0 instead of partiton 1 would 
be more beneficial.
+      However this might not always be a win.
+
+   b) Currently we choose a set with higher priority, whose occurence is 
strictly greater.
+      This might not always be optimal.
+      For instance if in a  partition {a, b} occurs 'x' times and {a, b, c} 
occurs 'y' times,
+      and y < x but very close to x, it probably makes more sense to first 
consider the set {a, b, c}.
+      Perhaps define a range [x - c, x] where c is constant and choose 
allocation of set with higher
+      number of elements, whose number of occurences is y if y falls within [x 
-c, x].
+
+   c) hot/cold functions: We might want to ignore occurence of set in cold and 
noreturn functions and give
+      a higher priority to set that occurs in a hot function.
+*/
+
+static void
+assign_vars_to_partitions (void)
+{
+#define ASSIGN_VARS_TO_PARTITIONS_DEBUG
+
+  global_map_t global_map;
+
+  /* Step 1: Assign set to partition that has maxmium references for the set.  
*/
+
+  for (unsigned i = 0; i < ltrans_partitions.length (); i++)
+    {
+      /* partition_map: map of set -> number of times it's referenced in the 
partition.  */ 
+      typedef std::map< std::vector< varpool_node * >, unsigned > 
partition_map_t;
+      partition_map_t partition_map;
+
+      const ltrans_partition& part = ltrans_partitions[i];
+      for (unsigned j = 0; j < lto_symtab_encoder_size (part->encoder); j++)
+       {
+         cgraph_node *cnode = dyn_cast<cgraph_node *> 
(lto_symtab_encoder_deref (part->encoder, j));
+         if (!cnode)
+           continue;
+
+         /* var_map: map of variable -> number of times it's referenced in the 
function.  */
+         std::map<varpool_node *, unsigned> var_map;
+         ipa_ref *ref;   
+
+         for (unsigned k = 0; cnode->iterate_reference (k, ref); k++)
+           if (varpool_node *vnode = dyn_cast<varpool_node *> (ref->referred))
+             var_map[vnode]++;
+
+         /* vars: set of variables referenced by function.  */
+         std::vector< varpool_node * > vars; 
+         for (std::map<varpool_node *, unsigned>::const_iterator it = 
var_map.begin (); it != var_map.end (); ++it)
+           vars.push_back (it->first);
+
+         /* If a function refers to only one variable, there's no point 
anchoring it.  */ 
+         if (vars.size () == 1)
+           continue;
+
+         /* Compute the power set for vars and for each set within the 
power_set, count
+            occurences of the set in function.
+             The occurence of set == occurence of variable, which has least 
occurence
+             for instance occurence{a, b} == min (nrefs(a), nrefs(b)).
+            number of references for individual variables are already computed 
in var_map.  */
+
+         std::vector< std::vector< varpool_node * > > vars_pset; 
+         compute_powerset (vars, vars_pset);
+
+         for (unsigned i = 0; i < vars_pset.size (); ++i)
+           {
+             unsigned min = UINT_MAX;
+             const std::vector< varpool_node * >& v = vars_pset[i];
+
+             for (unsigned i = 0; i < v.size (); i++)
+               if (var_map[v[i]] < min)
+                 min = var_map[v[i]];
+             partition_map[v] += min;
+           }
+       }
+
+      /* Walk each set that we computed occurences for in this partition. If a 
set S has max occurences
+        in current partition, then update global_map[S] .count = 
partition_map[S], .partition_index = i;  */
+    
+      for (partition_map_t::iterator it = partition_map.begin (); it != 
partition_map.end (); ++it)
+       {
+         const std::vector<varpool_node *>& v = it->first;
+         unsigned count = it->second;
+
+         global_map_value& gv = global_map[v];
+         if (count > gv.count)
+           {
+             gv.partition_index = i; 
+             gv.count = count;
+           }
+       }
+    }
+
+  /* Step 2: We are not really sorting global_map, rather initially store all
+     global_map_t::iterators in keys vector, and then sort keys in
+     descending order. global_pair_cmp is used for comparison.  */ 
+
+  std::vector< global_map_t::iterator > keys;
+
+  for (global_map_t::iterator it = global_map.begin (); it != global_map.end 
(); ++it)
+    keys.push_back (it);
+  
+  std::sort (keys.begin (), keys.end (), global_pair_cmp);
+  
+  /* Step 3: Walk each iterator in keys, and attempt to put variables from set
+     in the respective partition.  */
+
+  for (unsigned i = 0; i < keys.size (); i++)
+    {
+      global_map_t::iterator it = keys[i];
+      const std::vector< varpool_node * >& v = it->first;
+      global_map_value gv = it->second;
+
+      for (unsigned j = 0; j < v.size (); j++)
+       {
+         varpool_node *vnode = v[j];
+
+         if (!vnode->definition)
+           continue;
+
+         if (!symbol_partitioned_p (vnode) && flag_toplevel_reorder
+             && !vnode->no_reorder
+             && vnode->get_partitioning_class () == SYMBOL_PARTITION)
+           {
+         //    fprintf (stderr, "%s goes to partition %u\n", vnode->name (), 
gv.partition_index);
+             add_symbol_to_partition (ltrans_partitions[gv.partition_index], 
vnode);
+           }
+       }
+    }
+}
 
 /* Group cgraph nodes into equally-sized partitions.
 
@@ -467,6 +741,7 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
   int npartitions;
   int current_order = -1;
   int noreorder_pos = 0;
+  bool section_anchor_supported_p = target_supports_section_anchors_p ();
 
   FOR_EACH_VARIABLE (vnode)
     gcc_assert (!vnode->aux);
@@ -628,6 +903,9 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
          for (j = 0; refs_node->iterate_reference (j, ref); j++)
            if (is_a <varpool_node *> (ref->referred))
              {
+               if (section_anchor_supported_p)
+                 continue;
+
                int index;
 
                vnode = dyn_cast <varpool_node *> (ref->referred);
@@ -663,6 +941,9 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
          for (j = 0; refs_node->iterate_referring (j, ref); j++)
            if (is_a <varpool_node *> (ref->referring))
              {
+               if (section_anchor_supported_p)
+                 continue;
+
                int index;
 
                vnode = dyn_cast <varpool_node *> (ref->referring);
@@ -763,6 +1044,8 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
     }
 
   next_nodes.truncate (0);
+  if (section_anchor_supported_p)
+    assign_vars_to_partitions ();
 
   /* Varables that are not reachable from the code go into last partition.  */
   if (flag_toplevel_reorder)
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 543b8a3..0fd0d54 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1165,7 +1165,7 @@ general_init (const char *argv0, bool init_signals)
 
 /* Return true if the current target supports -fsection-anchors.  */
 
-static bool
+bool
 target_supports_section_anchors_p (void)
 {
   if (targetm.min_anchor_offset == 0 && targetm.max_anchor_offset == 0)
diff --git a/gcc/toplev.h b/gcc/toplev.h
index 06923cf..d8b9f77 100644
--- a/gcc/toplev.h
+++ b/gcc/toplev.h
@@ -100,4 +100,5 @@ extern const char *set_random_seed (const char *);
 
 extern void initialize_rtl (void);
 
+extern bool target_supports_section_anchors_p (void);
 #endif /* ! GCC_TOPLEV_H */

Reply via email to