On Wed, 27 Mar 2024, Michael Matz wrote:

> Hey,
> 
> On Wed, 27 Mar 2024, Jakub Jelinek wrote:
> 
> > > @@ -1712,12 +1711,9 @@ compute_idf (bitmap def_blocks, bitmap_head *dfs)
> > >        gcc_checking_assert (bb_index
> > >                      < (unsigned) last_basic_block_for_fn (cfun));
> > >  
> > > -      EXECUTE_IF_AND_COMPL_IN_BITMAP (&dfs[bb_index], 
> > > phi_insertion_points,
> > > -                               0, i, bi)
> > > - {
> > > +      EXECUTE_IF_SET_IN_BITMAP (&dfs[bb_index], 0, i, bi)
> > > + if (bitmap_set_bit (phi_insertion_points, i))
> > >     bitmap_set_bit (work_set, i);
> > > -   bitmap_set_bit (phi_insertion_points, i);
> > > - }
> > >      }
> > 
> > I don't understand why the above is better.
> > Wouldn't it be best to do
> >       bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
> >                              phi_insertion_points);
> >       bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);
> > ?
> 
> I had the same hunch, but:
> 
> 1) One would have to make work_set be non-tree-view again (which with the 
> current structure is a wash anyway, and that makes sense as accesses to 
> work_set aren't heavily random here).

The tree-view is a wash indeed (I tried many things).

> 2) But doing that and using bitmap_ior.._into is still measurably slower: 
> on a reduced testcase with -O0 -fno-checking, proposed structure 
> (tree-view or not-tree-view workset doesn't matter):
> 
>  tree SSA rewrite                   :  14.93 ( 12%)   0.01 (  2%)  14.95 ( 
> 12%)    27M (  8%)
> 
> with non-tree-view, and your suggestion:
> 
>  tree SSA rewrite                   :  20.68 ( 12%)   0.02 (  4%)  20.75 ( 
> 12%)    27M (  8%)
> 
> I can only speculate that the usually extreme sparsity of the bitmaps in 
> question make the setup costs of the two bitmap_ior calls actually more 
> expensive than the often skipped second call to bitmap_set_bit in Richis 
> proposed structure.  (That or cache effects)

So slightly "better" than Jakubs variant would be

      if (bitmap_ior_and_compl_into (work_set, &dfs[bb_index],
                                     phi_insertion_points))
        bitmap_ior_into (phi_insertion_points, &dfs[bb_index]);

since phi_insertion_points grows that IOR becomes more expensive over 
time.

The above for me (today Zen2, yesterday Zen4) is

 tree SSA rewrite                   : 181.02 ( 37%) 

with unconditiona ior_into:

 tree SSA rewrite                   : 180.93 ( 36%)

while my patch is

 tree SSA rewrite                   :  22.04 (  6%)

not sure what uarch Micha tested on.  I think the testcase has simply
many variables we write into SSA (man compute_idf calls), many BBs
but very low popcount DFS[] so iterating over DFS[] only is very
beneficial here as opposed to also walk phi_insertion_points
and work_set.  I think low popcount DFS[] is quite typical
for a CFG - but for sure popcount of DFS[] is going to be lower
than popcount of the IDF (phi_insertion_points).

Btw, with my patch compute_idf is completely off the profile so it's
hard to improve further (we do process blocks possibly twice for
example, but that doesn't make a difference here)

Indeed doing statistics shows the maximum popcount of a dominance
frontier is 8 but 99% have just a single block.  But the popcount
of the final IDF is more than 10000 half of the time and more
than 1000 90% of the time.

I have pushed the patch now.

Richard.

Reply via email to