Re: [PATCH] Implement hierarchical discriminators for AutoFDO

Jan Hubicka Mon, 10 Nov 2025 15:59:17 -0800

> diff --git a/gcc/hierarchical_discriminator.h 
> b/gcc/hierarchical_discriminator.h
> new file mode 100644
> index 00000000000..dd3cb1b0ae7
> --- /dev/null
> +++ b/gcc/hierarchical_discriminator.h
> @@ -0,0 +1,75 @@
> +/* Copyright The GNU Toolchain Authors
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#ifndef GCC_HIERARCHICAL_DISCRIMINATOR_H
> +#define GCC_HIERARCHICAL_DISCRIMINATOR_H
> +
> +#include "gimple.h"
> +#include "tree.h"
> +#include "basic-block.h"
> +#include "input.h"
> +
> +/* Hierarchical discriminator layout (32 bits total):
> +   - Base: bits 0-11 (12 bits, 0-4095)
> +   - Pass1: bits 12-23 (12 bits, 0-4095)
> +   - Pass2: bits 24-31 (8 bits, 0-255)
> +
> +   Base discriminator: Used by front-end and early passes to distinguish
> +                    different statements on the same source line.
> +
> +   Pass1 discriminator: Used by middle-end optimizations to distinguish
> +                     different versions/contexts of the same code:
> +                     - Loop versioning (vectorized vs scalar)
> +                     - Inlining contexts (different callsites)
> +                     - Function cloning
> +
> +   Pass2 discriminator: Used by late optimizations to distinguish
> +                     iterations or variants:
> +                     - Loop unrolling iterations
> +                     - Vectorization variants
> + */
> +
> +/* Loop versioning discriminators.  */
> +#define DISCRIMINATOR_LOOP_VERSION_VECTORIZED  1  /* Vectorized version.  */
> +#define DISCRIMINATOR_LOOP_VERSION_SCALAR      2  /* Scalar version.  */
> +#define DISCRIMINATOR_LOOP_VERSION_ALIGNED     3  /* Aligned version.  */
> +#define DISCRIMINATOR_LOOP_VERSION_UNALIGNED   4  /* Unaligned version.  */
> +
> +/* Loop transformation discriminators.  */
> +#define DISCRIMINATOR_LOOP_UNROLLED         5  /* Unrolled loop.  */
> +#define DISCRIMINATOR_LOOP_PEELED           6  /* Peeled loop.  */


I think we want to solve two problems
 1) Handle situation when statement is duplicated (unrolling, loop
    header copying etc.) and thus different locations in program
    corresponds to different copies.

    Current code simply picks address with maximal count (expecting that
    the statemnt perhaps spans multiple BBs) while we should really sum
    counts of execution of different copies.
 1) Handle situation where single address implies multiple executions
    of the same statement (i.e. after vectorization)

I wonder whether we really need to use predefined bits for different
tranformations we have.

For example in
1  int test()
2  {
3   int a[3];
4   for (int i = 0; i < 3; i++)
5           a[i]=0;
6   for (int i = 0; i < 3; i++)
7           a[i]++;
8   int s = 0;
9   for (int i = 0; i < 3; i++)
10          s+=a[i];
11 return s;
}

we eventually unroll all the loops to figure CSE through them and then
remove all references to a.  After the unrolling of first loop we have:

  [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:5:4] # DEBUG BEGIN_STMT
  [t.c:5:8] [t.c:5:5] a[0] = 0;
  [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3] i_54 = 1;
  [t.c:4:27 discrim 3] # DEBUG i => i_54
  # DEBUG i => i_54
  [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:5:4] # DEBUG BEGIN_STMT
  [t.c:5:8] [t.c:5:5] a[i_54] = 0;
  [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3] i_58 = i_54 + 1;
  [t.c:4:27 discrim 3] # DEBUG i => i_58
  # DEBUG i => i_58
  [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:5:4] # DEBUG BEGIN_STMT 
  [t.c:5:8] [t.c:5:5] a[i_58] = 0;
  [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3] i_62 = i_58 + 1;
  [t.c:4:27 discrim 3] # DEBUG i => i_62

so each statement exists 3 times which, if it survived to final code,
would already result in wrong AFDO data since it will be counted just
once.

Here we only need to assign copy ids:

  [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:5:4] # DEBUG BEGIN_STMT
  [t.c:5:8] [t.c:5:5] a[0] = 0;
  [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3] i_54 = 1;
  [t.c:4:27 discrim 3] # DEBUG i => i_54
  # DEBUG i => i_54
  [t.c:4:21 discrim 1 copy 1] # DEBUG BEGIN_STMT
  [t.c:5:4 copy 1] # DEBUG BEGIN_STMT
  [t.c:5:8 copy 1] [t.c:5:5] a[i_54] = 0;
  [t.c:4:27 discrim 3 copy 1] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3 copy 1] i_58 = i_54 + 1;
  [t.c:4:27 discrim 3 copy 1] # DEBUG i => i_58
  # DEBUG i => i_58
  [t.c:4:21 discrim 1 copy 2] # DEBUG BEGIN_STMT
  [t.c:5:4 copy 2] # DEBUG BEGIN_STMT 
  [t.c:5:8 copy 2] [t.c:5:5] a[i_58] = 0;
  [t.c:4:27 discrim 3 copy 2] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3 copy 2] i_62 = i_58 + 1;
  [t.c:4:27 discrim 3 copy 2] # DEBUG i => i_62

Later we optimize out the stores end end up with:

  [t.c:3:3] # DEBUG BEGIN_STMT
  [t.c:4:3] # DEBUG BEGIN_STMT
  [t.c:4:8] # DEBUG BEGIN_STMT
  [t.c:5:4] # DEBUG BEGIN_STMT
  [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT
  # DEBUG i => 3
  [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:7:4] # DEBUG BEGIN_STMT
  [t.c:6:27 discrim 3] # DEBUG BEGIN_STMT
  # DEBUG i => 3
  [t.c:6:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:10:4] # DEBUG BEGIN_STMT
  [t.c:9:27 discrim 3] # DEBUG BEGIN_STMT
  # DEBUG i => 3
  # DEBUG s => 3
  [t.c:9:21 discrim 1] # DEBUG BEGIN_STMT
  [t.c:11:3] # DEBUG BEGIN_STMT
  [t.c:11:10] return 3;

Which means that we know all the line numbers of now optimized out code
but we do not know that some lines was originally executed multiple
times.  If copy ID was added we should get i.e.
  [t.c:5:4] # DEBUG BEGIN_STMT
  [t.c:5:4 copy 1] # DEBUG BEGIN_STMT
  [t.c:5:4 copy 2] # DEBUG BEGIN_STMT

We we possibly can reduce to

  [t.c:5:4 multiplicity 3] # DEBUG BEGIN_STMT

To make this work, I think we only should watch for places we duplicate
statements and be sure new copy_ids are assigned (i.e. in copy_id). Either from 
counter in
struct_function or we can have hash remembering last copy of each base
location.

Then we need to watch for places where we remove duplicated BEGIN_STMTs
i.e. in:
          else if (gimple_debug_begin_stmt_p (stmt))
            {
              /* We are only keeping the last debug-begin in a series of
                 debug-begin stmts.  */
              if (locs_seen.add (gimple_location (stmt)))
                remove_dead_stmt (&gsi, bb, to_remove_edges);
              continue;
            }

And we need to adjust vectorizer, to behave sanely to BEGIN_STMT, not
throw them away and increase multiplicities.
Or do I miss something important here?

Honza

Re: [PATCH] Implement hierarchical discriminators for AutoFDO

Reply via email to