Hi Honza, Thanks for the review.
> On 11 Nov 2025, at 10:59 am, Jan Hubicka <[email protected]> wrote: > > External email: Use caution opening links or attachments > > >> diff --git a/gcc/hierarchical_discriminator.h >> b/gcc/hierarchical_discriminator.h >> new file mode 100644 >> index 00000000000..dd3cb1b0ae7 >> --- /dev/null >> +++ b/gcc/hierarchical_discriminator.h >> @@ -0,0 +1,75 @@ >> +/* Copyright The GNU Toolchain Authors >> + >> +This file is part of GCC. >> + >> +GCC is free software; you can redistribute it and/or modify it under >> +the terms of the GNU General Public License as published by the Free >> +Software Foundation; either version 3, or (at your option) any later >> +version. >> + >> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY >> +WARRANTY; without even the implied warranty of MERCHANTABILITY or >> +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License >> +for more details. >> + >> +You should have received a copy of the GNU General Public License >> +along with GCC; see the file COPYING3. If not see >> +<http://www.gnu.org/licenses/>. */ >> + >> +#include "config.h" >> +#include "system.h" >> +#include "coretypes.h" >> +#ifndef GCC_HIERARCHICAL_DISCRIMINATOR_H >> +#define GCC_HIERARCHICAL_DISCRIMINATOR_H >> + >> +#include "gimple.h" >> +#include "tree.h" >> +#include "basic-block.h" >> +#include "input.h" >> + >> +/* Hierarchical discriminator layout (32 bits total): >> + - Base: bits 0-11 (12 bits, 0-4095) >> + - Pass1: bits 12-23 (12 bits, 0-4095) >> + - Pass2: bits 24-31 (8 bits, 0-255) >> + >> + Base discriminator: Used by front-end and early passes to distinguish >> + different statements on the same source line. >> + >> + Pass1 discriminator: Used by middle-end optimizations to distinguish >> + different versions/contexts of the same code: >> + - Loop versioning (vectorized vs scalar) >> + - Inlining contexts (different callsites) >> + - Function cloning >> + >> + Pass2 discriminator: Used by late optimizations to distinguish >> + iterations or variants: >> + - Loop unrolling iterations >> + - Vectorization variants >> + */ >> + >> +/* Loop versioning discriminators. */ >> +#define DISCRIMINATOR_LOOP_VERSION_VECTORIZED 1 /* Vectorized version. */ >> +#define DISCRIMINATOR_LOOP_VERSION_SCALAR 2 /* Scalar version. */ >> +#define DISCRIMINATOR_LOOP_VERSION_ALIGNED 3 /* Aligned version. */ >> +#define DISCRIMINATOR_LOOP_VERSION_UNALIGNED 4 /* Unaligned version. */ >> + >> +/* Loop transformation discriminators. */ >> +#define DISCRIMINATOR_LOOP_UNROLLED 5 /* Unrolled loop. */ >> +#define DISCRIMINATOR_LOOP_PEELED 6 /* Peeled loop. */ > > I think we want to solve two problems > 1) Handle situation when statement is duplicated (unrolling, loop > header copying etc.) and thus different locations in program > corresponds to different copies. > > Current code simply picks address with maximal count (expecting that > the statemnt perhaps spans multiple BBs) while we should really sum > counts of execution of different copies. > 1) Handle situation where single address implies multiple executions > of the same statement (i.e. after vectorization) > This makes sense. So, do you suggest we have: Just copy_id fields and multiplicity apart from the base discriminator. Should we also have a bit dedicated for instructions that are combined by optimisations such as CSE? Reason being that we have to be careful propagating samples counts from this instructions. We could also try to infer the counts for this from surrounding instructions if we can. I will split the patch based on the response for this. Thanks, Kugan > I wonder whether we really need to use predefined bits for different > tranformations we have. > > For example in > 1 int test() > 2 { > 3 int a[3]; > 4 for (int i = 0; i < 3; i++) > 5 a[i]=0; > 6 for (int i = 0; i < 3; i++) > 7 a[i]++; > 8 int s = 0; > 9 for (int i = 0; i < 3; i++) > 10 s+=a[i]; > 11 return s; > } > > we eventually unroll all the loops to figure CSE through them and then > remove all references to a. After the unrolling of first loop we have: > > [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:5:4] # DEBUG BEGIN_STMT > [t.c:5:8] [t.c:5:5] a[0] = 0; > [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3] i_54 = 1; > [t.c:4:27 discrim 3] # DEBUG i => i_54 > # DEBUG i => i_54 > [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:5:4] # DEBUG BEGIN_STMT > [t.c:5:8] [t.c:5:5] a[i_54] = 0; > [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3] i_58 = i_54 + 1; > [t.c:4:27 discrim 3] # DEBUG i => i_58 > # DEBUG i => i_58 > [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:5:4] # DEBUG BEGIN_STMT > [t.c:5:8] [t.c:5:5] a[i_58] = 0; > [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3] i_62 = i_58 + 1; > [t.c:4:27 discrim 3] # DEBUG i => i_62 > > so each statement exists 3 times which, if it survived to final code, > would already result in wrong AFDO data since it will be counted just > once. > > Here we only need to assign copy ids: > > [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:5:4] # DEBUG BEGIN_STMT > [t.c:5:8] [t.c:5:5] a[0] = 0; > [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3] i_54 = 1; > [t.c:4:27 discrim 3] # DEBUG i => i_54 > # DEBUG i => i_54 > [t.c:4:21 discrim 1 copy 1] # DEBUG BEGIN_STMT > [t.c:5:4 copy 1] # DEBUG BEGIN_STMT > [t.c:5:8 copy 1] [t.c:5:5] a[i_54] = 0; > [t.c:4:27 discrim 3 copy 1] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3 copy 1] i_58 = i_54 + 1; > [t.c:4:27 discrim 3 copy 1] # DEBUG i => i_58 > # DEBUG i => i_58 > [t.c:4:21 discrim 1 copy 2] # DEBUG BEGIN_STMT > [t.c:5:4 copy 2] # DEBUG BEGIN_STMT > [t.c:5:8 copy 2] [t.c:5:5] a[i_58] = 0; > [t.c:4:27 discrim 3 copy 2] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3 copy 2] i_62 = i_58 + 1; > [t.c:4:27 discrim 3 copy 2] # DEBUG i => i_62 > > Later we optimize out the stores end end up with: > > [t.c:3:3] # DEBUG BEGIN_STMT > [t.c:4:3] # DEBUG BEGIN_STMT > [t.c:4:8] # DEBUG BEGIN_STMT > [t.c:5:4] # DEBUG BEGIN_STMT > [t.c:4:27 discrim 3] # DEBUG BEGIN_STMT > # DEBUG i => 3 > [t.c:4:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:7:4] # DEBUG BEGIN_STMT > [t.c:6:27 discrim 3] # DEBUG BEGIN_STMT > # DEBUG i => 3 > [t.c:6:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:10:4] # DEBUG BEGIN_STMT > [t.c:9:27 discrim 3] # DEBUG BEGIN_STMT > # DEBUG i => 3 > # DEBUG s => 3 > [t.c:9:21 discrim 1] # DEBUG BEGIN_STMT > [t.c:11:3] # DEBUG BEGIN_STMT > [t.c:11:10] return 3; > > Which means that we know all the line numbers of now optimized out code > but we do not know that some lines was originally executed multiple > times. If copy ID was added we should get i.e. > [t.c:5:4] # DEBUG BEGIN_STMT > [t.c:5:4 copy 1] # DEBUG BEGIN_STMT > [t.c:5:4 copy 2] # DEBUG BEGIN_STMT > > We we possibly can reduce to > > [t.c:5:4 multiplicity 3] # DEBUG BEGIN_STMT > > To make this work, I think we only should watch for places we duplicate > statements and be sure new copy_ids are assigned (i.e. in copy_id). Either > from counter in > struct_function or we can have hash remembering last copy of each base > location. > > Then we need to watch for places where we remove duplicated BEGIN_STMTs > i.e. in: > else if (gimple_debug_begin_stmt_p (stmt)) > { > /* We are only keeping the last debug-begin in a series of > debug-begin stmts. */ > if (locs_seen.add (gimple_location (stmt))) > remove_dead_stmt (&gsi, bb, to_remove_edges); > continue; > } > > And we need to adjust vectorizer, to behave sanely to BEGIN_STMT, not > throw them away and increase multiplicities. > Or do I miss something important here? > > Honza
