Hi Richard, Thanks for the review. On Tue, 30 Oct 2018 at 01:25, Richard Biener <richard.guent...@gmail.com> wrote: > > On Mon, Oct 29, 2018 at 2:06 AM Kugan Vivekanandarajah > <kugan.vivekanandara...@linaro.org> wrote: > > > > Hi Richard and Jeff, > > > > Thanks for your comments. > > > > On Fri, 26 Oct 2018 at 19:40, Richard Biener <richard.guent...@gmail.com> > > wrote: > > > > > > On Fri, Oct 26, 2018 at 4:55 AM Jeff Law <l...@redhat.com> wrote: > > > > > > > > On 10/25/18 4:33 PM, Kugan Vivekanandarajah wrote: > > > > > Hi, > > > > > > > > > > PR87528 showed a case where libgcc generated popcount is causing > > > > > regression for Skylake. > > > > > We also have PR86677 where kernel build is failing because the kernel > > > > > does not use the libgcc (when backend is not defining popcount > > > > > pattern). While I agree that the kernel should implement its own > > > > > functionality when it is not using the libgcc, I am afraid that the > > > > > implementation can have the same performance issues reported for > > > > > Skylake in PR87528. > > > > > > > > > > Therefore, I would like to propose that we disable popcount detection > > > > > when we don't have a pattern for that. The attached patch (based on > > > > > previous discussions) does this. > > > > > > > > > > Bootstrapped and regression tested on x86_64-linux-gnu with no new > > > > > regressions. We need to disable the popcount* testcases. I will have > > > > > to define a effective_target_with_popcount in > > > > > gcc/testsuite/lib/target-supports.exp if this patch is OK? > > > > > Thanks, > > > > > Kugan > > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > 2018-10-25 Kugan Vivekanandarajah <kug...@linaro.org> > > > > > > > > > > * tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN > > > > > POPCOUNT > > > > > as expensive when backend does not define it. > > > > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > 2018-10-25 Kugan Vivekanandarajah <kug...@linaro.org> > > > > > > > > > > * gcc.target/aarch64/popcount4.c: New test. > > > > > > > > > FWIW, I've been disabling by checking direct_optab_handler elsewhere > > > > (number_of_iterations_popcount) in my tester. It may in fact be an old > > > > patch from you. > > > > > > > > Richi argued that it's the kernel team's responsibility to provide a > > > > popcount since they don't link with libgcc. And I'm generally in > > > > agreement with that position, though it does tend to generate some > > > > friction with the kernel developers. We also run the real risk of GCC 9 > > > > not being able to build the kernel which, IMHO, would be a disaster from > > > > a PR standpoint. > > > > > > > > I'd like to hear from others here. I fully realize we're beyond the > > > > realm of what is strictly technically correct here from a review > > > > standpoint. > > > > > > As said final value replacement to a library call is probably not wanted > > > for optimization purpose, so adjusting expression_expensive_p is OK with > > > me. It might not fully solve the (non-)issue in case another > > > optimization pass > > > chooses to materialize niter computation result. > > > > > > Few comments on the patch: > > > > > > + tree fndecl = get_callee_fndecl (expr); > > > + > > > + if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL) > > > + { > > > + combined_fn cfn = as_combined_fn (DECL_FUNCTION_CODE (fndecl)); > > > > > > combined_fn cfn = gimple_call_combined_fn (expr); > > > switch (cfn) > > > { > > > > Did you mean: > > combined_fn cfn = get_call_combined_fn (expr); > > Yes. > > > > ... > > > > > > cfn will be CFN_LAST for a non-builtin/internal call. I know Richard is > > > mostly > > > offline but eventually he knows whether there is a better way to query > > > > > > + CASE_CFN_POPCOUNT: > > > + /* Check if opcode for popcount is available. */ > > > + if (optab_handler (popcount_optab, > > > + TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG > > > (expr, 0)))) > > > + == CODE_FOR_nothing) > > > + return true; > > > > > > note that we currently generate builtin calls rather than IFN calls > > > (when a direct > > > optab is supported). > > > > > > Another comment on the patch is that you probably have to adjust existing > > > popcount testcases to add architecture specific flags enabling suport for > > > the instructions, otherwise you won't see loop replacement. > > Indeed. > > In lib/target-supports.exp, I will try to add support for > > check_effective_target_popcount_long. > > When I grep for the popcount pattern in md files, I see it is defined for: > > > > tilegx > > tilepro > > alpha > > aarch64 when TARGET_SIMD > > ia64 > > rs6000 > > i386 when TARGET_POPCOUNT > > popwerpcspce when TARGET_POPCNTB || TARGET_POPCNTD > > s390 when TARGET_Z916 && TARGET_64BIT > > sparc when TARGET_POPC > > arm when TARGET_NEON > > mips when ISA_HAS_POP > > spu > > avr > > > > I can check these targets with the condition. > > Another possibility is to check with a sample code and see if we are > > getting a libcall in the asm. Not sure if that is straightforward. Are > > there any example for such. > > You could try linking w/o libgcc ... I realized that there are some examples already and I have based it on that. Attached patch 0001-fix-kernel-build-v3.patch does this. Bootstrapped and regression tested on aarch64-linux-gnu with no new regression. Is this OK?
> > > We could also move these test to a primary target that is tested often > > tested which also defines popcount pattern. I dont think these tests > > change for targets and if we can test in one target that could be > > enough, > > > > I am happy to implement what is appropriate. > > How about the -Os idea? Attached patch 0002-allow-builtin-popcount-for-size_p.patch attempts this. As you mentioned, this will again break the kernel. While I am trying to provide the popcount implementation for the newer kernel, it will still be a problem for existing kernel versions. May I request that we provide a flag to disable this if we decide to go with the patch please? Thanks, Kugan > > Richard. > > > Thanks, > > Kugan > > > > > > > > > > > > Also I think that the expression is only expensive (for final value > > > replacement!) > > > if you consider optimizing for speed. When optimizing for size getting > > > rid of > > > the loop is probably beneificial unconditionally. That would leave the > > > possibility to switch said testcases to -Os. It would require adding a > > > bool size_p flag to expression_expensive and passing down > > > optimize_loop_for_size_p (). > > > > > > _NOTE_ that expression_expensive_p is also used by IVOPTs and there > > > replacing sth with an expression based on the niter analysis result > > > doesn't > > > mean we get rid of the loop (but only of an IV), so maybe that reasoning > > > doesn't apply there. > > > > > > Richard. > > > > > > > Jeff > > > >
From 085d75c277bab60e8f71caff0c396a6495e38bcb Mon Sep 17 00:00:00 2001 From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> Date: Wed, 24 Oct 2018 20:33:50 +1100 Subject: [PATCH 1/2] fix kernel build [v3] Change-Id: I4df1ccb6bb671610072dfbc60afc5724e17609bb --- gcc/testsuite/g++.dg/tree-ssa/pr86544.C | 1 + gcc/testsuite/gcc.dg/tree-ssa/popcount.c | 1 + gcc/testsuite/gcc.dg/tree-ssa/popcount2.c | 1 + gcc/testsuite/gcc.dg/tree-ssa/popcount3.c | 1 + gcc/testsuite/gcc.target/aarch64/popcount4.c | 14 ++++++++++++++ gcc/testsuite/lib/target-supports.exp | 12 ++++++++++++ gcc/tree-scalar-evolution.c | 16 ++++++++++++++++ 7 files changed, 46 insertions(+) create mode 100644 gcc/testsuite/gcc.target/aarch64/popcount4.c diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr86544.C b/gcc/testsuite/g++.dg/tree-ssa/pr86544.C index fd844b4..ef43891 100644 --- a/gcc/testsuite/g++.dg/tree-ssa/pr86544.C +++ b/gcc/testsuite/g++.dg/tree-ssa/pr86544.C @@ -1,4 +1,5 @@ /* { dg-do compile } */ +/* { dg-require-effective-target popcountl } */ /* { dg-options "-O2 -fdump-tree-phiopt4 -fdump-tree-optimized" } */ int PopCount (long b) { diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount.c index a5ec3b3..b469410 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/popcount.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount.c @@ -1,4 +1,5 @@ /* { dg-do compile } */ +/* { dg-require-effective-target popcountl } */ /* { dg-options "-O3 -fdump-tree-optimized -fno-tree-ch" } */ extern int foo (int); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c index 9096c6b..ef73e34 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount2.c @@ -1,4 +1,5 @@ /* { dg-do run } */ +/* { dg-require-effective-target popcountl } */ /* { dg-options "-O2 -fno-tree-ch -fdump-tree-optimized" } */ int diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c index fd844b4..ef43891 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c @@ -1,4 +1,5 @@ /* { dg-do compile } */ +/* { dg-require-effective-target popcountl } */ /* { dg-options "-O2 -fdump-tree-phiopt4 -fdump-tree-optimized" } */ int PopCount (long b) { diff --git a/gcc/testsuite/gcc.target/aarch64/popcount4.c b/gcc/testsuite/gcc.target/aarch64/popcount4.c new file mode 100644 index 0000000..ee55b2e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/popcount4.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized -mgeneral-regs-only" } */ + +int PopCount (long b) { + int c = 0; + + while (b) { + b &= b - 1; + c++; + } + return c; +} + +/* { dg-final { scan-tree-dump-times "__builtin_popcount" 0 "optimized" } } */ diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index fd74c04..214582f 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -6500,6 +6500,18 @@ proc check_effective_target_sync_long_long { } { } } +# Return 1 if the target supports popcount on long. + +proc check_effective_target_popcountl { } { + return [check_no_messages_and_pattern popcountl "!__popcount" assembly { + int foo () + { + long volatile b = 3; + return __builtin_popcountl (b); + } + } "" ] +} + # Return 1 if the target supports atomic operations on "long long" # and can execute them. # diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 6475743..01e29e3 100644 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -257,7 +257,9 @@ along with GCC; see the file COPYING3. If not see #include "system.h" #include "coretypes.h" #include "backend.h" +#include "target.h" #include "rtl.h" +#include "optabs-query.h" #include "tree.h" #include "gimple.h" #include "ssa.h" @@ -282,6 +284,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-fold.h" #include "tree-into-ssa.h" #include "builtins.h" +#include "case-cfn-macros.h" static tree analyze_scalar_evolution_1 (struct loop *, tree); static tree analyze_scalar_evolution_for_address_of (struct loop *loop, @@ -3501,6 +3504,19 @@ expression_expensive_p (tree expr) tree arg; call_expr_arg_iterator iter; + combined_fn cfn = get_call_combined_fn (expr); + switch (cfn) + { + CASE_CFN_POPCOUNT: + /* Check if opcode for popcount is available. */ + if (optab_handler (popcount_optab, + TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG (expr, 0)))) + == CODE_FOR_nothing) + return true; + default: + break; + } + if (!is_inexpensive_builtin (get_callee_fndecl (expr))) return true; FOR_EACH_CALL_EXPR_ARG (arg, iter, expr) -- 2.7.4
From 714c0fff95c3cd4bfde14f7649deb829962ec50b Mon Sep 17 00:00:00 2001 From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> Date: Fri, 2 Nov 2018 19:46:06 +1100 Subject: [PATCH 2/2] allow builtin popcount for size_p Change-Id: I3843088970b9cad482b2d480f7aa3e1d5fa11614 --- gcc/tree-scalar-evolution.c | 24 +++++++++++++----------- gcc/tree-scalar-evolution.h | 2 +- gcc/tree-ssa-loop-ivopts.c | 8 ++++++-- 3 files changed, 20 insertions(+), 14 deletions(-) diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 01e29e3..bace52c 100644 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -285,6 +285,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-into-ssa.h" #include "builtins.h" #include "case-cfn-macros.h" +#include "predict.h" static tree analyze_scalar_evolution_1 (struct loop *, tree); static tree analyze_scalar_evolution_for_address_of (struct loop *loop, @@ -3472,10 +3473,10 @@ scev_finalize (void) } /* Returns true if the expression EXPR is considered to be too expensive - for scev_const_prop. */ + for scev_const_prop. If SIZE_P is true, it it optimized for size. */ bool -expression_expensive_p (tree expr) +expression_expensive_p (tree expr, bool size_p) { enum tree_code code; @@ -3509,8 +3510,9 @@ expression_expensive_p (tree expr) { CASE_CFN_POPCOUNT: /* Check if opcode for popcount is available. */ - if (optab_handler (popcount_optab, - TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG (expr, 0)))) + if (!size_p + && optab_handler (popcount_optab, + TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG (expr, 0)))) == CODE_FOR_nothing) return true; default: @@ -3520,13 +3522,13 @@ expression_expensive_p (tree expr) if (!is_inexpensive_builtin (get_callee_fndecl (expr))) return true; FOR_EACH_CALL_EXPR_ARG (arg, iter, expr) - if (expression_expensive_p (arg)) + if (expression_expensive_p (arg, size_p)) return true; return false; } if (code == COND_EXPR) - return (expression_expensive_p (TREE_OPERAND (expr, 0)) + return (expression_expensive_p (TREE_OPERAND (expr, 0), size_p) || (EXPR_P (TREE_OPERAND (expr, 1)) && EXPR_P (TREE_OPERAND (expr, 2))) /* If either branch has side effects or could trap. */ @@ -3534,19 +3536,19 @@ expression_expensive_p (tree expr) || generic_expr_could_trap_p (TREE_OPERAND (expr, 1)) || TREE_SIDE_EFFECTS (TREE_OPERAND (expr, 0)) || generic_expr_could_trap_p (TREE_OPERAND (expr, 0)) - || expression_expensive_p (TREE_OPERAND (expr, 1)) - || expression_expensive_p (TREE_OPERAND (expr, 2))); + || expression_expensive_p (TREE_OPERAND (expr, 1), size_p) + || expression_expensive_p (TREE_OPERAND (expr, 2), size_p)); switch (TREE_CODE_CLASS (code)) { case tcc_binary: case tcc_comparison: - if (expression_expensive_p (TREE_OPERAND (expr, 1))) + if (expression_expensive_p (TREE_OPERAND (expr, 1), size_p)) return true; /* Fallthru. */ case tcc_unary: - return expression_expensive_p (TREE_OPERAND (expr, 0)); + return expression_expensive_p (TREE_OPERAND (expr, 0), size_p); default: return true; @@ -3615,7 +3617,7 @@ final_value_replacement_loop (struct loop *loop) he probably knows that n is not large, and does not want it to be turned into n %= 45. */ - || expression_expensive_p (def)) + || expression_expensive_p (def, optimize_loop_for_size_p (loop))) { if (dump_file && (dump_flags & TDF_DETAILS)) { diff --git a/gcc/tree-scalar-evolution.h b/gcc/tree-scalar-evolution.h index f58ce66..1f19736 100644 --- a/gcc/tree-scalar-evolution.h +++ b/gcc/tree-scalar-evolution.h @@ -35,7 +35,7 @@ extern tree resolve_mixers (struct loop *, tree, bool *); extern void gather_stats_on_scev_database (void); extern void final_value_replacement_loop (struct loop *); extern unsigned int scev_const_prop (void); -extern bool expression_expensive_p (tree); +extern bool expression_expensive_p (tree, bool); extern bool simple_iv_with_niters (struct loop *, struct loop *, tree, struct affine_iv *, tree *, bool); extern bool simple_iv (struct loop *, struct loop *, tree, struct affine_iv *, diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index b42f32d..038eabd 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -5149,8 +5149,12 @@ may_eliminate_iv (struct ivopts_data *data, *comp = iv_elimination_compare (data, use); /* It is unlikely that computing the number of iterations using division - would be more profitable than keeping the original induction variable. */ - if (expression_expensive_p (*bound)) + would be more profitable than keeping the original induction variable. + + When replacing a loop, second argument to expression_expensive_p is used + in deciding if size should be considered. In this case since this is + only concerning IV, we pass false. */ + if (expression_expensive_p (*bound, false)) return false; /* Sometimes, it is possible to handle the situation that the number of -- 2.7.4