On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote: > On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote: > > > I guess here the problem is floating-point compare instruction is much > > > more costly than other instructions but the fact is not correctly > > > modeled yet. Could you try > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html > > > where I've raised fp_add cost (which is used for estimating floating- > > > point compare cost) to 5 instructions and see if it solves your problem > > > without LOGICAL_OP_NON_SHORT_CIRCUIT? > > I think this is not the same issue as the cost of floating-point > > comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT > > affects how the short-circuit branch, such as (A AND-IF B), is executed, > > and it is not directly related to the cost of floating-point comparison > > instructions. I will try to test it using SPECCPU 2017. > > The point is if the cost of floating-point comparison is very high, the > middle end *should* short cut floating-point comparisons even if > LOGICAL_OP_NON_SHORT_CIRCUIT = 1. > > I've created https://gcc.gnu.org/PR112985. > > Another factor regressing the code is we don't have modeled movcf2gr > instruction yet, so we are not really eliding the branches as > LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
I made up this: diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md index a5d0dcd65fe..84d828ebd0f 100644 --- a/gcc/config/loongarch/loongarch.md +++ b/gcc/config/loongarch/loongarch.md @@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode" [(set_attr "type" "fcmp") (set_attr "mode" "FCC")]) +(define_insn "movcf2gr<GPR:mode>" + [(set (match_operand:GPR 0 "register_operand" "=r") + (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z") + (const_int 0)) + (const_int 1) + (const_int 0)))] + "TARGET_HARD_FLOAT" + "movcf2gr\t%0,%1" + [(set_attr "type" "move") + (set_attr "mode" "FCC")]) + +(define_expand "cstore<ANYF:mode>4" + [(set (match_operand:SI 0 "register_operand") + (match_operator:SI 1 "loongarch_fcmp_operator" + [(match_operand:ANYF 2 "register_operand") + (match_operand:ANYF 3 "register_operand")]))] + "" + { + rtx fcc = gen_reg_rtx (FCCmode); + rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode, + operands[2], operands[3]); + + emit_insn (gen_rtx_SET (fcc, cmp)); + if (TARGET_64BIT) + { + rtx gpr = gen_reg_rtx (DImode); + emit_insn (gen_movcf2grdi (gpr, fcc)); + emit_insn (gen_rtx_SET (operands[0], + lowpart_subreg (SImode, gpr, DImode))); + } + else + emit_insn (gen_movcf2grsi (operands[0], fcc)); + + DONE; + }) + ;; ;; .................... diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md index 9e9ce58cb53..83fea08315c 100644 --- a/gcc/config/loongarch/predicates.md +++ b/gcc/config/loongarch/predicates.md @@ -590,6 +590,10 @@ (define_predicate "order_operator" (define_predicate "loongarch_cstore_operator" (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu")) +(define_predicate "loongarch_fcmp_operator" + (match_code + "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt")) + (define_predicate "small_data_pattern" (and (match_code "set,parallel,unspec,unspec_volatile,prefetch") (match_test "loongarch_small_data_pattern_p (op)"))) and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT = 1): fld.s $f1,$r4,0 fld.s $f0,$r4,4 fld.s $f3,$r4,8 fld.s $f2,$r4,12 fcmp.slt.s $fcc1,$f0,$f3 fcmp.sgt.s $fcc0,$f1,$f2 movcf2gr $r13,$fcc1 movcf2gr $r12,$fcc0 or $r12,$r12,$r13 bnez $r12,.L3 fld.s $f4,$r4,16 fld.s $f5,$r4,20 or $r4,$r0,$r0 fcmp.sgt.s $fcc1,$f1,$f5 fcmp.slt.s $fcc0,$f0,$f4 movcf2gr $r12,$fcc1 movcf2gr $r13,$fcc0 or $r12,$r12,$r13 bnez $r12,.L2 fcmp.sgt.s $fcc1,$f3,$f5 fcmp.slt.s $fcc0,$f2,$f4 movcf2gr $r4,$fcc1 movcf2gr $r12,$fcc0 or $r4,$r4,$r12 xori $r4,$r4,1 slli.w $r4,$r4,0 jr $r1 .align 4 .L3: or $r4,$r0,$r0 .align 4 .L2: jr $r1 Per my micro-benchmark this is much faster than LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e. when the branches are not predictable). Note that there is a redundant slli.w instruction in the compiled code and I couldn't find a way to remove it (my trick in the TARGET_64BIT branch only works for simple examples). We may be able to handle via the ext_dce pass [1] in the future. [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html -- Xi Ruoyao <xry...@xry111.site> School of Aerospace Science and Technology, Xidian University