On Thu, May 7, 2026 at 4:14 PM Hongtao Liu <[email protected]> wrote: > > On Thu, May 7, 2026 at 3:38 PM H.J. Lu <[email protected]> wrote: > > > > On Thu, May 7, 2026 at 3:16 PM Hongtao Liu <[email protected]> wrote: > > > > > > On Thu, May 7, 2026 at 3:07 PM H.J. Lu <[email protected]> wrote: > > > > > > > > On Thu, May 7, 2026 at 3:00 PM Hongtao Liu <[email protected]> wrote: > > > > > > > > > > On Thu, May 7, 2026 at 2:04 PM H.J. Lu <[email protected]> wrote: > > > > > > > > > > > > On Wed, May 6, 2026 at 10:17 PM Hongtao Liu <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > > On Mon, May 4, 2026 at 9:08 PM H.J. Lu <[email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, May 4, 2026 at 8:36 PM Liu, Hongtao > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > Could you rebase the patch to latest trunk, I failed to git > > > > > > > > > am your patch. > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > >- /* NB: CONST_VECTOR load is generated and handled in > > > > > > > > > >x86_cse. */ > > > > > > > > > >- if (load > > > > > > > > > >- && !CONST_VECTOR_P (src) > > > > > > > > > >- && load->kind == X86_CSE_VEC_DUP) > > > > > > > > > >+ if (load && load->kind == X86_CSE_VEC_DUP) > > > > > > > > > > > > > > > > > > It looks like the base is without your former patch[1], but > > > > > > > > > the patch should be incremental to that. > > > > > > > > > > > > > > > > Here is the rebased patch. > > > > > > > > > > > > > > > > > [1] > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2026-April/715240.html > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: H.J. Lu <[email protected]> > > > > > > > > > > Sent: Thursday, April 30, 2026 10:07 PM > > > > > > > > > > To: GCC Patches <[email protected]>; Uros Bizjak > > > > > > > > > > <[email protected]>; Liu, Hongtao <[email protected]> > > > > > > > > > > Subject: [PATCH] x86_cse: Add X86_CSE_CONST_VECTOR > > > > > > > > > > > > > > > > > > > > Add X86_CSE_CONST_VECTOR for native CONST_VECTOR: > > > > > > > > > > > > > > > > > > > > (insn 25 23 234 4 (set (reg:V16QI 135) > > > > > > > > > > (const_vector:V16QI [ > > > > > > > > > > (const_int -1 [0xffffffffffffffff]) > > > > > > > > > > repeated x16 > > > > > > > > > > ])) "bar-2.c":10:16 discrim 67584 2453 > > > > > > > > > > {movv16qi_internal} > > > > > > > > > > (nil)) > > > > > > > > > > > > > > > > > > > > and constant integer load: > > > > > > > > > > > > > > > > > > > > (insn 280 8 279 2 (set (subreg:HI (reg:V2QI 172) 0) > > > > > > > > > > (const_int -1 [0xffffffffffffffff])) -1 > > > > > > > > > > (nil)) > > > > > > > > > > ... > > > > > > > > > > (insn 110 39 194 9 (set (reg:V2QI 147) > > > > > > > > > > (reg:V2QI 172)) 2089 {*movv2qi_internal} > > > > > > > > > > (expr_list:REG_EQUAL (const_vector:V2QI [ > > > > > > > > > > (const_int -1 [0xffffffffffffffff]) > > > > > > > > > > repeated x2 > > > > > > > > > > ]) > > > > > > > > > > (nil))) > > > > > > > > > > > > > > > > > > > > converted from > > > > > > > > > > > > > > > > > > > > (insn 111 87 121 18 (set (reg:V2QI 147) > > > > > > > > > > (mem/u/c:V2QI (symbol_ref/u:DI ("*.LC0") [flags > > > > > > > > > > 0x2]) [0 S2 > > > > > > > > > > A16])) 2089 {*movv2qi_internal} > > > > > > > > > > (expr_list:REG_EQUAL (const_vector:V2QI [ > > > > > > > > > > (const_int -1 [0xffffffffffffffff]) > > > > > > > > > > repeated x2]) > > > > > > > > > > (nil))) > > > > > > > > > > > > > > > > > > > > Keep redundant constant integer load when crossing a > > > > > > > > > > function call since it is > > > > > > > > > > faster than save and restore an integer register. > > > > > > > > > > > > > > > > > > > > Convert CONST_VECTOR load no larger than integer register > > > > > > > > > > to constant > > > > > > > > > > integer load even if there is no redundant CONST_VECTOR > > > > > > > > > > load. > > > > > > > > > > > > > > > > > > > > Tested on Linux/x86-64 and Linux/i686. > > > > > > > > > > > > > > > > > > > > gcc/ > > > > > > > > > > > > > > > > > > > > PR target/125100 > > > > > > > > > > * config/i386/i386-features.cc (x86_cse_kind): Add > > > > > > > > > > X86_CSE_CONST_VECTOR. > > > > > > > > > > (redundant_pattern): Add dest_mode. > > > > > > > > > > (ix86_place_single_vector_set): Handle X86_CSE_CONST_VECTOR. > > > > > > > > > > Generate SUBREG for constant integer source. > > > > > > > > > > (ix86_broadcast_inner): Add an INSN argument. Check > > > > > > > > > > REG_EQUAL notes for > > > > > > > > > > CONST_VECTOR. Set load kind to X86_CSE_CONST_VECTOR for > > > > > > > > > > native and > > > > > > > > > > converted CONST_VECTORs. Return CONST_VECTOR if it can be > > > > > > > > > > converted to > > > > > > > > > > constant integer load. > > > > > > > > > > (pass_x86_cse::candidate_vector_p): Add an INSN argument > > > > > > > > > > and pass the > > > > > > > > > > insn to ix86_broadcast_inner. > > > > > > > > > > (pass_x86_cse::x86_cse): Add a basic block bitmap for calls. > > > > > > > > > > Pass the insn to candidate_vector_p. Handle > > > > > > > > > > X86_CSE_CONST_VECTOR. > > > > > > > > > > Set dest_mode. Keep redundant constant integer load when > > > > > > > > > > crossing a > > > > > > > > > > function call. Convert CONST_VECTOR load no larger than > > > > > > > > > > integer register to > > > > > > > > > > constant integer load even if there is no redundant > > > > > > > > > > CONST_VECTOR load. > > > > > > > > > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > > > > > > > > > PR target/125100 > > > > > > > > > > * gcc.target/i386/pr125100-1.c: New test. > > > > > > > > > > * gcc.target/i386/pr125100-2.c: Likewise. > > > > > > > > > > * gcc.target/i386/pr125100-3.c: Likewise. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > H.J. > > > > > > > > > > > > > > > > > > > > > > > >- else if (CONST_VECTOR_P (op)) > > > > > > > >+ else > > > > > > > > { > > > > > > > >- rtx first = XVECEXP (op, 0, 0); > > > > > > > >- for (int i = 1; i < nunits; ++i) > > > > > > > >+ rtx equal; > > > > > > > >+ bool int_load_p = false; > > > > > > > >+ if (CONST_VECTOR_P (op)) > > > > > > > > { > > > > > > > >- rtx tmp = XVECEXP (op, 0, i); > > > > > > > >- /* Vector duplicate value. */ > > > > > > > >- if (!rtx_equal_p (tmp, first)) > > > > > > > >- return nullptr; > > > > > > > >+ /* CONST_VECTOR is supported natively. */ > > > > > > > >+ *kind_p = X86_CSE_CONST_VECTOR; > > > > > > > >+ int_load_p = GET_MODE_SIZE (mode) <= UNITS_PER_WORD; > > > > > > > >+ equal = op; > > > > > > > > } > > > > > > > >- /* Use the inner mode to handle > > > > > > > >- (const_vector:V2QI [(const_int 0 [0]) repeated x2]) > > > > > > > >- */ > > > > > > > >- *scalar_mode_p = GET_MODE_INNER (mode); > > > > > > > >- *insn_p = nullptr; > > > > > > > >- return first; > > > > > > > >+ else > > > > > > > >+ { > > > > > > > >+ /* Check CONST_VECTOR load which can be converted to > > > > > > > >constant > > > > > > > >+ integer load. */ > > > > > > > >+ equal = find_reg_equal_equiv_note (*insn_p); > > > > > > > > > > > > > > Could you funnel non-CONST_VECTOR cases through REG_EQUAL, then > > > > > > > run > > > > > > > one shared CONST_VECTOR path for int_load_p or const vec > > > > > > > duplicate. > > > > > > > Also add some comments to ix86_broadcast_inner that the > > > > > > > const_vector > > > > > > > will be returned if GET_MODE_SIZE (mode) <= UNITS_PER_WORD. > > > > > > > > > > > > Changed in the v2 patch to check REG_EQUAL note first. > > > > > > > > > > > > > ... > > > > > > > > > > > > > > >- if (!set && !CALL_P (insn)) > > > > > > > >- continue; > > > > > > > >+ if (!set) > > > > > > > >+ { > > > > > > > >+ if (CALL_P (insn)) > > > > > > > >+ bitmap_set_bit (call_bbs, BLOCK_FOR_INSN > > > > > > > >(insn)->index); > > > > > > > > > > > > > > There could be call_p (insn) but still a single set? and that > > > > > > > call_insn will be missed by upper code. > > > > > > > > > > > > Changed to > > > > > > > > > > > > bool call_p = CALL_P (insn); > > > > > > rtx set = single_set (insn); > > > > > > if (!set && !call_p) > > > > > > continue; > > > > > > > > > > > > tlsdesc_val = nullptr; > > > > > > > > > > > > attr_tls64 tls64 = get_attr_tls64 (insn); > > > > > > > > > > > > /* NB: TLS calls preserve all registers. */ > > > > > > if (call_p && tls64 == TLS64_NONE) > > > > > > bitmap_set_bit (call_bbs, BLOCK_FOR_INSN (insn)->index); > > > > > > > > > > > > > >+ else > > > > > > > >+ continue; > > > > > > > >+ } > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > >+ bool keep_redundant_load = false; > > > > > > > > > > > > > > When load->count == 1, I think we also want replace const_vector > > > > > > > with > > > > > > > integer load, so better with > > > > > > > keep_redundant_load = load->count == 1; > > > > > > > > > > > > The variable name may be confusing since 1 constant integer load > > > > > > is also converted: > > > > > > > > > > For load->count == 1, it converted with vector mode, not directly > > > > > integer load from imm > > > > > > > > > > Replace: > > > > > > > > > > (insn 58 40 64 9 (set (reg:V2QI 120) > > > > > (mem/u/c:V2QI (symbol_ref/u:SI ("*.LC1") [flags 0x2]) [0 S2 > > > > > A16])) 2081 {*movv2qi_internal} > > > > > (expr_list:REG_EQUAL (const_vector:V2QI [ > > > > > (const_int 1 [0x1]) repeated x2 > > > > > ]) > > > > > (nil))) > > > > > > > > > > with: > > > > > > > > > > (insn 58 40 64 9 (set (reg:V2QI 120) > > > > > (reg:V2QI 131)) 2081 {*movv2qi_internal} > > > > > (expr_list:REG_EQUAL (const_vector:V2QI [ > > > > > (const_int 1 [0x1]) repeated x2 > > > > > ]) > > > > > (nil))) > > > > > > > > > > deferring rescan insn with uid = 58. > > > > > > > > > > deferring rescan insn with uid = 141. > > > > > > > > > > Place: > > > > > > > > > > (insn 141 40 58 9 (set (subreg:HI (reg:V2QI 131) 0) > > > > > (const_int 257 [0x101])) -1 > > > > > (nil)) > > > > > > > > > > after: > > > > > > > > > > (note 40 39 141 9 [bb 9] NOTE_INSN_BASIC_BLOCK) > > > > > > > > > > > > > > > Why shouldn't we directly replace it with > > > > > (insn 58 40 64 9 (set (subreg: HI(reg:V2QI 120) 0) > > > > > (const_int 257 [0x101])) > > > > > > > > This is related to > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125102 > > > > > > > > We avoid generating imm16 stores in the x86_cse pass > > > > so that we don't need to deal with TARGET_LCP_STALL. > > > > > > But we still have imm16 stores when keep_redundant_load == true? > > > > It is constant imm6 store vs V2QI vector store: > > For testcase > ---------------------------- testcase start---------------- > struct desc { > char c1; > char c2; > }; > > extern int bar (); > extern int qqq(); > int > foo (struct desc *list, int j) > { > int x = bar (); > if (x) > { > int x = qqq (); > list[j].c1 = 1; > list[j].c2 = 1; > } > > return x; > } > ----------------------- testcase end--------------------------- > > > With your patch, the dump is as below > > > ---------------------------dump start----------------------------- > Replace: > > (insn 13 12 14 3 (set (reg:V2QI 104) > (mem/u/c:V2QI (symbol_ref/u:SI ("*.LC0") [flags 0x2]) [0 S2 > A16])) "test2.c":15:18 2081 {*movv2qi_internal} > (expr_list:REG_EQUAL (const_vector:V2QI [ > (const_int 1 [0x1]) repeated x2 > ]) > (nil))) > > with: > > (insn 13 12 14 3 (set (subreg:HI (reg:V2QI 104) 0) > (const_int 257 [0x101])) "test2.c":15:18 101 {*movhi_internal} > (expr_list:REG_EQUAL (const_vector:V2QI [ > (const_int 1 [0x1]) repeated x2 > ]) > (nil))) > -------------------dump end---------------------- > > There's no imm16 store, I assume load_count == 1 should be also > similar, just directly replace the mem with const_int, instead of > > ------------------------------------dump start------------------------------ > Replace: > > (insn 7 4 8 2 (set (reg:V2QI 103) > (mem/u/c:V2QI (symbol_ref/u:SI ("*.LC0") [flags 0x2]) [0 S2 > A16])) "test2.c":11:18 2081 {*movv2qi_internal} > (expr_list:REG_EQUAL (const_vector:V2QI [ > (const_int 1 [0x1]) repeated x2 > ]) > (nil))) > > with: > > (insn 7 4 8 2 (set (reg:V2QI 103) > (reg:V2QI 104)) "test2.c":11:18 2081 {*movv2qi_internal} > (expr_list:REG_EQUAL (const_vector:V2QI [ > (const_int 1 [0x1]) repeated x2 > ]) > (nil))) > > deferring rescan insn with uid = 7. > ;; 1 loops found > ;; > ;; Loop 0 > ;; header 0, latch 1 > ;; depth 0, outer -1 > ;; nodes: 0 1 2 > ;; 2 succs { 1 } > deferring rescan insn with uid = 11. > > Place: > > (insn 11 5 2 2 (set (subreg:HI (reg:V2QI 104) 0) > (const_int 257 [0x101])) -1 > (nil)) > > after: > -----------------------------------dump end --------------------------------- > > > Or am I misunderstanding something?
Dump looks fine. The x86_cse pass won't generate movw imm16, mem16 which will be splitted to movl imm16, reg32 movw reg16, mem16 if TARGET_LCP_STALL is true. After https://gcc.gnu.org/g:f7a08d53ab3074d1661d74a438932da6ed4879cc movl imm16, reg32 movw reg16, mem16 will be optimized to movw imm16, mem16 if TARGET_LCP_STALL is false. The final assembly outputs look reasonable. -- H.J.
