[RS6000] Fix PR53040
I'm applying this to mainline as obvious. If we save fprs inline for any reason, we must also restore them inline. (The same goes for gprs, and that is handled later in this function.) Bootstrapped and regresion tested powerpc-linux. PR target/53040 * config/rs6000/rs6000.c (rs6000_savres_strategy): When using static chain, set REST_INLINE_FPRS too. diff -urp gcc-alan4/gcc/config/rs6000/rs6000.c gcc-alan5a/gcc/config/rs6000/rs6000.c --- gcc-alan4/gcc/config/rs6000/rs6000.c2012-04-19 18:33:20.171843404 +0930 +++ gcc-alan5a/gcc/config/rs6000/rs6000.c 2012-04-19 21:24:46.643632761 +0930 @@ -17456,8 +17456,9 @@ rs6000_savres_strategy (rs6000_stack_t * static chain is rarely used anyway. FPRs are saved w.r.t the stack pointer on Darwin. */ if (using_static_chain_p) -strategy |= (DEFAULT_ABI == ABI_DARWIN ? 0 : SAVE_INLINE_FPRS) - | SAVE_INLINE_GPRS; +strategy |= ((DEFAULT_ABI == ABI_DARWIN + ? 0 : SAVE_INLINE_FPRS | REST_INLINE_FPRS) +| SAVE_INLINE_GPRS); /* If we are going to use store multiple, then don't even bother with the out-of-line routines, since the store-multiple -- Alan Modra Australia Development Lab, IBM
[RS6000] Fix PR53038, cfa_restore out of order
); +restore_saved_cr (cr_save_reg, using_mtcr_multiple, exit_func); /* If this is V.4, unwind the stack pointer after all of the loads have been done. */ @@ -20877,11 +20940,8 @@ int i; rtx sym; - if ((DEFAULT_ABI == ABI_V4 || flag_shrink_wrap) - lr) - cfa_restores = alloc_reg_note (REG_CFA_RESTORE, - gen_rtx_REG (Pmode, LR_REGNO), - cfa_restores); + if (flag_shrink_wrap) + cfa_restores = add_crlr_cfa_restore (info, cfa_restores); sym = rs6000_savres_routine_sym (info, /*savep=*/false, @@ -20902,7 +20962,7 @@ reg = gen_rtx_REG (DFmode, info-first_fp_reg_save + i); RTVEC_ELT (p, i + 4) = gen_rtx_SET (VOIDmode, reg, mem); - if (DEFAULT_ABI == ABI_V4 || flag_shrink_wrap) + if (flag_shrink_wrap) cfa_restores = alloc_reg_note (REG_CFA_RESTORE, reg, cfa_restores); } -- Alan Modra Australia Development Lab, IBM
PowerPC prologue and epilogue 6
This patch adds out-of-line vector saves and restores. To do this I made some infrastructure changes to various functions like rs6000_emit_savres_rtx that currently take boolean parameters (savep, gpr, and lr). Rather than add yet another boolean to specify vector regs, I chose to lump them all together in a bitmask. This made the patch a little larger but overall is a better interface, I think. I also revert a change I made in http://gcc.gnu.org/ml/gcc-patches/2012-04/msg01014.html to always use r11 as a frame reg whenever abiv4 emits out-of-line saves. Code quality in functions with small frames is better without that particular change. This however meant some changes are required later when setting up pointer regs for gpr and fpr out-of-line saves. What else is here? Improved register selection when saving vrsave in the prologue and when restoring cr in the epilogue, allowing better scheduling. A fix to rs6000_output_function_prologue to output the correct .extern for ELF, then deciding we don't need such things anyway. And various other little code cleanups. Bootstrapped and regression tested powerpc-linux. gcc/ * config/rs6000/rs6000 (SAVE_INLINE_VRS, REST_INLINE_VRS, V_SAVE_INLINE, SAVRES_LR, SAVRES_SAVE, SAVRES_REG, SAVRES_GPR, SAVRES_FPR, SAVRES_VR): Define. (no_global_regs_above): Delete. (no_global_regs): New function. (rs6000_savres_strategy): Handle vector regs. Use proper lr_save_p value for load multiple test. (savres_routine_syms): Increase size. (rs6000_savres_routine_name, rs6000_savres_routine_sym, ptr_regno_for_savres, rs6000_emit_savres_rtx): Pass in int selector rather than a number of boolean flags. Update all callers. (rs6000_savres_routine_name): Generate vector save/restore names. (rs6000_savres_routine_sym): Handle vector regs. Delete forward decl. (ptr_regno_for_savres, rs6000_emit_savres_rtx): Likewise. (rs6000_emit_prologue): Delete saving_FPRs_inline, saving_GPRs_inline and using_store_multiple. Expand uses. Don't always use r11 as frame reg when needed for out-of-line saves. Set up initial offset for out-of-line vector saves when buying stack frame. Handle pointer reg setup for out-of-line fp save. Emit call to out-of-line vector save function. Choose r11 or r12 for vrsave reg when available for better scheduling. (rs6000_output_function_prologue): Don't emit .extern for ELF. (rs6000_emit_epilogue): Choose a better frame reg when restoring from back-chain to suit out-of-line vector restore functions. Emit call to out-of-line vector restore function. Adjust register used for cr restore. Tweak pointer register setup for gpr restore. * config/rs6000/rs6000.h (FIRST_SAVED_GP_REGNO): Take into account FIXED_R13. * config/rs6000/sysv4.h (FP_SAVE_INLINE, GP_SAVE_INLINE): Simplify. (V_SAVE_INLINE): Define. * config/rs6000/altivec.md (save_vregs_*, restore_vregs_*): New insns. libgcc/ * config/rs6000/crtsavevr.S: New file. * config/rs6000/crtrestvr.S: New file. * config/rs6000/t-savresfgpr: Build the above. * config/rs6000/t-netbsd: Likewise. diff -urpN gcc-alan5/gcc/config/rs6000/rs6000.c gcc-alan6/gcc/config/rs6000/rs6000.c --- gcc-alan5/gcc/config/rs6000/rs6000.c2012-04-19 20:55:02.214727782 +0930 +++ gcc-alan6/gcc/config/rs6000/rs6000.c2012-04-21 15:47:44.193462791 +0930 @@ -937,7 +937,6 @@ static bool legitimate_small_data_p (enu static bool legitimate_lo_sum_address_p (enum machine_mode, rtx, int); static struct machine_function * rs6000_init_machine_status (void); static bool rs6000_assemble_integer (rtx, unsigned int, int); -static bool no_global_regs_above (int, bool); #if defined (HAVE_GAS_HIDDEN) !TARGET_MACHO static void rs6000_assemble_visibility (tree, int); #endif @@ -950,7 +949,6 @@ static tree rs6000_handle_struct_attribu static void rs6000_eliminate_indexed_memrefs (rtx operands[2]); static const char *rs6000_mangle_type (const_tree); static void rs6000_set_default_type_attributes (tree); -static rtx rs6000_savres_routine_sym (rs6000_stack_t *, bool, bool, bool); static bool rs6000_reg_live_or_pic_offset_p (int); static tree rs6000_builtin_vectorized_libmass (tree, tree, tree); static tree rs6000_builtin_vectorized_function (tree, tree, tree); @@ -17405,6 +17403,21 @@ is_altivec_return_reg (rtx reg, void *xy } +/* Look for user-defined global regs in the range FIRST to LAST-1. + We should not restore these, and so cannot use lmw or out-of-line + restore functions if there are any. We also can't save them + (well, emit frame notes for them), because frame unwinding during + exception handling will restore saved registers. */ + +static bool +global_regs_p (unsigned first, unsigned last) +{ + while (first last) +if
Re: PowerPC prologue and epilogue 6
(void) +{ + char b[10]; + void ns_0 (void) + { +char a[33]; +__asm __volatile (#%0 %1 : =m (a), =m (b) ); + } + ns_0 (); +} + +int main (void) +{ + INIT_REGS; + USE_ALL_CR; +#ifdef __ALTIVEC__ + USE_ALL_VR; +#ifndef __NO_FPRS__ + USE_ALL_FPR; +#endif +#endif + USE_ALL_GPR; +#ifdef __ALTIVEC__ +#ifndef __NO_FPRS__ + b_all (); + VERIFY_REGS; + b_cvfr (); + VERIFY_REGS; + b_vfr (); + VERIFY_REGS; + b_cvf (); + VERIFY_REGS; + b_vf (); + VERIFY_REGS; +#endif + b_cvr (); + VERIFY_REGS; + b_vr (); + VERIFY_REGS; + b_cv (); + VERIFY_REGS; + b_v (); + VERIFY_REGS; +#endif +#ifndef __NO_FPRS__ + b_cfr (); + VERIFY_REGS; + b_fr (); + VERIFY_REGS; + b_cf (); + VERIFY_REGS; + b_f (); + VERIFY_REGS; +#endif + b_cr (); + VERIFY_REGS; + b_r (); + VERIFY_REGS; + b_c (); + VERIFY_REGS; + b_0 (); + VERIFY_REGS; +#ifdef __ALTIVEC__ +#ifndef __NO_FPRS__ + s_all (); + VERIFY_REGS; + s_cvfr (); + VERIFY_REGS; + s_vfr (); + VERIFY_REGS; + s_cvf (); + VERIFY_REGS; + s_vf (); + VERIFY_REGS; +#endif + s_cvr (); + VERIFY_REGS; + s_vr (); + VERIFY_REGS; + s_cv (); + VERIFY_REGS; + s_v (); + VERIFY_REGS; +#endif +#ifndef __NO_FPRS__ + s_cfr (); + VERIFY_REGS; + s_fr (); + VERIFY_REGS; + s_cf (); + VERIFY_REGS; + s_f (); + VERIFY_REGS; +#endif + s_cr (); + VERIFY_REGS; + s_r (); + VERIFY_REGS; + s_c (); + VERIFY_REGS; + s_0 (); + VERIFY_REGS; +#ifdef __ALTIVEC__ +#ifndef __NO_FPRS__ + wb_all (); + VERIFY_REGS; + wb_cvfr (); + VERIFY_REGS; + wb_vfr (); + VERIFY_REGS; + wb_cvf (); + VERIFY_REGS; + wb_vf (); + VERIFY_REGS; +#endif + wb_cvr (); + VERIFY_REGS; + wb_vr (); + VERIFY_REGS; + wb_cv (); + VERIFY_REGS; + wb_v (); + VERIFY_REGS; +#endif +#ifndef __NO_FPRS__ + wb_cfr (); + VERIFY_REGS; + wb_fr (); + VERIFY_REGS; + wb_cf (); + VERIFY_REGS; + wb_f (); + VERIFY_REGS; +#endif + wb_cr (); + VERIFY_REGS; + wb_r (); + VERIFY_REGS; + wb_c (); + VERIFY_REGS; + wb_0 (); + VERIFY_REGS; +#ifdef __ALTIVEC__ +#ifndef __NO_FPRS__ + ws_all (); + VERIFY_REGS; + ws_cvfr (); + VERIFY_REGS; + ws_vfr (); + VERIFY_REGS; + ws_cvf (); + VERIFY_REGS; + ws_vf (); + VERIFY_REGS; +#endif + ws_cvr (); + VERIFY_REGS; + ws_vr (); + VERIFY_REGS; + ws_cv (); + VERIFY_REGS; + ws_v (); + VERIFY_REGS; +#endif +#ifndef __NO_FPRS__ + ws_cfr (); + VERIFY_REGS; + ws_fr (); + VERIFY_REGS; + ws_cf (); + VERIFY_REGS; + ws_f (); + VERIFY_REGS; +#endif + ws_cr (); + VERIFY_REGS; + ws_r (); + VERIFY_REGS; + ws_c (); + VERIFY_REGS; + ws_0 (); + VERIFY_REGS; + return 0; +} -- Alan Modra Australia Development Lab, IBM
Re: [RS6000] Fix PR53038, cfa_restore out of order
On Wed, Apr 25, 2012 at 02:00:48PM -0700, Richard Henderson wrote: On 04/20/12 18:58, Alan Modra wrote: + add_reg_note (insn, REG_FRAME_RELATED_EXPR, set); + RTX_FRAME_RELATED_P (insn) = 1; You're probably better off using REG_CFA_REGISTER here, rather than making dwarf2cfi.c guess what you meant. The guessing is rather tuned to prologues... It worked! I take your point though, and thanks for reviewing the patch! The new version is just s/FRAME_RELATED_EXPR/CFA_REGISTER/ but I'll post the entire patch again since my other prologue/epilogue patches mean the original pr53038 fix no longer applies cleanly. Bootstrapped and regression tested powerpc-linux. PR target/53038 * config/rs6000/rs6000.c (load_lr_save, restore_saved_lr, load_cr_save, add_crlr_cfa_restore): New functions. (rs6000_restore_saved_cr): Rename to.. (restore_saved_cr): ..this. Add cfa_restore notes for cr. (rs6000_emit_epilogue): Use new functions. Adjust condition for emitting lr and cr cfa_restore. Emit cfa_restores for fp regs when using out-of-line restore only when shrink wrapping. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 186854) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -952,7 +952,6 @@ static bool rs6000_reg_live_or_pic_offset_p (int); static tree rs6000_builtin_vectorized_libmass (tree, tree, tree); static tree rs6000_builtin_vectorized_function (tree, tree, tree); -static void rs6000_restore_saved_cr (rtx, int); static bool rs6000_output_addr_const_extra (FILE *, rtx); static void rs6000_output_function_prologue (FILE *, HOST_WIDE_INT); static void rs6000_output_function_epilogue (FILE *, HOST_WIDE_INT); @@ -20276,10 +20275,37 @@ we restore after the pop when possible. */ #define ALWAYS_RESTORE_ALTIVEC_BEFORE_POP 0 +/* Restoring cr is a two step process: loading a reg from the frame + save, then moving the reg to cr. For ABI_V4 we must let the + unwinder know that the stack location is no longer valid at or + before the stack deallocation, but we can't emit a cfa_restore for + cr at the stack deallocation like we do for other registers. + The trouble is that it is possible for the move to cr to be + scheduled after the stack deallocation. So say exactly where cr + is located on each of the two insns. */ + +static rtx +load_cr_save (int regno, rtx frame_reg_rtx, int offset, bool exit_func) +{ + rtx mem = gen_frame_mem_offset (SImode, frame_reg_rtx, offset); + rtx reg = gen_rtx_REG (SImode, regno); + rtx insn = emit_move_insn (reg, mem); + + if (!exit_func DEFAULT_ABI == ABI_V4) +{ + rtx cr = gen_rtx_REG (SImode, CR2_REGNO); + rtx set = gen_rtx_SET (VOIDmode, reg, cr); + + add_reg_note (insn, REG_CFA_REGISTER, set); + RTX_FRAME_RELATED_P (insn) = 1; +} + return reg; +} + /* Reload CR from REG. */ static void -rs6000_restore_saved_cr (rtx reg, int using_mfcr_multiple) +restore_saved_cr (rtx reg, int using_mfcr_multiple, bool exit_func) { int count = 0; int i; @@ -20317,13 +20343,61 @@ else for (i = 0; i 8; i++) if (df_regs_ever_live_p (CR0_REGNO+i) ! call_used_regs[CR0_REGNO+i]) - { - emit_insn (gen_movsi_to_cr_one (gen_rtx_REG (CCmode, - CR0_REGNO+i), - reg)); - } + emit_insn (gen_movsi_to_cr_one (gen_rtx_REG (CCmode, CR0_REGNO+i), + reg)); + + if (!exit_func (DEFAULT_ABI == ABI_V4 || flag_shrink_wrap)) +{ + rtx insn = get_last_insn (); + rtx cr = gen_rtx_REG (SImode, CR2_REGNO); + + add_reg_note (insn, REG_CFA_RESTORE, cr); + RTX_FRAME_RELATED_P (insn) = 1; +} } +/* Like cr, the move to lr instruction can be scheduled after the + stack deallocation, but unlike cr, its stack frame save is still + valid. So we only need to emit the cfa_restore on the correct + instruction. */ + +static void +load_lr_save (int regno, rtx frame_reg_rtx, int offset) +{ + rtx mem = gen_frame_mem_offset (Pmode, frame_reg_rtx, offset); + rtx reg = gen_rtx_REG (Pmode, regno); + + emit_move_insn (reg, mem); +} + +static void +restore_saved_lr (int regno, bool exit_func) +{ + rtx reg = gen_rtx_REG (Pmode, regno); + rtx lr = gen_rtx_REG (Pmode, LR_REGNO); + rtx insn = emit_move_insn (lr, reg); + + if (!exit_func flag_shrink_wrap) +{ + add_reg_note (insn, REG_CFA_RESTORE, lr); + RTX_FRAME_RELATED_P (insn) = 1; +} +} + +static rtx +add_crlr_cfa_restore (const rs6000_stack_t *info, rtx cfa_restores) +{ + if (info-cr_save_p) +cfa_restores = alloc_reg_note (REG_CFA_RESTORE, + gen_rtx_REG (SImode, CR2_REGNO), + cfa_restores); + if (info-lr_save_p) +cfa_restores = alloc_reg_note
Re: rs6000 toc reference rtl again
(match_dup 0) (lo_sum:P (match_dup 0) (match_dup 1)))]) + ;; Elf specific ways of loading addresses for non-PIC code. ;; The output of this could be r0, but we make a very strong ;; preference for a base register because it will usually @@ -12251,22 +12274,6 @@ @ {cal|la} %0,%2@l(%1) {ai|addic} %0,%1,%K2) - -;; Largetoc support -(define_insn largetoc_high - [(set (match_operand:DI 0 gpc_reg_operand =b) - (const:DI - (plus:DI (match_operand:DI 1 gpc_reg_operand b) - (high:DI (match_operand:DI 2 )] - TARGET_ELF TARGET_CMODEL != CMODEL_SMALL - {cau|addis} %0,%1,%2@ha) - -(define_insn largetoc_low - [(set (match_operand:DI 0 gpc_reg_operand =r) -(lo_sum:DI (match_operand:DI 1 gpc_reg_operand b) - (match_operand:DI 2 )))] - TARGET_ELF TARGET_CMODEL != CMODEL_SMALL - {cal %0,%2@l(%1)|addi %0,%1,%2@l}) ;; Call and call_value insns (define_expand call Index: gcc/config/rs6000/rs6000-protos.h === --- gcc/config/rs6000/rs6000-protos.h (revision 187010) +++ gcc/config/rs6000/rs6000-protos.h (working copy) @@ -38,7 +38,7 @@ extern bool macho_lo_sum_memory_operand (rtx, enum extern int num_insns_constant (rtx, enum machine_mode); extern int num_insns_constant_wide (HOST_WIDE_INT); extern int small_data_operand (rtx, enum machine_mode); -extern bool toc_relative_expr_p (rtx); +extern bool toc_relative_expr_p (const_rtx, bool); extern bool invalid_e500_subreg (rtx, enum machine_mode); extern void validate_condition_mode (enum rtx_code, enum machine_mode); extern bool legitimate_constant_pool_address_p (const_rtx, enum machine_mode, -- Alan Modra Australia Development Lab, IBM
[RS6000] Fix PR53271 powerpc ports
(p, i + 4) = gen_rtx_SET (VOIDmode, reg, mem); + RTVEC_ELT (p, i + 4) + = gen_frame_load (reg, sp_reg_rtx, info-fp_save_offset + 8 * i); if (flag_shrink_wrap) cfa_restores = alloc_reg_note (REG_CFA_RESTORE, reg, cfa_restores); -- Alan Modra Australia Development Lab, IBM
Re: rs6000 toc reference rtl again
On Tue, May 01, 2012 at 01:47:51PM +0930, Alan Modra wrote: This revision splits the medium/large code model toc reference after reload. Ping http://gcc.gnu.org/ml/gcc-patches/2012-05/msg6.html -- Alan Modra Australia Development Lab, IBM
rs6000.c forward declaration cleanup
) ATTRIBUTE_UNUSED; static void rs6000_output_dwarf_dtprel (FILE *file, int size, rtx x) { @@ -6419,7 +6205,7 @@ 32-bit DImode, TImode, TFmode, TDmode), indexed addressing cannot be used because adjacent memory cells are accessed by adding word-sized offsets during assembly output. */ -bool +static bool rs6000_legitimate_address_p (enum machine_mode mode, rtx x, bool reg_ok_strict) { bool reg_offset_p = reg_offset_addressing_ok_p (mode); @@ -9207,7 +8993,7 @@ /* Implement va_arg. */ -tree +static tree rs6000_gimplify_va_arg (tree valist, tree type, gimple_seq *pre_p, gimple_seq *post_p) { @@ -24740,6 +24526,7 @@ function descriptor name. This saves a lot of overriding code to read the prefixes. */ +static void rs6000_elf_encode_section_info (tree, rtx, int) ATTRIBUTE_UNUSED; static void rs6000_elf_encode_section_info (tree decl, rtx rtl, int first) { @@ -25252,6 +25039,7 @@ This differs from default_named_section_asm_out_constructor in that we have special handling for -mrelocatable. */ +static void rs6000_elf_asm_out_constructor (rtx, int) ATTRIBUTE_UNUSED; static void rs6000_elf_asm_out_constructor (rtx symbol, int priority) { @@ -25281,6 +25069,7 @@ assemble_integer (symbol, POINTER_SIZE / BITS_PER_UNIT, POINTER_SIZE, 1); } +static void rs6000_elf_asm_out_destructor (rtx, int) ATTRIBUTE_UNUSED; static void rs6000_elf_asm_out_destructor (rtx symbol, int priority) { @@ -25386,6 +25175,7 @@ ASM_OUTPUT_LABEL (file, name); } +static void rs6000_elf_file_end (void) ATTRIBUTE_UNUSED; static void rs6000_elf_file_end (void) { @@ -26931,7 +26721,7 @@ On RS/6000 an integer value is in r3 and a floating-point value is in fp1, unless -msoft-float. */ -rtx +static rtx rs6000_function_value (const_tree valtype, const_tree fn_decl_or_type ATTRIBUTE_UNUSED, bool outgoing ATTRIBUTE_UNUSED) @@ -27078,7 +26868,7 @@ We need r30 if -mminimal-toc was specified, and there are constant pool references. */ -bool +static bool rs6000_can_eliminate (const int from, const int to) { return (from == ARG_POINTER_REGNUM to == STACK_POINTER_REGNUM @@ -28205,6 +27995,7 @@ /* This function emits the simple thunk routine that is used to preserve the link stack on the 476 cpu. */ +static void rs6000_code_end (void) ATTRIBUTE_UNUSED; static void rs6000_code_end (void) { -- Alan Modra Australia Development Lab, IBM
[RS6000] save/restore reg tidy
On Tue, May 08, 2012 at 08:02:39PM +0930, Alan Modra wrote: I also make use of gen_frame_store and siblings that I invented for generating the eh info, elsewhere in rs6000.c where doing so is blindingly obvious. We could probably use them in other places too, but I'll leave that for later. Like so. The part that isn't completely obvious is removing calls to emit_move_insn, which can transform the rtl (rs6000_emit_move). However, we're past reload, the insns emitted are always one set involving a hard reg and mem, and we don't want any addressing mode subsititions going on that avoid rs6000_emit_prologue tracking of r0, r11 and r12 usage. This patch also fixes a couple of places that call df_regs_ever_live_p without checking call_used_regs to test for global asm regs. Not serious bugs as they just result in larger stack frames. Bootstrapped and regression tested powerpc-linux. OK to apply? * config/rs6000/rs6000.c (save_reg_p): New function. (first_reg_to_save, first_fp_reg_to_save): Use it here. (first_altivec_reg_to_save, restore_saved_cr): Likewise. (emit_frame_save): Use gen_frame_store. (gen_frame_mem_offset): Correct SPE condition requiring reg+reg. (rs6000_emit_prologue): Use save_reg_p. Use gen_frame_store for vrsave and toc. (rs6000_emit_epilogue): Use save_reg_p. Use gen_frame_load for vrsave, toc, gp and fp restores. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 187699) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -17236,6 +17079,12 @@ /* This page contains routines that are used to determine what the function prologue and epilogue code will do and write them out. */ +static inline bool +save_reg_p (int r) +{ + return !call_used_regs[r] df_regs_ever_live_p (r); +} + /* Return the first fixed-point register that is required to be saved. 32 if none. */ @@ -17246,14 +17095,16 @@ /* Find lowest numbered live register. */ for (first_reg = 13; first_reg = 31; first_reg++) -if (df_regs_ever_live_p (first_reg) -(! call_used_regs[first_reg] - || (first_reg == RS6000_PIC_OFFSET_TABLE_REGNUM -((DEFAULT_ABI == ABI_V4 flag_pic != 0) - || (DEFAULT_ABI == ABI_DARWIN flag_pic) - || (TARGET_TOC TARGET_MINIMAL_TOC) +if (save_reg_p (first_reg)) break; + if (first_reg RS6000_PIC_OFFSET_TABLE_REGNUM + ((DEFAULT_ABI == ABI_V4 flag_pic != 0) + || (DEFAULT_ABI == ABI_DARWIN flag_pic) + || (TARGET_TOC TARGET_MINIMAL_TOC)) + df_regs_ever_live_p (RS6000_PIC_OFFSET_TABLE_REGNUM)) +first_reg = RS6000_PIC_OFFSET_TABLE_REGNUM; + #if TARGET_MACHO if (flag_pic crtl-uses_pic_offset_table @@ -17273,7 +17124,7 @@ /* Find lowest numbered live register. */ for (first_reg = 14 + 32; first_reg = 63; first_reg++) -if (df_regs_ever_live_p (first_reg)) +if (save_reg_p (first_reg)) break; return first_reg; @@ -17299,7 +17150,7 @@ /* Find lowest numbered live register. */ for (i = FIRST_ALTIVEC_REGNO + 20; i = LAST_ALTIVEC_REGNO; ++i) -if (df_regs_ever_live_p (i)) +if (save_reg_p (i)) break; return i; @@ -18995,7 +18904,7 @@ emit_frame_save (rtx frame_reg, enum machine_mode mode, unsigned int regno, int offset, HOST_WIDE_INT frame_reg_to_sp) { - rtx reg, insn, mem, addr; + rtx reg, insn; /* Some cases that need register indexed addressing. */ gcc_checking_assert (!((TARGET_ALTIVEC_ABI ALTIVEC_VECTOR_MODE (mode)) @@ -19006,9 +18915,7 @@ !SPE_CONST_OFFSET_OK (offset; reg = gen_rtx_REG (mode, regno); - addr = gen_rtx_PLUS (Pmode, frame_reg, GEN_INT (offset)); - mem = gen_frame_mem (mode, addr); - insn = emit_move_insn (mem, reg); + insn = emit_insn (gen_frame_store (reg, frame_reg, offset)); return rs6000_frame_related (insn, frame_reg, frame_reg_to_sp, NULL_RTX, NULL_RTX); } @@ -19023,7 +18930,7 @@ int_rtx = GEN_INT (offset); - if ((TARGET_SPE_ABI SPE_VECTOR_MODE (mode)) + if ((TARGET_SPE_ABI SPE_VECTOR_MODE (mode) !SPE_CONST_OFFSET_OK (offset)) || (TARGET_E500_DOUBLE mode == DFmode)) { offset_rtx = gen_rtx_REG (Pmode, FIXED_SCRATCH); @@ -19652,8 +19559,7 @@ { int i; for (i = 0; i 64 - info-first_fp_reg_save; i++) - if (df_regs_ever_live_p (info-first_fp_reg_save + i) -! call_used_regs[info-first_fp_reg_save + i]) + if (save_reg_p (info-first_fp_reg_save + i)) emit_frame_save (frame_reg_rtx, (TARGET_HARD_FLOAT TARGET_DOUBLE_FLOAT ? DFmode : SFmode), @@ -20103,7 +20009,7 @@ TARGET_ALTIVEC_VRSAVE info-vrsave_mask != 0) { - rtx reg, mem, vrsave; + rtx reg
[RS6000] out-of-line save/restore conditions
REST_INLINE_FPRS)) strategy |= REST_NOINLINE_FPRS_DOESNT_RESTORE_LR; -#endif + if (TARGET_MACHO !(strategy SAVE_INLINE_FPRS)) strategy |= SAVE_NOINLINE_FPRS_SAVES_LR; -- Alan Modra Australia Development Lab, IBM
Re: [PATCH] Add powerpc64-linux configuration options
On Thu, May 24, 2012 at 01:45:41PM -0400, Michael Meissner wrote: This alternative patch just disables building the 32-bit softfloat multlib, and removes the -mstrict-align in the powerpc64-linux case. I have bootstrapped it and had no regressions. David, which patch do you prefer? 2012-05-24 Michael Meissner meiss...@linux.vnet.ibm.com * config/rs6000/t-linux64: Delete the 32-bit multilib that uses software floating point emulation. No longer build the multilibs with -mstrict-align. I like this patch. I like it because all uses of MULTILIB_EXTRA_OPTS are suspect. Why? The option is not applied to the default multilib! That means that prior to this patch you get different libraries depending on the compiler default. For example, a powerpc-linux gcc defaulting to -m32 builds the 32-bit libraries without -mstrict-align, while a powerpc64-linux gcc defaulting to -m64 builds the 32-bit libraries with -mstrict-align. This is confusing to say the least, and I recall seeing libstdc++ test failures due to the difference. I've had MULTILIB_EXTRA_OPTS = in my t-linux64 for quite some time. -- Alan Modra Australia Development Lab, IBM
Re: PR53914, rs6000 constraints and reload queries
input_operand r,m,r,I,L,nF,d,m,d,*h,r,0,O))] + [(set (match_operand:DI 0 nonimmediate_operand =Y,r,r,r,r,r,?m,?*d,?*d,r,*h,*h,?wa) + (match_operand:DI 1 input_operand r,Y,r,I,L,nF,d,m,d,*h,r,0,O))] TARGET_POWERPC64 (!TARGET_MFPGPR || !TARGET_HARD_FLOAT || !TARGET_FPRS) (gpc_reg_operand (operands[0], DImode) || gpc_reg_operand (operands[1], DImode)) @ + std%U0%X0 %1,%0 + ld%U1%X1 %0,%1 mr %0,%1 - ld%U1%X1 %0,%1 - std%U0%X0 %1,%0 li %0,%1 lis %0,%v1 # + stfd%U0%X0 %1,%0 + lfd%U1%X1 %0,%1 fmr %0,%1 - lfd%U1%X1 %0,%1 - stfd%U0%X0 %1,%0 mf%1 %0 mt%0 %1 {cror 0,0,0|nop} xxlxor %x0,%x0,%x0 - [(set_attr type *,load,store,*,*,*,fp,fpload,fpstore,mfjmpr,mtjmpr,*,vecsimple) + [(set_attr type store,load,*,*,*,*,fpstore,fpload,fp,mfjmpr,mtjmpr,*,vecsimple) (set_attr length 4,4,4,4,4,20,4,4,4,4,4,4,4)]) ;; immediate value valid for a single instruction hiding in a const_double @@ -10313,8 +10313,8 @@ ;; giving the SCRATCH mq. (define_insn *movti_power - [(set (match_operand:TI 0 reg_or_mem_operand =Q,m,r,r,r,r) - (match_operand:TI 1 input_operand r,r,r,Q,m,n)) + [(set (match_operand:TI 0 reg_or_mem_operand =Q,Y,r,r,r,r) + (match_operand:TI 1 input_operand r,r,Q,Y,r,n)) (clobber (match_scratch:SI 2 =q,q#X,X,X,X,X))] TARGET_POWER ! TARGET_POWERPC64 (gpc_reg_operand (operands[0], TImode) || gpc_reg_operand (operands[1], TImode)) @@ -10329,25 +10329,25 @@ if (TARGET_STRING) return \{stsi|stswi} %1,%P0,16\; case 1: + return \#\; case 2: - return \#\; -case 3: /* If the address is not used in the output, we can use lsi. Otherwise, fall through to generating four loads. */ if (TARGET_STRING ! reg_overlap_mentioned_p (operands[0], operands[1])) return \{lsi|lswi} %0,%P1,16\; /* ... fall through ... */ +case 3: case 4: case 5: return \#\; } } - [(set_attr type store,store,*,load,load,*)]) + [(set_attr type store,store,load,load,*,*)]) (define_insn *movti_string - [(set (match_operand:TI 0 reg_or_mem_operand =Q,o,r,r,r,r) - (match_operand:TI 1 input_operand r,r,r,Q,m,n))] + [(set (match_operand:TI 0 reg_or_mem_operand =Q,Y,r,r,r,r) + (match_operand:TI 1 input_operand r,r,Q,Y,r,n))] ! TARGET_POWER ! TARGET_POWERPC64 (gpc_reg_operand (operands[0], TImode) || gpc_reg_operand (operands[1], TImode)) * @@ -10360,33 +10360,33 @@ if (TARGET_STRING) return \{stsi|stswi} %1,%P0,16\; case 1: + return \#\; case 2: - return \#\; -case 3: /* If the address is not used in the output, we can use lsi. Otherwise, fall through to generating four loads. */ if (TARGET_STRING ! reg_overlap_mentioned_p (operands[0], operands[1])) return \{lsi|lswi} %0,%P1,16\; /* ... fall through ... */ +case 3: case 4: case 5: return \#\; } } - [(set_attr type store_ux,store_ux,*,load_ux,load_ux,*) + [(set_attr type store_ux,store_ux,load_ux,load_ux,*,*) (set (attr cell_micro) (if_then_else (match_test TARGET_STRING) (const_string always) (const_string conditional)))]) (define_insn *movti_ppc64 - [(set (match_operand:TI 0 nonimmediate_operand =r,o,r) - (match_operand:TI 1 input_operand r,r,m))] + [(set (match_operand:TI 0 nonimmediate_operand =Y,r,r) + (match_operand:TI 1 input_operand r,Y,r))] (TARGET_POWERPC64 (gpc_reg_operand (operands[0], TImode) || gpc_reg_operand (operands[1], TImode))) VECTOR_MEM_NONE_P (TImode) # - [(set_attr type *,store,load)]) + [(set_attr type store,load,*)]) (define_split [(set (match_operand:TI 0 gpc_reg_operand ) @@ -13215,8 +13215,8 @@ (set_attr length 12)]) (define_insn stack_protect_setdi - [(set (match_operand:DI 0 memory_operand =m) - (unspec:DI [(match_operand:DI 1 memory_operand m)] UNSPEC_SP_SET)) + [(set (match_operand:DI 0 memory_operand =Y) + (unspec:DI [(match_operand:DI 1 memory_operand Y)] UNSPEC_SP_SET)) (set (match_scratch:DI 2 =r) (const_int 0))] TARGET_64BIT ld%U1%X1 %2,%1\;std%U0%X0 %2,%0\;{lil|li} %2,0 @@ -13257,8 +13257,8 @@ (define_insn stack_protect_testdi [(set (match_operand:CCEQ 0 cc_reg_operand =x,?y) -(unspec:CCEQ [(match_operand:DI 1 memory_operand m,m) - (match_operand:DI 2 memory_operand m,m)] +(unspec:CCEQ [(match_operand:DI 1 memory_operand Y,Y) + (match_operand:DI 2 memory_operand Y,Y)] UNSPEC_SP_TEST)) (set (match_scratch:DI 4 =r,r) (const_int 0)) (clobber (match_scratch:DI 3 =r,r))] -- Alan Modra Australia Development Lab, IBM
[RS6000] Fix PR54093, ICE due to 07-24 changes
This fixes a thinko and typo in http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01168.html that shows up as an ICE on e500. Two issues really. One is that the secondary reload insn emitted by rs6000_secondary_reload_gpr had better be valid. The other is that we only need these reloads when the insn predicate says that the address is good. Fixing the second problem avoids the first. Bootstrapped and regression tested powerpc-linux, and fixes the testcases in the PR on e500. OK to apply? * config/rs6000/rs6000.c (rs6000_secondary_reload): Limit 32-bit multi-gpr reload to cases where predicate passes. Do the same for 64-bit multi-gpr reload. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 189801) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -13643,8 +13643,11 @@ rs6000_secondary_reload (bool in_p, GET_MODE_SIZE (GET_MODE (x)) = UNITS_PER_WORD) { rtx off = address_offset (XEXP (x, 0)); + unsigned int extra = GET_MODE_SIZE (GET_MODE (x)) - UNITS_PER_WORD; - if (off != NULL_RTX (INTVAL (off) 3) != 0) + if (off != NULL_RTX + (INTVAL (off) 3) != 0 + (unsigned HOST_WIDE_INT) INTVAL (off) + 0x8000 0x1 - extra) { if (in_p) sri-icode = CODE_FOR_reload_di_load; @@ -13662,10 +13665,17 @@ rs6000_secondary_reload (bool in_p, GET_MODE_SIZE (GET_MODE (x)) UNITS_PER_WORD) { rtx off = address_offset (XEXP (x, 0)); + unsigned int extra = GET_MODE_SIZE (GET_MODE (x)) - UNITS_PER_WORD; + /* We need a secondary reload only when our legitimate_address_p +says the address is good (as otherwise the entire address +will be reloaded). So for mode sizes of 8 and 16 this will +be when the offset is in the ranges [0x7ffc,0x7fff] and +[0x7ff4,0x7ff7] respectively. Note that the address we see +here may have been manipulated by legitimize_reload_address. */ if (off != NULL_RTX - ((unsigned HOST_WIDE_INT) INTVAL (off) + 0x8000 - = 0x1000u - (GET_MODE_SIZE (GET_MODE (x)) - UNITS_PER_WORD))) + ((unsigned HOST_WIDE_INT) INTVAL (off) - (0x8000 - extra) + UNITS_PER_WORD)) { if (in_p) sri-icode = CODE_FOR_reload_si_load; -- Alan Modra Australia Development Lab, IBM
[RS6000] Fix PR54131, ICE building 416.gamess
This cures the 'Y' constraint of being overly restrictive with lo_sum offsets. I've added a comment that explains why it is wrong to limit the range of lo_sum offsets. Bootstrapped and regressiotn tested powerpc-linux. OK to apply? PR target/54131 * config/rs6000/rs6000.c (mem_operand_gpr): Don't limit range of lo_sum offsets. Comment. Assert mode at least word size rather than bypassing powerpc64 word offset check. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 189996) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -5008,24 +5008,38 @@ Offsetting a lo_sum should not be allowed, except where we know by alignment that a 32k boundary is not crossed, but see the ??? - comment in rs6000_legitimize_reload_address. */ + comment in rs6000_legitimize_reload_address. Note that by + offsetting here we mean a further offset to access parts of the + MEM. It's fine to have a lo_sum where the inner address is offset + from a sym, since the same sym+offset will appear in the high part + of the address calculation. */ bool mem_operand_gpr (rtx op, enum machine_mode mode) { unsigned HOST_WIDE_INT offset; int extra; + rtx addr = XEXP (op, 0); - op = address_offset (XEXP (op, 0)); + op = address_offset (addr); if (op == NULL_RTX) return true; offset = INTVAL (op); + if (TARGET_POWERPC64 (offset 3) != 0) +return false; + + if (GET_CODE (addr) == LO_SUM) +/* We know by alignment that ABI_AIX medium/large model toc refs + will not cross a 32k boundary, since all entries in the + constant pool are naturally aligned and we check alignment for + other medium model toc-relative addresses. For ABI_V4 and + ABI_DARWIN lo_sum addresses, we just check that 64-bit + offsets are 4-byte aligned. */ +return true; + extra = GET_MODE_SIZE (mode) - UNITS_PER_WORD; - if (extra 0) -extra = 0; - else if (TARGET_POWERPC64 (offset 3) != 0) -return false; + gcc_assert (extra = 0); return offset + 0x8000 0x1u - extra; } -- Alan Modra Australia Development Lab, IBM
Re: PR53914, rs6000 constraints and reload queries
*vsx_movmode - [(set (match_operand:VSX_M 0 nonimmediate_operand =Z,VSr,VSr,?Z,?wa,?wa,*o,*r,*r,VSr,?wa,v,wZ,v) - (match_operand:VSX_M 1 input_operand VSr,Z,VSr,wa,Z,wa,r,o,r,j,j,W,v,wZ))] + [(set (match_operand:VSX_M 0 nonimmediate_operand =Z,VSr,VSr,?Z,?wa,?wa,*Y,*r,*r,VSr,?wa,v,wZ,v) + (match_operand:VSX_M 1 input_operand VSr,Z,VSr,wa,Z,wa,r,Y,r,j,j,W,v,wZ))] VECTOR_MEM_VSX_P (MODEmode) (register_operand (operands[0], MODEmode) || register_operand (operands[1], MODEmode)) @@ -272,8 +272,8 @@ ;; Unlike other VSX moves, allow the GPRs, since a normal use of TImode is for ;; unions. However for plain data movement, slightly favor the vector loads (define_insn *vsx_movti - [(set (match_operand:TI 0 nonimmediate_operand =Z,wa,wa,?o,?r,?r,wa,v,v,wZ) - (match_operand:TI 1 input_operand wa,Z,wa,r,o,r,j,W,wZ,v))] + [(set (match_operand:TI 0 nonimmediate_operand =Z,wa,wa,?Y,?r,?r,wa,v,v,wZ) + (match_operand:TI 1 input_operand wa,Z,wa,r,Y,r,j,W,wZ,v))] VECTOR_MEM_VSX_P (TImode) (register_operand (operands[0], TImode) || register_operand (operands[1], TImode)) Index: gcc/config/rs6000/altivec.md === --- gcc/config/rs6000/altivec.md(revision 190035) +++ gcc/config/rs6000/altivec.md(working copy) @@ -165,8 +165,8 @@ ;; Vector move instructions. (define_insn *altivec_movmode - [(set (match_operand:VM2 0 nonimmediate_operand =Z,v,v,*o,*r,*r,v,v) - (match_operand:VM2 1 input_operand v,Z,v,r,o,r,j,W))] + [(set (match_operand:VM2 0 nonimmediate_operand =Z,v,v,*Y,*r,*r,v,v) + (match_operand:VM2 1 input_operand v,Z,v,r,Y,r,j,W))] VECTOR_MEM_ALTIVEC_P (MODEmode) (register_operand (operands[0], MODEmode) || register_operand (operands[1], MODEmode)) @@ -190,8 +190,8 @@ ;; is for unions. However for plain data movement, slightly favor the vector ;; loads (define_insn *altivec_movti - [(set (match_operand:TI 0 nonimmediate_operand =Z,v,v,?o,?r,?r,v,v) - (match_operand:TI 1 input_operand v,Z,v,r,o,r,j,W))] + [(set (match_operand:TI 0 nonimmediate_operand =Z,v,v,?Y,?r,?r,v,v) + (match_operand:TI 1 input_operand v,Z,v,r,Y,r,j,W))] VECTOR_MEM_ALTIVEC_P (TImode) (register_operand (operands[0], TImode) || register_operand (operands[1], TImode)) Index: gcc/config/rs6000/dfp.md === --- gcc/config/rs6000/dfp.md(revision 190035) +++ gcc/config/rs6000/dfp.md(working copy) @@ -426,12 +426,12 @@ TARGET_HARD_FLOAT TARGET_FPRS { rs6000_emit_move (operands[0], operands[1], TDmode); DONE; }) -; It's important to list the o-f and f-o moves before f-f because -; otherwise reload, given m-f, will try to pick f-f and reload it, -; which doesn't make progress. Likewise r-Y must be before r-r. +; It's important to list the Y-r and r-Y moves before r-r because +; otherwise reload, given m-r, will try to pick r-r and reload it, +; which doesn't make progress. (define_insn_and_split *movtd_internal - [(set (match_operand:TD 0 nonimmediate_operand =o,d,d,r,Y,r) - (match_operand:TD 1 input_operand d,o,d,YGHF,r,r))] + [(set (match_operand:TD 0 nonimmediate_operand =m,d,d,Y,r,r) + (match_operand:TD 1 input_operand d,m,d,r,YGHF,r))] TARGET_HARD_FLOAT TARGET_FPRS (gpc_reg_operand (operands[0], TDmode) || gpc_reg_operand (operands[1], TDmode)) Index: gcc/config/rs6000/paired.md === --- gcc/config/rs6000/paired.md (revision 190035) +++ gcc/config/rs6000/paired.md (working copy) @@ -201,8 +201,8 @@ [(set_attr type fp)]) (define_insn *movv2sf_paired - [(set (match_operand:V2SF 0 nonimmediate_operand =Z,f,f,o,r,r,f) -(match_operand:V2SF 1 input_operand f,Z,f,r,o,r,W))] + [(set (match_operand:V2SF 0 nonimmediate_operand =Z,f,f,Y,r,r,f) +(match_operand:V2SF 1 input_operand f,Z,f,r,Y,r,W))] TARGET_PAIRED_FLOAT (register_operand (operands[0], V2SFmode) || register_operand (operands[1], V2SFmode)) -- Alan Modra Australia Development Lab, IBM
[RS6000] asynch exceptions and unwind info
Hi David, I've been looking into what we need to do to support unwinding from async signal handlers. I've implemented unwind info generation for .glink in the linker, but to keep the ppc64 .glink unwind info simple I've assumed that frob_update_context is still used. We still have some difficulties related to r2 tracking on ppc64. frob_update_context doesn't quite do the right thing for async unwinding. A typical (no-r11) plt call stub looks like addis r12,2,off@ha std 2,40(1) ld 11,off@l(12) mtctr 11 ld 2,off+8@l(12) bctr or, when the offset from r2 to the function descriptor is small std 2,40(1) ld 11,off(2) mtctr 11 ld 2,off+8(2) bctr Now if we're stopped before the save of r2 we obviously don't want the unwinder to restore r2 from 40(1), but that's exactly what the current unwinder does. Also, there is a one insn window where frob_update_context may do the wrong thing for gcc generated calls via function pointer, which typically looks like ld 0,0(r) std 2,40(1) mtctr 0 ld 2,8(r) bctrl ld 2,40(1) Here, if we are stopped after the ld 2,8(r) then r2 needs to be restored from 40(1). The following patch fixes these two issues. Ideally what I'd like to do is have ld and gcc emit accurate r2 tracking unwind info and dispense with hacks like frob_update_context. If ld did emit accurate unwind info for .glink, then the justification for frob_update_context disappears. The difficulty then is backwards compatibility. You'd need a way for the gcc unwinder to handle a mix of old code (that needs frob_update_context) with new code (that doesn't). One way to accomplish this would be to set a dummy reg with initial CIE dwarf instructions, then test this reg in frob_update_context. Bootstrapped and regression tested powerpc64-linux. * config/rs6000/linux-unwind.h (frob_update_context __powerpc64__): Leave r2 REG_UNSAVED if stopped on the instruction that saves r2 in a plt call stub. Do restore r2 if stopped on bctrl. Index: libgcc/config/rs6000/linux-unwind.h === --- libgcc/config/rs6000/linux-unwind.h (revision 176780) +++ libgcc/config/rs6000/linux-unwind.h (working copy) @@ -346,10 +346,28 @@ frob_update_context (struct _Unwind_Cont figure out if it was saved. The big problem here is that the code that does the save/restore is generated by the linker, so we have no good way to determine at compile time what to do. */ - unsigned int *insn - = (unsigned int *) _Unwind_GetGR (context, R_LR); - if (insn *insn == 0xE8410028) - _Unwind_SetGRPtr (context, 2, context-cfa + 40); + if (pc[0] == 0xF8410028 + || ((pc[0] 0x) == 0x3D82 + pc[1] == 0xF8410028)) + { + /* We are in a plt call stub or r2 adjusting long branch stub, +before r2 has been saved. Keep REG_UNSAVED. */ + } + else if (pc[0] == 0x4E800421 + pc[1] == 0xE8410028) + { + /* We are at the bctrl instruction in a call via function +pointer. gcc always emits the load of the new r2 just +before the bctrl. */ + _Unwind_SetGRPtr (context, 2, context-cfa + 40); + } + else + { + unsigned int *insn + = (unsigned int *) _Unwind_GetGR (context, R_LR); + if (insn *insn == 0xE8410028) + _Unwind_SetGRPtr (context, 2, context-cfa + 40); + } } #endif } -- Alan Modra Australia Development Lab, IBM
Re: [RS6000] asynch exceptions and unwind info
On Wed, Jul 27, 2011 at 03:00:45PM +0930, Alan Modra wrote: Ideally what I'd like to do is have ld and gcc emit accurate r2 tracking unwind info and dispense with hacks like frob_update_context. If ld did emit accurate unwind info for .glink, then the justification for frob_update_context disappears. For the record, this statement of mine doesn't make sense. A .glink stub doesn't make a frame, so a backtrace won't normally pass through a stub, thus having accurate unwind info for .glink doesn't help at all. ld would need to insert unwind info for r2 on the call, but that involves editing .eh_frame and in any case isn't accurate since the r2 save doesn't happen until one or two instructions after the call, in the stub. I think we are stuck with frob_update_context. -- Alan Modra Australia Development Lab, IBM
Re: [RS6000] asynch exceptions and unwind info
On Thu, Jul 28, 2011 at 11:49:16AM -0700, Richard Henderson wrote: On 07/28/2011 12:27 AM, Alan Modra wrote: On Wed, Jul 27, 2011 at 03:00:45PM +0930, Alan Modra wrote: Ideally what I'd like to do is have ld and gcc emit accurate r2 tracking unwind info and dispense with hacks like frob_update_context. If ld did emit accurate unwind info for .glink, then the justification for frob_update_context disappears. For the record, this statement of mine doesn't make sense. A .glink stub doesn't make a frame, so a backtrace won't normally pass through a stub, thus having accurate unwind info for .glink doesn't help at all. It does, for the duration of the stub. Right, but I was talking about the normal case, where the unwinder won't even look at .glink unwind info. The whole problem is that toc pointer copy in 40(1) is only valid during indirect call sequences, and iff ld inserted a stub? I.e. direct calls between functions that share toc pointers never save the copy? Yes. Would it make sense, if a function has any indirect call, to move the toc pointer save into the prologue? You'd get to avoid that store all the time. Of course you'd not be able to sink the load after the call, but it might still be a win. And in that special case you can annotate the r2 save slot just once, correctly. Except that any info about r2 in an indirect call sequence really belongs to the *called* function frame, not the callee. I woke up this morning with the realization that what I'd done in frob_update_context for indirect call sequences was wrong. Ditto for the r2 store that Michael moved into the prologue. The only time we want the unwinder to restore from that particular save is if r2 isn't saved in the current frame. Untested patch follows. libgcc/ * config/rs6000/linux-unwind.h (frob_update_context __powerpc64__): Restore for indirect call bcrtl from correct stack slot, and only if cfa+40 isn't valid. gcc/ * config/rs6000/rs6000-protos.h (rs6000_save_toc_in_prologue_p): Delete. * config/rs6000/rs6000.c (rs6000_save_toc_in_prologue_p): Make static. (rs6000_emit_prologue): Don't emit eh_frame info in save_toc_in_prologue case. (rs6000_call_indirect_aix): Formatting. Index: libgcc/config/rs6000/linux-unwind.h === --- libgcc/config/rs6000/linux-unwind.h (revision 176905) +++ libgcc/config/rs6000/linux-unwind.h (working copy) @@ -354,20 +354,22 @@ frob_update_context (struct _Unwind_Cont /* We are in a plt call stub or r2 adjusting long branch stub, before r2 has been saved. Keep REG_UNSAVED. */ } - else if (pc[0] == 0x4E800421 - pc[1] == 0xE8410028) - { - /* We are at the bctrl instruction in a call via function -pointer. gcc always emits the load of the new r2 just -before the bctrl. */ - _Unwind_SetGRPtr (context, 2, context-cfa + 40); - } else { unsigned int *insn = (unsigned int *) _Unwind_GetGR (context, R_LR); if (insn *insn == 0xE8410028) _Unwind_SetGRPtr (context, 2, context-cfa + 40); + else if (pc[0] == 0x4E800421 + pc[1] == 0xE8410028) + { + /* We are at the bctrl instruction in a call via function +pointer. gcc always emits the load of the new R2 just +before the bctrl so this is the first and only place +we need to use the stored R2. */ + _Unwind_Word sp = _Unwind_GetGR (context, 1); + _Unwind_SetGRPtr (context, 2, sp + 40); + } } } #endif Index: gcc/config/rs6000/rs6000-protos.h === --- gcc/config/rs6000/rs6000-protos.h (revision 176905) +++ gcc/config/rs6000/rs6000-protos.h (working copy) @@ -172,8 +172,6 @@ extern void rs6000_emit_epilogue (int); extern void rs6000_emit_eh_reg_restore (rtx, rtx); extern const char * output_isel (rtx *); extern void rs6000_call_indirect_aix (rtx, rtx, rtx); -extern bool rs6000_save_toc_in_prologue_p (void); - extern void rs6000_aix_asm_output_dwarf_table_ref (char *); /* Declare functions in rs6000-c.c */ Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 176905) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -1178,6 +1178,7 @@ static void rs6000_conditional_register_ static void rs6000_trampoline_init (rtx, tree, rtx); static bool rs6000_cannot_force_const_mem (enum machine_mode, rtx); static bool rs6000_legitimate_constant_p (enum machine_mode, rtx); +static bool rs6000_save_toc_in_prologue_p (void); /* Hash table stuff for keeping track of TOC entries. */ @@ -20536,8 +20562,11 @@ rs6000_emit_prologue (void) /* If we
Re: [RS6000] asynch exceptions and unwind info
On Thu, Jul 28, 2011 at 12:09:51PM -0700, Richard Henderson wrote: Well, even if we're not able to hoist the R2 store, we may be able to simply add REG_CFA_OFFSET and REG_CFA_RESTORE notes to the insns in the stream. You'd need to mark every non-local call with something that says R2 may be saved, effectively duplicating md_frob_update in dwarf. I guess that is possible even without extending our eh encoding, but each call would have at least 6 bytes added to eh_frame: DW_CFA_expression, 2, 3, DW_OP_skip, offset_to_r2_prog and you'd need to emit multiple copies of r2_prog for functions that have a lot of calls, since the offset is limited to +/-32k. I think that would inflate the size of .eh_frame too much, and slow down handling of exceptions dramatically. -- Alan Modra Australia Development Lab, IBM
Re: [RS6000] asynch exceptions and unwind info
On Fri, Jul 29, 2011 at 10:57:48AM +0930, Alan Modra wrote: Except that any info about r2 in an indirect call sequence really belongs to the *called* function frame, not the callee. I woke up this morning with the realization that what I'd done in frob_update_context for indirect call sequences was wrong. Ditto for the r2 store that Michael moved into the prologue. The only time we want the unwinder to restore from that particular save is if r2 isn't saved in the current frame. Untested patch follows. Here's a tested patch that fixes an issue with TOC_SINGLE_PIC_BASE and enables Michael's save_toc_in_prologue optimization for all functions except those that make dynamic stack adjustments. Incidentally, the rs6000_emit_prologue comment I added below suggests another solution. Since all we need is the toc pointer for the frame, it would be possible to tell the unwinder to simply load r2 from the .opd entry. I think.. libgcc/ * config/rs6000/linux-unwind.h (frob_update_context __powerpc64__): Restore for indirect call bcrtl from correct stack slot, and only if cfa+40 isn't valid. gcc/ * config/rs6000/rs6000-protos.h (rs6000_save_toc_in_prologue_p): Delete. * config/rs6000/rs6000.c (rs6000_save_toc_in_prologue_p): Make static. (rs6000_emit_prologue): Don't prematurely return when TARGET_SINGLE_PIC_BASE. Don't emit eh_frame info in save_toc_in_prologue case. (rs6000_call_indirect_aix): Only disallow save_toc_in_prologue for calls_alloca. Index: libgcc/config/rs6000/linux-unwind.h === --- libgcc/config/rs6000/linux-unwind.h (revision 176905) +++ libgcc/config/rs6000/linux-unwind.h (working copy) @@ -354,20 +354,22 @@ frob_update_context (struct _Unwind_Cont /* We are in a plt call stub or r2 adjusting long branch stub, before r2 has been saved. Keep REG_UNSAVED. */ } - else if (pc[0] == 0x4E800421 - pc[1] == 0xE8410028) - { - /* We are at the bctrl instruction in a call via function -pointer. gcc always emits the load of the new r2 just -before the bctrl. */ - _Unwind_SetGRPtr (context, 2, context-cfa + 40); - } else { unsigned int *insn = (unsigned int *) _Unwind_GetGR (context, R_LR); if (insn *insn == 0xE8410028) _Unwind_SetGRPtr (context, 2, context-cfa + 40); + else if (pc[0] == 0x4E800421 + pc[1] == 0xE8410028) + { + /* We are at the bctrl instruction in a call via function +pointer. gcc always emits the load of the new R2 just +before the bctrl so this is the first and only place +we need to use the stored R2. */ + _Unwind_Word sp = _Unwind_GetGR (context, 1); + _Unwind_SetGRPtr (context, 2, sp + 40); + } } } #endif Index: gcc/config/rs6000/rs6000-protos.h === --- gcc/config/rs6000/rs6000-protos.h (revision 176905) +++ gcc/config/rs6000/rs6000-protos.h (working copy) @@ -172,8 +172,6 @@ extern void rs6000_emit_epilogue (int); extern void rs6000_emit_eh_reg_restore (rtx, rtx); extern const char * output_isel (rtx *); extern void rs6000_call_indirect_aix (rtx, rtx, rtx); -extern bool rs6000_save_toc_in_prologue_p (void); - extern void rs6000_aix_asm_output_dwarf_table_ref (char *); /* Declare functions in rs6000-c.c */ Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 176905) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -1178,6 +1178,7 @@ static void rs6000_conditional_register_ static void rs6000_trampoline_init (rtx, tree, rtx); static bool rs6000_cannot_force_const_mem (enum machine_mode, rtx); static bool rs6000_legitimate_constant_p (enum machine_mode, rtx); +static bool rs6000_save_toc_in_prologue_p (void); /* Hash table stuff for keeping track of TOC entries. */ @@ -20478,14 +20504,12 @@ rs6000_emit_prologue (void) insn = emit_insn (generate_set_vrsave (reg, info, 0)); } - if (TARGET_SINGLE_PIC_BASE) -return; /* Do not set PIC register */ - /* If we are using RS6000_PIC_OFFSET_TABLE_REGNUM, we need to set it up. */ - if ((TARGET_TOC TARGET_MINIMAL_TOC get_pool_size () != 0) - || (DEFAULT_ABI == ABI_V4 - (flag_pic == 1 || (flag_pic TARGET_SECURE_PLT)) - df_regs_ever_live_p (RS6000_PIC_OFFSET_TABLE_REGNUM))) + if (!TARGET_SINGLE_PIC_BASE + ((TARGET_TOC TARGET_MINIMAL_TOC get_pool_size () != 0) + || (DEFAULT_ABI == ABI_V4 + (flag_pic == 1 || (flag_pic TARGET_SECURE_PLT)) + df_regs_ever_live_p (RS6000_PIC_OFFSET_TABLE_REGNUM
Re: [RS6000] asynch exceptions and unwind info
On Fri, Jul 29, 2011 at 09:16:09AM -0400, David Edelsohn wrote: Which has the problem? Which are you trying to solve? And how is your change solving it? Michael's save_toc_in_prologue emit_frame_save writes unwind info for the wrong frame. That r2 save is the current r2. What we need is info about the previous r2, so we can restore when unwinding. I made a similar mistake in frob_update_context in that the value saved by an indirect function call sequence is the r2 for the current function. I also restored from the wrong location. -- Alan Modra Australia Development Lab, IBM
Re: [RS6000] asynch exceptions and unwind info
On Fri, Jul 29, 2011 at 10:28:28PM +0930, Alan Modra wrote: libgcc/ * config/rs6000/linux-unwind.h (frob_update_context __powerpc64__): Restore for indirect call bcrtl from correct stack slot, and only if cfa+40 isn't valid. gcc/ * config/rs6000/rs6000-protos.h (rs6000_save_toc_in_prologue_p): Delete. * config/rs6000/rs6000.c (rs6000_save_toc_in_prologue_p): Make static. (rs6000_emit_prologue): Don't prematurely return when TARGET_SINGLE_PIC_BASE. Don't emit eh_frame info in save_toc_in_prologue case. (rs6000_call_indirect_aix): Only disallow save_toc_in_prologue for calls_alloca. Approved offline and applied with a comment change. -- Alan Modra Australia Development Lab, IBM
[RS6000] Fix ICE in reg_save
Seen when running binutils testsuite on powerpc-linux. .../ld-elfvers/vers24b.c: In function 'foo': .../ld-elfvers/vers24b.c:5:1: internal compiler error: in reg_save, at dwarf2cfi.c:827 Bootstrapped and regression tested powerpc-linux. Committed as obvious. * config/rs6000/rs6000.c (rs6000_emit_prologue): Add REG_CFA_RESTORE note for save_LR_around_toc_setup sequence. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 177041) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -20505,6 +20505,7 @@ rs6000_emit_prologue (void) rs6000_emit_load_toc_table (TRUE); insn = emit_move_insn (lr, frame_ptr_rtx); + add_reg_note (insn, REG_CFA_RESTORE, lr); RTX_FRAME_RELATED_P (insn) = 1; } else -- Alan Modra Australia Development Lab, IBM
Re: [PATCH 4/6] Shrink-wrapping
On Thu, Jul 28, 2011 at 12:35:46PM +0200, Bernd Schmidt wrote: [snip] * rtl.h (ANY_RETURN_P): New macro. [snip] This patch makes rebuild_jump_labels set JUMP_LABEL appropriately for return jumps, and fixes sharing for RETURN. Since ANY_RETURN_P(X) is defined as ((X) == ret_rtx), RETURNs need to stay shared. Bootstrapped and regression tested powerpc-linux and powerpc64-linux. OK to apply? PR rtl-optimization/49941 * jump.c (mark_jump_label): Comment. (mark_jump_label_1): Set JUMP_LABEL for return jumps. * emit-rtl.c (copy_rtx_if_shared_1, copy_insn_1): Leave RETURN shared. (mark_used_flags): Don't mark RETURN. Index: gcc/jump.c === --- gcc/jump.c (revision 177084) +++ gcc/jump.c (working copy) @@ -1039,6 +1039,7 @@ sets_cc0_p (const_rtx x) notes. If INSN is an INSN or a CALL_INSN or non-target operands of a JUMP_INSN, and there is at least one CODE_LABEL referenced in INSN, add a REG_LABEL_OPERAND note containing that label to INSN. + For returnjumps, the JUMP_LABEL will also be set as appropriate. Note that two labels separated by a loop-beginning note must be kept distinct if we have not yet done loop-optimization, @@ -1081,6 +1082,14 @@ mark_jump_label_1 (rtx x, rtx insn, bool case CALL: return; +case RETURN: + if (is_target) + { + gcc_assert (JUMP_LABEL (insn) == NULL || JUMP_LABEL (insn) == x); + JUMP_LABEL (insn) = x; + } + return; + case MEM: in_mem = true; break; Index: gcc/emit-rtl.c === --- gcc/emit-rtl.c (revision 177084) +++ gcc/emit-rtl.c (working copy) @@ -2724,6 +2724,7 @@ repeat: case CODE_LABEL: case PC: case CC0: +case RETURN: case SCRATCH: /* SCRATCH must be shared because they represent distinct values. */ return; @@ -2843,6 +2844,7 @@ repeat: case CODE_LABEL: case PC: case CC0: +case RETURN: return; case DEBUG_INSN: @@ -5257,6 +5259,7 @@ copy_insn_1 (rtx orig) case CODE_LABEL: case PC: case CC0: +case RETURN: return orig; case CLOBBER: if (REG_P (XEXP (orig, 0)) REGNO (XEXP (orig, 0)) FIRST_PSEUDO_REGISTER) -- Alan Modra Australia Development Lab, IBM
simplify some range tests in rs6000.c
This patch came about because I noticed a comment about the second condition in rs6000_emit_epilogue no longer matched the code. I was going to fix the comment, then noticed the code could be simplified. In rs6000_emit_epilogue, lr_save_offset is a small positive value (4,8 or 16), or zero when !lr_save_p, and total_size a HOST_WIDE_INT. In rs6000_legitimate_offset_address_p, offset is an unsigned HOST_WIDE_INT, extra a value between 0 and 12. Unless my numerical analysis has totally deserted me, the new code here is exactly equivalent to the old. Bootstrapped and regression tested powerpc-linux. Committing as obvious. * config/rs6000/rs6000.c (rs6000_emit_epilogue): Simplify use_backchain_to_restore_sp initialisation. (rs6000_legitimate_offset_address_p): Simplify offset test. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 177267) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -5391,9 +5391,9 @@ rs6000_legitimate_offset_address_p (enum break; } offset += 0x8000; - return (offset 0x1) (offset + extra 0x1); + return offset 0x1 - extra; } bool legitimate_indexed_address_p (rtx x, int strict) @@ -20711,12 +20711,9 @@ rs6000_emit_epilogue (int sibcall) is more efficient than an addis, addi pair. The second condition here will not trigger at the moment; We don't actually need a frame pointer for alloca, but the generic parts of the compiler give us one anyway. */ - use_backchain_to_restore_sp = (info-total_size 32767 -|| info-total_size -+ (info-lr_save_p ? info-lr_save_offset : 0) - 32767 + use_backchain_to_restore_sp = (info-total_size 32767 - info-lr_save_offset || (cfun-calls_alloca !frame_pointer_needed)); restore_lr = (info-lr_save_p (restoring_FPRs_inline -- Alan Modra Australia Development Lab, IBM
Re: PowerPC prologue and epilogue 6
On Wed, May 30, 2012 at 03:21:28PM +0200, Dominique Dhumieres wrote: I get an ICE of the form /opt/gcc/work/gcc/testsuite/gcc.target/powerpc/savres.c: In function 'nb_all': /opt/gcc/work/gcc/testsuite/gcc.target/powerpc/savres.c:473:3: internal compiler error: in rs6000_emit_prologue, at config/rs6000/rs6000.c:19850 Is the test intended to work on PIC targets? No, but see rs6000/darwin.h CC1_SPEC. -static makes you non-PIC. I've just built a darwin cc1 to reproduce the problem. The ICE is on START_USE (ptr_regno); when setting up a reg to use for altivec saves. The reg clashes with the static chain pointer (nb_all is a nested function), so this is a real bug that the register checks have uncovered. I haven't determined whether this is a new bug introduced with my prologue changes, or whether it's a long-standing bug. I suspect the latter. -- Alan Modra Australia Development Lab, IBM
Re: PowerPC prologue and epilogue 6
On Thu, May 31, 2012 at 09:43:09AM +0930, Alan Modra wrote: real bug that the register checks have uncovered. I haven't determined whether this is a new bug introduced with my prologue changes, or whether it's a long-standing bug. I suspect the latter. Looks like it is one I introduced. gcc-4.6 uses r12 to save altivec regs, my new code tries to use r11. Will fix. -- Alan Modra Australia Development Lab, IBM
Re: PowerPC prologue and epilogue 6
On Thu, May 31, 2012 at 10:41:26AM +0930, Alan Modra wrote: Looks like it is one I introduced. gcc-4.6 uses r12 to save altivec regs, my new code tries to use r11. Will fix. Please try out this patch on Darwin. Bootstrapped and regression tested powerpc-linux. gcc/ * config/rs6000/rs6000.c (ptr_regno_for_savres): Comment. (rs6000_emit_prologue): Ensure register used for inline saves of vector regs is not the static chain register. Revise comment. gcc/testsuite/ * gcc.target/powerpc/savres.c: Add -static to dg-options. Check static chain in nested funcs. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 187999) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -19108,6 +19161,9 @@ rs6000_emit_stack_reset (rs6000_stack_t *info, return NULL_RTX; } +/* Return the register number used as a pointer by out-of-line + save/restore functions. */ + static inline unsigned ptr_regno_for_savres (int sel) { @@ -19845,6 +19901,9 @@ rs6000_emit_prologue (void) int sel = SAVRES_SAVE | SAVRES_VR; unsigned ptr_regno = ptr_regno_for_savres (sel); + if (using_static_chain_p + ptr_regno == STATIC_CHAIN_REGNUM) + ptr_regno = 12; if (REGNO (frame_reg_rtx) != ptr_regno) START_USE (ptr_regno); ptr_reg = gen_rtx_REG (Pmode, ptr_regno); @@ -19953,9 +20012,9 @@ rs6000_emit_prologue (void) int offset; int save_regno; - /* Get VRSAVE onto a GPR. Note that ABI_V4 might be using r12 -as frame_reg_rtx and r11 as the static chain pointer for -nested functions. */ + /* Get VRSAVE onto a GPR. Note that ABI_V4 and ABI_DARWIN might +be using r12 as frame_reg_rtx and r11 as the static chain +pointer for nested functions. */ save_regno = 12; if (DEFAULT_ABI == ABI_AIX !using_static_chain_p) save_regno = 11; Index: gcc/testsuite/gcc.target/powerpc/savres.c === --- gcc/testsuite/gcc.target/powerpc/savres.c (revision 187999) +++ gcc/testsuite/gcc.target/powerpc/savres.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options -fno-inline -fomit-frame-pointer } */ +/* { dg-options -fno-inline -fomit-frame-pointer -static } */ /* -fno-inline -maltivec -m32/-m64 -mmultiple/no-multiple -Os/-O2. */ #ifndef NO_BODY @@ -73,6 +73,7 @@ __attribute__ ((vector_size (16))) int val31 = {-3 #else /* NO_BODY */ /* For looking at prologue and epilogue code without distractions. */ +#define abort() #define TRASH_ALL_CR #define TRASH_ALL_VR #define TRASH_ALL_FPR @@ -458,7 +459,7 @@ void s_0 (void) void wb_all (void) { char b[10]; - void nb_all (void) + char *nb_all (void) { char a[33000]; TRASH_ALL_CR; @@ -470,14 +471,16 @@ void wb_all (void) USE_ALL_FPR; USE_ALL_GPR; __asm __volatile (#%0 %1 : =m (a), =m (b) : : cr2, cr3, cr4, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30, v31, fr14, fr15, fr16, fr17, fr18, fr19, fr20, fr21, fr22, fr23, fr24, fr25, fr26, fr27, fr28, fr29, fr30, fr31, r14, r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, r31); +return b; } - nb_all(); + if (nb_all() != b) +abort (); } void wb_cvfr (void) { char b[10]; - void nb_cvfr (void) + char *nb_cvfr (void) { char a[33000]; TRASH_SOME_CR; @@ -489,14 +492,16 @@ void wb_cvfr (void) USE_SOME_FPR; USE_SOME_GPR; __asm __volatile (#%0 %1 : =m (a), =m (b) : : cr2, v26, v27, v31, fr28, fr31, r30, r31); +return b; } - nb_cvfr (); + if (nb_cvfr () != b) +abort (); } void wb_vfr (void) { char b[10]; - void nb_vfr (void) + char *nb_vfr (void) { char a[33000]; TRASH_SOME_VR; @@ -506,14 +511,16 @@ void wb_vfr (void) USE_SOME_FPR; USE_SOME_GPR; __asm __volatile (#%0 %1 : =m (a), =m (b) : : v26, v27, v31, fr28, fr31, r30, r31); +return b; } - nb_vfr (); + if (nb_vfr () != b) +abort (); } void wb_cvf (void) { char b[10]; - void nb_cvf (void) + char *nb_cvf (void) { char a[33000]; TRASH_SOME_CR; @@ -523,14 +530,16 @@ void wb_cvf (void) USE_SOME_VR; USE_SOME_FPR; __asm __volatile (#%0 %1 : =m (a), =m (b) : : cr2, v26, v27, v31, fr28, fr31); +return b; } - nb_cvf (); + if (nb_cvf () != b) +abort (); } void wb_vf (void) { char b[10]; - void nb_vf (void) + char *nb_vf (void) { char a[33000]; TRASH_SOME_VR; @@ -538,15 +547,17 @@ void wb_vf (void) USE_SOME_VR; USE_SOME_FPR; __asm __volatile (#%0 %1 : =m (a), =m (b) : : v26, v27, v31, fr28, fr31); +return b; } - nb_vf (); + if (nb_vf () != b) +abort (); } #endif void wb_cvr (void) { char b[10]; - void nb_cvr (void) + char *nb_cvr (void) { char
Re: PowerPC prologue and epilogue 6
On Thu, May 31, 2012 at 02:16:32PM +0200, Dominique Dhumieres wrote: (the final patch will require the suitable dg directives;-). This is really stretching my testsuite knowledge. Maybe add /* { dg-additional-options -mdynamic-no-pic { target *-*-darwin* } } */ -- Alan Modra Australia Development Lab, IBM
Re: [Target maintainers]: Please update libjava/sysdep/*/locks.h with new atomic builtins
On Wed, Jun 20, 2012 at 09:10:44AM -0400, David Edelsohn wrote: inline static void release_set (volatile obj_addr_t *addr, obj_addr_t new_val) { - __asm__ __volatile__ (sync : : : memory); - *addr = new_val; + __atomic_store_n(addr, val, __ATOMIC_RELEASE); A typo seems to have crept in here. s/val/new_val/ -- Alan Modra Australia Development Lab, IBM
Re: [RS6000] atomic tweaks
On Thu, Jun 21, 2012 at 12:08:42PM -0400, David Edelsohn wrote: 4) Later loads and stores to which a dependency is carried from the initial load (see 1.10p9 and 1.10p10 in the standard) cannot be reordered to precede the load(memory_order_consume). Other loads and stores following the load(memory_order_consume) may be freely reordered to precede the load(memory_order_consume). A relaxed load is subject to constraints #1-#3 above, but not #4. In contrast, a memory_order_acquire load would be subject to #1-3 above, but would be subject to a stronger version of #4 that prohibited -any- load or store following the load(memory_order_consume) to be reordered prior to that load(memory_order_consume). The isync is not technically required, but removing it requires subtle load-store tracking conformance from the compiler and library. We left it in to be safe. I'm curious about Richard and Jakub's current thoughts. Hmm, OK. So, let me see if I understand. We know p = atomic_load_n (addr, __ATOMIC_CONSUME); i = *p; needs no barriers on powerpc. But if we had something like p = atomic_load_n (addr, __ATOMIC_CONSUME); if (p == foo) { i = *p; ... } and the compiler optimised i = *p to i = foo, then you'd be in trouble because the hardware won't see the load from foo having any dependency on p. -- Alan Modra Australia Development Lab, IBM
[RS6000] offset addressing
Hi David, I've been auditing rs6000.c after looking at https://bugzilla.redhat.com/show_bug.cgi?id=837630 We have problems all over the place. :-( For starters, this code in rs6000_legitimate_offset_address_p. case TFmode: ... case TDmode: case TImode: if (mode == TFmode || mode == TDmode || !TARGET_POWERPC64) extra = 12; else if (offset 3) return false; else extra = 8; break; Notice how a TFmode when TARGET_POWERPC64 gets extra==12. How can that be corrent? Both general and floating point registers are 8 bytes. However, this is not the main problem I see here and in many other places. The problem is that the mode is only a hint as to the register used in the insn. You can easily have a floating point mode in gprs or an integer mode in fprs, as shown by the following examples. These are all quite contrived, but a similar situation may arise in real code due to register pressure. double a, b; void f1 (void) { double tmp = a; asm (# : +r (tmp) : : fr0, fr1, fr2,fr3, fr4, fr5, fr6, fr7, fr8, fr9, fr10, fr11, fr12, fr13, fr14, fr15, fr16, fr17, fr18, fr19, fr20, fr21, fr22, fr23, fr24, fr25, fr26, fr27, fr28, fr29, fr30, fr31); b = tmp; } long long x, y; void f2 (void) { long long tmp = x; asm (# : +f (tmp) : : r0, r1, r2,r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, r31); y = tmp; } long double la, lb; void f3 (void) { long double tmp = la; asm (# : +r (tmp) : : fr0, fr1, fr2,fr3, fr4, fr5, fr6, fr7, fr8, fr9, fr10, fr11, fr12, fr13, fr14, fr15, fr16, fr17, fr18, fr19, fr20, fr21, fr22, fr23, fr24, fr25, fr26, fr27, fr28, fr29, fr30, fr31); lb = tmp; } So I think that rs6000_legitimate_offset_address_p code snippet ought to be case TFmode: ... case TDmode: case TImode: if (!TARGET_POWERPC64) extra = 12; else if (offset 3) return false; else extra = 8; break; We can't assume that TFmode or TDmode are in fprs, so for 32-bit that means a possible addition of 12 to the offset and for 64-bit it means the offset must be a multiple of 4. If I go ahead and fix the many places in rs6000.c we will see some code degradation for oddball cases. For example, loading a mis-aligned double (well, not 4-byte aligned) on ppc64 to a fpr will use direct addressing rather than offset addressing. Maybe we could do something about that with a peephole.. I know the above isn't anything new. When this sort of problem came up in the past we didn't want to lose code quality in the common case, and tried to avoid the correctness issue with hacks that made incorrect code unlikely to occur. The thing is that with section anchors we are much more likely to see offset addresses near 32k boundaries. So it is worth trying to fix this properly? If it is, here's the first baby step toward that aim. Pegs the negative limit at -32768 instead of -32780, and increases the positive limit a little for powerpc64. This patch is a prerequisite to making rs6000_legitimate_offset_address_p fixes since this one affects insn operand constraints, while other places affect predicates; You can't have insn contraints not covering all operands allowed by the predicate, but the converse is fine. Bootstrapped etc. powerpc64-linux and powerpc-linux. * config/rs6000/rs6000.c (rs6000_mode_dependent_address): Correct offset addressing limits. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 189345) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -6409,7 +6409,7 @@ rs6000_mode_dependent_address (const_rtx addr) GET_CODE (XEXP (addr, 1)) == CONST_INT) { unsigned HOST_WIDE_INT val = INTVAL (XEXP (addr, 1)); - return val + 12 + 0x8000 = 0x1; + return val + 0x8000 = 0x1 - (TARGET_POWERPC64 ? 8 : 12); } break; -- Alan Modra Australia Development Lab, IBM
Re: PR53914, rs6000 constraints and reload queries
(gpc_reg_operand (operands[0], TImode) || gpc_reg_operand (operands[1], TImode)) * @@ -10360,33 +10360,33 @@ if (TARGET_STRING) return \{stsi|stswi} %1,%P0,16\; case 1: + return \#\; case 2: - return \#\; -case 3: /* If the address is not used in the output, we can use lsi. Otherwise, fall through to generating four loads. */ if (TARGET_STRING ! reg_overlap_mentioned_p (operands[0], operands[1])) return \{lsi|lswi} %0,%P1,16\; /* ... fall through ... */ +case 3: case 4: case 5: return \#\; } } - [(set_attr type store_ux,store_ux,*,load_ux,load_ux,*) + [(set_attr type store_ux,store_ux,load_ux,load_ux,*,*) (set (attr cell_micro) (if_then_else (match_test TARGET_STRING) (const_string always) (const_string conditional)))]) (define_insn *movti_ppc64 - [(set (match_operand:TI 0 nonimmediate_operand =r,o,r) - (match_operand:TI 1 input_operand r,r,m))] + [(set (match_operand:TI 0 nonimmediate_operand =Y,r,r) + (match_operand:TI 1 input_operand r,Y,r))] (TARGET_POWERPC64 (gpc_reg_operand (operands[0], TImode) || gpc_reg_operand (operands[1], TImode))) VECTOR_MEM_NONE_P (TImode) # - [(set_attr type *,store,load)]) + [(set_attr type store,load,*)]) (define_split [(set (match_operand:TI 0 gpc_reg_operand ) @@ -13215,8 +13215,8 @@ (set_attr length 12)]) (define_insn stack_protect_setdi - [(set (match_operand:DI 0 memory_operand =m) - (unspec:DI [(match_operand:DI 1 memory_operand m)] UNSPEC_SP_SET)) + [(set (match_operand:DI 0 memory_operand =Y) + (unspec:DI [(match_operand:DI 1 memory_operand Y)] UNSPEC_SP_SET)) (set (match_scratch:DI 2 =r) (const_int 0))] TARGET_64BIT ld%U1%X1 %2,%1\;std%U0%X0 %2,%0\;{lil|li} %2,0 @@ -13257,8 +13257,8 @@ (define_insn stack_protect_testdi [(set (match_operand:CCEQ 0 cc_reg_operand =x,?y) -(unspec:CCEQ [(match_operand:DI 1 memory_operand m,m) - (match_operand:DI 2 memory_operand m,m)] +(unspec:CCEQ [(match_operand:DI 1 memory_operand Y,Y) + (match_operand:DI 2 memory_operand Y,Y)] UNSPEC_SP_TEST)) (set (match_scratch:DI 4 =r,r) (const_int 0)) (clobber (match_scratch:DI 3 =r,r))] -- Alan Modra Australia Development Lab, IBM
Re: PR53914, rs6000 constraints and reload queries
On Thu, Jul 19, 2012 at 12:34:25PM +0930, Alan Modra wrote: and fixes pr54009. David, in looking over this today, I realised that this bug isn't completely fixed. I stopped gcc emitting an offset of 32768, but that isn't enough. lo_sum addresses can't be offset, except when you know the alignment of the object being addressed guarantees that no part crosses a 32k boundary. For example, given lis 9,x@ha; lwz 3,x@l(9); lwz 4,x+4@l(9); we run into trouble if x happens to reside at n*64k + 32764. The final address matters, not just the offset in the insn. So I have some changes to mem_operand_gpr and rs6000_secondary_reload. I'll post a patch on top of the previous one. Another thing. I reckon we can do without the 'wY' constraint. I implemented it because movtf_internal currently uses an o constraint, but it seems to me that rs6000_legitimate_address_p already prevents non-offsettable TFmode addresses for mems accessed by fprs. Geoff introduced the o on movtf here: http://gcc.gnu.org/ml/gcc-patches/2003-12/msg00803.html Thoughts? -- Alan Modra Australia Development Lab, IBM
Re: PR53914, rs6000 constraints and reload queries
On Fri, Jul 20, 2012 at 12:05:28AM +0930, Alan Modra wrote: On Thu, Jul 19, 2012 at 12:34:25PM +0930, Alan Modra wrote: and fixes pr54009. David, in looking over this today, I realised that this bug isn't completely fixed. I stopped gcc emitting an offset of 32768, but that isn't enough. lo_sum addresses can't be offset, except when you know the alignment of the object being addressed guarantees that no part crosses a 32k boundary. For example, given lis 9,x@ha; lwz 3,x@l(9); lwz 4,x+4@l(9); we run into trouble if x happens to reside at n*64k + 32764. The final address matters, not just the offset in the insn. So I have some changes to mem_operand_gpr and rs6000_secondary_reload. I'll post a patch on top of the previous one. Actually, maybe I won't. When I disabled offsetting LO_SUMs, except for tocrefs (which we know are aligned), I ran into an insn does not satisfy its constraints ICE on gcc.dg/pr28795-2. The cause: code in rs6000_legitimze_reload_address near the comment starting /* Don't do this for TFmode. There we have a list of modes from which we make assumptions about the regs used, and get it wrong for ppc32. Reload decided to use 2 gprs for a DFmode. We've been doing this forever, and almost always the mem is aligned. So fixing an esoteric testcase like the one in pr54009 would come at the expense of normal code quality. Unless I can reliably figure out regs used and alignment of syms in rs6000_legitimize_reload_address, and as far as I can tell that isn't possible. Another thing. I reckon we can do without the 'wY' constraint. I implemented it because movtf_internal currently uses an o constraint, but it seems to me that rs6000_legitimate_address_p already prevents non-offsettable TFmode addresses for mems accessed by fprs. Geoff introduced the o on movtf here: http://gcc.gnu.org/ml/gcc-patches/2003-12/msg00803.html Thoughts? This on the other hand works. Please consider the patch amended to remove mem_operand_fpr, the 'wY' constraint, and uses thereof replaced with m. -- Alan Modra Australia Development Lab, IBM
Re: PowerPC shrink-wrap support 3 of 3
On Thu, Oct 27, 2011 at 12:24:46AM +1030, Alan Modra wrote: more code than duplicating epilogues. From what I've seen, the duplicate tails are generally very small. I guess I should dump out some info so we can get a better idea. There were 545 occurrences of shrink-wrap in the gcc/ dir for a -O2 powerpc-linux bootstrap. I counted active insns in duplicated blocks. 244 had zero insns (all the ones I looked at were just uses of r3), 96 had one insn (all the ones I looked at were setting r3), with the remainder being: cfgrtl.c.199r.pro_and_epilogue:Duplicating bb 6, 2 insns. tail call function.c.199r.pro_and_epilogue:Duplicating bb 17, 2 insns. setting a value via pointer (*poffset in instantiate_new_reg) insn-automata.c.199r.pro_and_epilogue:Duplicating bb 229, 2 insns. tail call varasm.c.199r.pro_and_epilogue:Duplicating bb 48, 2 insns. setting a global var rs6000.c.199r.pro_and_epilogue:Duplicating bb 300, 3 insns. tail call rs6000.c.199r.pro_and_epilogue:Duplicating bb 8, 3 insns. tail call toplev.c.199r.pro_and_epilogue:Duplicating bb 5, 4 insns. loading two word return value from random_seed var. reginfo.c.199r.pro_and_epilogue:Duplicating bb 4, 7 insns. setting a two word global var and r3 dbxout.c.199r.pro_and_epilogue:Duplicating bb 4, 8 insns. tail call dbxout.c.199r.pro_and_epilogue:Duplicating bb 4, 8 insns. tail call Note that having 350 duplicated blocks doesn't mean we had 350 extra cases of shrink-wrapping, because some tails were two blocks, and I recall seeing a case when developing my code where one function had two duplicated tails. Certainly many more shrink wrapped functions though. :-) -- Alan Modra Australia Development Lab, IBM
Re: PowerPC shrink-wrap support 3 of 3
; ! else ! pending_edge_cold = e; ! } ! ! FOR_EACH_VEC_ELT (edge, unconverted_simple_returns, i, e) { basic_block *pdest_bb; + edge pending; ! if (BB_PARTITION (e-src) == BB_HOT_PARTITION) ! { ! pdest_bb = simple_return_block_hot; ! pending = pending_edge_hot; ! } else ! { ! pdest_bb = simple_return_block_cold; ! pending = pending_edge_cold; ! } ! ! if (*pdest_bb == NULL pending != NULL) ! { ! emit_return_into_block (true, pending-src); ! pending-flags = ~(EDGE_FALLTHRU | EDGE_FAKE); ! *pdest_bb = pending-src; ! } ! else if (*pdest_bb == NULL) { basic_block bb; rtx start; *** epilogue_done: *** 6219,6225 } redirect_edge_and_branch_force (e, *pdest_bb); } ! VEC_free (rtx, heap, unconverted_simple_returns); } #endif --- 6414,6432 } redirect_edge_and_branch_force (e, *pdest_bb); } ! VEC_free (edge, heap, unconverted_simple_returns); ! } ! ! if (entry_edge != orig_entry_edge) ! { ! FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR-preds) ! if (EDGE_COUNT (e-src-preds) != 0 !(e-flags EDGE_FAKE) != 0 !!bitmap_bit_p (bb_flags, e-src-index)) ! { ! emit_return_into_block (true, e-src); ! e-flags = ~(EDGE_FALLTHRU | EDGE_FAKE); ! } } #endif *** epilogue_done: *** 6233,6240 if (!CALL_P (insn) || ! SIBLING_CALL_P (insn) || (entry_edge != orig_entry_edge ! !bitmap_bit_p (bb_flags, bb-index))) { ei_next (ei); continue; --- 6440,6450 if (!CALL_P (insn) || ! SIBLING_CALL_P (insn) + #ifdef HAVE_simple_return || (entry_edge != orig_entry_edge ! !bitmap_bit_p (bb_flags, bb-index)) ! #endif ! ) { ei_next (ei); continue; *** epilogue_done: *** 6281,6287 --- 6491,6499 } #endif + #ifdef HAVE_simple_return bitmap_clear (bb_flags); + #endif /* Threading the prologue and epilogue changes the artificial refs in the entry and exit blocks. */ -- Alan Modra Australia Development Lab, IBM
Re: PowerPC shrink-wrap support 3 of 3
On Tue, Nov 01, 2011 at 12:57:22AM +1030, Alan Modra wrote: Bits left to do - limit size of duplicated tails Done here. Also fixes a hole in that I took no notice of targetm.cannot_copy_insn_p when duplicating tails. One interesting result is that the tail duplication actually reduces the text size of libstdc++.so from 1074042 to 1073478 bytes on powerpc-linux. The reason being that a shrink-wrapped function that needs a prologue only on paths ending in a sibling call will lose one copy of the epilogue. That must happen enough to more than make up for duplicated tails. Bootstrapped and regression tested powerpc-linux. OK to apply? (And I won't be posting any more versions of the patch until this is reviewed. Please excuse me for spamming the list.) * function.c (bb_active_p): Delete. (dup_block_and_redirect, active_insn_between): New functions. (convert_jumps_to_returns, emit_return_for_exit): New functions, split out from.. (thread_prologue_and_epilogue_insns): ..here. Delete shadowing variables. Don't do prologue register clobber tests when shrink wrapping already failed. Delete all last_bb_active code. Instead compute tail block candidates for duplicating exit path. Remove these from antic set. Duplicate tails when reached from both blocks needing a prologue/epilogue and blocks not needing such. * ifcvt.c (dead_or_predicable): Test both flag_shrink_wrap and HAVE_simple_return. * bb-reorder.c (get_uncond_jump_length): Make global. * bb-reorder.h (get_uncond_jump_length): Declare. * cfgrtl.c (rtl_create_basic_block): Comment typo fix. (rtl_split_edge): Likewise. Warning fix. (rtl_duplicate_bb): New function. (rtl_cfg_hooks): Enable can_duplicate_block_p and duplicate_block. Index: gcc/function.c === --- gcc/function.c (revision 180588) +++ gcc/function.c (working copy) @@ -65,6 +65,8 @@ along with GCC; see the file COPYING3. #include df.h #include timevar.h #include vecprim.h +#include params.h +#include bb-reorder.h /* So we can assign to cfun in this file. */ #undef cfun @@ -5290,8 +5292,6 @@ requires_stack_frame_p (rtx insn, HARD_R HARD_REG_SET hardregs; unsigned regno; - if (!INSN_P (insn) || DEBUG_INSN_P (insn)) -return false; if (CALL_P (insn)) return !SIBLING_CALL_P (insn); @@ -5514,23 +5514,186 @@ set_return_jump_label (rtx returnjump) JUMP_LABEL (returnjump) = ret_rtx; } -/* Return true if BB has any active insns. */ +#ifdef HAVE_simple_return +/* Create a copy of BB instructions and insert at BEFORE. Redirect + preds of BB to COPY_BB if they don't appear in NEED_PROLOGUE. */ +static void +dup_block_and_redirect (basic_block bb, basic_block copy_bb, rtx before, + bitmap_head *need_prologue) +{ + edge_iterator ei; + edge e; + rtx insn = BB_END (bb); + + /* We know BB has a single successor, so there is no need to copy a + simple jump at the end of BB. */ + if (simplejump_p (insn)) +insn = PREV_INSN (insn); + + start_sequence (); + duplicate_insn_chain (BB_HEAD (bb), insn); + if (dump_file) +{ + unsigned count = 0; + for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) + if (active_insn_p (insn)) + ++count; + fprintf (dump_file, Duplicating bb %d to bb %d, %u active insns.\n, + bb-index, copy_bb-index, count); +} + insn = get_insns (); + end_sequence (); + emit_insn_before (insn, before); + + /* Redirect all the paths that need no prologue into copy_bb. */ + for (ei = ei_start (bb-preds); (e = ei_safe_edge (ei)); ) +if (!bitmap_bit_p (need_prologue, e-src-index)) + { + redirect_edge_and_branch_force (e, copy_bb); + continue; + } +else + ei_next (ei); +} +#endif + +#if defined (HAVE_return) || defined (HAVE_simple_return) +/* Return true if there are any active insns between HEAD and TAIL. */ static bool -bb_active_p (basic_block bb) +active_insn_between (rtx head, rtx tail) { + while (tail) +{ + if (active_insn_p (tail)) + return true; + if (tail == head) + return false; + tail = PREV_INSN (tail); +} + return false; +} + +/* LAST_BB is a block that exits, and empty of active instructions. + Examine its predecessors for jumps that can be converted to + (conditional) returns. */ +static VEC (edge, heap) * +convert_jumps_to_returns (basic_block last_bb, bool simple_p, + VEC (edge, heap) *unconverted ATTRIBUTE_UNUSED) +{ + int i; + basic_block bb; rtx label; + edge_iterator ei; + edge e; + VEC(basic_block,heap) *src_bbs; + + src_bbs = VEC_alloc (basic_block, heap, EDGE_COUNT (last_bb-preds)); + FOR_EACH_EDGE (e, ei, last_bb-preds) +if (e-src != ENTRY_BLOCK_PTR) + VEC_quick_push (basic_block
Re: [PATCH, rs6000] Preserve link stack for 476 cpus
On Tue, Nov 01, 2011 at 02:00:25PM -0500, Peter Bergner wrote: (get_ppc476_thunk_name): New function. (rs6000_code_end): Likewise. rs6000.c:27968:1: error: 'void rs6000_code_end()' defined but not used [-Werror=unused-function] cc1plus: all warnings being treated as errors Bootstrapped and committed as obvious, revision 180761. * config/rs6000/rs6000.c (rs6000_code_end): Declare ATTRIBUTE_UNUSED. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 180754) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -1176,6 +1176,7 @@ static void rs6000_trampoline_init (rtx, static bool rs6000_cannot_force_const_mem (enum machine_mode, rtx); static bool rs6000_legitimate_constant_p (enum machine_mode, rtx); static bool rs6000_save_toc_in_prologue_p (void); +static void rs6000_code_end (void) ATTRIBUTE_UNUSED; /* Hash table stuff for keeping track of TOC entries. */ -- Alan Modra Australia Development Lab, IBM
[rs6000] Fix PR 50906, eh_frame and other woes
, can_use_exit); + sp_offset, can_use_exit); if (DEFAULT_ABI == ABI_DARWIN) /* we only need a copy, no fprs were saved. */ - emit_move_insn (gen_rtx_REG (reg_mode, 11), frame_reg_rtx); + emit_move_insn (gen_rtx_REG (Pmode, 11), frame_reg_rtx); if (info-cr_save_p) rs6000_restore_saved_cr (cr_save_reg, using_mtcr_multiple); } else { - emit_insn (gen_add3_insn (gen_rtx_REG (Pmode, DEFAULT_ABI == ABI_AIX - ? 12 : 11), - frame_reg_rtx, + rtx src_reg = gen_rtx_REG (Pmode, DEFAULT_ABI == ABI_AIX ? 12 : 11); + + emit_insn (gen_add3_insn (src_reg, frame_reg_rtx, GEN_INT (sp_offset - info-fp_size))); - if (REGNO (frame_reg_rtx) == 11) - sp_offset += info-fp_size; + if (REGNO (frame_reg_rtx) == REGNO (src_reg)) + sp_offset = info-fp_size; } rs6000_emit_savres_rtx (info, frame_reg_rtx, -- Alan Modra Australia Development Lab, IBM
Re: Reload related segfaults
Ping http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02429.html Eric fixed the bootstrap breakage by another patch, but there is a fairly obvious bug in gen_reload fixed by my patch. I gave enough contect in the diff that you don't even need to look at the file. :) -- Alan Modra Australia Development Lab, IBM
[rs6000] fix PR 30282, load from stack moved past stack update
This patch fixes PR30282, caused by instructions being scheduled over the stack reset. Note that only for ABI_V4 do we currently have frame_reg_rtx != sp_reg_rtx in rs6000_emit_stack_reset, so the patch doesn't emit *less* blockages. I did benchmark this change and saw nothing but the usual benchmake noise. Bootstrapped etc. powerpc-linux. OK to apply? * config/rs6000/rs6000.c (rs6000_emit_stack_reset): Always emit blockage for ABI_V4. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 177309) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -19689,7 +19689,7 @@ rs6000_emit_stack_reset (rs6000_stack_t { /* This blockage is needed so that sched doesn't decide to move the sp change before the register restores. */ - if (frame_reg_rtx != sp_reg_rtx + if (DEFAULT_ABI == ABI_V4 || (TARGET_SPE_ABI info-spe_64bit_regs_used != 0 info-first_gp_reg_save != 32)) -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Fix PR 50906, eh_frame and other woes
On Fri, Nov 04, 2011 at 06:19:07PM +, Iain Sandoe wrote: I did wonder, when implementing the out-of-line saves for Darwin, whether the setting in sysv4.h is a bit optimistic in firing for r31 when -Os is on. whether it saves space, in reality, would presumably depend on whether branch islands were inserted to reach the common code. Yes, but that doesn't happen on powerpc-linux until code size goes above 33M. We get a (small) gain otherwise when able to use the exit versions of the out-of-line restores. Bootstrapped this, Thanks! -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Fix PR 50906, eh_frame and other woes
On Mon, Nov 07, 2011 at 10:53:51AM +0100, Olivier Hainque wrote: Another bug we're running into here is an unwarranted move of the sp restore prior to register fetches despite an attempt at preventing that with a stack_tie instruction (VxWorks target). http://gcc.gnu.org/ml/gcc/2011-03/msg00123.html The failure can still be exposed on mainline with a minor adjustment to the C testcase quoted in the msg. Even after revision 181056? -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Fix PR 50906, eh_frame and other woes
On Mon, Nov 07, 2011 at 04:10:38PM +, Joseph S. Myers wrote: FWIW, when I encountered such a problem in 4.4-based tools I found I needed the following change to stack_tie patterns to fix it (applied to and rediffed against trunk, but not tested there), in addition to a DEFAULT_ABI == ABI_V4 check such as you added. Given all the confusion about this class of bugs and whether they should be addressed in each back end or in a target-independent way, I didn't try to send this upstream at the time. 2011-04-11 Joseph Myers jos...@codesourcery.com * config/rs6000/rs6000.md (stack_tie): Use (mem:BLK (scratch)) as output of SET. (restore_stack_block, restore_stack_nonlocal): Update UNSPEC_TIE patterns. OK, so you made the stack ties conflict with all memory. I wondered about that too, but didn't find a testcase where it mattered. It would certainly be safer to use the big hammer, but on the other hand it can restrict scheduling unnecessarily. For example void byref (int *a) { int x; asm (#%0 set here : =r (x) : : r29, r30); *a = x; } with the existing stack ties, the write to *a happens after stack deallocation. -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Fix PR 50906, eh_frame and other woes
On Tue, Nov 08, 2011 at 12:30:36AM +, Joseph S. Myers wrote: On Tue, 8 Nov 2011, Alan Modra wrote: OK, so you made the stack ties conflict with all memory. I wondered about that too, but didn't find a testcase where it mattered. It The test I had was one of those from the various PRs for this issue. int find_num(int i) { const int arr[5] = {0, 1, 2, 3, 4}; return arr[i]; } (-O2 -mcpu=603). But I don't know if it reproduces with FSF 4.4 or trunk; there were plenty of other local changes in the 4.4-based tree where I found this to be necessary. The testcase generates good code on current gcc-4.7, gcc-4.6, and gcc-4.5, but fails on current gcc-4.4. I thought I had this PR fixed on all active branches. :-( It's been a while since I looked at what was happening with this testcase, but from memory what stops sheduling over the stack_tie is alias.c:base_alias_check. Which relies on tracking registers to see whether two memory accesses using different regs could alias. Quite possibly gcc-4.4 is deficient in this area. -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Fix PR 50906, eh_frame and other woes
On Tue, Nov 08, 2011 at 11:37:57AM +0100, Olivier Hainque wrote: Joseph resorted to mem:scratch to impose a strong barrier. That's certainly safe and I don't think the performance impact can be significant, so this looks like a good way out. I agree. Our fancy powerpc stack deallocation barriers rely too much on alias analysis, which as you've shown (thanks!) isn't reliable in this instance. Other targets just use the mem:scratch barrier. I tried an approach in between here: replace the stack_tie insn by a frame_tie insn with frame_base_rtx as the second base. In the case above, with frame_base_rtx==r11, this is something like (set (mem/c:BLK (reg:SI 11 11) [0 A8]) (unspec:BLK [ (mem/c:BLK (reg:SI 11 11) [0 A8]) (mem/c:BLK (reg/f:SI 1 1) [0 A8]) Looks similar to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30282#c15 -- Alan Modra Australia Development Lab, IBM
Re: CFG review needed for fix of PowerPC shrink-wrap support 3 of 3
On Mon, Nov 14, 2011 at 07:48:03AM -1000, Richard Henderson wrote: On 11/14/2011 04:10 AM, Hans-Peter Nilsson wrote: Looks like all we need is a positive review of http://gcc.gnu.org/ml/gcc-patches/2011-11/msg01409.html and a ChangeLog entry to unbreak three or more targets. Someone with approval rights: pretty please? That patch is ok. Sorry for the bootstrap problems. I believe the cause is my extraction of convert_jumps_to_returns and emit_return_for_exit. The new HAVE_return code misses a single_succ_p test (covered by the bitmap_bit_p (bb_flags, ...) test in the HAVE_simple_return case). I haven't really looked into what Bernd's fix does. I know this one fixes what I broke.. * function.c (thread_prologue_and_epilogue_insns): Guard emitting return with single_succ_p test. Index: gcc/function.c === --- gcc/function.c (revision 181188) +++ gcc/function.c (working copy) @@ -6230,7 +6230,8 @@ thread_prologue_and_epilogue_insns (void !active_insn_between (BB_HEAD (last_bb), BB_END (last_bb))) convert_jumps_to_returns (last_bb, false, NULL); - if (EDGE_COUNT (exit_fallthru_edge-src-preds) != 0) + if (EDGE_COUNT (last_bb-preds) != 0 + single_succ_p (last_bb)) { last_bb = emit_return_for_exit (exit_fallthru_edge, false); epilogue_end = returnjump = BB_END (last_bb); -- Alan Modra Australia Development Lab, IBM
Re: [PATCH, testsuite]: Fix timeout in simulate-thread.exp
On Thu, Nov 17, 2011 at 10:54:23AM +0100, Uros Bizjak wrote: The solution for this problem is to search for simulate_thread_done in gdb log. Thanks! I like this for another reason too: I've seen these tests fail due to gdb never reporting simulate_thread_main on any instruction in that function. -- Alan Modra Australia Development Lab, IBM
powerpc lack of memory_barrier
Lacking this pattern means the builtin __sync_synchronize() on powerpc is just an asm with a memory clobber (see builtins.c), which is hardly a full memory barrier as extend.texi says it should be. This patch fixes multiple libgomp testsuite failures. Bootstrapped and regression tested powerpc-linux. OK for mainline? * config/rs6000/sync.md: Duplicate hwsync as memory_barrier. Index: gcc/config/rs6000/sync.md === --- gcc/config/rs6000/sync.md (revision 181425) +++ gcc/config/rs6000/sync.md (working copy) @@ -53,6 +53,15 @@ (define_expand mem_thread_fence DONE; }) +(define_expand memory_barrier + [(set (match_dup 0) + (unspec:BLK [(match_dup 0)] UNSPEC_SYNC))] + +{ + operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)); + MEM_VOLATILE_P (operands[0]) = 1; +}) + (define_expand hwsync [(set (match_dup 0) (unspec:BLK [(match_dup 0)] UNSPEC_SYNC))] -- Alan Modra Australia Development Lab, IBM
RFC: powerpc libgomp locking
This is part of work in progress getting locking back into shape on powerpc. (If we ever were in shape, which is doubtful.) Using the ia64 version means we have a needless sync at the start of gomp_mutex_lock (courtesy of __sync_val_compare_and_swap) and a needless isync at the end of gomp_mutex_unlock (slightly odd use of __sync_lock_test_and_set to release a lock). It would be nice to get rid of the __sync_val_compare_and_swap in config/linux/mutex.c too. * config/linux/powerpc/mutex.h: Rewrite. Index: libgomp/config/linux/powerpc/mutex.h === --- libgomp/config/linux/powerpc/mutex.h(revision 181425) +++ libgomp/config/linux/powerpc/mutex.h(working copy) @@ -1,2 +1,75 @@ -/* On PowerPC __sync_lock_test_and_set isn't a full barrier. */ -#include config/linux/ia64/mutex.h +/* Copyright (C) 2011 Free Software Foundation, Inc. + Contributed by Alan Modra amo...@au1.ibm.com + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + http://www.gnu.org/licenses/. */ + +/* This is a Linux specific implementation of a mutex synchronization + mechanism for libgomp. This type is private to the library. This + implementation uses atomic instructions and the futex syscall. */ + +#ifndef GOMP_MUTEX_H +#define GOMP_MUTEX_H 1 + +typedef int gomp_mutex_t; +extern void gomp_mutex_lock_slow (gomp_mutex_t *mutex, int); +extern void gomp_mutex_unlock_slow (gomp_mutex_t *mutex); + +#define GOMP_MUTEX_INIT_0 1 + +enum memmodel +{ + MEMMODEL_RELAXED = 0, + MEMMODEL_CONSUME = 1, + MEMMODEL_ACQUIRE = 2, + MEMMODEL_RELEASE = 3, + MEMMODEL_ACQ_REL = 4, + MEMMODEL_SEQ_CST = 5, + MEMMODEL_LAST = 6 +}; + +static inline void gomp_mutex_init (gomp_mutex_t *mutex) +{ + *mutex = 0; +} + +static inline void gomp_mutex_destroy (gomp_mutex_t *mutex) +{ +} + +static inline void gomp_mutex_lock (gomp_mutex_t *mutex) +{ + int oldval = 0; + + __atomic_compare_exchange_4 (mutex, oldval, 1, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (__builtin_expect (oldval != 0, 0)) +gomp_mutex_lock_slow (mutex, oldval); +} + +static inline void gomp_mutex_unlock (gomp_mutex_t *mutex) +{ + int oldval = __atomic_exchange_4 (mutex, 0, MEMMODEL_RELEASE); + + if (__builtin_expect (oldval 1, 0)) +gomp_mutex_unlock_slow (mutex); +} +#endif /* GOMP_MUTEX_H */ -- Alan Modra Australia Development Lab, IBM
Re: RFC: powerpc libgomp locking
On Thu, Nov 17, 2011 at 02:56:31PM -1000, Richard Henderson wrote: On 11/17/2011 01:56 AM, Alan Modra wrote: This is part of work in progress getting locking back into shape on powerpc. (If we ever were in shape, which is doubtful.) Using the ia64 version means we have a needless sync at the start of gomp_mutex_lock (courtesy of __sync_val_compare_and_swap) and a needless isync at the end of gomp_mutex_unlock (slightly odd use of __sync_lock_test_and_set to release a lock). It would be nice to get rid of the __sync_val_compare_and_swap in config/linux/mutex.c too. We should be able to use this new file as a replacement for all targets, not just PPC. And, yes, updating all of the __sync uses in libgomp would be a good thing. + __atomic_compare_exchange_4 (mutex, oldval, 1, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (__builtin_expect (oldval != 0, 0)) +gomp_mutex_lock_slow (mutex, oldval); __atomic_compare_exchange_4 returns a boolean success: if (__builtin_expect (!__atomic_compare_exchange_4 (mutex, oldval, 1, false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED), 0)) In theory that should avoid an extra comparison on some targets. That was the way I wrote it at first. This way gives three fewer insns on powerpc, because insns setting up the boolean are elided as is the comparison of the boolean against zero. An oldval != 0 comparison insn is also not needed because __atomic_compare_exchange already did that for us. BTW, we're left with dead stores into oldval's stack slot. Is it legal for the first two parameters of __atomic_compare_exchange to alias? I'm guessing that the stores might disappear if we tell gcc that they can't alias. -- Alan Modra Australia Development Lab, IBM
Re: powerpc compare_and_swap fails
On Thu, Nov 17, 2011 at 12:24:08PM -1000, Richard Henderson wrote: On 11/17/2011 10:25 AM, Richard Henderson wrote: so that we don't write back into the subreg. Like so. Tested on ppc64-linux and committed. Thanks! This, along with my mutex.h rewrite plus adding a memory_barrier insn gets me down to just one libgomp failure. -- Alan Modra Australia Development Lab, IBM
Re: [brobec...@adacore.com: Re: [PATCH 18/348] Fix -Wsahdow warnings]
On Wed, Nov 23, 2011 at 12:05:56PM -0800, Joel Brobecker wrote: We were looking at -Wshadow warnings that we thought were counter productive, and found that you had written a patch that would help us. The patch was okayed, but never applied. Is there a reason for it? I must have missed seeing the OK. Committed revision 181684. -- Alan Modra Australia Development Lab, IBM
Fix libgomp semaphores
This fixes PR51249, a failure caused by insufficient care in waking threads on sem_post. It's quite a tricky business to get right, but I believe this rewrite is now correct. I've also converted over lock.c, mutex.h and mutex.c to use the new atomic builtins. This means no target should need target-specific versions of mutex.h. mutex-lock.h came about because at one stage I was trying out semaphore and lock implementations that used a lock/count int and an nwaiters int to track number of threads waiting. That turned out to be a really bad idea. You need barriers all over the place to ensure you don't see an inconsistent lock/semaphore state, and extra barriers are costly. It's much better to pay the price of an extra futex_wake system call when transitioning from contended to uncontended. Anyway, I left the lock abstraction as a good thing. I also changed the mutex implementation to use -1 to mean locked and (possibly) some thread waiting on the lock rather than using 2. The very minor benefit is that some targets may use one less instruction testing 0 compared to testing 1. It's also just that little bit more like the semaphore implementation. Bootstrapped and regression tested powerpc64-linux. I do still see the occasional testsuite failures on power7 (see pr51298), but the semaphore failures are gone. PR libgomp/51249 * config/linux/mutex-lock.h: New. * config/linux/omp-lock.h: Include above. (omp_lock_t, omp_nest_lock_t, omp_lock_25_t): Use gomp_metx_t. * config/linux/lock.c: Use atomic rather than sync builtins. * config/linux/mutex.h: Likewise. * config/linux/mutex.c: Likewise. Comment. Use -1 for waiting state. * config/linux/sem.h: Rewrite. * config/linux/sem.c: Rewrite. * config/linux/arm/mutex.h: Delete. * config/linux/powerpc/mutex.h: Delete. * config/linux/ia64/mutex.h: Delete. * config/linux/mips/mutex.h: Delete. Index: libgomp/config/linux/mutex-lock.h === --- libgomp/config/linux/mutex-lock.h (revision 0) +++ libgomp/config/linux/mutex-lock.h (revision 0) @@ -0,0 +1,35 @@ +/* Copyright (C) 2011 Free Software Foundation, Inc. + Contributed by Alan Modra, IBM. + + This file is part of the GNU OpenMP Library (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + http://www.gnu.org/licenses/. */ + +/* This type is private to the library. */ + +#ifndef MUTEX_LOCK_H +#define MUTEX_LOCK_H 1 + +typedef struct linux_gomp_mutex +{ + /* 0 unlocked, 1 locked, -1 locked and possibly someone waiting. */ + int flag; +} gomp_mutex_t; +#endif Index: libgomp/config/linux/omp-lock.h === --- libgomp/config/linux/omp-lock.h (revision 181683) +++ libgomp/config/linux/omp-lock.h (working copy) @@ -3,10 +3,12 @@ structures without polluting the namespace. When using the Linux futex primitive, non-recursive locks require - only one int. Recursive locks require we identify the owning task - and so require one int and a pointer. */ + one int. Recursive locks require we identify the owning task + and so require in addition one int and a pointer. */ -typedef int omp_lock_t; -typedef struct { int lock, count; void *owner; } omp_nest_lock_t; -typedef int omp_lock_25_t; +#include mutex-lock.h + +typedef gomp_mutex_t omp_lock_t; +typedef struct { omp_lock_t lock; int count; void *owner; } omp_nest_lock_t; +typedef gomp_mutex_t omp_lock_25_t; typedef struct { int owner, count; } omp_nest_lock_25_t; Index: libgomp/config/linux/mutex.h === --- libgomp/config/linux/mutex.h(revision 181683) +++ libgomp/config/linux/mutex.h(working copy) @@ -1,4 +1,4 @@ -/* Copyright (C) 2005, 2009, 2011 Free Software Foundation, Inc. +/* Copyright (C) 2011 Free Software Foundation, Inc. Contributed by Richard Henderson r...@redhat.com. This file is part of the GNU OpenMP Library (libgomp). @@ -29,43 +29,39 @@ #ifndef
Re: Fix libgomp semaphores
On Fri, Nov 25, 2011 at 08:38:39AM +0100, Jakub Jelinek wrote: My preference would be to avoid the abstraction changes though, both because it is additional clutter in the changeset and because omp_lock and nested lock are part of public ABIs, so if struct is layed out differently on some weird architecture, it would be an ABI change. OK, fair enough. I didn't consider that structs may be laid out differently. So, if you could keep gomp_mutex_t, omp_lock_t and gomp_sem_t as integers, it would be appreciated. Furthermore, I'd prefer if the patch could be split into smaller parts, e.g. for bisecting purposes. One patch would do the mutex changes to use new atomics, remove extra mutex.h headers and start using 0/1/-1 instead of 0/1/2. And another patch would rewrite the semaphores. OK. I need to do this anyway as I just discovered a regression when looping on one of the tests. I suspect the acquire/release mutex locking may have exposed bugs elsewhere in libgomp that were covered by the heavyweight locking used by the __sync builtins. -- Alan Modra Australia Development Lab, IBM
Re: Fix libgomp semaphores
/config/linux/sem.c (revision 181718) +++ libgomp/config/linux/sem.c (working copy) @@ -28,34 +28,69 @@ #include wait.h - void -gomp_sem_wait_slow (gomp_sem_t *sem) +gomp_sem_wait_slow (gomp_sem_t *sem, int count) { - while (1) + int oldval, newval; + + /* First loop spins a while. */ + while (count == 0) { - int val = __sync_val_compare_and_swap (sem, 0, -1); - if (val 0) + if (do_spin (sem, 0)) + { + /* Spin timeout, nothing changed. Set waiting flag. */ + oldval = 0; + __atomic_compare_exchange_4 (sem, oldval, 0x8000, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + count = oldval; + if (oldval == 0) + { + futex_wait (sem, 0x8000); + count = *sem; + } + break; + } + /* Something changed. If positive, we're good to go. */ + else if ((count = *sem) 0) { - if (__sync_bool_compare_and_swap (sem, val, val - 1)) + oldval = count; + __atomic_compare_exchange_4 (sem, oldval, count - 1, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (oldval == count) return; + count = oldval; } - do_wait (sem, -1); +} + + /* Second loop waits until semaphore is posted. We always exit this + loop with wait flag set, so next post will awaken a thread. */ + while (1) +{ + oldval = count; + newval = 0x8000; + if ((count 0x7fff) != 0) + newval |= count - 1; + __atomic_compare_exchange_4 (sem, oldval, newval, false, + MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (oldval == count) + { + if ((count 0x7fff) != 0) + { + /* If we can wake more threads, do so now. */ + if ((count 0x7fff) 1) + gomp_sem_post_slow (sem); + break; + } + do_wait (sem, 0x8000); + count = *sem; + } + else + count = oldval; } } void gomp_sem_post_slow (gomp_sem_t *sem) { - int old, tmp = *sem, wake; - - do -{ - old = tmp; - wake = old 0 ? old + 1 : 1; - tmp = __sync_val_compare_and_swap (sem, old, wake); -} - while (old != tmp); - - futex_wake (sem, wake); + futex_wake (sem, 1); } -- Alan Modra Australia Development Lab, IBM
Use atomics in libgomp mutex
that it will be useful, but WITHOUT ANY - WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS - FOR A PARTICULAR PURPOSE. See the GNU General Public License for - more details. - - Under Section 7 of GPL version 3, you are granted additional - permissions described in the GCC Runtime Library Exception, version - 3.1, as published by the Free Software Foundation. - - You should have received a copy of the GNU General Public License and - a copy of the GCC Runtime Library Exception along with this program; - see the files COPYING3 and COPYING.RUNTIME respectively. If not, see - http://www.gnu.org/licenses/. */ - -/* MIPS needs the same correct usage of __sync_synchronize and - __sync_lock_test_and_set as ia64. So we just use its mutex.h. */ - -#include config/linux/ia64/mutex.h -- Alan Modra Australia Development Lab, IBM
Fix PR51298, libgomp barrier failure
-generation != state + 4); + while (gen != state + 4); } void -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 09:27:20AM +1030, Alan Modra wrote: This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. Arrgh, I posted the wrong patch. I know it needs work elsewhere in libgomp to comply with the OMP spec, which says of flush: 14 A flush region without a list is implied at the following locations: 15 • During a barrier region. 16 • At entry to and exit from parallel, critical, and ordered regions. [snip] Yet COMP_critical_start simply calls gomp_mutex_lock, relying on the lock acquisition to do the flush. This won't necessarily happen with MEMMODEL_ACQUIRE. This bit in gomp_mutex_lock + __atomic_compare_exchange_4 (mutex, oldval, 1, false, +MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); should be __atomic_compare_exchange_4 (mutex, oldval, 1, false, MEMMODEL_ACQ_REL, MEMMODEL_ACQ_REL); and once you do that it's hard to justify the patch for stage3 as fixing a bug (unnecessary sync instructions emitted by __sync builtins). -- Alan Modra Australia Development Lab, IBM
Re: [build] Cleanup rs6000/t-ppccomm configurations (PR other/51022)
On Sun, Nov 27, 2011 at 08:36:38PM -0500, David Edelsohn wrote: I was hoping that other targets affected by the patch would test it as well. powerpc-linux and powerpc64-linux look good. -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 09:15:02AM +0100, Jakub Jelinek wrote: On Mon, Nov 28, 2011 at 10:37:09AM +1030, Alan Modra wrote: 14 A flush region without a list is implied at the following locations: 15 • During a barrier region. 16 • At entry to and exit from parallel, critical, and ordered regions. [snip] If it is just GOMP_critical_start, couldn't it use an extra barrier before resp. after it calls mutex_lock? I believe gomp_ordered_sync needs a barrier, and I'm unsure about GOMP_parallel_start but I think one is needed there too. Yes, putting the barriers where they are actually needed is the best solution, but I'm far from an OpenMP expert so am unsure whether these are the only places requiring a barrier. -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 09:13:27AM +0100, Jakub Jelinek wrote: On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote: --- libgomp/config/linux/bar.c (revision 181718) +++ libgomp/config/linux/bar.c (working copy) @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b if (__builtin_expect ((state 1) != 0, 0)) { /* Next time we'll be awaiting TOTAL threads again. */ - bar-awaited = bar-total; - atomic_write_barrier (); - bar-generation += 4; + __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE); + __atomic_add_fetch (bar-generation, 4, MEMMODEL_ACQ_REL); futex_wake ((int *) bar-generation, INT_MAX); } The above is unfortunate, bar-generation should be only modified from a single thread at this point, but the __atomic_add_fetch will enforce there an atomic instruction on it rather than just some kind of barrier. I will try without the atomic add and see what happens. -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 08:09:01PM +1030, Alan Modra wrote: On Mon, Nov 28, 2011 at 09:13:27AM +0100, Jakub Jelinek wrote: On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote: --- libgomp/config/linux/bar.c(revision 181718) +++ libgomp/config/linux/bar.c(working copy) @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b if (__builtin_expect ((state 1) != 0, 0)) { /* Next time we'll be awaiting TOTAL threads again. */ - bar-awaited = bar-total; - atomic_write_barrier (); - bar-generation += 4; + __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE); + __atomic_add_fetch (bar-generation, 4, MEMMODEL_ACQ_REL); futex_wake ((int *) bar-generation, INT_MAX); } The above is unfortunate, bar-generation should be only modified from a single thread at this point, but the __atomic_add_fetch will enforce there an atomic instruction on it rather than just some kind of barrier. I will try without the atomic add and see what happens. Seems to be fine. I took out the extra write barriers too, so we now just have two MEMMODEL_RELEASE atomic store barriers replacing the two old atomic_write_barriers, and a number of new acquire barriers. PR libgomp/51298 * config/linux/bar.h: Use atomic rather than sync builtins. * config/linux/bar.c: Likewise. Add missing acquire synchronisation on generation field. * task.c (gomp_barrier_handle_tasks): Regain lock so as to not double unlock. Index: libgomp/task.c === --- libgomp/task.c (revision 181718) +++ libgomp/task.c (working copy) @@ -273,6 +273,7 @@ gomp_barrier_handle_tasks (gomp_barrier_ gomp_team_barrier_done (team-barrier, state); gomp_mutex_unlock (team-task_lock); gomp_team_barrier_wake (team-barrier, 0); + gomp_mutex_lock (team-task_lock); } } } Index: libgomp/config/linux/bar.h === --- libgomp/config/linux/bar.h (revision 181770) +++ libgomp/config/linux/bar.h (working copy) @@ -50,7 +50,7 @@ static inline void gomp_barrier_init (go static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count) { - __sync_fetch_and_add (bar-awaited, count - bar-total); + __atomic_add_fetch (bar-awaited, count - bar-total, MEMMODEL_ACQ_REL); bar-total = count; } @@ -69,10 +69,8 @@ extern void gomp_team_barrier_wake (gomp static inline gomp_barrier_state_t gomp_barrier_wait_start (gomp_barrier_t *bar) { - unsigned int ret = bar-generation ~3; - /* Do we need any barrier here or is __sync_add_and_fetch acting - as the needed LoadLoad barrier already? */ - ret += __sync_add_and_fetch (bar-awaited, -1) == 0; + unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) ~3; + ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0; return ret; } Index: libgomp/config/linux/bar.c === --- libgomp/config/linux/bar.c (revision 181770) +++ libgomp/config/linux/bar.c (working copy) @@ -37,17 +37,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b { /* Next time we'll be awaiting TOTAL threads again. */ bar-awaited = bar-total; - atomic_write_barrier (); - bar-generation += 4; + __atomic_store_4 (bar-generation, bar-generation + 4, + MEMMODEL_RELEASE); futex_wake ((int *) bar-generation, INT_MAX); } else { - unsigned int generation = state; - do - do_wait ((int *) bar-generation, generation); - while (bar-generation == generation); + do_wait ((int *) bar-generation, state); + while (__atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) == state); } } @@ -81,15 +79,15 @@ gomp_team_barrier_wake (gomp_barrier_t * void gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) { - unsigned int generation; + unsigned int generation, gen; if (__builtin_expect ((state 1) != 0, 0)) { /* Next time we'll be awaiting TOTAL threads again. */ struct gomp_thread *thr = gomp_thread (); struct gomp_team *team = thr-ts.team; + bar-awaited = bar-total; - atomic_write_barrier (); if (__builtin_expect (team-task_count, 0)) { gomp_barrier_handle_tasks (state); @@ -97,7 +95,7 @@ gomp_team_barrier_wait_end (gomp_barrier } else { - bar-generation = state + 3; + __atomic_store_4 (bar-generation, state + 3, MEMMODEL_RELEASE); futex_wake ((int *) bar-generation, INT_MAX); return; } @@ -107,12 +105,16 @@ gomp_team_barrier_wait_end (gomp_barrier do { do_wait ((int *) bar-generation, generation
Re: Fix libgomp semaphores
On Mon, Nov 28, 2011 at 05:23:37PM +0100, Jakub Jelinek wrote: On Mon, Nov 28, 2011 at 09:23:43AM +1030, Alan Modra wrote: + int count = *sem; + + while ((count 0x7fff) != 0) +{ + int oldval = count; + __atomic_compare_exchange_4 (sem, oldval, count - 1, + false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (__builtin_expect (oldval == count, 1)) + return; Why aren't you instead testing the return value of __atomic_compare_exchange_4 (here and in other cases)? If you use the return value on powerpc, you find that requires two load immediate instructions (loading zero and one), and a compare against zero. That makes three fewer instructions as written, because the oldval == count comparison has already been done inside the atomic sequence. I'd expect fewer on most other architectures unless they happen to have a compare and exchange instruction that sets condition bits (ie. Intel). Even on Intel the way I've written the code shouldn't take more instructions with a properly written cmpxchg rtl description. Does it? Hmm, I suppose you could argue that powerpc and others ought to not generate those three extra instructions when using the return value. I'll see about fixing powerpc. -- Alan Modra Australia Development Lab, IBM
Re: Fix libgomp semaphores
On Mon, Nov 28, 2011 at 04:00:10PM -0800, Richard Henderson wrote: On 11/27/2011 02:53 PM, Alan Modra wrote: +enum memmodel +{ + MEMMODEL_RELAXED = 0, + MEMMODEL_CONSUME = 1, + MEMMODEL_ACQUIRE = 2, + MEMMODEL_RELEASE = 3, + MEMMODEL_ACQ_REL = 4, + MEMMODEL_SEQ_CST = 5, + MEMMODEL_LAST = 6 +}; This should probably go to libgomp.h. I wondered about that. /* This is a Linux specific implementation of a semaphore synchronization mechanism for libgomp. This type is private to the library. This + counting semaphore implementation uses atomic instructions and the + futex syscall, and a single 32-bit int to store semaphore state. + The low 31 bits are the count, the top bit is a flag set when some + threads may be waiting. */ I think we'll get better code generation on a number of targets if we make the low bit the waiting bit, and the higher bits the count. This we do (count 1) and (count + 2) instead of larger constants needing to be generated. Not changed below... Funny, that's the way I wrote this code at first, then went for the wait bit as the sign bit because you can test 0 in one place where you want not waiting and count non-zero. +static inline void +gomp_sem_wait (gomp_sem_t *sem) { + int count = *sem; + + while ((count 0x7fff) != 0) +{ + int oldval = count; + __atomic_compare_exchange_4 (sem, oldval, count - 1, + false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED); + if (__builtin_expect (oldval == count, 1)) + return; + count = oldval; +} I'd really prefer not to hard-code any sizes at all. Let's change the explicit _4 to _n everywhere. OK. The loop above doesn't actually make sense to me. If the compare-and-swap succeeds, then oldval == count - 1. Which doesn't seem to be what you're testing at all. Huh? If it succeeds, oldval == count and we return. Perhaps this entire function is better written as { int count = *sem; do { if (__builtin_expect (count 0x8000u, 0) { gomp_sem_wait_slow (sem, count); return; } } while (!__atomic_compare_exchange_n (sem, count, count - 1, true, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED)); } No, we don't need the slow path if we have *sem 0x7fff non-zero. +gomp_sem_post (gomp_sem_t *sem) { + int count = *sem; + + while (1) +{ + int oldval = count; + __atomic_compare_exchange_4 (sem, oldval, (count + 1) 0x7fff, + false, MEMMODEL_RELEASE, MEMMODEL_RELAXED); + if (__builtin_expect (oldval == count, 1)) + break; Again, this equality doesn't make sense. Further, you aren't testing for the high bit set inside the loop, nor are you preserving the high bit. See above about the equality. We don't want to preserve the wait bit here. Otherwise we'd never go back to the uncontended state, which answers ... All that said, I don't see how we can ever clear the wait bit once its set? this question. Are we better off splitting the 32-bit value into two 16-bit fields for value+waiters? Or similarly splitting a 64-bit value? That way we can at least update both fields atomically, and we ought never lose a waiter. Other splits of the field are certainly possible, but of course restrict the max number, and you'd need the fancy futex op_wait. Some targets don't have 64-bit atomics, so we can't really go that way. Yes, if I did keep track of number of waiters we'd save one unnecessary futex_wake call, but I'm quite confident no waiters will be lost just using a flag. -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Streamline compare-and-swap success return value computation
On Mon, Nov 28, 2011 at 04:33:58PM -0800, Richard Henderson wrote: On 11/28/2011 04:26 PM, Richard Henderson wrote: On 11/28/2011 03:05 PM, Richard Henderson wrote: On 11/28/2011 02:16 PM, Alan Modra wrote: Hmm, I suppose you could argue that powerpc and others ought to not generate those three extra instructions when using the return value. I'll see about fixing powerpc. However, we can do better by considering the value to be stored in CR0... Try this and see if it generates the sort of code you want. Untested. ... actually, this one. There's no reason to differentiate between strong and weak compare-and-swap when computing boolval. Looks very similar to what I wrote, except + else if (reg_overlap_mentioned_p (retval, oldval)) +oldval = copy_to_reg (oldval); I missed this bit - x = gen_rtx_NE (VOIDmode, x, oldval); - x = rs6000_generate_compare (x, mode); + cond = gen_reg_rtx (CCmode); + x = gen_rtx_COMPARE (CCmode, x, oldval); + emit_insn (gen_rtx_SET (VOIDmode, cond, x)); + + x = gen_rtx_NE (VOIDmode, cond, const0_rtx); and instead pulled cond out of the return from rs6000_generate_compare. Results look good. -- Alan Modra Australia Development Lab, IBM
Re: [build] Cleanup rs6000/t-ppccomm configurations (PR other/51022)
On Mon, Nov 28, 2011 at 05:15:08PM +0100, Rainer Orth wrote: Alan Modra amo...@gmail.com writes: On Sun, Nov 27, 2011 at 08:36:38PM -0500, David Edelsohn wrote: I was hoping that other targets affected by the patch would test it as well. powerpc-linux and powerpc64-linux look good. Thanks for the confirmation. However, Andreas Tobler pointed out in private mail that I'd missed to reference rs6000/t-linux64 during the libgcc migration. Strange that this seems to have no ill effect. Hmm, yeah, we should have that back in. Lack of t-linux64 means we get all the soft-fp functions that we never use, bloating libgcc. Also, -mno-minimal-toc gives better code. Even when -mminimal-toc wouldn't be in force anyway the option selects -mcmodel=small. -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On Mon, Nov 28, 2011 at 05:42:15PM -0800, Richard Henderson wrote: On 11/28/2011 06:02 AM, Alan Modra wrote: - unsigned int ret = bar-generation ~3; - /* Do we need any barrier here or is __sync_add_and_fetch acting - as the needed LoadLoad barrier already? */ - ret += __sync_add_and_fetch (bar-awaited, -1) == 0; + unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) ~3; + ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0; Given that the read from bar-generation is ACQ, we don't need a duplicate barrier from the REL on the atomic add. I believe both can be MEMMODEL_ACQUIRE both in order to force the ordering of these two memops, as well as force these to happen before anything subsequent. I tried with MEMMODEL_ACQUIRE and ran force-parallel-6.exe, the test that seems most sensitive to barrier problems, many times, and it hangs occasionally in futex_wait called via gomp_team_barrier_wait_end. I believe that threads can't be allowed to exit from gomp_{,team_}barrier_wait without hitting a release barrier, and perhaps from gomp_barrier_wait_last too. gomp_barrier_wait_start is a convenient point to insert the barrier, and a minimal change from the old code using __sync_add_and_fetch. I can add a comment. ;-) -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On Mon, Nov 28, 2011 at 05:33:17PM -0800, Richard Henderson wrote: On 11/27/2011 02:57 PM, Alan Modra wrote: This is the mutex part. Depends on http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02454.html for MEMMODEL_* values. * config/linux/mutex.h: Use atomic rather than sync builtins. * config/linux/mutex.c: Likewise. Comment. Use -1 for waiting state. * config/linux/omp-lock.h: Comment fix. * config/linux/arm/mutex.h: Delete. * config/linux/powerpc/mutex.h: Delete. * config/linux/ia64/mutex.h: Delete. * config/linux/mips/mutex.h: Delete. Looks good modulo _4/_n and using the success return of __atomic_compare_exchange. Did you see my comment about the OpenMP spec requirement that critical, ordered and parallel regions have a flush on entry and on exit? So should I commit a version with static inline void gomp_mutex_lock (gomp_mutex_t *mutex) { int oldval = 0; /* FIXME: This should just be MEMMODEL_ACQUIRE, MEMMODEL_RELAXED but GOMP_critical_start and other functions rely on the lock acquisition doing a flush. See OpenMP API version 3.1 section 2.8.6 flush Construct. */ if (__builtin_expect (!__atomic_compare_exchange_n (mutex, oldval, 1, false, MEMMODEL_ACQ_REL, MEMMODEL_ACQUIRE), 0)) gomp_mutex_lock_slow (mutex, oldval); } or as posted in http://gcc.gnu.org/ml/gcc-patches/2011-11/msg02455.html? Perhaps with the following as well * ordered.c (gomp_ordered_sync): Add MEMMODEL_RELEASE fence. * critical.c (GOMP_critical_start): Likewise. I think we're OK on parallel regions by virtue of gomp_barrier_wait, but I may not know what I'm talking about. This stuff fries my brains. Index: ordered.c === --- ordered.c (revision 181770) +++ ordered.c (working copy) @@ -199,6 +199,9 @@ gomp_ordered_sync (void) if (team == NULL || team-nthreads == 1) return; + /* There is an implicit flush on entry to an ordered region. */ + __atomic_thread_fence (MEMMODEL_RELEASE); + /* ??? I believe it to be safe to access this data without taking the ws-lock. The only presumed race condition is with the previous thread on the queue incrementing ordered_cur such that it points Index: critical.c === --- critical.c (revision 181770) +++ critical.c (working copy) @@ -33,6 +33,8 @@ static gomp_mutex_t default_lock; void GOMP_critical_start (void) { + /* There is an implicit flush on entry to a critical region. */ + __atomic_thread_fence (MEMMODEL_RELEASE); gomp_mutex_lock (default_lock); } -- Alan Modra Australia Development Lab, IBM
Re: Fix libgomp semaphores
) { - int val = __sync_val_compare_and_swap (sem, 0, -1); - if (val 0) + unsigned int wake = count ~SEM_WAIT; + int newval = SEM_WAIT; + + if (wake != 0) + newval |= wake - SEM_INC; + if (likely_compare_exchange (sem, count, newval, + false, MEMMODEL_ACQUIRE, MEMMODEL_RELAXED)) { - if (__sync_bool_compare_and_swap (sem, val, val - 1)) - return; + if (wake != 0) + { + /* If we can wake more threads, do so now. */ + if (wake SEM_INC) + gomp_sem_post_slow (sem); + break; + } + do_wait (sem, SEM_WAIT); + count = *sem; } - do_wait (sem, -1); } } void gomp_sem_post_slow (gomp_sem_t *sem) { - int old, tmp = *sem, wake; - - do -{ - old = tmp; - wake = old 0 ? old + 1 : 1; - tmp = __sync_val_compare_and_swap (sem, old, wake); -} - while (old != tmp); - - futex_wake (sem, wake); + futex_wake (sem, 1); } -- Alan Modra Australia Development Lab, IBM
Re: Fix libgomp semaphores
On Tue, Nov 29, 2011 at 08:09:31AM -0800, Richard Henderson wrote: Tight loops like this should use the weak compare-exchange so that we don't get two nested loops without need. Done. I do think we need more commentary here so that next month I can still follow it. Added this in gomp_sem_post /* Clear SEM_WAIT here so that if there are no more waiting threads we transition back to the uncontended state that does not make futex syscalls. If there are waiting threads then when one is awoken it will set SEM_WAIT again, so other waiting threads are woken on a future gomp_sem_post. Furthermore, the awoken thread will wake other threads in case gomp_sem_post was called again before it had time to set SEM_WAIT. */ and committed revision 181831 without likely_compare_exchange as discussed. -- Alan Modra Australia Development Lab, IBM
Re: Use atomics in libgomp mutex
On Tue, Nov 29, 2011 at 08:37:07AM -0800, Richard Henderson wrote: How about MEMMODEL_ACQ_REL Right. Committed with that change revision 181832. -- Alan Modra Australia Development Lab, IBM
Re: Fix PR51298, libgomp barrier failure
On Tue, Nov 29, 2011 at 04:15:36PM +1030, Alan Modra wrote: On Mon, Nov 28, 2011 at 05:42:15PM -0800, Richard Henderson wrote: On 11/28/2011 06:02 AM, Alan Modra wrote: - unsigned int ret = bar-generation ~3; - /* Do we need any barrier here or is __sync_add_and_fetch acting - as the needed LoadLoad barrier already? */ - ret += __sync_add_and_fetch (bar-awaited, -1) == 0; + unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) ~3; + ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0; Given that the read from bar-generation is ACQ, we don't need a duplicate barrier from the REL on the atomic add. I believe both can be MEMMODEL_ACQUIRE both in order to force the ordering of these two memops, as well as force these to happen before anything subsequent. I tried with MEMMODEL_ACQUIRE and ran force-parallel-6.exe, the test that seems most sensitive to barrier problems, many times, and it hangs occasionally in futex_wait called via gomp_team_barrier_wait_end. I believe that threads can't be allowed to exit from gomp_{,team_}barrier_wait without hitting a release barrier, and perhaps from gomp_barrier_wait_last too. gomp_barrier_wait_start is a convenient point to insert the barrier, and a minimal change from the old code using __sync_add_and_fetch. I can add a comment. ;-) Committed rev 181833. -- Alan Modra Australia Development Lab, IBM
Use atomics in remaining libgomp/config/linux sources
@@ cpu_relax (void) __asm volatile ( : : : memory); #endif } - -static inline void -atomic_write_barrier (void) -{ -#if defined __arch64__ || defined __sparc_v9__ - __asm volatile (membar #StoreStore : : : memory); -#else - __sync_synchronize (); -#endif -} Index: libgomp/config/linux/x86/futex.h === --- libgomp/config/linux/x86/futex.h(revision 181830) +++ libgomp/config/linux/x86/futex.h(working copy) @@ -145,9 +145,3 @@ cpu_relax (void) { __asm volatile (rep; nop : : : memory); } - -static inline void -atomic_write_barrier (void) -{ - __sync_synchronize (); -} Index: libgomp/config/linux/futex.h === --- libgomp/config/linux/futex.h(revision 181830) +++ libgomp/config/linux/futex.h(working copy) @@ -67,9 +67,3 @@ cpu_relax (void) { __asm volatile ( : : : memory); } - -static inline void -atomic_write_barrier (void) -{ - __sync_synchronize (); -} -- Alan Modra Australia Development Lab, IBM
PR libgomp/51376 fix
The simple one-line fix in GOMP_taskwait took many hours to find. Shared memory problems are a pain to debug, especially when adding code to dump some state turns a testcase that fails every hundred or so runs into one that takes thousands of times longer to fail. What happens here is that GOMP_taskwait is called in the parent thread some time after gomp_barrier_handle_tasks has run in the child to the point of clearing the parent's children field. However, since there is no acquire barrier in the parent and the child may or may not have reached the release barrier in the mutex unlock, the memory stores in the child are not guaranteed to be seen in order in the parent thread. Thus the parent can see task-children clear but not yet see stores done as part of the real work of the child, ie. to a and n in the testcase. The GOMP_task change fixes a similar potential problem. Bootstrapped and regression tested powerpc-linux. OK to apply? PR libgomp/51376 * task.c (GOMP_taskwait): Don't access task-children outside of task_lock mutex region. (GOMP_task): Likewise. Index: libgomp/task.c === --- libgomp/task.c (revision 181833) +++ libgomp/task.c (working copy) @@ -116,12 +116,12 @@ GOMP_task (void (*fn) (void *), void *da } else fn (data); - if (task.children) - { - gomp_mutex_lock (team-task_lock); - gomp_clear_parent (task.children); - gomp_mutex_unlock (team-task_lock); - } + if (team != NULL) + gomp_mutex_lock (team-task_lock); + if (task.children != NULL) + gomp_clear_parent (task.children); + if (team != NULL) + gomp_mutex_unlock (team-task_lock); gomp_end_task (); } else @@ -290,8 +290,9 @@ GOMP_taskwait (void) struct gomp_task *child_task = NULL; struct gomp_task *to_free = NULL; - if (task == NULL || task-children == NULL) + if (task == NULL) return; + gomp_mutex_lock (team-task_lock); while (1) { -- Alan Modra Australia Development Lab, IBM
Re: PR libgomp/51376 fix
On Thu, Dec 01, 2011 at 12:36:08PM +0100, Jakub Jelinek wrote: On Thu, Dec 01, 2011 at 09:58:08PM +1030, Alan Modra wrote: The GOMP_task change fixes a similar potential problem. Bootstrapped and regression tested powerpc-linux. OK to apply? PR libgomp/51376 * task.c (GOMP_taskwait): Don't access task-children outside of task_lock mutex region. (GOMP_task): Likewise. Can't this be solved just by adding a barrier? The access to the var outside of the lock has been quite intentional, to avoid locking in the common case where there are no children. No, I tried that and the task-6.C testcase still failed although not quite as readily. I was using if (task == NULL || __atomic_load_n (task-children, MEMMODEL_ACQUIRE) == 0) You need a release in the child as well as the acquire to ensure proper synchronisation, and there's a window for failure between the child clearing task-children and performing a release as part of the mutex unlock. Oops, on looking at this email I see I attached an old patch.. This one avoids a segfault on trying to lock team-task_lock when there is no team. This one really has been bootstrapped and regression tested successfully. PR libgomp/51376 * task.c (GOMP_taskwait): Don't access task-children outside of task_lock mutex region. (GOMP_task): Likewise. Index: libgomp/task.c === --- libgomp/task.c (revision 181902) +++ libgomp/task.c (working copy) @@ -116,10 +116,11 @@ GOMP_task (void (*fn) (void *), void *da } else fn (data); - if (task.children) + if (team != NULL) { gomp_mutex_lock (team-task_lock); - gomp_clear_parent (task.children); + if (task.children != NULL) + gomp_clear_parent (task.children); gomp_mutex_unlock (team-task_lock); } gomp_end_task (); @@ -290,8 +291,9 @@ GOMP_taskwait (void) struct gomp_task *child_task = NULL; struct gomp_task *to_free = NULL; - if (task == NULL || task-children == NULL) + if (task == NULL || team == NULL) return; + gomp_mutex_lock (team-task_lock); while (1) { -- Alan Modra Australia Development Lab, IBM
Re: PR libgomp/51376 fix
On Fri, Dec 02, 2011 at 08:10:11PM +1030, Alan Modra wrote: On Thu, Dec 01, 2011 at 12:36:08PM +0100, Jakub Jelinek wrote: On Thu, Dec 01, 2011 at 09:58:08PM +1030, Alan Modra wrote: The GOMP_task change fixes a similar potential problem. Bootstrapped and regression tested powerpc-linux. OK to apply? PR libgomp/51376 * task.c (GOMP_taskwait): Don't access task-children outside of task_lock mutex region. (GOMP_task): Likewise. Can't this be solved just by adding a barrier? The access to the var outside of the lock has been quite intentional, to avoid locking in the common case where there are no children. No, I tried that and the task-6.C testcase still failed although not quite as readily. I was using if (task == NULL || __atomic_load_n (task-children, MEMMODEL_ACQUIRE) == 0) You need a release in the child as well as the acquire to ensure proper synchronisation, and there's a window for failure between the child clearing task-children and performing a release as part of the mutex unlock. Perhaps I misunderstood your question. I suppose you could solve the problem by adding a barrier, if by that you meant the acquire barrier in GOMP_taskwait as I tried before I fully understood the problem, plus the corresponding release barrier in gomp_barrier_handle_tasks. ie. replace parent-children = NULL; with __atomic_write_n (parent-children, NULL, MEMMODEL_RELEASE); That should work, but of course will slow down what I imagine is a common multi-thread case where child threads don't complete before the parent gets around to a taskwait. Do you really want to do that? -- Alan Modra Australia Development Lab, IBM
Re: Allow more PowerPC sibling calls
On Mon, Apr 18, 2011 at 10:29:17AM -0700, Nathan Froyd wrote: Could I request that you use FOREACH_FUNCTION_ARGS in these two cases? The conversion is trivial, and avoiding more exposed TYPE_ARG_TYPES calls is a good thing. Committed revision 172855. -- Alan Modra Australia Development Lab, IBM
Re: Allow more PowerPC sibling calls
On Fri, Apr 22, 2011 at 01:29:14PM +0930, Alan Modra wrote: On Mon, Apr 18, 2011 at 10:29:17AM -0700, Nathan Froyd wrote: Could I request that you use FOREACH_FUNCTION_ARGS in these two cases? The conversion is trivial, and avoiding more exposed TYPE_ARG_TYPES calls is a good thing. Committed revision 172855. Oops, I missed the following hunk, which I talked about here I also fixed a minor problem with CALL_LIBCALL in the call cookie, which could cause various call insns to match an n constraint rather than a 0 constraint and so give the wrong insn length. and mentioned in the changelog: * config/rs6000/rs6000.c (rs6000_function_arg): Remove CALL_LIBCALL when returning call_cookie. but failed to include in my post. Committed as obvious, revision 172856. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 172855) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -9007,7 +9007,7 @@ rs6000_function_arg (CUMULATIVE_ARGS *cu : CALL_V4_CLEAR_FP_ARGS)); } - return GEN_INT (cum-call_cookie); + return GEN_INT (cum-call_cookie ~CALL_LIBCALL); } if (TARGET_MACHO rs6000_darwin64_struct_check_p (mode, type)) -- Alan Modra Australia Development Lab, IBM
PowerPC64 non-delegitimized UNSPEC_TOCREL
This patch fixes the following warnings seen during a powerpc64 bootstrap. libgomp/config/linux/sem.c: In function ‘gomp_sem_wait_slow’: libgomp/config/linux/sem.c:33:1: note: non-delegitimized UNSPEC UNSPEC_TOCREL (44) found in variable location libgomp/config/linux/bar.c: In function ‘gomp_barrier_wait_end’: libgomp/config/linux/bar.c:34:1: note: non-delegitimized UNSPEC UNSPEC_TOCREL (44) found in variable location What's happening is that we are getting an address that looked like lo_sum ((reg 31) (const (plus (unspec [symbol_ref (some_var) tocrel]) 4))) but only expect to handle something like lo_sum ((reg 31) (const (unspec [symbol_ref (some_var) tocrel]))) I also tidied the macho code which used a mix of orig_x and x, makeing the code fragile wrt. some future change that assigns x earlier in the function. (If orig_x is a lo_sum, then x == orig_x currently.) Bootstrapped and regression tested powerpc64-linux. OK to apply? * config/rs6000/rs6000.c (rs6000_delegitimize_address): Handle unspec plus offset. Tidy macho code. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 173064) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -6380,7 +6380,16 @@ rs6000_delegitimize_address (rtx orig_x) if (GET_CODE (x) == (TARGET_CMODEL != CMODEL_SMALL ? LO_SUM : PLUS) GET_CODE (XEXP (x, 1)) == CONST) { + rtx offset = NULL_RTX; + y = XEXP (XEXP (x, 1), 0); + if (GET_CODE (y) == PLUS + GET_MODE (y) == Pmode + CONST_INT_P (XEXP (y, 1))) + { + offset = XEXP (y, 1); + y = XEXP (y, 0); + } if (GET_CODE (y) == UNSPEC XINT (y, 1) == UNSPEC_TOCREL ((GET_CODE (XEXP (x, 0)) == REG @@ -6396,6 +6405,8 @@ rs6000_delegitimize_address (rtx orig_x) XEXP (XEXP (XEXP (x, 0), 1), 0) { y = XVECEXP (y, 0, 0); + if (offset != NULL_RTX) + y = gen_rtx_PLUS (Pmode, y, offset); if (!MEM_P (orig_x)) return y; else @@ -6405,9 +6416,9 @@ rs6000_delegitimize_address (rtx orig_x) if (TARGET_MACHO GET_CODE (orig_x) == LO_SUM - GET_CODE (XEXP (x, 1)) == CONST) + GET_CODE (XEXP (orig_x, 1)) == CONST) { - y = XEXP (XEXP (x, 1), 0); + y = XEXP (XEXP (orig_x, 1), 0); if (GET_CODE (y) == UNSPEC XINT (y, 1) == UNSPEC_MACHOPIC_OFFSET) return XVECEXP (y, 0, 0); -- Alan Modra Australia Development Lab, IBM
Re: PowerPC64 non-delegitimized UNSPEC_TOCREL
On Thu, Apr 28, 2011 at 01:58:24PM -0400, David Edelsohn wrote: On Thu, Apr 28, 2011 at 11:22 AM, Alan Modra amo...@gmail.com wrote: * config/rs6000/rs6000.c (rs6000_delegitimize_address): Handle unspec plus offset. Tidy macho code. Looks good. Committed mainline revision 173141. I meant to ask about 4.6 as we have the same problem there. OK for 4.6 too? -- Alan Modra Australia Development Lab, IBM
Re: rs6000_handle_option global state avoidance, part 1
On Thu, May 05, 2011 at 11:29:13AM -0400, David Edelsohn wrote: Alan, is mcmodel suppose to set m64? No, that was an accident. -- Alan Modra Australia Development Lab, IBM
Fix PR48900, powerpc duplicate __tls_get_addr calls
My fix for PR44266 using the libcall machinery to ensure we had a proper stack frame allocated for __tls_get_addr calls sloppily used r3 as the arg to the dummy libcall. This made the call seem to depend on whatever was in r3 previously, at least until we get to the first split pass and the real arg is exposed. So DCE couldn't merge calls. Even for a simple testcase like extern __thread int i; void foo (void) { i++; } we get two __tls_get_addr calls if using global-dynamic tls model. Easliy fixed by giving the dummy libcall an arg of zero. An alternative giving slightly better -O0 code would be to say that the libcall doesn't have any args. I chose to leave the libcall with one arg since this is closest to the real __tls_get_addr call, and the whole point of faking up a libcall here is to have the generic code do whatever is necessary when making function calls. It's not totally impossible to imagine some future ABI change that treats zero arg calls differently from other calls. Bootstrapped and regression tested powerpc64-linux. OK to apply mainline, 4.6 and 4.5? PR target/48900 * config/rs6000/rs6000.c (rs6000_legitimize_tls_address): Use const0_rtx as the arg to the dummy __tls_get_addr libcall. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 173464) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -6412,10 +6412,11 @@ rs6000_legitimize_tls_address (rtx addr, if (model == TLS_MODEL_GLOBAL_DYNAMIC) { - r3 = gen_rtx_REG (Pmode, 3); tga = rs6000_tls_get_addr (); - emit_library_call_value (tga, dest, LCT_CONST, Pmode, 1, r3, Pmode); + emit_library_call_value (tga, dest, LCT_CONST, Pmode, + 1, const0_rtx, Pmode); + r3 = gen_rtx_REG (Pmode, 3); if (DEFAULT_ABI == ABI_AIX TARGET_64BIT) insn = gen_tls_gd_aix64 (r3, got, addr, tga, const0_rtx); else if (DEFAULT_ABI == ABI_AIX !TARGET_64BIT) @@ -6432,11 +6433,12 @@ rs6000_legitimize_tls_address (rtx addr, } else if (model == TLS_MODEL_LOCAL_DYNAMIC) { - r3 = gen_rtx_REG (Pmode, 3); tga = rs6000_tls_get_addr (); tmp1 = gen_reg_rtx (Pmode); - emit_library_call_value (tga, tmp1, LCT_CONST, Pmode, 1, r3, Pmode); + emit_library_call_value (tga, tmp1, LCT_CONST, Pmode, + 1, const0_rtx, Pmode); + r3 = gen_rtx_REG (Pmode, 3); if (DEFAULT_ABI == ABI_AIX TARGET_64BIT) insn = gen_tls_ld_aix64 (r3, got, tga, const0_rtx); else if (DEFAULT_ABI == ABI_AIX !TARGET_64BIT) -- Alan Modra Australia Development Lab, IBM
-Wshadow warning
Wouldn't -Wshadow be more useful if it obeyed -Wno-system-headers? For code like #include stdlib.h int foo (int atof); int foo (int atof) { return atof; } we currently do not warn on the prototype, but do on the function definition, leading to reports such as http://sourceware.org/bugzilla/show_bug.cgi?id=13121 The following has been bootstrapped and regression tested powerpc-linux. OK to apply? * c-decl.c (warn_if_shadowing): Don't warn if shadowed identifier is from system header. Index: gcc/c-decl.c === --- gcc/c-decl.c(revision 178035) +++ gcc/c-decl.c(working copy) @@ -2516,7 +2516,10 @@ warn_if_shadowing (tree new_decl) /* Is anything being shadowed? Invisible decls do not count. */ for (b = I_SYMBOL_BINDING (DECL_NAME (new_decl)); b; b = b-shadowed) -if (b-decl b-decl != new_decl !b-invisible) +if (b-decl b-decl != new_decl !b-invisible +(b-decl == error_mark_node + || diagnostic_report_warnings_p (global_dc, +DECL_SOURCE_LOCATION (b-decl { tree old_decl = b-decl; -- Alan Modra Australia Development Lab, IBM
rs6000 toc reference rtl
gpc_reg_operand =r) -(lo_sum:DI (match_operand:DI 1 gpc_reg_operand b) - (match_operand:DI 2 )))] - TARGET_ELF TARGET_CMODEL != CMODEL_SMALL - {cal %0,%2@l(%1)|addi %0,%1,%2@l}) ;; Call and call_value insns (define_expand call -- Alan Modra Australia Development Lab, IBM
Re: rs6000 toc reference rtl
On Tue, Sep 06, 2011 at 01:11:26AM +0930, Alan Modra wrote: Consequently, Mike's change to split rtl for indirect calls sometimes sees the scheduler moving the r2 load in the indirect call sequence before a toc reference. Actually, this isn't correct. Mike's change adding rs6000.c rs6000_call_indirect_aix just made it more likely. Even before this post-reload scheduling could move the r2 load around, since rs6000.md call_indirect_aix patterns were (and still are) split post-reload. Here's an example I was shown today of such damage (qemu compiled with gcc-4.6-redhat). .LVL57151: ld 0,0(31) # load opd+0, function addr addis 4,2,.LC4758@toc@ha ld 11,16(31) mr 7,3 std 2,40(1) # save r2 mr 5,25 addi 4,4,.LC4758@toc@l mtctr 0 # mr 6,26 ld 2,8(31) # load opd+8, new toc ptr in r2 mr 3,28 .LBB255670: .LBB255668: .loc 8 98 0 addis 27,2,.LC4761@toc@ha # oops, should be using old r2 .LVL57152: addi 27,27,.LC4761@toc@l .LBE255668: .LBE255670: .loc 3 9212 0 addis 25,2,.LC4762@toc@ha # oops again .loc 3 9198 0 bctrl # make the call ld 2,40(1) # restore r2 r27 and r25 set up here for later use now contain bogus values. The blame rests on my 2011-06-20 change. -- Alan Modra Australia Development Lab, IBM
PowerPC shrink-wrap support 0 of 3
This patch series adds shrink-wrap support for PowerPC. The patches are on top of Bernd's Initial shrink-wrapping patch: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg02557.html, but with the tm.texi patch applied to tm.texi.in. Bootstrapped and regression tested powerpc64-linux all langs except ada, and spec CPU2006 tested. The spec results were a little disappointing as I expected to see some gains, but my baseline was a -O3 run and I suppose most of the shrink-wrap opportunities were lost to inlining. I deliberately omitted defining RETURN_ADDR_REGNUM for PowerPC as I don't see why it is necessary to treat LR any differently to other regs saved by the prologue. I believe code in requires_stack_frame_p CLEAR_HARD_REG_SET (hardregs); note_stores (PATTERN (insn), record_hard_reg_sets, hardregs); if (hard_reg_set_intersect_p (hardregs, prologue_used)) return true; will correctly pick up any set of LR. If code in a function body sets LR without saving and restoring around the use, then the prologue must save LR, so LR will be in prologue_used. -- Alan Modra Australia Development Lab, IBM
PowerPC shrink-wrap support 1 of 3
This obvious patch extends my 2011-08-03 fix to simple_return. Necessary for the same reason as the original patch, namely that rs6000.md has a peephole2 that matches and recreates a conditional jump. With the third patch in this series applied, rs6000 will start to use simple_return in conditional jumps. Even though I think this patch meets the obvious requirement, I won't apply it without the other two since the assert may trigger on mips without the second patch in this series. PR rtl-optimization/49941 * jump.c (mark_jump_label_1): Set JUMP_LABEL for simple_return jumps. diff -urp -x.svn -x'*~' -x'*.orig' gcc-bernd/gcc/jump.c gcc-current/gcc/jump.c --- gcc-bernd/gcc/jump.c2011-09-16 11:52:14.0 +0930 +++ gcc-current/gcc/jump.c 2011-09-16 09:59:39.0 +0930 @@ -1086,6 +1086,7 @@ mark_jump_label_1 (rtx x, rtx insn, bool return; case RETURN: +case SIMPLE_RETURN: if (is_target) { gcc_assert (JUMP_LABEL (insn) == NULL || JUMP_LABEL (insn) == x); @@ -1408,7 +1409,7 @@ redirect_exp_1 (rtx *loc, rtx olabel, rt int i; const char *fmt; - if ((code == LABEL_REF XEXP (x, 0) == olabel) + if ((code == LABEL_REF XEXP (x, 0) == olabel) || x == olabel) { x = redirect_target (nlabel); -- Alan Modra Australia Development Lab, IBM
PowerPC shrink-wrap support 3 of 3
)])] { @@ -13002,7 +13019,7 @@ (match_operand 2 ))) (use (match_operand:SI 3 immediate_operand O,n,O,n)) (use (reg:SI LR_REGNO)) - (return)] + (simple_return)] (DEFAULT_ABI == ABI_DARWIN || DEFAULT_ABI == ABI_V4) (INTVAL (operands[3]) CALL_LONG) == 0 @@ -15328,9 +15345,9 @@ [(match_operand 1 cc_reg_operand y) (const_int 0)]) - (return) + (any_return) (pc)))] - direct_return () + return_pred * { return output_cbranch (operands[0], NULL, 0, insn); @@ -15360,8 +15377,8 @@ cc_reg_operand y) (const_int 0)]) (pc) - (return)))] - direct_return () + (any_return)))] + return_pred * { return output_cbranch (operands[0], NULL, 1, insn); @@ -15491,9 +15508,9 @@ b %l0 [(set_attr type branch)]) -(define_insn return - [(return)] - direct_return () +(define_insn return_strreturn + [(any_return)] + return_pred {br|blr} [(set_attr type jmpreg)]) @@ -16015,7 +16032,7 @@ (set_attr cell_micro always)]) (define_insn *return_internal_mode - [(return) + [(simple_return) (use (match_operand:P 0 register_operand lc))] b%T0 @@ -16077,7 +16094,7 @@ (define_insn *return_and_restore_gpregs_mode_r11 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (match_operand:P 1 register_operand =l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 11)) @@ -16090,7 +16107,7 @@ (define_insn *return_and_restore_gpregs_mode_r12 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (match_operand:P 1 register_operand =l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 12)) @@ -16103,7 +16120,7 @@ (define_insn *return_and_restore_gpregs_mode_r1 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (match_operand:P 1 register_operand =l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 1)) @@ -16116,7 +16133,7 @@ (define_insn *return_and_restore_fpregs_mode_r11 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (match_operand:P 1 register_operand =l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 11)) @@ -16129,7 +16146,7 @@ (define_insn *return_and_restore_fpregs_mode_r12 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (match_operand:P 1 register_operand =l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 12)) @@ -16142,7 +16159,7 @@ (define_insn *return_and_restore_fpregs_mode_r1 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (match_operand:P 1 register_operand =l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 1)) @@ -16155,7 +16172,7 @@ (define_insn *return_and_restore_fpregs_aix_mode_r11 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (use (match_operand:P 1 register_operand l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 11)) @@ -16168,7 +16185,7 @@ (define_insn *return_and_restore_fpregs_aix_mode_r1 [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (use (match_operand:P 1 register_operand l)) (use (match_operand:P 2 symbol_ref_operand s)) (use (reg:P 1)) diff -urp -x.svn -x'*~' -x'*.orig' gcc-bernd/gcc/config/rs6000/spe.md gcc-current/gcc/config/rs6000/spe.md --- gcc-bernd/gcc/config/rs6000/spe.md 2011-09-16 11:52:15.0 +0930 +++ gcc-current/gcc/config/rs6000/spe.md2011-09-15 18:28:43.0 +0930 @@ -3178,7 +3178,7 @@ (define_insn *return_and_restore_gpregs_spe [(match_parallel 0 any_parallel_operand - [(return) + [(simple_return) (clobber (reg:P 65)) (use (match_operand:P 1 symbol_ref_operand s)) (use (reg:P 11)) -- Alan Modra Australia Development Lab, IBM
Re: PowerPC shrink-wrap support 0 of 3
On Sat, Sep 17, 2011 at 03:26:21PM +0200, Bernd Schmidt wrote: On 09/17/11 09:16, Alan Modra wrote: This patch series adds shrink-wrap support for PowerPC. The patches are on top of Bernd's Initial shrink-wrapping patch: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg02557.html, but with the tm.texi patch applied to tm.texi.in. Bootstrapped and regression tested powerpc64-linux all langs except ada, and spec CPU2006 tested. The spec results were a little disappointing as I expected to see some gains, but my baseline was a -O3 run and I suppose most of the shrink-wrap opportunities were lost to inlining. The last posted version had a bug that crept in during the review cycle, and which made it quite ineffective. I wasn't complaining! My disappointment really stemmed from having unrealistically high expectations. I still think this optimization is a great feature. Thanks for contributing it! -- Alan Modra Australia Development Lab, IBM
Re: [PATCH] Bug fix in store_bit_field_1 for big endian targets (issue 51893)
On Wed, Mar 21, 2012 at 10:53:07PM +0100, Eric Botcazou wrote: This patch (for 4.6) fixes a wrong subword index computation in store_bit_field_1 for big endian targets when value is at least 4 times bigger than a word (DI REG value with HI words). It fixes a regression on gcc.c-torture/execute/bitfld-3.c for my current backend port. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51893 OK to commit? It is OK for mainline on principle but there is no ChangeLog entry and the patch doesn't follow the GNU Coding Style: TABs instead of spaces, spaces before parentheses, etc. See the equivalent code in extract_bit_field_1. Moreover you need to properly test it on a mainstream big-endian platform. Passes bootstrap and regression test powerpc64-linux. -- Alan Modra Australia Development Lab, IBM
rs6000 toc reference rtl again
) -(high:TLSmode - (unspec:TLSmode [(match_operand:TLSmode 2 rs6000_tls_symbol_ref )] - UNSPEC_TLSGOTDTPREL)] + (high:TLSmode + (unspec:TLSmode [(match_operand:TLSmode 1 gpc_reg_operand b) + (match_operand:TLSmode 2 rs6000_tls_symbol_ref )] + UNSPEC_TLSGOTDTPREL)))] HAVE_AS_TLS TARGET_CMODEL != CMODEL_SMALL addis %0,%1,%2@got@dtprel@ha [(set_attr length 4)]) @@ -11855,10 +11847,8 @@ lTLSmode:tls_insn_suffix %0,%2@got@tprel(%1) TARGET_CMODEL != CMODEL_SMALL [(set (match_dup 3) - (const:TLSmode - (plus:TLSmode (match_dup 1) - (high:TLSmode - (unspec:TLSmode [(match_dup 2)] UNSPEC_TLSGOTTPREL) + (high:TLSmode + (unspec:TLSmode [(match_dup 1) (match_dup 2)] UNSPEC_TLSGOTTPREL))) (set (match_dup 0) (lo_sum:TLSmode (match_dup 3) (unspec:TLSmode [(match_dup 2)] UNSPEC_TLSGOTTPREL)))] @@ -11873,11 +11863,10 @@ (define_insn *tls_got_tprel_highTLSmode:tls_abi_suffix [(set (match_operand:TLSmode 0 gpc_reg_operand =b) - (const:TLSmode - (plus:TLSmode (match_operand:TLSmode 1 gpc_reg_operand b) -(high:TLSmode - (unspec:TLSmode [(match_operand:TLSmode 2 rs6000_tls_symbol_ref )] - UNSPEC_TLSGOTTPREL)] + (high:TLSmode + (unspec:TLSmode [(match_operand:TLSmode 1 gpc_reg_operand b) + (match_operand:TLSmode 2 rs6000_tls_symbol_ref )] + UNSPEC_TLSGOTTPREL)))] HAVE_AS_TLS TARGET_CMODEL != CMODEL_SMALL addis %0,%1,%2@got@tprel@ha [(set_attr length 4)]) @@ -12223,6 +12212,25 @@ DONE; }) +;; Largetoc support +(define_insn largetoc_high + [(set (match_operand:DI 0 gpc_reg_operand =b*r) +(high:DI + (unspec [(match_operand:DI 1 ) + (match_operand:DI 2 gpc_reg_operand b)] + UNSPEC_TOCREL)))] + TARGET_ELF TARGET_CMODEL != CMODEL_SMALL + {cau|addis} %0,%2,%1@toc@ha) + +(define_insn largetoc_low + [(set (match_operand:DI 0 gpc_reg_operand =r,r) +(lo_sum:DI (match_operand:DI 1 gpc_reg_operand b,!*r) + (match_operand:DI 2 )))] + TARGET_ELF TARGET_CMODEL != CMODEL_SMALL + @ +{cal %0,%2@l(%1)|addi %0,%1,%2@l} +{ai|addic} %0,%1,%2@l) + ;; Elf specific ways of loading addresses for non-PIC code. ;; The output of this could be r0, but we make a very strong ;; preference for a base register because it will usually @@ -12241,22 +12249,6 @@ @ {cal|la} %0,%2@l(%1) {ai|addic} %0,%1,%K2) - -;; Largetoc support -(define_insn largetoc_high - [(set (match_operand:DI 0 gpc_reg_operand =b) - (const:DI - (plus:DI (match_operand:DI 1 gpc_reg_operand b) - (high:DI (match_operand:DI 2 )] - TARGET_ELF TARGET_CMODEL != CMODEL_SMALL - {cau|addis} %0,%1,%2@ha) - -(define_insn largetoc_low - [(set (match_operand:DI 0 gpc_reg_operand =r) -(lo_sum:DI (match_operand:DI 1 gpc_reg_operand b) - (match_operand:DI 2 )))] - TARGET_ELF TARGET_CMODEL != CMODEL_SMALL - {cal %0,%2@l(%1)|addi %0,%1,%2@l}) ;; Call and call_value insns (define_expand call Index: gcc/config/rs6000/rs6000-protos.h === --- gcc/config/rs6000/rs6000-protos.h (revision 185830) +++ gcc/config/rs6000/rs6000-protos.h (working copy) @@ -38,7 +38,7 @@ extern bool macho_lo_sum_memory_operand (rtx, enum extern int num_insns_constant (rtx, enum machine_mode); extern int num_insns_constant_wide (HOST_WIDE_INT); extern int small_data_operand (rtx, enum machine_mode); -extern bool toc_relative_expr_p (rtx); +extern bool toc_relative_expr_p (const_rtx, bool); extern bool invalid_e500_subreg (rtx, enum machine_mode); extern void validate_condition_mode (enum rtx_code, enum machine_mode); extern bool legitimate_constant_pool_address_p (const_rtx, enum machine_mode, -- Alan Modra Australia Development Lab, IBM
[RS6000] Stack tie fix.
; - emit_insn (gen_frame_tie (mem1, mem2)); - } + rs6000_emit_stack_tie (frame_reg_rtx, true); insn = emit_insn (gen_add3_insn (frame_reg_rtx, hard_frame_pointer_rtx, GEN_INT (info-total_size))); @@ -20456,11 +20465,7 @@ rs6000_emit_epilogue (int sibcall) /* Prevent reordering memory accesses against stack pointer restore. */ if (cfun-calls_alloca || offset_below_red_zone_p (-info-total_size)) - { - rtx mem = gen_rtx_MEM (BLKmode, sp_reg_rtx); - MEM_NOTRAP_P (mem) = 1; - emit_insn (gen_stack_tie (mem)); - } + rs6000_emit_stack_tie (frame_reg_rtx, false); insn = emit_insn (gen_add3_insn (sp_reg_rtx, sp_reg_rtx, GEN_INT (info-total_size))); sp_offset = 0; Index: gcc/config/rs6000/predicates.md === --- gcc/config/rs6000/predicates.md (revision 185830) +++ gcc/config/rs6000/predicates.md (working copy) @@ -1451,3 +1451,11 @@ return 1; }) + +;; Return 1 if OP is a stack tie operand. +(define_predicate tie_operand + (match_code mem) +{ + return (GET_CODE (XEXP (op, 0)) == UNSPEC + XINT (XEXP (op, 0), 1) == UNSPEC_TIE); +}) Index: gcc/config/rs6000/rs6000.md === --- gcc/config/rs6000/rs6000.md (revision 185830) +++ gcc/config/rs6000/rs6000.md (working copy) @@ -11989,17 +11989,20 @@ (define_expand restore_stack_block [(set (match_dup 2) (match_dup 3)) (set (match_dup 4) (match_dup 2)) - (set (match_dup 5) (unspec:BLK [(match_dup 5)] UNSPEC_TIE)) + (set (match_dup 5) (const_int 0)) (set (match_operand 0 register_operand ) (match_operand 1 register_operand ))] { + rtx u; + operands[1] = force_reg (Pmode, operands[1]); operands[2] = gen_reg_rtx (Pmode); operands[3] = gen_frame_mem (Pmode, operands[0]); operands[4] = gen_frame_mem (Pmode, operands[1]); - operands[5] = gen_frame_mem (BLKmode, operands[0]); + u = gen_rtx_UNSPEC (Pmode, gen_rtvec (1, operands[0]), UNSPEC_TIE); + operands[5] = gen_frame_mem (BLKmode, u); }) (define_expand save_stack_nonlocal @@ -12022,12 +12025,13 @@ [(set (match_dup 2) (match_operand 1 memory_operand )) (set (match_dup 3) (match_dup 4)) (set (match_dup 5) (match_dup 2)) - (set (match_dup 6) (unspec:BLK [(match_dup 6)] UNSPEC_TIE)) + (set (match_dup 6) (const_int 0)) (set (match_operand 0 register_operand ) (match_dup 3))] { int units_per_word = (TARGET_32BIT) ? 4 : 8; + rtx u; /* Restore the backchain from the first word, sp from the second. */ operands[2] = gen_reg_rtx (Pmode); @@ -12035,7 +12039,8 @@ operands[1] = adjust_address_nv (operands[1], Pmode, 0); operands[4] = adjust_address_nv (operands[1], Pmode, units_per_word); operands[5] = gen_frame_mem (Pmode, operands[3]); - operands[6] = gen_frame_mem (BLKmode, operands[0]); + u = gen_rtx_UNSPEC (Pmode, gen_rtvec (1, operands[0]), UNSPEC_TIE); + operands[6] = gen_frame_mem (BLKmode, u); }) ;; TOC register handling. @@ -15899,25 +15904,15 @@ [(set_attr type branch) (set_attr length 4)]) -; These are to explain that changes to the stack pointer should -; not be moved over stores to stack memory. +; This is to explain that changes to the stack pointer should +; not be moved over loads from or stores to stack memory. (define_insn stack_tie - [(set (match_operand:BLK 0 memory_operand +m) -(unspec:BLK [(match_dup 0)] UNSPEC_TIE))] + [(set (match_operand:BLK 0 tie_operand ) + (const_int 0))] [(set_attr length 0)]) -; Like stack_tie, but depend on both fp and sp based memory. -(define_insn frame_tie - [(set (match_operand:BLK 0 memory_operand +m) - (unspec:BLK [(match_dup 0) -(match_operand:BLK 1 memory_operand m)] UNSPEC_TIE))] - - - [(set_attr length 0)]) - - (define_expand epilogue [(use (const_int 0))] -- Alan Modra Australia Development Lab, IBM
Re: [rs6000] Fix PR 50906, eh_frame and other woes
Ping http://gcc.gnu.org/ml/gcc-patches/2011-11/msg00543.html -- Alan Modra Australia Development Lab, IBM
Re: PR libgomp/51376 fix
On Fri, Dec 02, 2011 at 08:10:11PM +1030, Alan Modra wrote: PR libgomp/51376 * task.c (GOMP_taskwait): Don't access task-children outside of task_lock mutex region. (GOMP_task): Likewise. Committed revision 182151 on rth's irc OK. -- Alan Modra Australia Development Lab, IBM
FIX PR51926, libstdc++ iterator store bigendian bitfield related
This patch fixes a wrong-code bug in store_bit_field. When storing a multi-word BLKmode value (such as from a RECORD_TYPE), the last word is trashed on big-endian targets. See the PR for the gory details. Bootstrapped and regression tested powerpc64-linux. Fixes 74 testsuite failures. OK for mainline? And earlier versions of gcc? PR middle-end/51926 * expmed.c (store_bit_field_1): Properly handle last word of BLKmode value when bigendian. Index: gcc/expmed.c === --- gcc/expmed.c(revision 183228) +++ gcc/expmed.c(working copy) @@ -551,19 +551,27 @@ store_bit_field_1 (rtx str_rtx, unsigned /* If I is 0, use the low-order word in both field and target; if I is 1, use the next to lowest word; and so on. */ unsigned int wordnum = (backwards ? nwords - i - 1 : i); + unsigned int nbits = (i == nwords - 1 + ? bitsize - i * BITS_PER_WORD + : BITS_PER_WORD); unsigned int bit_offset = (backwards -? MAX ((int) bitsize - ((int) i + 1) - * BITS_PER_WORD, - 0) -: (int) i * BITS_PER_WORD); +? bitsize - i * BITS_PER_WORD - nbits +: i * BITS_PER_WORD); rtx value_word = operand_subword_force (value, wordnum, fieldmode); - if (!store_bit_field_1 (op0, MIN (BITS_PER_WORD, - bitsize - i * BITS_PER_WORD), - bitnum + bit_offset, + if (WORDS_BIG_ENDIAN !backwards nbits != BITS_PER_WORD) + { + /* Original fieldmode was BLKmode. The value in this +case consists of bits in memory order, so any unused +bits in the last word are in the least significant +position. */ + value_word = expand_shift (RSHIFT_EXPR, word_mode, value_word, +BITS_PER_WORD - nbits, +value_word, 1); + } + if (!store_bit_field_1 (op0, nbits, bitnum + bit_offset, bitregion_start, bitregion_end, - word_mode, - value_word, fallback_p)) + word_mode, value_word, fallback_p)) { delete_insns_since (last); return false; -- Alan Modra Australia Development Lab, IBM
Re: Bug store_bit_field_1 + patch
On Wed, Jan 18, 2012 at 12:48:45PM +0100, Aurelien Buhrig wrote: The problem occurs when value is a REG and bitsize BITS_PER_WORD. This is because wordnum, which is used to get the subword of value, is incorrectly computed, in BIG_ENDIAN, wrt the number of words needed by bitsize instead of the number of words needed by the integer mode of the field, which is the mode used to compute subwords. That doesn't sound right. wordnum is passed to operand_subword as its offset param, which counts in words of word_mode size. I think you have a problem elsewhere. -- Alan Modra Australia Development Lab, IBM
Re: FIX PR51926, libstdc++ iterator store bigendian bitfield related
On Sat, Jan 21, 2012 at 11:07:10AM +1030, Alan Modra wrote: PR middle-end/51926 * expmed.c (store_bit_field_1): Properly handle last word of BLKmode value when bigendian. Blah, I was working on an old tree. The bug is already fixed. -- Alan Modra Australia Development Lab, IBM