Author: fengzhou Date: 2011-05-10 12:27:46 -0400 (Tue, 10 May 2011) New Revision: 3592
Modified: trunk/osprey/be/cg/calls.cxx trunk/osprey/common/com/x8664/targ_sim.cxx Log: Background ---------- The problem is related to gcc's extension, namely the stdcall function attributes. It is exposed while using gnu m4 as the test case. According to gcc, stdcall On the Intel 386, the stdcall attribute causes the compiler to assume that the called function will pop off the stack space used to pass arguments, unless it takes a variable number of arguments. For example, the following code int __attribute__ ((stdcall)) foo(int a, int b) { int j; j = a + b; return j; } is compiled into (with gcc -m32 -O0): foo: pushl %ebp movl %esp, %ebp subl $16, %esp movl 12(%ebp), %eax addl 8(%ebp), %eax movl %eax, -4(%ebp) movl -4(%ebp), %eax leave ret $8 <== note $8 here, which is the space needed for passing actual parameters, callee adjusts sp and the callsite is compiled into: call foo subl $8, %esp <== generated only when __attribute__ ((stdcall)) is declared on `foo' Bug 1: m4 runtime failure when compiled with `-O2' or `-Os' (-m32) ------------------------------------------------------------------ This bug is caused by tail call optimization in CG. Basically, in some cases, CG decides if it can just transform a call of B --> C into a jump instruction. In m4, we have the following kind of code: static reg_errcode_t __attribute__ ((regparm (3), stdcall)) get_subexp_sub (re_match_context_t *mctx, const re_sub_match_top_t *sub_top, re_sub_match_last_t *sub_last, Idx bkref_node, Idx bkref_str) { ... return clean_state_log_if_needed (mctx, to_idx); } static reg_errcode_t __attribute__ ((regparm (3), stdcall)) clean_state_log_if_needed (re_match_context_t *mctx, Idx next_state_log_idx) For function `get_subexp_sub', `regparm(3)' causes the first three arguments to be passed via registers and the last two via stack. So, the stack size needed for the actual parameters is 8. Therefore, you will see a `ret $8' instruction at the end of this function. On the other hand, all the parameters for `clean_state_log_if_needed' are passed via registers because of the `regparm(3)' declaration. So, you will see a `ret' instruction at the end instead (no sp adjustment needed). Because the `sp' adjustment is different for these two functions, the call to `clean_state_log_if_needed' cannot be converted into a `jump' instruction in this case. The original code fails to check to see if the stack adjustments are same. Bug 2: m4 runtime failure when compiled with `-Ofast' (-m32) ------------------------------------------------------------ This bug is also caused by `stdcall'. In function `Initialize_Stack_Frame', `Calc_Formal_Area' is called to compute the stack area from types of the formal parameters. On the other hand, when processing a callsite, `Calc_Actual_Area' is called to compute the stack area from types of the *actual* parameters. Ideally, the results should be same for the same function. The compiler also assumes so, since it uses `arg_area_size_array' to store the computation result (regardless whether it is from `Calc_Formal_Area' or `Calc_Actual_Area'), so that only one such computation is needed for each function type. In m4 code, we have the following two functions: static reg_errcode_t __attribute__ ((regparm (3), stdcall)) re_search_internal (const regex_t *preg, const char *string, Idx length, Idx start, Idx last_start, Idx stop, size_t nmatch, regmatch_t pmatch[], int eflags) //calls set_regs static reg_errcode_t __attribute__ ((regparm (3), stdcall)) set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch, regmatch_t *pmatch, _Bool fl_backtrack) For `set_regs' function, because of `regparm(3)', the first 3 arguments are passed via registers. So, for the call to `set_regs', only the last two arguments need to be passed via stack. The last parameter of `set_regs' is of _Bool type, which is of size 1 byte. However, the actual parameter needs to be at least 4 byte (I4) even for _Bool. So, the stack size computed from `Calc_Actual_Area' is 8 while it is 5 from `Calc_Formal_Area'. So, now what you have in the generated code looks like: re_search_internal: call set_regs sub $8, sp set_regs: ... ret $5 Again, this is caused inconsistency in sp adjustment. The fix is to take account of padding in `Calc_Formal_Area' as well because of stack alignment. Additional Notes ---------------- 1. Bug 2 does not happen in `-O3' because the compiler sees calls to `set_regs' before compiling `set_regs'. Therefore, the stack size computed is 8. Then, when `set_regs' function is being compiled, because the `arg_area_size_array' has already been filled for this function type, `Calc_Formal_Area' does not re-compute. So you get `ret $8' at the end of `set_regs' function. Code Reviewed by Jian-Xin Modified: trunk/osprey/be/cg/calls.cxx =================================================================== --- trunk/osprey/be/cg/calls.cxx 2011-05-08 16:05:12 UTC (rev 3591) +++ trunk/osprey/be/cg/calls.cxx 2011-05-10 16:27:46 UTC (rev 3592) @@ -1182,6 +1182,50 @@ } } + /* Don't perform tail call optimization if the caller and callee have different stack + adjustment size + see: Generate_Exit + */ + if ( Is_Target_32bit() && call_st ){ + // call_st might be NULL in indirect calls; + const TY_IDX caller_ty = ST_pu_type(pu_st); + const BOOL caller_ff2c_abi = PU_ff2c_abi(Pu_Table[ST_pu(pu_st)]); + const RETURN_INFO caller_return_info = + Get_Return_Info( TY_ret_type(caller_ty), No_Simulated, caller_ff2c_abi ); + + const TY_IDX callee_ty = func_type; + const BOOL callee_ff2c_abi = PU_ff2c_abi(Pu_Table[ST_pu(call_st)]); + const RETURN_INFO callee_return_info = + Get_Return_Info( TY_ret_type(callee_ty), No_Simulated, callee_ff2c_abi ); + + int caller_sp_adjust = 0; + int callee_sp_adjust = 0; + + if( RETURN_INFO_return_via_first_arg(caller_return_info) || + TY_return_to_param( caller_ty ) ){ + caller_sp_adjust = Pointer_Size; + } + + // callee adjust SP for stdcall/fastcall at return time + if (TY_has_stdcall(caller_ty) || TY_has_fastcall(caller_ty)) { + caller_sp_adjust += Get_PU_arg_area_size(caller_ty); + } + + if( RETURN_INFO_return_via_first_arg(callee_return_info) || + TY_return_to_param( callee_ty ) ){ + callee_sp_adjust = Pointer_Size; + } + + // callee adjust SP for stdcall/fastcall at return time + if (TY_has_stdcall(callee_ty) || TY_has_fastcall(callee_ty)) { + callee_sp_adjust += Get_PU_arg_area_size(callee_ty); + } + + if ( caller_sp_adjust != callee_sp_adjust ) { + return NULL; + } + } + /* Under -m32 -fpic, don't do tail call optimization, because the caller needs to restore GOT before return. */ Modified: trunk/osprey/common/com/x8664/targ_sim.cxx =================================================================== --- trunk/osprey/common/com/x8664/targ_sim.cxx 2011-05-08 16:05:12 UTC (rev 3591) +++ trunk/osprey/common/com/x8664/targ_sim.cxx 2011-05-10 16:27:46 UTC (rev 3592) @@ -822,6 +822,13 @@ if (Is_Target_32bit() && ploc.size == 8) { ++Current_Int_Param_Num; } + // for IA-32, stack needs to be aligned. So, if the last formal argument + // needs pading, the total stack size computed from (just) function decl + // is different from the stack size computed from actual callsite + if (Is_Target_32bit() && PLOC_on_stack(ploc)) { + ploc.size += rpad; + rpad = 0; + } } break; ------------------------------------------------------------------------------ Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay _______________________________________________ Open64-devel mailing list Open64-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/open64-devel