Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-02-02 Thread Qing Zhao via Gcc-patches
Hi,

With the following patch:

[qinzhao@localhost gcc]$ git diff tree-ssa-structalias.c
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index cf653be..bd18841 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -4851,6 +4851,30 @@ find_func_aliases_for_builtin_call (struct function *fn, 
gcall *t)
   return false;
 }
 
+static void
+find_func_aliases_for_deferred_init (gcall *t)
+{
+  
+  tree lhsop = gimple_call_lhs (t);
+  enum auto_init_type init_type
+= (enum auto_init_type) TREE_INT_CST_LOW (gimple_call_arg (t, 1));
+  auto_vec lhsc;
+  auto_vec rhsc;
+  struct constraint_expr temp;
+ 
+  get_constraint_for (lhsop, );
+  if (init_type == AUTO_INIT_ZERO && flag_delete_null_pointer_checks)
+temp.var = nothing_id;
+  else
+temp.var = nonlocal_id;
+  temp.type = ADDRESSOF;
+  temp.offset = 0;
+  rhsc.safe_push (temp);
+
+  process_all_all_constraints (lhsc, rhsc);
+  return;
+}
+
 /* Create constraints for the call T.  */
 
 static void
@@ -4864,6 +4888,12 @@ find_func_aliases_for_call (struct function *fn, gcall 
*t)
   && find_func_aliases_for_builtin_call (fn, t))
 return;
 
+  if (gimple_call_internal_p (t, IFN_DEFERRED_INIT))
+{
+  find_func_aliases_for_deferred_init (t);
+  return;
+}
+

The *.ealias dump for the routine “bump_map” are exactly the same for approach 
A and D. 
However, the stack size for D still bigger than A. 

Any suggestions?

Qing


On Feb 2, 2021, at 9:17 AM, Qing Zhao via Gcc-patches  
wrote:
> 
> 
> 
>> On Feb 2, 2021, at 1:43 AM, Richard Biener  wrote:
>> 
>> On Mon, 1 Feb 2021, Qing Zhao wrote:
>> 
>>> Hi, Richard,
>>> 
>>> I have adjusted SRA phase to split calls to DEFERRED_INIT per you 
>>> suggestion.
>>> 
>>> And now the routine “bump_map” in 511.povray is like following:
>>> ...
>>> 
>>> # DEBUG BEGIN_STMT
>>> xcoor = 0.0;
>>> ycoor = 0.0;
>>> # DEBUG BEGIN_STMT
>>> index = .DEFERRED_INIT (index, 2);
>>> index2 = .DEFERRED_INIT (index2, 2);
>>> index3 = .DEFERRED_INIT (index3, 2);
>>> # DEBUG BEGIN_STMT
>>> colour1 = .DEFERRED_INIT (colour1, 2);
>>> colour2 = .DEFERRED_INIT (colour2, 2);
>>> colour3 = .DEFERRED_INIT (colour3, 2);
>>> # DEBUG BEGIN_STMT
>>> p1$0_181 = .DEFERRED_INIT (p1$0_195(D), 2);
>>> # DEBUG p1$0 => p1$0_181
>>> p1$1_184 = .DEFERRED_INIT (p1$1_182(D), 2);
>>> # DEBUG p1$1 => p1$1_184
>>> p1$2_172 = .DEFERRED_INIT (p1$2_185(D), 2);
>>> # DEBUG p1$2 => p1$2_172
>>> p2$0_177 = .DEFERRED_INIT (p2$0_173(D), 2);
>>> # DEBUG p2$0 => p2$0_177
>>> p2$1_135 = .DEFERRED_INIT (p2$1_178(D), 2);
>>> # DEBUG p2$1 => p2$1_135
>>> p2$2_137 = .DEFERRED_INIT (p2$2_136(D), 2);
>>> # DEBUG p2$2 => p2$2_137
>>> p3$0_377 = .DEFERRED_INIT (p3$0_376(D), 2);
>>> # DEBUG p3$0 => p3$0_377
>>> p3$1_379 = .DEFERRED_INIT (p3$1_378(D), 2);
>>> # DEBUG p3$1 => p3$1_379
>>> p3$2_381 = .DEFERRED_INIT (p3$2_380(D), 2);
>>> # DEBUG p3$2 => p3$2_381
>>> 
>>> 
>>> In the above, p1, p2, and p3 are all splitted to calls to DEFERRED_INIT of 
>>> the components of p1, p2 and p3. 
>>> 
>>> With this change, the stack usage numbers with -fstack-usage for approach 
>>> A, old approach D and new D with the splitting in SRA are:
>>> 
>>> Approach A  Approach D-old  Approach D-new
>>> 
>>> 272 624 368
>>> 
>>> From the above, we can see that splitting the call to DEFERRED_INIT in SRA 
>>> can reduce the stack usage increase dramatically. 
>>> 
>>> However, looks like that the stack size for D is still bigger than A. 
>>> 
>>> I checked the IR again, and found that the alias analysis might be 
>>> responsible for this (by compare the image.cpp.026t.ealias for both A and 
>>> D):
>>> 
>>> (Due to the call to:
>>> 
>>> colour1 = .DEFERRED_INIT (colour1, 2);
>>> )
>>> 
>>> **Approach A:
>>> 
>>> Points_to analysis:
>>> 
>>> Constraints:
>>> …
>>> colour1 = 
>>> …
>>> colour1 = 
>>> colour1 = 
>>> colour1 = 
>>> colour1 = 
>>> colour1 = 
>>> ...
>>> callarg(53) = 
>>> ...
>>> _53 = colour1
>>> 
>>> Points_to sets:
>>> …
>>> colour1 = { NULL ESCAPED NONLOCAL } same as _53
>>> ...
>>> CALLUSED(48) = { NULL ESCAPED NONLOCAL index colour1 }
>>> CALLCLOBBERED(49) = { NULL ESCAPED NONLOCAL index colour1 } same as 
>>> CALLUSED(48)
>>> ...
>>> callarg(53) = { NULL ESCAPED NONLOCAL colour1 }
>>> 
>>> **Apprach D:
>>> 
>>> Points_to analysis:
>>> 
>>> Constraints:
>>> …
>>> callarg(19) = colour1
>>> callarg(19) = 
>>> colour1 = callarg(19) + UNKNOWN
>>> colour1 = 
>>> …
>>> colour1 = 
>>> colour1 = 
>>> colour1 = 
>>> colour1 = 
>>> colour1 = 
>>> …
>>> callarg(74) = 
>>> callarg(74) = callarg(74) + UNKNOWN
>>> callarg(74) = *callarg(74) + UNKNOWN
>>> …
>>> _53 = colour1
>>> _54 = _53
>>> _55 = _54 + UNKNOWN
>>> _55 = 
>>> _56 = colour1
>>> _57 = _56
>>> _58 = _57 + UNKNOWN
>>> _58 = 
>>> _59 = _55 + UNKNOWN
>>> _59 = _58 + UNKNOWN
>>> _60 = colour1
>>> _61 = _60
>>> _62 = _61 + UNKNOWN
>>> _62 = 
>>> _63 = _59 + UNKNOWN
>>> _63 = _62 + UNKNOWN
>>> _64 = _63 + UNKNOWN
>>> ..
>>> 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-02-02 Thread Qing Zhao via Gcc-patches



> On Feb 2, 2021, at 1:43 AM, Richard Biener  wrote:
> 
> On Mon, 1 Feb 2021, Qing Zhao wrote:
> 
>> Hi, Richard,
>> 
>> I have adjusted SRA phase to split calls to DEFERRED_INIT per you suggestion.
>> 
>> And now the routine “bump_map” in 511.povray is like following:
>> ...
>> 
>> # DEBUG BEGIN_STMT
>>  xcoor = 0.0;
>>  ycoor = 0.0;
>>  # DEBUG BEGIN_STMT
>>  index = .DEFERRED_INIT (index, 2);
>>  index2 = .DEFERRED_INIT (index2, 2);
>>  index3 = .DEFERRED_INIT (index3, 2);
>>  # DEBUG BEGIN_STMT
>>  colour1 = .DEFERRED_INIT (colour1, 2);
>>  colour2 = .DEFERRED_INIT (colour2, 2);
>>  colour3 = .DEFERRED_INIT (colour3, 2);
>>  # DEBUG BEGIN_STMT
>>  p1$0_181 = .DEFERRED_INIT (p1$0_195(D), 2);
>>  # DEBUG p1$0 => p1$0_181
>>  p1$1_184 = .DEFERRED_INIT (p1$1_182(D), 2);
>>  # DEBUG p1$1 => p1$1_184
>>  p1$2_172 = .DEFERRED_INIT (p1$2_185(D), 2);
>>  # DEBUG p1$2 => p1$2_172
>>  p2$0_177 = .DEFERRED_INIT (p2$0_173(D), 2);
>>  # DEBUG p2$0 => p2$0_177
>>  p2$1_135 = .DEFERRED_INIT (p2$1_178(D), 2);
>>  # DEBUG p2$1 => p2$1_135
>>  p2$2_137 = .DEFERRED_INIT (p2$2_136(D), 2);
>>  # DEBUG p2$2 => p2$2_137
>>  p3$0_377 = .DEFERRED_INIT (p3$0_376(D), 2);
>>  # DEBUG p3$0 => p3$0_377
>>  p3$1_379 = .DEFERRED_INIT (p3$1_378(D), 2);
>>  # DEBUG p3$1 => p3$1_379
>>  p3$2_381 = .DEFERRED_INIT (p3$2_380(D), 2);
>>  # DEBUG p3$2 => p3$2_381
>> 
>> 
>> In the above, p1, p2, and p3 are all splitted to calls to DEFERRED_INIT of 
>> the components of p1, p2 and p3. 
>> 
>> With this change, the stack usage numbers with -fstack-usage for approach A, 
>> old approach D and new D with the splitting in SRA are:
>> 
>>  Approach A  Approach D-old  Approach D-new
>> 
>>  272 624 368
>> 
>> From the above, we can see that splitting the call to DEFERRED_INIT in SRA 
>> can reduce the stack usage increase dramatically. 
>> 
>> However, looks like that the stack size for D is still bigger than A. 
>> 
>> I checked the IR again, and found that the alias analysis might be 
>> responsible for this (by compare the image.cpp.026t.ealias for both A and D):
>> 
>> (Due to the call to:
>> 
>>  colour1 = .DEFERRED_INIT (colour1, 2);
>> )
>> 
>> **Approach A:
>> 
>> Points_to analysis:
>> 
>> Constraints:
>> …
>> colour1 = 
>> …
>> colour1 = 
>> colour1 = 
>> colour1 = 
>> colour1 = 
>> colour1 = 
>> ...
>> callarg(53) = 
>> ...
>> _53 = colour1
>> 
>> Points_to sets:
>> …
>> colour1 = { NULL ESCAPED NONLOCAL } same as _53
>> ...
>> CALLUSED(48) = { NULL ESCAPED NONLOCAL index colour1 }
>> CALLCLOBBERED(49) = { NULL ESCAPED NONLOCAL index colour1 } same as 
>> CALLUSED(48)
>> ...
>> callarg(53) = { NULL ESCAPED NONLOCAL colour1 }
>> 
>> **Apprach D:
>> 
>> Points_to analysis:
>> 
>> Constraints:
>> …
>> callarg(19) = colour1
>> callarg(19) = 
>> colour1 = callarg(19) + UNKNOWN
>> colour1 = 
>> …
>> colour1 = 
>> colour1 = 
>> colour1 = 
>> colour1 = 
>> colour1 = 
>> …
>> callarg(74) = 
>> callarg(74) = callarg(74) + UNKNOWN
>> callarg(74) = *callarg(74) + UNKNOWN
>> …
>> _53 = colour1
>> _54 = _53
>> _55 = _54 + UNKNOWN
>> _55 = 
>> _56 = colour1
>> _57 = _56
>> _58 = _57 + UNKNOWN
>> _58 = 
>> _59 = _55 + UNKNOWN
>> _59 = _58 + UNKNOWN
>> _60 = colour1
>> _61 = _60
>> _62 = _61 + UNKNOWN
>> _62 = 
>> _63 = _59 + UNKNOWN
>> _63 = _62 + UNKNOWN
>> _64 = _63 + UNKNOWN
>> ..
>> Points_to set:
>> …
>> colour1 = { ESCAPED NONLOCAL } same as callarg(19)
>> …
>> CALLUSED(69) = { ESCAPED NONLOCAL index colour1 }
>> CALLCLOBBERED(70) = { ESCAPED NONLOCAL index colour1 } same as CALLUSED(69)
>> callarg(71) = { ESCAPED NONLOCAL }
>> callarg(72) = { ESCAPED NONLOCAL }
>> callarg(73) = { ESCAPED NONLOCAL }
>> callarg(74) = { ESCAPED NONLOCAL colour1 }
>> 
>> My question:
>> 
>> Is it possible to adjust alias analysis to resolve this issue?
> 
> You probably want to handle .DEFERRED_INIT in tree-ssa-structalias.c
> find_func_aliases_for_call (it's not a builtin but you can look in
> the respective subroutine for examples).  Specifically you want to
> avoid making anything escaped or clobbered.

Okay, thanks.

Will check on that.

Qing
>> 
> 
> -- 
> Richard Biener mailto:rguent...@suse.de>>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)



Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-02-01 Thread Richard Biener
On Mon, 1 Feb 2021, Qing Zhao wrote:

> Hi, Richard,
> 
> I have adjusted SRA phase to split calls to DEFERRED_INIT per you suggestion.
> 
> And now the routine “bump_map” in 511.povray is like following:
> ...
> 
>  # DEBUG BEGIN_STMT
>   xcoor = 0.0;
>   ycoor = 0.0;
>   # DEBUG BEGIN_STMT
>   index = .DEFERRED_INIT (index, 2);
>   index2 = .DEFERRED_INIT (index2, 2);
>   index3 = .DEFERRED_INIT (index3, 2);
>   # DEBUG BEGIN_STMT
>   colour1 = .DEFERRED_INIT (colour1, 2);
>   colour2 = .DEFERRED_INIT (colour2, 2);
>   colour3 = .DEFERRED_INIT (colour3, 2);
>   # DEBUG BEGIN_STMT
>   p1$0_181 = .DEFERRED_INIT (p1$0_195(D), 2);
>   # DEBUG p1$0 => p1$0_181
>   p1$1_184 = .DEFERRED_INIT (p1$1_182(D), 2);
>   # DEBUG p1$1 => p1$1_184
>   p1$2_172 = .DEFERRED_INIT (p1$2_185(D), 2);
>   # DEBUG p1$2 => p1$2_172
>   p2$0_177 = .DEFERRED_INIT (p2$0_173(D), 2);
>   # DEBUG p2$0 => p2$0_177
>   p2$1_135 = .DEFERRED_INIT (p2$1_178(D), 2);
>   # DEBUG p2$1 => p2$1_135
>   p2$2_137 = .DEFERRED_INIT (p2$2_136(D), 2);
>   # DEBUG p2$2 => p2$2_137
>   p3$0_377 = .DEFERRED_INIT (p3$0_376(D), 2);
>   # DEBUG p3$0 => p3$0_377
>   p3$1_379 = .DEFERRED_INIT (p3$1_378(D), 2);
>   # DEBUG p3$1 => p3$1_379
>   p3$2_381 = .DEFERRED_INIT (p3$2_380(D), 2);
>   # DEBUG p3$2 => p3$2_381
> 
> 
> In the above, p1, p2, and p3 are all splitted to calls to DEFERRED_INIT of 
> the components of p1, p2 and p3. 
> 
> With this change, the stack usage numbers with -fstack-usage for approach A, 
> old approach D and new D with the splitting in SRA are:
> 
>   Approach A  Approach D-old  Approach D-new
> 
>   272 624 368
> 
> From the above, we can see that splitting the call to DEFERRED_INIT in SRA 
> can reduce the stack usage increase dramatically. 
> 
> However, looks like that the stack size for D is still bigger than A. 
> 
> I checked the IR again, and found that the alias analysis might be 
> responsible for this (by compare the image.cpp.026t.ealias for both A and D):
> 
> (Due to the call to:
> 
>   colour1 = .DEFERRED_INIT (colour1, 2);
> )
> 
> **Approach A:
> 
> Points_to analysis:
> 
> Constraints:
> …
> colour1 = 
> …
> colour1 = 
> colour1 = 
> colour1 = 
> colour1 = 
> colour1 = 
> ...
> callarg(53) = 
> ...
> _53 = colour1
> 
> Points_to sets:
> …
> colour1 = { NULL ESCAPED NONLOCAL } same as _53
> ...
> CALLUSED(48) = { NULL ESCAPED NONLOCAL index colour1 }
> CALLCLOBBERED(49) = { NULL ESCAPED NONLOCAL index colour1 } same as 
> CALLUSED(48)
> ...
> callarg(53) = { NULL ESCAPED NONLOCAL colour1 }
> 
> **Apprach D:
> 
> Points_to analysis:
> 
> Constraints:
> …
> callarg(19) = colour1
> callarg(19) = 
> colour1 = callarg(19) + UNKNOWN
> colour1 = 
> …
> colour1 = 
> colour1 = 
> colour1 = 
> colour1 = 
> colour1 = 
> …
> callarg(74) = 
> callarg(74) = callarg(74) + UNKNOWN
> callarg(74) = *callarg(74) + UNKNOWN
> …
> _53 = colour1
> _54 = _53
> _55 = _54 + UNKNOWN
> _55 = 
> _56 = colour1
> _57 = _56
> _58 = _57 + UNKNOWN
> _58 = 
> _59 = _55 + UNKNOWN
> _59 = _58 + UNKNOWN
> _60 = colour1
> _61 = _60
> _62 = _61 + UNKNOWN
> _62 = 
> _63 = _59 + UNKNOWN
> _63 = _62 + UNKNOWN
> _64 = _63 + UNKNOWN
> ..
> Points_to set:
> …
> colour1 = { ESCAPED NONLOCAL } same as callarg(19)
> …
> CALLUSED(69) = { ESCAPED NONLOCAL index colour1 }
> CALLCLOBBERED(70) = { ESCAPED NONLOCAL index colour1 } same as CALLUSED(69)
> callarg(71) = { ESCAPED NONLOCAL }
> callarg(72) = { ESCAPED NONLOCAL }
> callarg(73) = { ESCAPED NONLOCAL }
> callarg(74) = { ESCAPED NONLOCAL colour1 }
> 
> My question:
> 
> Is it possible to adjust alias analysis to resolve this issue?

You probably want to handle .DEFERRED_INIT in tree-ssa-structalias.c
find_func_aliases_for_call (it's not a builtin but you can look in
the respective subroutine for examples).  Specifically you want to
avoid making anything escaped or clobbered.

> thanks.
> 
> Qing
> 
> > On Jan 18, 2021, at 10:12 AM, Qing Zhao via Gcc-patches 
> >  wrote:
> > 
> > I checked the routine “poverties::bump_map” in 511.povray_r since it
> > has a lot stack increase 
> > due to implementation D, by examine the IR immediate before RTL
> > expansion phase.  
> > (image.cpp.244t.optimized), I found that we have the following
> > additional statements for the array elements:
> > 
> > void  pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double
> > * normal)
> > {
> > …
> > double p3[3];
> > double p2[3];
> > double p1[3];
> > float colour3[5];
> > float colour2[5];
> > float colour1[5];
> > …
> > # DEBUG BEGIN_STMT
> > colour1 = .DEFERRED_INIT (colour1, 2);
> > colour2 = .DEFERRED_INIT (colour2, 2);
> > colour3 = .DEFERRED_INIT (colour3, 2);
> > # DEBUG BEGIN_STMT
> > MEM  [(double[3] *)] = p1$0_144(D);
> > MEM  [(double[3] *) + 8B] = p1$1_135(D);
> > MEM  [(double[3] *) + 16B] = p1$2_138(D);
> > p1 = .DEFERRED_INIT (p1, 2);
> > # DEBUG 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-02-01 Thread Qing Zhao via Gcc-patches
Hi, Richard,

I have adjusted SRA phase to split calls to DEFERRED_INIT per you suggestion.

And now the routine “bump_map” in 511.povray is like following:
...

 # DEBUG BEGIN_STMT
  xcoor = 0.0;
  ycoor = 0.0;
  # DEBUG BEGIN_STMT
  index = .DEFERRED_INIT (index, 2);
  index2 = .DEFERRED_INIT (index2, 2);
  index3 = .DEFERRED_INIT (index3, 2);
  # DEBUG BEGIN_STMT
  colour1 = .DEFERRED_INIT (colour1, 2);
  colour2 = .DEFERRED_INIT (colour2, 2);
  colour3 = .DEFERRED_INIT (colour3, 2);
  # DEBUG BEGIN_STMT
  p1$0_181 = .DEFERRED_INIT (p1$0_195(D), 2);
  # DEBUG p1$0 => p1$0_181
  p1$1_184 = .DEFERRED_INIT (p1$1_182(D), 2);
  # DEBUG p1$1 => p1$1_184
  p1$2_172 = .DEFERRED_INIT (p1$2_185(D), 2);
  # DEBUG p1$2 => p1$2_172
  p2$0_177 = .DEFERRED_INIT (p2$0_173(D), 2);
  # DEBUG p2$0 => p2$0_177
  p2$1_135 = .DEFERRED_INIT (p2$1_178(D), 2);
  # DEBUG p2$1 => p2$1_135
  p2$2_137 = .DEFERRED_INIT (p2$2_136(D), 2);
  # DEBUG p2$2 => p2$2_137
  p3$0_377 = .DEFERRED_INIT (p3$0_376(D), 2);
  # DEBUG p3$0 => p3$0_377
  p3$1_379 = .DEFERRED_INIT (p3$1_378(D), 2);
  # DEBUG p3$1 => p3$1_379
  p3$2_381 = .DEFERRED_INIT (p3$2_380(D), 2);
  # DEBUG p3$2 => p3$2_381


In the above, p1, p2, and p3 are all splitted to calls to DEFERRED_INIT of the 
components of p1, p2 and p3. 

With this change, the stack usage numbers with -fstack-usage for approach A, 
old approach D and new D with the splitting in SRA are:

  Approach AApproach D-old  Approach D-new

272 624 368

From the above, we can see that splitting the call to DEFERRED_INIT in SRA can 
reduce the stack usage increase dramatically. 

However, looks like that the stack size for D is still bigger than A. 

I checked the IR again, and found that the alias analysis might be responsible 
for this (by compare the image.cpp.026t.ealias for both A and D):

(Due to the call to:

  colour1 = .DEFERRED_INIT (colour1, 2);
)

**Approach A:

Points_to analysis:

Constraints:
…
colour1 = 
…
colour1 = 
colour1 = 
colour1 = 
colour1 = 
colour1 = 
...
callarg(53) = 
...
_53 = colour1

Points_to sets:
…
colour1 = { NULL ESCAPED NONLOCAL } same as _53
...
CALLUSED(48) = { NULL ESCAPED NONLOCAL index colour1 }
CALLCLOBBERED(49) = { NULL ESCAPED NONLOCAL index colour1 } same as CALLUSED(48)
...
callarg(53) = { NULL ESCAPED NONLOCAL colour1 }

**Apprach D:

Points_to analysis:

Constraints:
…
callarg(19) = colour1
callarg(19) = 
colour1 = callarg(19) + UNKNOWN
colour1 = 
…
colour1 = 
colour1 = 
colour1 = 
colour1 = 
colour1 = 
…
callarg(74) = 
callarg(74) = callarg(74) + UNKNOWN
callarg(74) = *callarg(74) + UNKNOWN
…
_53 = colour1
_54 = _53
_55 = _54 + UNKNOWN
_55 = 
_56 = colour1
_57 = _56
_58 = _57 + UNKNOWN
_58 = 
_59 = _55 + UNKNOWN
_59 = _58 + UNKNOWN
_60 = colour1
_61 = _60
_62 = _61 + UNKNOWN
_62 = 
_63 = _59 + UNKNOWN
_63 = _62 + UNKNOWN
_64 = _63 + UNKNOWN
..
Points_to set:
…
colour1 = { ESCAPED NONLOCAL } same as callarg(19)
…
CALLUSED(69) = { ESCAPED NONLOCAL index colour1 }
CALLCLOBBERED(70) = { ESCAPED NONLOCAL index colour1 } same as CALLUSED(69)
callarg(71) = { ESCAPED NONLOCAL }
callarg(72) = { ESCAPED NONLOCAL }
callarg(73) = { ESCAPED NONLOCAL }
callarg(74) = { ESCAPED NONLOCAL colour1 }

My question:

Is it possible to adjust alias analysis to resolve this issue?

thanks.

Qing

> On Jan 18, 2021, at 10:12 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> I checked the routine “poverties::bump_map” in 511.povray_r since it
> has a lot stack increase 
> due to implementation D, by examine the IR immediate before RTL
> expansion phase.  
> (image.cpp.244t.optimized), I found that we have the following
> additional statements for the array elements:
> 
> void  pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double
> * normal)
> {
> …
> double p3[3];
> double p2[3];
> double p1[3];
> float colour3[5];
> float colour2[5];
> float colour1[5];
> …
> # DEBUG BEGIN_STMT
> colour1 = .DEFERRED_INIT (colour1, 2);
> colour2 = .DEFERRED_INIT (colour2, 2);
> colour3 = .DEFERRED_INIT (colour3, 2);
> # DEBUG BEGIN_STMT
> MEM  [(double[3] *)] = p1$0_144(D);
> MEM  [(double[3] *) + 8B] = p1$1_135(D);
> MEM  [(double[3] *) + 16B] = p1$2_138(D);
> p1 = .DEFERRED_INIT (p1, 2);
> # DEBUG D#12 => MEM  [(double[3] *)]
> # DEBUG p1$0 => D#12
> # DEBUG D#11 => MEM  [(double[3] *) + 8B]
> # DEBUG p1$1 => D#11
> # DEBUG D#10 => MEM  [(double[3] *) + 16B]
> # DEBUG p1$2 => D#10
> MEM  [(double[3] *)] = p2$0_109(D);
> MEM  [(double[3] *) + 8B] = p2$1_111(D);
> MEM  [(double[3] *) + 16B] = p2$2_254(D);
> p2 = .DEFERRED_INIT (p2, 2);
> # DEBUG D#9 => MEM  [(double[3] *)]
> # DEBUG p2$0 => D#9
> # DEBUG D#8 => MEM  [(double[3] *) + 8B]
> # DEBUG p2$1 => D#8
> # DEBUG D#7 => MEM  [(double[3] *) + 16B]
> # DEBUG p2$2 => D#7
> MEM  [(double[3] *)] = p3$0_256(D);

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-18 Thread Qing Zhao via Gcc-patches



> On Jan 18, 2021, at 7:09 AM, Richard Sandiford  
> wrote:
> 
> Qing Zhao  writes:
> D will keep all initialized aggregates as aggregates and live which
> means stack will be allocated for it.  With A the usual optimizations
> to reduce stack usage can be applied.
 
 I checked the routine “poverties::bump_map” in 511.povray_r since it
 has a lot stack increase 
 due to implementation D, by examine the IR immediate before RTL
 expansion phase.  
 (image.cpp.244t.optimized), I found that we have the following
 additional statements for the array elements:
 
 void  pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double
 * normal)
 {
 …
 double p3[3];
 double p2[3];
 double p1[3];
 float colour3[5];
 float colour2[5];
 float colour1[5];
 …
 # DEBUG BEGIN_STMT
 colour1 = .DEFERRED_INIT (colour1, 2);
 colour2 = .DEFERRED_INIT (colour2, 2);
 colour3 = .DEFERRED_INIT (colour3, 2);
 # DEBUG BEGIN_STMT
 MEM  [(double[3] *)] = p1$0_144(D);
 MEM  [(double[3] *) + 8B] = p1$1_135(D);
 MEM  [(double[3] *) + 16B] = p1$2_138(D);
 p1 = .DEFERRED_INIT (p1, 2);
 # DEBUG D#12 => MEM  [(double[3] *)]
 # DEBUG p1$0 => D#12
 # DEBUG D#11 => MEM  [(double[3] *) + 8B]
 # DEBUG p1$1 => D#11
 # DEBUG D#10 => MEM  [(double[3] *) + 16B]
 # DEBUG p1$2 => D#10
 MEM  [(double[3] *)] = p2$0_109(D);
 MEM  [(double[3] *) + 8B] = p2$1_111(D);
 MEM  [(double[3] *) + 16B] = p2$2_254(D);
 p2 = .DEFERRED_INIT (p2, 2);
 # DEBUG D#9 => MEM  [(double[3] *)]
 # DEBUG p2$0 => D#9
 # DEBUG D#8 => MEM  [(double[3] *) + 8B]
 # DEBUG p2$1 => D#8
 # DEBUG D#7 => MEM  [(double[3] *) + 16B]
 # DEBUG p2$2 => D#7
 MEM  [(double[3] *)] = p3$0_256(D);
 MEM  [(double[3] *) + 8B] = p3$1_258(D);
 MEM  [(double[3] *) + 16B] = p3$2_260(D);
 p3 = .DEFERRED_INIT (p3, 2);
 ….
 }
 
 I guess that the above “MEM ….. = …” are the ones that make the
 differences. Which phase introduced them?
>>> 
>>> Looks like SRA. But you can just dump all and grep for the first 
>>> occurrence. 
>> 
>> Yes, looks like that SRA is the one:
>> 
>> image.cpp.035t.esra:  MEM  [(double[3] *)] = p1$0_195(D);
>> image.cpp.035t.esra:  MEM  [(double[3] *) + 8B] = p1$1_182(D);
>> image.cpp.035t.esra:  MEM  [(double[3] *) + 16B] = p1$2_185(D);
> 
> I realise no-one was suggesting otherwise, but FWIW: SRA could easily
> be extended to handle .DEFERRED_INIT if that's the main source of
> excess stack usage.  A single .DEFERRED_INIT of an aggregate can
> be split into .DEFERRED_INITs of individual components.

Thanks a lot for the suggestion,
I will study the code of SRA to see how to do this and then see whether this 
can resolve the issue.
> 
> In other words, the investigation you're doing looks like the right way
> of deciding which passes are worth extending to handle .DEFERRED_INIT.
Yes, with the study so far, looks like the major issue with the .DERERRED_INIT 
approach is the stack size increase.
Hopefully after resolving this issue, we will be done.

Qing

> 
> Thanks,
> Richard



Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-18 Thread Richard Sandiford via Gcc-patches
Qing Zhao  writes:
 D will keep all initialized aggregates as aggregates and live which
 means stack will be allocated for it.  With A the usual optimizations
 to reduce stack usage can be applied.
>>> 
>>> I checked the routine “poverties::bump_map” in 511.povray_r since it
>>> has a lot stack increase 
>>> due to implementation D, by examine the IR immediate before RTL
>>> expansion phase.  
>>> (image.cpp.244t.optimized), I found that we have the following
>>> additional statements for the array elements:
>>> 
>>> void  pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double
>>> * normal)
>>> {
>>> …
>>> double p3[3];
>>> double p2[3];
>>> double p1[3];
>>> float colour3[5];
>>> float colour2[5];
>>> float colour1[5];
>>> …
>>>  # DEBUG BEGIN_STMT
>>> colour1 = .DEFERRED_INIT (colour1, 2);
>>> colour2 = .DEFERRED_INIT (colour2, 2);
>>> colour3 = .DEFERRED_INIT (colour3, 2);
>>> # DEBUG BEGIN_STMT
>>> MEM  [(double[3] *)] = p1$0_144(D);
>>> MEM  [(double[3] *) + 8B] = p1$1_135(D);
>>> MEM  [(double[3] *) + 16B] = p1$2_138(D);
>>> p1 = .DEFERRED_INIT (p1, 2);
>>> # DEBUG D#12 => MEM  [(double[3] *)]
>>> # DEBUG p1$0 => D#12
>>> # DEBUG D#11 => MEM  [(double[3] *) + 8B]
>>> # DEBUG p1$1 => D#11
>>> # DEBUG D#10 => MEM  [(double[3] *) + 16B]
>>> # DEBUG p1$2 => D#10
>>> MEM  [(double[3] *)] = p2$0_109(D);
>>> MEM  [(double[3] *) + 8B] = p2$1_111(D);
>>> MEM  [(double[3] *) + 16B] = p2$2_254(D);
>>> p2 = .DEFERRED_INIT (p2, 2);
>>> # DEBUG D#9 => MEM  [(double[3] *)]
>>> # DEBUG p2$0 => D#9
>>> # DEBUG D#8 => MEM  [(double[3] *) + 8B]
>>> # DEBUG p2$1 => D#8
>>> # DEBUG D#7 => MEM  [(double[3] *) + 16B]
>>> # DEBUG p2$2 => D#7
>>> MEM  [(double[3] *)] = p3$0_256(D);
>>> MEM  [(double[3] *) + 8B] = p3$1_258(D);
>>> MEM  [(double[3] *) + 16B] = p3$2_260(D);
>>> p3 = .DEFERRED_INIT (p3, 2);
>>> ….
>>> }
>>> 
>>> I guess that the above “MEM ….. = …” are the ones that make the
>>> differences. Which phase introduced them?
>> 
>> Looks like SRA. But you can just dump all and grep for the first occurrence. 
>
> Yes, looks like that SRA is the one:
>
> image.cpp.035t.esra:  MEM  [(double[3] *)] = p1$0_195(D);
> image.cpp.035t.esra:  MEM  [(double[3] *) + 8B] = p1$1_182(D);
> image.cpp.035t.esra:  MEM  [(double[3] *) + 16B] = p1$2_185(D);

I realise no-one was suggesting otherwise, but FWIW: SRA could easily
be extended to handle .DEFERRED_INIT if that's the main source of
excess stack usage.  A single .DEFERRED_INIT of an aggregate can
be split into .DEFERRED_INITs of individual components.

In other words, the investigation you're doing looks like the right way
of deciding which passes are worth extending to handle .DEFERRED_INIT.

Thanks,
Richard


Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Qing Zhao via Gcc-patches



> On Jan 15, 2021, at 11:22 AM, Richard Biener  wrote:
> 
> On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao  > wrote:
>> 
>> 
>>> On Jan 15, 2021, at 2:11 AM, Richard Biener 
>> wrote:
>>> 
>>> 
>>> 
>>> On Thu, 14 Jan 2021, Qing Zhao wrote:
>>> 
 Hi, 
 More data on code size and compilation time with CPU2017:
 Compilation time data:   the numbers are the slowdown
>> against the
 default “no”:
 benchmarks  A/no D/no
 
 500.perlbench_r 5.19% 1.95%
 502.gcc_r 0.46% -0.23%
 505.mcf_r 0.00% 0.00%
 520.omnetpp_r 0.85% 0.00%
 523.xalancbmk_r 0.79% -0.40%
 525.x264_r -4.48% 0.00%
 531.deepsjeng_r 16.67% 16.67%
 541.leela_r  0.00%  0.00%
 557.xz_r 0.00%  0.00%
 
 507.cactuBSSN_r 1.16% 0.58%
 508.namd_r 9.62% 8.65%
 510.parest_r 0.48% 1.19%
 511.povray_r 3.70% 3.70%
 519.lbm_r 0.00% 0.00%
 521.wrf_r 0.05% 0.02%
 526.blender_r 0.33% 1.32%
 527.cam4_r -0.93% -0.93%
 538.imagick_r 1.32% 3.95%
 544.nab_r  0.00% 0.00%
 From the above data, looks like that the compilation time impact
 from implementation A and D are almost the same.
 ***code size data: the numbers are the code size increase
>> against the
 default “no”:
 benchmarks A/no D/no
 
 500.perlbench_r 2.84% 0.34%
 502.gcc_r 2.59% 0.35%
 505.mcf_r 3.55% 0.39%
 520.omnetpp_r 0.54% 0.03%
 523.xalancbmk_r 0.36%  0.39%
 525.x264_r 1.39% 0.13%
 531.deepsjeng_r 2.15% -1.12%
 541.leela_r 0.50% -0.20%
 557.xz_r 0.31% 0.13%
 
 507.cactuBSSN_r 5.00% -0.01%
 508.namd_r 3.64% -0.07%
 510.parest_r 1.12% 0.33%
 511.povray_r 4.18% 1.16%
 519.lbm_r 8.83% 6.44%
 521.wrf_r 0.08% 0.02%
 526.blender_r 1.63% 0.45%
 527.cam4_r  0.16% 0.06%
 538.imagick_r 3.18% -0.80%
 544.nab_r 5.76% -1.11%
 Avg 2.52% 0.36%
 From the above data, the implementation D is always better than A,
>> it’s a
 surprising to me, not sure what’s the reason for this.
>>> 
>>> D probably inhibits most interesting loop transforms (check SPEC FP
>>> performance).
>> 
>> The call to .DEFERRED_INIT is marked as ECF_CONST:
>> 
>> /* A function to represent an artifical initialization to an
>> uninitialized
>>  automatic variable. The first argument is the variable itself, the
>>  second argument is the initialization type.  */
>> DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW,
>> NULL)
>> 
>> So, I assume that such const call should minimize the impact to loop
>> optimizations. But yes, it will still inhibit some of the loop
>> transformations.
>> 
>>> It will also most definitely disallow SRA which, when
>>> an aggregate is not completely elided, tends to grow code.
>> 
>> Make sense to me. 
>> 
>> The run-time performance data for D and A are actually very similar as
>> I posted in the previous email (I listed it here for convenience)
>> 
>> Run-time performance overhead with A and D:
>> 
>> benchmarks   A / no  D /no
>> 
>> 500.perlbench_r  1.25%   1.25%
>> 502.gcc_r0.68%   1.80%
>> 505.mcf_r0.68%   0.14%
>> 520.omnetpp_r4.83%   4.68%
>> 523.xalancbmk_r  0.18%   1.96%
>> 525.x264_r   1.55%   2.07%
>> 531.deepsjeng_   11.57%  11.85%
>> 541.leela_r  0.64%   0.80%
>> 557.xz_   -0.41% -0.41%
>> 
>> 507.cactuBSSN_r  0.44%   0.44%
>> 508.namd_r   0.34%   0.34%
>> 510.parest_r 0.17%   0.25%
>> 511.povray_r 56.57%  57.27%
>> 519.lbm_r0.00%   0.00%
>> 521.wrf_r -0.28% -0.37%
>> 526.blender_r16.96%  17.71%
>> 527.cam4_r   0.70%   0.53%
>> 538.imagick_r2.40%   2.40%
>> 544.nab_r0.00%   -0.65%
>> 
>> avg  5.17%   5.37%
>> 
>> Especially for the SPEC FP benchmarks, I didn’t see too much
>> performance difference between A and D. 
>> I guess that the RTL optimizations might be enough to get rid of most
>> of the overhead introduced by the additional initialization. 
>> 
>>> 
 stack usage data, I added -fstack-usage to the compilation
>> line when
 compiling CPU2017 benchmarks. And all the *.su files were generated
>> for each
 of the modules.
 Since there a lot of such files, and the stack size information are
>> embedded
 in each of the files.  I just picked up one benchmark 511.povray to
 check. Which is the one that 
 has the most runtime overhead when adding initialization (both A and
>> D). 
 I identified all the *.su files that are different between A and D
>> and do a
 diff on those *.su files, and looks like that the stack size is much
>> higher
 with D than that with A, for example:
 $ diff build_base_auto_init.D./bbox.su
 build_base_auto_init.A./bbox.su5c5
 < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
 pov::BBOX_TREE**&, long 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Richard Biener
On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao  
wrote:
>
>
>> On Jan 15, 2021, at 2:11 AM, Richard Biener 
>wrote:
>> 
>> 
>> 
>> On Thu, 14 Jan 2021, Qing Zhao wrote:
>> 
>>> Hi, 
>>> More data on code size and compilation time with CPU2017:
>>> Compilation time data:   the numbers are the slowdown
>against the
>>> default “no”:
>>> benchmarks  A/no D/no
>>> 
>>> 500.perlbench_r 5.19% 1.95%
>>> 502.gcc_r 0.46% -0.23%
>>> 505.mcf_r 0.00% 0.00%
>>> 520.omnetpp_r 0.85% 0.00%
>>> 523.xalancbmk_r 0.79% -0.40%
>>> 525.x264_r -4.48% 0.00%
>>> 531.deepsjeng_r 16.67% 16.67%
>>> 541.leela_r  0.00%  0.00%
>>> 557.xz_r 0.00%  0.00%
>>> 
>>> 507.cactuBSSN_r 1.16% 0.58%
>>> 508.namd_r 9.62% 8.65%
>>> 510.parest_r 0.48% 1.19%
>>> 511.povray_r 3.70% 3.70%
>>> 519.lbm_r 0.00% 0.00%
>>> 521.wrf_r 0.05% 0.02%
>>> 526.blender_r 0.33% 1.32%
>>> 527.cam4_r -0.93% -0.93%
>>> 538.imagick_r 1.32% 3.95%
>>> 544.nab_r  0.00% 0.00%
>>> From the above data, looks like that the compilation time impact
>>> from implementation A and D are almost the same.
>>> ***code size data: the numbers are the code size increase
>against the
>>> default “no”:
>>> benchmarks A/no D/no
>>> 
>>> 500.perlbench_r 2.84% 0.34%
>>> 502.gcc_r 2.59% 0.35%
>>> 505.mcf_r 3.55% 0.39%
>>> 520.omnetpp_r 0.54% 0.03%
>>> 523.xalancbmk_r 0.36%  0.39%
>>> 525.x264_r 1.39% 0.13%
>>> 531.deepsjeng_r 2.15% -1.12%
>>> 541.leela_r 0.50% -0.20%
>>> 557.xz_r 0.31% 0.13%
>>> 
>>> 507.cactuBSSN_r 5.00% -0.01%
>>> 508.namd_r 3.64% -0.07%
>>> 510.parest_r 1.12% 0.33%
>>> 511.povray_r 4.18% 1.16%
>>> 519.lbm_r 8.83% 6.44%
>>> 521.wrf_r 0.08% 0.02%
>>> 526.blender_r 1.63% 0.45%
>>> 527.cam4_r  0.16% 0.06%
>>> 538.imagick_r 3.18% -0.80%
>>> 544.nab_r 5.76% -1.11%
>>> Avg 2.52% 0.36%
>>> From the above data, the implementation D is always better than A,
>it’s a
>>> surprising to me, not sure what’s the reason for this.
>> 
>> D probably inhibits most interesting loop transforms (check SPEC FP
>> performance).
>
>The call to .DEFERRED_INIT is marked as ECF_CONST:
>
>/* A function to represent an artifical initialization to an
>uninitialized
>   automatic variable. The first argument is the variable itself, the
>   second argument is the initialization type.  */
>DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW,
>NULL)
>
>So, I assume that such const call should minimize the impact to loop
>optimizations. But yes, it will still inhibit some of the loop
>transformations.
>
>>  It will also most definitely disallow SRA which, when
>> an aggregate is not completely elided, tends to grow code.
>
>Make sense to me. 
>
>The run-time performance data for D and A are actually very similar as
>I posted in the previous email (I listed it here for convenience)
>
>Run-time performance overhead with A and D:
>
>benchmarks A / no  D /no
>
>500.perlbench_r1.25%   1.25%
>502.gcc_r  0.68%   1.80%
>505.mcf_r  0.68%   0.14%
>520.omnetpp_r  4.83%   4.68%
>523.xalancbmk_r0.18%   1.96%
>525.x264_r 1.55%   2.07%
>531.deepsjeng_ 11.57%  11.85%
>541.leela_r0.64%   0.80%
>557.xz_ -0.41% -0.41%
>
>507.cactuBSSN_r0.44%   0.44%
>508.namd_r 0.34%   0.34%
>510.parest_r   0.17%   0.25%
>511.povray_r   56.57%  57.27%
>519.lbm_r  0.00%   0.00%
>521.wrf_r   -0.28% -0.37%
>526.blender_r  16.96%  17.71%
>527.cam4_r 0.70%   0.53%
>538.imagick_r  2.40%   2.40%
>544.nab_r  0.00%   -0.65%
>
>avg5.17%   5.37%
>
>Especially for the SPEC FP benchmarks, I didn’t see too much
>performance difference between A and D. 
>I guess that the RTL optimizations might be enough to get rid of most
>of the overhead introduced by the additional initialization. 
>
>> 
>>> stack usage data, I added -fstack-usage to the compilation
>line when
>>> compiling CPU2017 benchmarks. And all the *.su files were generated
>for each
>>> of the modules.
>>> Since there a lot of such files, and the stack size information are
>embedded
>>> in each of the files.  I just picked up one benchmark 511.povray to
>>> check. Which is the one that 
>>> has the most runtime overhead when adding initialization (both A and
>D). 
>>> I identified all the *.su files that are different between A and D
>and do a
>>> diff on those *.su files, and looks like that the stack size is much
>higher
>>> with D than that with A, for example:
>>> $ diff build_base_auto_init.D./bbox.su
>>> build_base_auto_init.A./bbox.su5c5
>>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>>> ---
>>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>>> $ diff 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Qing Zhao via Gcc-patches



> On Jan 15, 2021, at 2:11 AM, Richard Biener  wrote:
> 
> 
> 
> On Thu, 14 Jan 2021, Qing Zhao wrote:
> 
>> Hi, 
>> More data on code size and compilation time with CPU2017:
>> Compilation time data:   the numbers are the slowdown against the
>> default “no”:
>> benchmarks  A/no D/no
>> 
>> 500.perlbench_r 5.19% 1.95%
>> 502.gcc_r 0.46% -0.23%
>> 505.mcf_r 0.00% 0.00%
>> 520.omnetpp_r 0.85% 0.00%
>> 523.xalancbmk_r 0.79% -0.40%
>> 525.x264_r -4.48% 0.00%
>> 531.deepsjeng_r 16.67% 16.67%
>> 541.leela_r  0.00%  0.00%
>> 557.xz_r 0.00%  0.00%
>> 
>> 507.cactuBSSN_r 1.16% 0.58%
>> 508.namd_r 9.62% 8.65%
>> 510.parest_r 0.48% 1.19%
>> 511.povray_r 3.70% 3.70%
>> 519.lbm_r 0.00% 0.00%
>> 521.wrf_r 0.05% 0.02%
>> 526.blender_r 0.33% 1.32%
>> 527.cam4_r -0.93% -0.93%
>> 538.imagick_r 1.32% 3.95%
>> 544.nab_r  0.00% 0.00%
>> From the above data, looks like that the compilation time impact
>> from implementation A and D are almost the same.
>> ***code size data: the numbers are the code size increase against the
>> default “no”:
>> benchmarks A/no D/no
>> 
>> 500.perlbench_r 2.84% 0.34%
>> 502.gcc_r 2.59% 0.35%
>> 505.mcf_r 3.55% 0.39%
>> 520.omnetpp_r 0.54% 0.03%
>> 523.xalancbmk_r 0.36%  0.39%
>> 525.x264_r 1.39% 0.13%
>> 531.deepsjeng_r 2.15% -1.12%
>> 541.leela_r 0.50% -0.20%
>> 557.xz_r 0.31% 0.13%
>> 
>> 507.cactuBSSN_r 5.00% -0.01%
>> 508.namd_r 3.64% -0.07%
>> 510.parest_r 1.12% 0.33%
>> 511.povray_r 4.18% 1.16%
>> 519.lbm_r 8.83% 6.44%
>> 521.wrf_r 0.08% 0.02%
>> 526.blender_r 1.63% 0.45%
>> 527.cam4_r  0.16% 0.06%
>> 538.imagick_r 3.18% -0.80%
>> 544.nab_r 5.76% -1.11%
>> Avg 2.52% 0.36%
>> From the above data, the implementation D is always better than A, it’s a
>> surprising to me, not sure what’s the reason for this.
> 
> D probably inhibits most interesting loop transforms (check SPEC FP
> performance).

The call to .DEFERRED_INIT is marked as ECF_CONST:

/* A function to represent an artifical initialization to an uninitialized
   automatic variable. The first argument is the variable itself, the
   second argument is the initialization type.  */
DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)

So, I assume that such const call should minimize the impact to loop 
optimizations. But yes, it will still inhibit some of the loop transformations.

>  It will also most definitely disallow SRA which, when
> an aggregate is not completely elided, tends to grow code.

Make sense to me. 

The run-time performance data for D and A are actually very similar as I posted 
in the previous email (I listed it here for convenience)

Run-time performance overhead with A and D:

benchmarks  A / no  D /no

500.perlbench_r 1.25%   1.25%
502.gcc_r   0.68%   1.80%
505.mcf_r   0.68%   0.14%
520.omnetpp_r   4.83%   4.68%
523.xalancbmk_r 0.18%   1.96%
525.x264_r  1.55%   2.07%
531.deepsjeng_  11.57%  11.85%
541.leela_r 0.64%   0.80%
557.xz_  -0.41% -0.41%

507.cactuBSSN_r 0.44%   0.44%
508.namd_r  0.34%   0.34%
510.parest_r0.17%   0.25%
511.povray_r56.57%  57.27%
519.lbm_r   0.00%   0.00%
521.wrf_r-0.28% -0.37%
526.blender_r   16.96%  17.71%
527.cam4_r  0.70%   0.53%
538.imagick_r   2.40%   2.40%
544.nab_r   0.00%   -0.65%

avg 5.17%   5.37%

Especially for the SPEC FP benchmarks, I didn’t see too much performance 
difference between A and D. 
I guess that the RTL optimizations might be enough to get rid of most of the 
overhead introduced by the additional initialization. 

> 
>> stack usage data, I added -fstack-usage to the compilation line when
>> compiling CPU2017 benchmarks. And all the *.su files were generated for each
>> of the modules.
>> Since there a lot of such files, and the stack size information are embedded
>> in each of the files.  I just picked up one benchmark 511.povray to
>> check. Which is the one that 
>> has the most runtime overhead when adding initialization (both A and D). 
>> I identified all the *.su files that are different between A and D and do a
>> diff on those *.su files, and looks like that the stack size is much higher
>> with D than that with A, for example:
>> $ diff build_base_auto_init.D./bbox.su
>> build_base_auto_init.A./bbox.su5c5
>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>> ---
>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>> $ diff build_base_auto_init.D./image.su
>> build_base_auto_init.A./image.su
>> 9c9
>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
>> static
>> ---
>> > image.cpp:240:6:void pov::bump_map(double*, 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Richard Biener




On Thu, 14 Jan 2021, Qing Zhao wrote:


Hi, 
More data on code size and compilation time with CPU2017:

Compilation time data:   the numbers are the slowdown against the
default “no”:

benchmarks  A/no D/no
                        
500.perlbench_r 5.19% 1.95%
502.gcc_r 0.46% -0.23%
505.mcf_r 0.00% 0.00%
520.omnetpp_r 0.85% 0.00%
523.xalancbmk_r 0.79% -0.40%
525.x264_r -4.48% 0.00%
531.deepsjeng_r 16.67% 16.67%
541.leela_r  0.00%  0.00%
557.xz_r 0.00%  0.00%
                        
507.cactuBSSN_r 1.16% 0.58%
508.namd_r 9.62% 8.65%
510.parest_r 0.48% 1.19%
511.povray_r 3.70% 3.70%
519.lbm_r 0.00% 0.00%
521.wrf_r 0.05% 0.02%
526.blender_r 0.33% 1.32%
527.cam4_r -0.93% -0.93%
538.imagick_r 1.32% 3.95%
544.nab_r  0.00% 0.00%

From the above data, looks like that the compilation time impact
from implementation A and D are almost the same.
***code size data: the numbers are the code size increase against the
default “no”:
benchmarks A/no D/no
                        
500.perlbench_r 2.84% 0.34%
502.gcc_r 2.59% 0.35%
505.mcf_r 3.55% 0.39%
520.omnetpp_r 0.54% 0.03%
523.xalancbmk_r 0.36%  0.39%
525.x264_r 1.39% 0.13%
531.deepsjeng_r 2.15% -1.12%
541.leela_r 0.50% -0.20%
557.xz_r 0.31% 0.13%
                        
507.cactuBSSN_r 5.00% -0.01%
508.namd_r 3.64% -0.07%
510.parest_r 1.12% 0.33%
511.povray_r 4.18% 1.16%
519.lbm_r 8.83% 6.44%
521.wrf_r 0.08% 0.02%
526.blender_r 1.63% 0.45%
527.cam4_r  0.16% 0.06%
538.imagick_r 3.18% -0.80%
544.nab_r 5.76% -1.11%
Avg 2.52% 0.36%

From the above data, the implementation D is always better than A, it’s a
surprising to me, not sure what’s the reason for this.


D probably inhibits most interesting loop transforms (check SPEC FP
performance).  It will also most definitely disallow SRA which, when
an aggregate is not completely elided, tends to grow code.


stack usage data, I added -fstack-usage to the compilation line when
compiling CPU2017 benchmarks. And all the *.su files were generated for each
of the modules.
Since there a lot of such files, and the stack size information are embedded
in each of the files.  I just picked up one benchmark 511.povray to
check. Which is the one that 
has the most runtime overhead when adding initialization (both A and D). 

I identified all the *.su files that are different between A and D and do a
diff on those *.su files, and looks like that the stack size is much higher
with D than that with A, for example:

$ diff build_base_auto_init.D./bbox.su
build_base_auto_init.A./bbox.su5c5
< bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
pov::BBOX_TREE**&, long int*, long int, long int) 160 static
---
> bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
pov::BBOX_TREE**&, long int*, long int, long int) 96 static

$ diff build_base_auto_init.D./image.su
build_base_auto_init.A./image.su
9c9
< image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
static
---
> image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272
static
….
Looks like that implementation D has more stack size impact than A. 

Do you have any insight on what the reason for this?


D will keep all initialized aggregates as aggregates and live which
means stack will be allocated for it.  With A the usual optimizations
to reduce stack usage can be applied.


Let me know if you have any comments and suggestions.


First of all I would check whether the prototype implementations
work as expected.

Richard.



thanks.

Qing
  On Jan 13, 2021, at 1:39 AM, Richard Biener 
  wrote:

  On Tue, 12 Jan 2021, Qing Zhao wrote:

Hi, 

Just check in to see whether you have any comments
and suggestions on this:

FYI, I have been continue with Approach D
implementation since last week:

D. Adding  calls to .DEFFERED_INIT during
gimplification, expand the .DEFFERED_INIT during
expand to
real initialization. Adjusting uninitialized pass
with the new refs with “.DEFFERED_INIT”.

For the remaining work of Approach D:

** complete the implementation of
-ftrivial-auto-var-init=pattern;
** complete the implementation of uninitialized
warnings maintenance work for D. 

I have completed the uninitialized warnings
maintenance work for D.
And finished partial of the
-ftrivial-auto-var-init=pattern implementation. 

The following are remaining work of Approach D:

  ** -ftrivial-auto-var-init=pattern for VLA;
  **add a new attribute for variable:
__attribute((uninitialized)
the marked variable is uninitialized intentionaly
for performance purpose.
  ** adding complete testing cases;


Please let me know if you have any objection on my
current decision on implementing approach 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-14 Thread Qing Zhao via Gcc-patches
Hi, 

More data on code size and compilation time with CPU2017:

Compilation time data:   the numbers are the slowdown against the 
default “no”:

benchmarks   A/no   D/no

500.perlbench_r 5.19%   1.95%
502.gcc_r   0.46%   -0.23%
505.mcf_r   0.00%   0.00%
520.omnetpp_r   0.85%   0.00%
523.xalancbmk_r 0.79%   -0.40%
525.x264_r  -4.48%  0.00%
531.deepsjeng_r 16.67%  16.67%
541.leela_r  0.00%   0.00%
557.xz_r0.00%0.00%

507.cactuBSSN_r 1.16%   0.58%
508.namd_r  9.62%   8.65%
510.parest_r0.48%   1.19%
511.povray_r3.70%   3.70%
519.lbm_r   0.00%   0.00%
521.wrf_r   0.05%   0.02%
526.blender_r   0.33%   1.32%
527.cam4_r  -0.93%  -0.93%
538.imagick_r   1.32%   3.95%
544.nab_r   0.00%   0.00%

From the above data, looks like that the compilation time impact from 
implementation A and D are almost the same.

***code size data: the numbers are the code size increase against the 
default “no”:
benchmarks  A/noD/no

500.perlbench_r 2.84%   0.34%
502.gcc_r   2.59%   0.35%
505.mcf_r   3.55%   0.39%
520.omnetpp_r   0.54%   0.03%
523.xalancbmk_r 0.36%0.39%
525.x264_r  1.39%   0.13%
531.deepsjeng_r 2.15%   -1.12%
541.leela_r 0.50%   -0.20%
557.xz_r0.31%   0.13%

507.cactuBSSN_r 5.00%   -0.01%
508.namd_r  3.64%   -0.07%
510.parest_r1.12%   0.33%
511.povray_r4.18%   1.16%
519.lbm_r   8.83%   6.44%
521.wrf_r   0.08%   0.02%
526.blender_r   1.63%   0.45%
527.cam4_r   0.16%  0.06%
538.imagick_r   3.18%   -0.80%
544.nab_r   5.76%   -1.11%
Avg 2.52%   0.36%

From the above data, the implementation D is always better than A, it’s a 
surprising to me, not sure what’s the reason for this.

stack usage data, I added -fstack-usage to the compilation line when 
compiling CPU2017 benchmarks. And all the *.su files were generated for each of 
the modules.
Since there a lot of such files, and the stack size information are embedded in 
each of the files.  I just picked up one benchmark 511.povray to check. Which 
is the one that 
has the most runtime overhead when adding initialization (both A and D). 

I identified all the *.su files that are different between A and D and do a 
diff on those *.su files, and looks like that the stack size is much higher 
with D than that with A, for example:

$ diff build_base_auto_init.D./bbox.su build_base_auto_init.A./bbox.su
5c5
< bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, pov::BBOX_TREE**&, 
long int*, long int, long int)  160 static
---
> bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, pov::BBOX_TREE**&, 
> long int*, long int, long int)  96  static

$ diff build_base_auto_init.D./image.su build_base_auto_init.A./image.su
9c9
< image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*)   624 
static
---
> image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*)   272 
> static
….

Looks like that implementation D has more stack size impact than A. 

Do you have any insight on what the reason for this?

Let me know if you have any comments and suggestions.

thanks.

Qing
> On Jan 13, 2021, at 1:39 AM, Richard Biener  wrote:
> 
> On Tue, 12 Jan 2021, Qing Zhao wrote:
> 
>> Hi, 
>> 
>> Just check in to see whether you have any comments and suggestions on this:
>> 
>> FYI, I have been continue with Approach D implementation since last week:
>> 
>> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
>> .DEFFERED_INIT during expand to
>> real initialization. Adjusting uninitialized pass with the new refs with 
>> “.DEFFERED_INIT”.
>> 
>> For the remaining work of Approach D:
>> 
>> ** complete the implementation of -ftrivial-auto-var-init=pattern;
>> ** complete the implementation of uninitialized warnings maintenance work 
>> for D. 
>> 
>> I have completed the uninitialized warnings maintenance work for D.
>> And finished partial of the -ftrivial-auto-var-init=pattern implementation. 
>> 
>> The following are remaining work of Approach D:
>> 
>>   ** -ftrivial-auto-var-init=pattern for VLA;
>>   **add a new attribute for variable:
>> __attribute((uninitialized)
>> the marked variable is uninitialized intentionaly for performance purpose.
>>   ** adding complete testing cases;
>> 
>> 
>> Please let me know if you have any objection on my current decision on 
>> implementing approach D. 
> 
> Did you do any analysis on how stack usage and code size are changed 
> with approach D?  How does compile-time behave (we could gobble up
> lots of .DEFERRED_INIT calls I guess)?
> 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-13 Thread Richard Biener
On Wed, 13 Jan 2021, Qing Zhao wrote:

> 
> 
> > On Jan 13, 2021, at 9:10 AM, Richard Biener  wrote:
> > 
> > On Wed, 13 Jan 2021, Qing Zhao wrote:
> > 
> >> 
> >> 
> >>> On Jan 13, 2021, at 1:39 AM, Richard Biener  wrote:
> >>> 
> >>> On Tue, 12 Jan 2021, Qing Zhao wrote:
> >>> 
>  Hi, 
>  
>  Just check in to see whether you have any comments and suggestions on 
>  this:
>  
>  FYI, I have been continue with Approach D implementation since last week:
>  
>  D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
>  .DEFFERED_INIT during expand to
>  real initialization. Adjusting uninitialized pass with the new refs with 
>  “.DEFFERED_INIT”.
>  
>  For the remaining work of Approach D:
>  
>  ** complete the implementation of -ftrivial-auto-var-init=pattern;
>  ** complete the implementation of uninitialized warnings maintenance 
>  work for D. 
>  
>  I have completed the uninitialized warnings maintenance work for D.
>  And finished partial of the -ftrivial-auto-var-init=pattern 
>  implementation. 
>  
>  The following are remaining work of Approach D:
>  
>   ** -ftrivial-auto-var-init=pattern for VLA;
>   **add a new attribute for variable:
>  __attribute((uninitialized)
>  the marked variable is uninitialized intentionaly for performance 
>  purpose.
>   ** adding complete testing cases;
>  
>  
>  Please let me know if you have any objection on my current decision on 
>  implementing approach D. 
> >>> 
> >>> Did you do any analysis on how stack usage and code size are changed 
> >>> with approach D?
> >> 
> >> I did the code size change comparison (I will provide the data in another 
> >> email). And with this data, D works better than A in general. (This is 
> >> surprise to me actually).
> >> 
> >> But not the stack usage.  Not sure how to collect the stack usage data, 
> >> do you have any suggestion on this?
> > 
> > There is -fstack-usage you could use, then of course watching
> > the stack segment at runtime.
> 
> I can do this for CPU2017 to collect the stack usage data and report back.
> 
> >  I'm mostly concerned about
> > stack-limited "processes" such as the linux kernel which I think
> > is a primary target of your work.
> 
> I don’t have any experience on building linux kernel. 
> Do we have to collect data for linux kernel at this time? Is CPU2017 data not 
> enough?

Well, it depends on the desired target.  The linux kernel has a
8kb hard stack limit for kernel threads on x86_64 (IIRC).  You
don't have to do anything, it was just a suggestion.  For normal
program stack usage is probably the least important problem.

Richard.

> Qing
> > 
> > Richard.
> > 
> >> 
> >>> How does compile-time behave (we could gobble up
> >>> lots of .DEFERRED_INIT calls I guess)?
> >> I can collect this data too and report it later.
> >> 
> >> Thanks.
> >> 
> >> Qing
> >>> 
> >>> Richard.
> >>> 
>  Thanks a lot for your help.
>  
>  Qing
>  
>  
> > On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches 
> >  wrote:
> > 
> > Hi,
> > 
> > This is an update for our previous discussion. 
> > 
> > 1. I implemented the following two different implementations in the 
> > latest upstream gcc:
> > 
> > A. Adding real initialization during gimplification, not maintain the 
> > uninitialized warnings.
> > 
> > D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> > .DEFFERED_INIT during expand to
> > real initialization. Adjusting uninitialized pass with the new refs 
> > with “.DEFFERED_INIT”.
> > 
> > Note, in this initial implementation,
> > ** I ONLY implement -ftrivial-auto-var-init=zero, the 
> > implementation of -ftrivial-auto-var-init=pattern 
> >is not done yet.  Therefore, the performance data is only 
> > about -ftrivial-auto-var-init=zero. 
> > 
> > ** I added an temporary  option 
> > -fauto-var-init-approach=A|B|C|D  to choose implementation A or D for 
> >runtime performance study.
> > ** I didn’t finish the uninitialized warnings maintenance work 
> > for D. (That might take more time than I expected). 
> > 
> > 2. I collected runtime data for CPU2017 on a x86 machine with this new 
> > gcc for the following 3 cases:
> > 
> > no: default. (-g -O2 -march=native )
> > A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
> > D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 
> > 
> > And then compute the slowdown data for both A and D as following:
> > 
> > benchmarks  A / no  D /no
> > 
> > 500.perlbench_r 1.25%   1.25%
> > 502.gcc_r   0.68%   1.80%
> > 505.mcf_r   0.68%   0.14%
> > 520.omnetpp_r   4.83%   4.68%
> 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-13 Thread Qing Zhao via Gcc-patches



> On Jan 13, 2021, at 9:10 AM, Richard Biener  wrote:
> 
> On Wed, 13 Jan 2021, Qing Zhao wrote:
> 
>> 
>> 
>>> On Jan 13, 2021, at 1:39 AM, Richard Biener  wrote:
>>> 
>>> On Tue, 12 Jan 2021, Qing Zhao wrote:
>>> 
 Hi, 
 
 Just check in to see whether you have any comments and suggestions on this:
 
 FYI, I have been continue with Approach D implementation since last week:
 
 D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
 .DEFFERED_INIT during expand to
 real initialization. Adjusting uninitialized pass with the new refs with 
 “.DEFFERED_INIT”.
 
 For the remaining work of Approach D:
 
 ** complete the implementation of -ftrivial-auto-var-init=pattern;
 ** complete the implementation of uninitialized warnings maintenance work 
 for D. 
 
 I have completed the uninitialized warnings maintenance work for D.
 And finished partial of the -ftrivial-auto-var-init=pattern 
 implementation. 
 
 The following are remaining work of Approach D:
 
  ** -ftrivial-auto-var-init=pattern for VLA;
  **add a new attribute for variable:
 __attribute((uninitialized)
 the marked variable is uninitialized intentionaly for performance purpose.
  ** adding complete testing cases;
 
 
 Please let me know if you have any objection on my current decision on 
 implementing approach D. 
>>> 
>>> Did you do any analysis on how stack usage and code size are changed 
>>> with approach D?
>> 
>> I did the code size change comparison (I will provide the data in another 
>> email). And with this data, D works better than A in general. (This is 
>> surprise to me actually).
>> 
>> But not the stack usage.  Not sure how to collect the stack usage data, 
>> do you have any suggestion on this?
> 
> There is -fstack-usage you could use, then of course watching
> the stack segment at runtime.

I can do this for CPU2017 to collect the stack usage data and report back.

>  I'm mostly concerned about
> stack-limited "processes" such as the linux kernel which I think
> is a primary target of your work.

I don’t have any experience on building linux kernel. 
Do we have to collect data for linux kernel at this time? Is CPU2017 data not 
enough?

Qing
> 
> Richard.
> 
>> 
>>> How does compile-time behave (we could gobble up
>>> lots of .DEFERRED_INIT calls I guess)?
>> I can collect this data too and report it later.
>> 
>> Thanks.
>> 
>> Qing
>>> 
>>> Richard.
>>> 
 Thanks a lot for your help.
 
 Qing
 
 
> On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> Hi,
> 
> This is an update for our previous discussion. 
> 
> 1. I implemented the following two different implementations in the 
> latest upstream gcc:
> 
> A. Adding real initialization during gimplification, not maintain the 
> uninitialized warnings.
> 
> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> .DEFFERED_INIT during expand to
> real initialization. Adjusting uninitialized pass with the new refs with 
> “.DEFFERED_INIT”.
> 
> Note, in this initial implementation,
>   ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of 
> -ftrivial-auto-var-init=pattern 
>  is not done yet.  Therefore, the performance data is only about 
> -ftrivial-auto-var-init=zero. 
> 
>   ** I added an temporary  option -fauto-var-init-approach=A|B|C|D  to 
> choose implementation A or D for 
>  runtime performance study.
>   ** I didn’t finish the uninitialized warnings maintenance work for D. 
> (That might take more time than I expected). 
> 
> 2. I collected runtime data for CPU2017 on a x86 machine with this new 
> gcc for the following 3 cases:
> 
> no: default. (-g -O2 -march=native )
> A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
> D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 
> 
> And then compute the slowdown data for both A and D as following:
> 
> benchmarksA / no  D /no
> 
> 500.perlbench_r   1.25%   1.25%
> 502.gcc_r 0.68%   1.80%
> 505.mcf_r 0.68%   0.14%
> 520.omnetpp_r 4.83%   4.68%
> 523.xalancbmk_r   0.18%   1.96%
> 525.x264_r1.55%   2.07%
> 531.deepsjeng_11.57%  11.85%
> 541.leela_r   0.64%   0.80%
> 557.xz_-0.41% -0.41%
> 
> 507.cactuBSSN_r   0.44%   0.44%
> 508.namd_r0.34%   0.34%
> 510.parest_r  0.17%   0.25%
> 511.povray_r  56.57%  57.27%
> 519.lbm_r 0.00%   0.00%
> 521.wrf_r  -0.28% -0.37%
> 526.blender_r 16.96%  17.71%
> 527.cam4_r0.70%   0.53%
> 538.imagick_r 2.40%   

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-13 Thread Richard Biener
On Wed, 13 Jan 2021, Qing Zhao wrote:

> 
> 
> > On Jan 13, 2021, at 1:39 AM, Richard Biener  wrote:
> > 
> > On Tue, 12 Jan 2021, Qing Zhao wrote:
> > 
> >> Hi, 
> >> 
> >> Just check in to see whether you have any comments and suggestions on this:
> >> 
> >> FYI, I have been continue with Approach D implementation since last week:
> >> 
> >> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> >> .DEFFERED_INIT during expand to
> >> real initialization. Adjusting uninitialized pass with the new refs with 
> >> “.DEFFERED_INIT”.
> >> 
> >> For the remaining work of Approach D:
> >> 
> >> ** complete the implementation of -ftrivial-auto-var-init=pattern;
> >> ** complete the implementation of uninitialized warnings maintenance work 
> >> for D. 
> >> 
> >> I have completed the uninitialized warnings maintenance work for D.
> >> And finished partial of the -ftrivial-auto-var-init=pattern 
> >> implementation. 
> >> 
> >> The following are remaining work of Approach D:
> >> 
> >>   ** -ftrivial-auto-var-init=pattern for VLA;
> >>   **add a new attribute for variable:
> >> __attribute((uninitialized)
> >> the marked variable is uninitialized intentionaly for performance purpose.
> >>   ** adding complete testing cases;
> >> 
> >> 
> >> Please let me know if you have any objection on my current decision on 
> >> implementing approach D. 
> > 
> > Did you do any analysis on how stack usage and code size are changed 
> > with approach D?
> 
> I did the code size change comparison (I will provide the data in another 
> email). And with this data, D works better than A in general. (This is 
> surprise to me actually).
> 
> But not the stack usage.  Not sure how to collect the stack usage data, 
> do you have any suggestion on this?

There is -fstack-usage you could use, then of course watching
the stack segment at runtime.  I'm mostly concerned about
stack-limited "processes" such as the linux kernel which I think
is a primary target of your work.

Richard.

> 
> > How does compile-time behave (we could gobble up
> > lots of .DEFERRED_INIT calls I guess)?
> I can collect this data too and report it later.
> 
> Thanks.
> 
> Qing
> > 
> > Richard.
> > 
> >> Thanks a lot for your help.
> >> 
> >> Qing
> >> 
> >> 
> >>> On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches 
> >>>  wrote:
> >>> 
> >>> Hi,
> >>> 
> >>> This is an update for our previous discussion. 
> >>> 
> >>> 1. I implemented the following two different implementations in the 
> >>> latest upstream gcc:
> >>> 
> >>> A. Adding real initialization during gimplification, not maintain the 
> >>> uninitialized warnings.
> >>> 
> >>> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> >>> .DEFFERED_INIT during expand to
> >>> real initialization. Adjusting uninitialized pass with the new refs with 
> >>> “.DEFFERED_INIT”.
> >>> 
> >>> Note, in this initial implementation,
> >>>   ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of 
> >>> -ftrivial-auto-var-init=pattern 
> >>>  is not done yet.  Therefore, the performance data is only about 
> >>> -ftrivial-auto-var-init=zero. 
> >>> 
> >>>   ** I added an temporary  option -fauto-var-init-approach=A|B|C|D  to 
> >>> choose implementation A or D for 
> >>>  runtime performance study.
> >>>   ** I didn’t finish the uninitialized warnings maintenance work for D. 
> >>> (That might take more time than I expected). 
> >>> 
> >>> 2. I collected runtime data for CPU2017 on a x86 machine with this new 
> >>> gcc for the following 3 cases:
> >>> 
> >>> no: default. (-g -O2 -march=native )
> >>> A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
> >>> D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 
> >>> 
> >>> And then compute the slowdown data for both A and D as following:
> >>> 
> >>> benchmarksA / no  D /no
> >>> 
> >>> 500.perlbench_r   1.25%   1.25%
> >>> 502.gcc_r 0.68%   1.80%
> >>> 505.mcf_r 0.68%   0.14%
> >>> 520.omnetpp_r 4.83%   4.68%
> >>> 523.xalancbmk_r   0.18%   1.96%
> >>> 525.x264_r1.55%   2.07%
> >>> 531.deepsjeng_11.57%  11.85%
> >>> 541.leela_r   0.64%   0.80%
> >>> 557.xz_-0.41% -0.41%
> >>> 
> >>> 507.cactuBSSN_r   0.44%   0.44%
> >>> 508.namd_r0.34%   0.34%
> >>> 510.parest_r  0.17%   0.25%
> >>> 511.povray_r  56.57%  57.27%
> >>> 519.lbm_r 0.00%   0.00%
> >>> 521.wrf_r  -0.28% -0.37%
> >>> 526.blender_r 16.96%  17.71%
> >>> 527.cam4_r0.70%   0.53%
> >>> 538.imagick_r 2.40%   2.40%
> >>> 544.nab_r 0.00%   -0.65%
> >>> 
> >>> avg   5.17%   5.37%
> >>> 
> >>> From the above data, we can see that in general, the runtime performance 
> >>> slowdown for 
> >>> implementation A and D are similar for individual benchmarks.
> >>> 
> >>> There are several 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-13 Thread Qing Zhao via Gcc-patches



> On Jan 13, 2021, at 1:39 AM, Richard Biener  wrote:
> 
> On Tue, 12 Jan 2021, Qing Zhao wrote:
> 
>> Hi, 
>> 
>> Just check in to see whether you have any comments and suggestions on this:
>> 
>> FYI, I have been continue with Approach D implementation since last week:
>> 
>> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
>> .DEFFERED_INIT during expand to
>> real initialization. Adjusting uninitialized pass with the new refs with 
>> “.DEFFERED_INIT”.
>> 
>> For the remaining work of Approach D:
>> 
>> ** complete the implementation of -ftrivial-auto-var-init=pattern;
>> ** complete the implementation of uninitialized warnings maintenance work 
>> for D. 
>> 
>> I have completed the uninitialized warnings maintenance work for D.
>> And finished partial of the -ftrivial-auto-var-init=pattern implementation. 
>> 
>> The following are remaining work of Approach D:
>> 
>>   ** -ftrivial-auto-var-init=pattern for VLA;
>>   **add a new attribute for variable:
>> __attribute((uninitialized)
>> the marked variable is uninitialized intentionaly for performance purpose.
>>   ** adding complete testing cases;
>> 
>> 
>> Please let me know if you have any objection on my current decision on 
>> implementing approach D. 
> 
> Did you do any analysis on how stack usage and code size are changed 
> with approach D?

I did the code size change comparison (I will provide the data in another 
email). And with this data, D works better than A in general. (This is surprise 
to me actually).

But not the stack usage.  Not sure how to collect the stack usage data, do you 
have any suggestion on this?


> How does compile-time behave (we could gobble up
> lots of .DEFERRED_INIT calls I guess)?
I can collect this data too and report it later.

Thanks.

Qing
> 
> Richard.
> 
>> Thanks a lot for your help.
>> 
>> Qing
>> 
>> 
>>> On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches 
>>>  wrote:
>>> 
>>> Hi,
>>> 
>>> This is an update for our previous discussion. 
>>> 
>>> 1. I implemented the following two different implementations in the latest 
>>> upstream gcc:
>>> 
>>> A. Adding real initialization during gimplification, not maintain the 
>>> uninitialized warnings.
>>> 
>>> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
>>> .DEFFERED_INIT during expand to
>>> real initialization. Adjusting uninitialized pass with the new refs with 
>>> “.DEFFERED_INIT”.
>>> 
>>> Note, in this initial implementation,
>>> ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of 
>>> -ftrivial-auto-var-init=pattern 
>>>is not done yet.  Therefore, the performance data is only about 
>>> -ftrivial-auto-var-init=zero. 
>>> 
>>> ** I added an temporary  option -fauto-var-init-approach=A|B|C|D  to 
>>> choose implementation A or D for 
>>>runtime performance study.
>>> ** I didn’t finish the uninitialized warnings maintenance work for D. 
>>> (That might take more time than I expected). 
>>> 
>>> 2. I collected runtime data for CPU2017 on a x86 machine with this new gcc 
>>> for the following 3 cases:
>>> 
>>> no: default. (-g -O2 -march=native )
>>> A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
>>> D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 
>>> 
>>> And then compute the slowdown data for both A and D as following:
>>> 
>>> benchmarks  A / no  D /no
>>> 
>>> 500.perlbench_r 1.25%   1.25%
>>> 502.gcc_r   0.68%   1.80%
>>> 505.mcf_r   0.68%   0.14%
>>> 520.omnetpp_r   4.83%   4.68%
>>> 523.xalancbmk_r 0.18%   1.96%
>>> 525.x264_r  1.55%   2.07%
>>> 531.deepsjeng_  11.57%  11.85%
>>> 541.leela_r 0.64%   0.80%
>>> 557.xz_  -0.41% -0.41%
>>> 
>>> 507.cactuBSSN_r 0.44%   0.44%
>>> 508.namd_r  0.34%   0.34%
>>> 510.parest_r0.17%   0.25%
>>> 511.povray_r56.57%  57.27%
>>> 519.lbm_r   0.00%   0.00%
>>> 521.wrf_r-0.28% -0.37%
>>> 526.blender_r   16.96%  17.71%
>>> 527.cam4_r  0.70%   0.53%
>>> 538.imagick_r   2.40%   2.40%
>>> 544.nab_r   0.00%   -0.65%
>>> 
>>> avg 5.17%   5.37%
>>> 
>>> From the above data, we can see that in general, the runtime performance 
>>> slowdown for 
>>> implementation A and D are similar for individual benchmarks.
>>> 
>>> There are several benchmarks that have significant slowdown with the new 
>>> added initialization for both
>>> A and D, for example, 511.povray_r, 526.blender_, and 531.deepsjeng_r, I 
>>> will try to study a little bit
>>> more on what kind of new initializations introduced such slowdown. 
>>> 
>>> From the current study so far, I think that approach D should be good 
>>> enough for our final implementation. 
>>> So, I will try to finish approach D with the following remaining work
>>> 
>>> ** complete the implementation of 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-12 Thread Richard Biener
On Tue, 12 Jan 2021, Qing Zhao wrote:

> Hi, 
> 
> Just check in to see whether you have any comments and suggestions on this:
> 
> FYI, I have been continue with Approach D implementation since last week:
> 
> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> .DEFFERED_INIT during expand to
> real initialization. Adjusting uninitialized pass with the new refs with 
> “.DEFFERED_INIT”.
> 
> For the remaining work of Approach D:
> 
>  ** complete the implementation of -ftrivial-auto-var-init=pattern;
>  ** complete the implementation of uninitialized warnings maintenance work 
> for D. 
> 
> I have completed the uninitialized warnings maintenance work for D.
> And finished partial of the -ftrivial-auto-var-init=pattern implementation. 
> 
> The following are remaining work of Approach D:
> 
>** -ftrivial-auto-var-init=pattern for VLA;
>**add a new attribute for variable:
> __attribute((uninitialized)
> the marked variable is uninitialized intentionaly for performance purpose.
>** adding complete testing cases;
>   
> 
> Please let me know if you have any objection on my current decision on 
> implementing approach D. 

Did you do any analysis on how stack usage and code size are changed 
with approach D?  How does compile-time behave (we could gobble up
lots of .DEFERRED_INIT calls I guess)?

Richard.

> Thanks a lot for your help.
> 
> Qing
> 
> 
> > On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches 
> >  wrote:
> > 
> > Hi,
> > 
> > This is an update for our previous discussion. 
> > 
> > 1. I implemented the following two different implementations in the latest 
> > upstream gcc:
> > 
> > A. Adding real initialization during gimplification, not maintain the 
> > uninitialized warnings.
> > 
> > D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> > .DEFFERED_INIT during expand to
> > real initialization. Adjusting uninitialized pass with the new refs with 
> > “.DEFFERED_INIT”.
> > 
> > Note, in this initial implementation,
> > ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of 
> > -ftrivial-auto-var-init=pattern 
> >is not done yet.  Therefore, the performance data is only about 
> > -ftrivial-auto-var-init=zero. 
> > 
> > ** I added an temporary  option -fauto-var-init-approach=A|B|C|D  to 
> > choose implementation A or D for 
> >runtime performance study.
> > ** I didn’t finish the uninitialized warnings maintenance work for D. 
> > (That might take more time than I expected). 
> > 
> > 2. I collected runtime data for CPU2017 on a x86 machine with this new gcc 
> > for the following 3 cases:
> > 
> > no: default. (-g -O2 -march=native )
> > A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
> > D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 
> > 
> > And then compute the slowdown data for both A and D as following:
> > 
> > benchmarks  A / no  D /no
> > 
> > 500.perlbench_r 1.25%   1.25%
> > 502.gcc_r   0.68%   1.80%
> > 505.mcf_r   0.68%   0.14%
> > 520.omnetpp_r   4.83%   4.68%
> > 523.xalancbmk_r 0.18%   1.96%
> > 525.x264_r  1.55%   2.07%
> > 531.deepsjeng_  11.57%  11.85%
> > 541.leela_r 0.64%   0.80%
> > 557.xz_  -0.41% -0.41%
> > 
> > 507.cactuBSSN_r 0.44%   0.44%
> > 508.namd_r  0.34%   0.34%
> > 510.parest_r0.17%   0.25%
> > 511.povray_r56.57%  57.27%
> > 519.lbm_r   0.00%   0.00%
> > 521.wrf_r-0.28% -0.37%
> > 526.blender_r   16.96%  17.71%
> > 527.cam4_r  0.70%   0.53%
> > 538.imagick_r   2.40%   2.40%
> > 544.nab_r   0.00%   -0.65%
> > 
> > avg 5.17%   5.37%
> > 
> > From the above data, we can see that in general, the runtime performance 
> > slowdown for 
> > implementation A and D are similar for individual benchmarks.
> > 
> > There are several benchmarks that have significant slowdown with the new 
> > added initialization for both
> > A and D, for example, 511.povray_r, 526.blender_, and 531.deepsjeng_r, I 
> > will try to study a little bit
> > more on what kind of new initializations introduced such slowdown. 
> > 
> > From the current study so far, I think that approach D should be good 
> > enough for our final implementation. 
> > So, I will try to finish approach D with the following remaining work
> > 
> >  ** complete the implementation of -ftrivial-auto-var-init=pattern;
> >  ** complete the implementation of uninitialized warnings maintenance 
> > work for D. 
> > 
> > 
> > Let me know if you have any comments and suggestions on my current and 
> > future work.
> > 
> > Thanks a lot for your help.
> > 
> > Qing
> > 
> >> On Dec 9, 2020, at 10:18 AM, Qing Zhao via Gcc-patches 
> >>  wrote:
> >> 
> >> The following are the approaches I will implement and compare:
> >> 
> >> Our final goal is to keep the 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-12 Thread Qing Zhao via Gcc-patches
Hi, 

Just check in to see whether you have any comments and suggestions on this:

FYI, I have been continue with Approach D implementation since last week:

D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
.DEFFERED_INIT during expand to
real initialization. Adjusting uninitialized pass with the new refs with 
“.DEFFERED_INIT”.

For the remaining work of Approach D:

 ** complete the implementation of -ftrivial-auto-var-init=pattern;
 ** complete the implementation of uninitialized warnings maintenance work for 
D. 

I have completed the uninitialized warnings maintenance work for D.
And finished partial of the -ftrivial-auto-var-init=pattern implementation. 

The following are remaining work of Approach D:

   ** -ftrivial-auto-var-init=pattern for VLA;
   **add a new attribute for variable:
__attribute((uninitialized)
the marked variable is uninitialized intentionaly for performance purpose.
   ** adding complete testing cases;
  

Please let me know if you have any objection on my current decision on 
implementing approach D. 

Thanks a lot for your help.

Qing


> On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> Hi,
> 
> This is an update for our previous discussion. 
> 
> 1. I implemented the following two different implementations in the latest 
> upstream gcc:
> 
> A. Adding real initialization during gimplification, not maintain the 
> uninitialized warnings.
> 
> D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
> .DEFFERED_INIT during expand to
> real initialization. Adjusting uninitialized pass with the new refs with 
> “.DEFFERED_INIT”.
> 
> Note, in this initial implementation,
>   ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of 
> -ftrivial-auto-var-init=pattern 
>  is not done yet.  Therefore, the performance data is only about 
> -ftrivial-auto-var-init=zero. 
> 
>   ** I added an temporary  option -fauto-var-init-approach=A|B|C|D  to 
> choose implementation A or D for 
>  runtime performance study.
>   ** I didn’t finish the uninitialized warnings maintenance work for D. 
> (That might take more time than I expected). 
> 
> 2. I collected runtime data for CPU2017 on a x86 machine with this new gcc 
> for the following 3 cases:
> 
> no: default. (-g -O2 -march=native )
> A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
> D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 
> 
> And then compute the slowdown data for both A and D as following:
> 
> benchmarksA / no  D /no
> 
> 500.perlbench_r   1.25%   1.25%
> 502.gcc_r 0.68%   1.80%
> 505.mcf_r 0.68%   0.14%
> 520.omnetpp_r 4.83%   4.68%
> 523.xalancbmk_r   0.18%   1.96%
> 525.x264_r1.55%   2.07%
> 531.deepsjeng_11.57%  11.85%
> 541.leela_r   0.64%   0.80%
> 557.xz_-0.41% -0.41%
> 
> 507.cactuBSSN_r   0.44%   0.44%
> 508.namd_r0.34%   0.34%
> 510.parest_r  0.17%   0.25%
> 511.povray_r  56.57%  57.27%
> 519.lbm_r 0.00%   0.00%
> 521.wrf_r  -0.28% -0.37%
> 526.blender_r 16.96%  17.71%
> 527.cam4_r0.70%   0.53%
> 538.imagick_r 2.40%   2.40%
> 544.nab_r 0.00%   -0.65%
> 
> avg   5.17%   5.37%
> 
> From the above data, we can see that in general, the runtime performance 
> slowdown for 
> implementation A and D are similar for individual benchmarks.
> 
> There are several benchmarks that have significant slowdown with the new 
> added initialization for both
> A and D, for example, 511.povray_r, 526.blender_, and 531.deepsjeng_r, I will 
> try to study a little bit
> more on what kind of new initializations introduced such slowdown. 
> 
> From the current study so far, I think that approach D should be good enough 
> for our final implementation. 
> So, I will try to finish approach D with the following remaining work
> 
>  ** complete the implementation of -ftrivial-auto-var-init=pattern;
>  ** complete the implementation of uninitialized warnings maintenance 
> work for D. 
> 
> 
> Let me know if you have any comments and suggestions on my current and future 
> work.
> 
> Thanks a lot for your help.
> 
> Qing
> 
>> On Dec 9, 2020, at 10:18 AM, Qing Zhao via Gcc-patches 
>>  wrote:
>> 
>> The following are the approaches I will implement and compare:
>> 
>> Our final goal is to keep the uninitialized warning and minimize the 
>> run-time performance cost.
>> 
>> A. Adding real initialization during gimplification, not maintain the 
>> uninitialized warnings.
>> B. Adding real initialization during gimplification, marking them with 
>> “artificial_init”. 
>>Adjusting uninitialized pass, maintaining the annotation, making sure the 
>> real init not
>>Deleted from the fake init. 
>> C.  Marking the DECL for an uninitialized auto variable as 
>> 

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-05 Thread Qing Zhao via Gcc-patches
I am attaching my current (incomplete) patch to gcc for your reference.

From a71eb73bee5857440c4ff67c4c82be115e0675cb Mon Sep 17 00:00:00 2001
From: qing zhao 
Date: Sat, 12 Dec 2020 00:02:28 +0100
Subject: [PATCH] First version of -ftrivial-auto-var-init

---
 gcc/common.opt| 35 ++
 gcc/flag-types.h  | 14 
 gcc/gimple-pretty-print.c |  2 +-
 gcc/gimplify.c| 90 +++
 gcc/internal-fn.c | 20 +++
 gcc/internal-fn.def   |  5 +++
 gcc/tree-cfg.c|  3 ++
 gcc/tree-ssa-uninit.c |  3 ++
 gcc/tree-ssa.c|  5 +++
 9 files changed, 176 insertions(+), 1 deletion(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 6645539f5e5..c4c4fc28ef7 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3053,6 +3053,41 @@ ftree-scev-cprop
 Common Report Var(flag_tree_scev_cprop) Init(1) Optimization
 Enable copy propagation of scalar-evolution information.
 
+ftrivial-auto-var-init=
+Common Joined RejectNegative Enum(auto_init_type) 
Var(flag_trivial_auto_var_init) Init(AUTO_INIT_UNINITIALIZED)
+-ftrivial-auto-var-init=[uninitialized|pattern|zero]   Add initializations to 
automatic variables. 
+
+Enum
+Name(auto_init_type) Type(enum auto_init_type) UnknownError(unrecognized 
automatic variable initialization type %qs)
+
+EnumValue
+Enum(auto_init_type) String(uninitialized) Value(AUTO_INIT_UNINITIALIZED)
+
+EnumValue
+Enum(auto_init_type) String(pattern) Value(AUTO_INIT_PATTERN)
+
+EnumValue
+Enum(auto_init_type) String(zero) Value(AUTO_INIT_ZERO)
+
+fauto-var-init-approach=
+Common Joined RejectNegative Enum(auto_init_approach) 
Var(flag_auto_init_approach) Init(AUTO_INIT_A))
+-fauto-var-init-approach=[A|B|C|D] Choose the approach to initialize 
automatic variables.  
+
+Enum
+Name(auto_init_approach) Type(enum auto_init_approach) 
UnknownError(unrecognized automatic variable initialization approach %qs)
+
+EnumValue
+Enum(auto_init_approach) String(A) Value(AUTO_INIT_A)
+
+EnumValue
+Enum(auto_init_approach) String(B) Value(AUTO_INIT_B)
+
+EnumValue
+Enum(auto_init_approach) String(C) Value(AUTO_INIT_C)
+
+EnumValue
+Enum(auto_init_approach) String(D) Value(AUTO_INIT_D)
+
 ; -fverbose-asm causes extra commentary information to be produced in
 ; the generated assembly code (to make it more readable).  This option
 ; is generally only of use to those who actually need to read the
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index 9342bd87be3..bfd0692b82c 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -242,6 +242,20 @@ enum vect_cost_model {
   VECT_COST_MODEL_DEFAULT = 1
 };
 
+/* Automatic variable initialization type.  */
+enum auto_init_type {
+  AUTO_INIT_UNINITIALIZED = 0,
+  AUTO_INIT_PATTERN = 1,
+  AUTO_INIT_ZERO = 2
+};
+
+enum auto_init_approach {
+  AUTO_INIT_A = 0,
+  AUTO_INIT_B = 1,
+  AUTO_INIT_C = 2,
+  AUTO_INIT_D = 3
+};
+
 /* Different instrumentation modes.  */
 enum sanitize_code {
   /* AddressSanitizer.  */
diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c
index 075d6e5208a..1044d54e8d3 100644
--- a/gcc/gimple-pretty-print.c
+++ b/gcc/gimple-pretty-print.c
@@ -81,7 +81,7 @@ newline_and_indent (pretty_printer *buffer, int spc)
 DEBUG_FUNCTION void
 debug_gimple_stmt (gimple *gs)
 {
-  print_gimple_stmt (stderr, gs, 0, TDF_VOPS|TDF_MEMSYMS);
+  print_gimple_stmt (stderr, gs, 0, TDF_VOPS|TDF_MEMSYMS|TDF_LINENO|TDF_ALIAS);
 }
 
 
diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 54cb66bd1dd..1eb0747ea2f 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -1674,6 +1674,16 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
   return GS_ALL_DONE;
 }
 
+/* Return the value that is used to initialize the vla DECL based 
+   on INIT_TYPE.  */
+tree memset_init_node (enum auto_init_type init_type)
+{
+  if (init_type == AUTO_INIT_ZERO)
+return integer_zero_node;
+  else
+gcc_assert (0);
+}
+
 /* Gimplify a variable-length array DECL.  */
 
 static void
@@ -1712,6 +1722,19 @@ gimplify_vla_decl (tree decl, gimple_seq *seq_p)
 
   gimplify_and_add (t, seq_p);
 
+  /* Add a call to memset to initialize this vla when the user requested.  */
+  if (flag_trivial_auto_var_init > AUTO_INIT_UNINITIALIZED
+  && !DECL_ARTIFICIAL (decl)
+  && VAR_P (decl) 
+  && !DECL_EXTERNAL (decl) 
+  && !TREE_STATIC (decl))
+  {
+t = builtin_decl_implicit (BUILT_IN_MEMSET);
+tree init_node = memset_init_node (flag_trivial_auto_var_init);
+t = build_call_expr (t, 3, addr, init_node, DECL_SIZE_UNIT (decl)); 
+gimplify_and_add (t, seq_p);
+  }
+
   /* Record the dynamic allocation associated with DECL if requested.  */
   if (flag_callgraph_info & CALLGRAPH_INFO_DYNAMIC_ALLOC)
 record_dynamic_alloc (decl);
@@ -1734,6 +1757,63 @@ force_labels_r (tree *tp, int *walk_subtrees, void *data 
ATTRIBUTE_UNUSED)
   return NULL_TREE;
 }
 
+
+/* Build a call to internal const function DEFERRED_INIT,
+   1st argument: DECL;
+  

The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-05 Thread Qing Zhao via Gcc-patches
Hi,

This is an update for our previous discussion. 

1. I implemented the following two different implementations in the latest 
upstream gcc:

A. Adding real initialization during gimplification, not maintain the 
uninitialized warnings.

D. Adding  calls to .DEFFERED_INIT during gimplification, expand the 
.DEFFERED_INIT during expand to
 real initialization. Adjusting uninitialized pass with the new refs with 
“.DEFFERED_INIT”.

Note, in this initial implementation,
** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of 
-ftrivial-auto-var-init=pattern 
   is not done yet.  Therefore, the performance data is only about 
-ftrivial-auto-var-init=zero. 

** I added an temporary  option -fauto-var-init-approach=A|B|C|D  to 
choose implementation A or D for 
   runtime performance study.
** I didn’t finish the uninitialized warnings maintenance work for D. 
(That might take more time than I expected). 

2. I collected runtime data for CPU2017 on a x86 machine with this new gcc for 
the following 3 cases:

no: default. (-g -O2 -march=native )
A:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=A 
D:  default +  -ftrivial-auto-var-init=zero -fauto-var-init-approach=D 

And then compute the slowdown data for both A and D as following:

benchmarks  A / no  D /no

500.perlbench_r 1.25%   1.25%
502.gcc_r   0.68%   1.80%
505.mcf_r   0.68%   0.14%
520.omnetpp_r   4.83%   4.68%
523.xalancbmk_r 0.18%   1.96%
525.x264_r  1.55%   2.07%
531.deepsjeng_  11.57%  11.85%
541.leela_r 0.64%   0.80%
557.xz_  -0.41% -0.41%

507.cactuBSSN_r 0.44%   0.44%
508.namd_r  0.34%   0.34%
510.parest_r0.17%   0.25%
511.povray_r56.57%  57.27%
519.lbm_r   0.00%   0.00%
521.wrf_r-0.28% -0.37%
526.blender_r   16.96%  17.71%
527.cam4_r  0.70%   0.53%
538.imagick_r   2.40%   2.40%
544.nab_r   0.00%   -0.65%

avg 5.17%   5.37%

From the above data, we can see that in general, the runtime performance 
slowdown for 
implementation A and D are similar for individual benchmarks.

There are several benchmarks that have significant slowdown with the new added 
initialization for both
A and D, for example, 511.povray_r, 526.blender_, and 531.deepsjeng_r, I will 
try to study a little bit
more on what kind of new initializations introduced such slowdown. 

From the current study so far, I think that approach D should be good enough 
for our final implementation. 
So, I will try to finish approach D with the following remaining work

  ** complete the implementation of -ftrivial-auto-var-init=pattern;
  ** complete the implementation of uninitialized warnings maintenance work 
for D. 


Let me know if you have any comments and suggestions on my current and future 
work.

Thanks a lot for your help.

Qing

> On Dec 9, 2020, at 10:18 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> The following are the approaches I will implement and compare:
> 
> Our final goal is to keep the uninitialized warning and minimize the run-time 
> performance cost.
> 
> A. Adding real initialization during gimplification, not maintain the 
> uninitialized warnings.
> B. Adding real initialization during gimplification, marking them with 
> “artificial_init”. 
> Adjusting uninitialized pass, maintaining the annotation, making sure the 
> real init not
> Deleted from the fake init. 
> C.  Marking the DECL for an uninitialized auto variable as “no_explicit_init” 
> during gimplification,
>  maintain this “no_explicit_init” bit till after 
> pass_late_warn_uninitialized, or till pass_expand, 
>  add real initialization for all DECLs that are marked with 
> “no_explicit_init”.
> D. Adding .DEFFERED_INIT during gimplification, expand the .DEFFERED_INIT 
> during expand to
> real initialization. Adjusting uninitialized pass with the new refs with 
> “.DEFFERED_INIT”.
> 
> 
> In the above, approach A will be the one that have the minimum run-time cost, 
> will be the base for the performance
> comparison. 
> 
> I will implement approach D then, this one is expected to have the most 
> run-time overhead among the above list, but
> Implementation should be the cleanest among B, C, D. Let’s see how much more 
> performance overhead this approach
> will be. If the data is good, maybe we can avoid the effort to implement B, 
> and C. 
> 
> If the performance of D is not good, I will implement B or C at that time.
> 
> Let me know if you have any comment or suggestions.
> 
> Thanks.
> 
> Qing