Re: [RFA] [PR tree-optimization/33562] [PATCH 1/4] Byte tracking in DSE - v3

2017-01-11 Thread Jeff Law

On 01/04/2017 06:23 AM, Richard Biener wrote:

On Wed, Jan 4, 2017 at 2:22 PM, Richard Biener
 wrote:

On Thu, Dec 22, 2016 at 7:26 AM, Jeff Law  wrote:

This is the first of the 4 part patchkit to address deficiencies in our DSE
implementation.

This patch addresses the P2 regression 33562 which has been a low priority
regression since gcc-4.3.  To summarize, DSE no longer has the ability to
detect an aggregate store as dead if subsequent stores are done in a
piecemeal fashion.

I originally tackled this by changing how we lower complex objects. That was
sufficient to address 33562, but was reasonably rejected.

This version attacks the problem by improving DSE to track stores to memory
at a byte level.  That allows us to determine if a series of stores
completely covers an earlier store (thus making the earlier store dead).

A useful side effect of this is we can detect when parts of a store are dead
and potentially rewrite the store.  This patch implements that for complex
object initializations.  While not strictly part of 33562, it's so closely
related that I felt it belongs as part of this patch.

This originally limited the size of the tracked memory space to 64 bytes.  I
bumped the limit after working through the CONSTRUCTOR and mem* trimming
patches.  The 256 byte limit is still fairly arbitrary and I wouldn't lose
sleep if we throttled back to 64 or 128 bytes.

Later patches in the kit will build upon this patch.  So if pieces look like
skeleton code, that's because it is.

The changes since the V2 patch are:

1. Using sbitmaps rather than bitmaps.
2. Returning a tri-state from dse_classify_store (renamed from
dse_possible_dead_store_p)
3. More efficient trim computation
4. Moving trimming code out of dse_classify_store
5. Refactoring code to delete dead calls/assignments
6. dse_optimize_stmt moves into the dse_dom_walker class

Not surprisingly, this patch has most of the changes based on prior feedback
as it includes the raw infrastructure.

Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?


New functions in sbitmap.c lack function comments.

bitmap_count_bits fails to guard against GCC_VERSION >= 3400 (the version
is not important, but non-GCC host compilers are).  See bitmap.c for a
fallback.

Both bitmap_clear_range and bitmap_set_range look rather inefficient...
(it's not likely anybody will clean this up after you)

I'd say split out the sbitmap.[ch] changes.

+DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
+"dse-max-object-size",
+"Maximum size (in bytes) of objects tracked by dead store
elimination.",
+256, 0, 0)

the docs suggest that DSE doesn't handle larger stores but it does (just in
the original limited way).  Maybe "tracked bytewise" is better.


Oh, and new --params need documeting in invoke.texi.

Fixed.

jeff



Re: [RFA] [PR tree-optimization/33562] [PATCH 1/4] Byte tracking in DSE - v3

2017-01-11 Thread Jeff Law

On 01/04/2017 06:22 AM, Richard Biener wrote:



Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?


New functions in sbitmap.c lack function comments.

Bah.  Sophomoric on my part.  Fixed.



bitmap_count_bits fails to guard against GCC_VERSION >= 3400 (the version
is not important, but non-GCC host compilers are).  See bitmap.c for a
fallback.
Mistake on my part.  I keep thinking we support starting the bootstrap 
process with the most recently released GCC, but we support 3.4 as well 
as other C++98/C++03 compilers.  Fixed in the next update (and tested by 
forcing the fallback method).




Both bitmap_clear_range and bitmap_set_range look rather inefficient...
(it's not likely anybody will clean this up after you)
They were, but not anymore.  Now they build a mask to deal with any 
partial clearing/setting in the first word, then a single memset for any 
whole words in the middle, then another masking operation on residuals 
in the last word.  Verified behavior by by keeping two bitmaps, one with 
the old slow approach and one with the faster implementation and 
checking for equality.  Obviously those verification bits won't be in 
the final patch.



I'd say split out the sbitmap.[ch] changes.

Sure.  That's easy enough.



+DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
+"dse-max-object-size",
+"Maximum size (in bytes) of objects tracked by dead store
elimination.",
+256, 0, 0)

the docs suggest that DSE doesn't handle larger stores but it does (just in
the original limited way).  Maybe "tracked bytewise" is better.

Agreed and fixed.




+static bool
+valid_ao_ref_for_dse (ao_ref *ref)
+{
+  return (ao_ref_base (ref)
+ && ref->max_size != -1
+ && (ref->offset % BITS_PER_UNIT) == 0
+ && (ref->size % BITS_PER_UNIT) == 0
+ && (ref->size / BITS_PER_UNIT) > 0);

I think the last test is better written as ref->size != -1.

Seems reasonable.  Fixed.




Btw, seeing you discount non-byte size/offset stores this somehow asks
for store-merging being done before the last DSE (it currently runs after).
Sth to keep in mind for GCC 8.
Yea, probably.  Of course it may also be the case that DSE enables store 
merging.  Worth some experimentation.





+/* Delete a dead store STMT, which is mem* call of some kind.  */

call STMT

Fixed.



+static void
+delete_dead_call (gimple *stmt)
+{
+
excess vertical space

Likewise.


..
+  if (lhs)
+{
+  tree ptr = gimple_call_arg (stmt, 0);
+  gimple *new_stmt = gimple_build_assign (lhs, ptr);
+  unlink_stmt_vdef (stmt);
+  if (gsi_replace (&gsi, new_stmt, true))
+bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);

  release_ssa_name (gimple_vdef (stmt));


+  { m_live_bytes = sbitmap_alloc (PARAM_VALUE
(PARAM_DSE_MAX_OBJECT_SIZE));m_byte_tracking_enabled = false; }

formatting.

Yea. Fixed.



The DSE parts look good to me with the nits above fixed.  Just re-spin
the sbitmap.[ch] part please.

Will repost the sbitmap.c bits after retesting the series.

jeff




Re: [RFA] [PR tree-optimization/33562] [PATCH 1/4] Byte tracking in DSE - v3

2017-01-04 Thread Richard Biener
On Wed, Jan 4, 2017 at 2:22 PM, Richard Biener
 wrote:
> On Thu, Dec 22, 2016 at 7:26 AM, Jeff Law  wrote:
>> This is the first of the 4 part patchkit to address deficiencies in our DSE
>> implementation.
>>
>> This patch addresses the P2 regression 33562 which has been a low priority
>> regression since gcc-4.3.  To summarize, DSE no longer has the ability to
>> detect an aggregate store as dead if subsequent stores are done in a
>> piecemeal fashion.
>>
>> I originally tackled this by changing how we lower complex objects. That was
>> sufficient to address 33562, but was reasonably rejected.
>>
>> This version attacks the problem by improving DSE to track stores to memory
>> at a byte level.  That allows us to determine if a series of stores
>> completely covers an earlier store (thus making the earlier store dead).
>>
>> A useful side effect of this is we can detect when parts of a store are dead
>> and potentially rewrite the store.  This patch implements that for complex
>> object initializations.  While not strictly part of 33562, it's so closely
>> related that I felt it belongs as part of this patch.
>>
>> This originally limited the size of the tracked memory space to 64 bytes.  I
>> bumped the limit after working through the CONSTRUCTOR and mem* trimming
>> patches.  The 256 byte limit is still fairly arbitrary and I wouldn't lose
>> sleep if we throttled back to 64 or 128 bytes.
>>
>> Later patches in the kit will build upon this patch.  So if pieces look like
>> skeleton code, that's because it is.
>>
>> The changes since the V2 patch are:
>>
>> 1. Using sbitmaps rather than bitmaps.
>> 2. Returning a tri-state from dse_classify_store (renamed from
>> dse_possible_dead_store_p)
>> 3. More efficient trim computation
>> 4. Moving trimming code out of dse_classify_store
>> 5. Refactoring code to delete dead calls/assignments
>> 6. dse_optimize_stmt moves into the dse_dom_walker class
>>
>> Not surprisingly, this patch has most of the changes based on prior feedback
>> as it includes the raw infrastructure.
>>
>> Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?
>
> New functions in sbitmap.c lack function comments.
>
> bitmap_count_bits fails to guard against GCC_VERSION >= 3400 (the version
> is not important, but non-GCC host compilers are).  See bitmap.c for a
> fallback.
>
> Both bitmap_clear_range and bitmap_set_range look rather inefficient...
> (it's not likely anybody will clean this up after you)
>
> I'd say split out the sbitmap.[ch] changes.
>
> +DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
> +"dse-max-object-size",
> +"Maximum size (in bytes) of objects tracked by dead store
> elimination.",
> +256, 0, 0)
>
> the docs suggest that DSE doesn't handle larger stores but it does (just in
> the original limited way).  Maybe "tracked bytewise" is better.

Oh, and new --params need documeting in invoke.texi.

Richard.

> +static bool
> +valid_ao_ref_for_dse (ao_ref *ref)
> +{
> +  return (ao_ref_base (ref)
> + && ref->max_size != -1
> + && (ref->offset % BITS_PER_UNIT) == 0
> + && (ref->size % BITS_PER_UNIT) == 0
> + && (ref->size / BITS_PER_UNIT) > 0);
>
> I think the last test is better written as ref->size != -1.
>
> Btw, seeing you discount non-byte size/offset stores this somehow asks
> for store-merging being done before the last DSE (it currently runs after).
> Sth to keep in mind for GCC 8.
>
> +/* Delete a dead store STMT, which is mem* call of some kind.  */
>
> call STMT
>
> +static void
> +delete_dead_call (gimple *stmt)
> +{
> +
> excess vertical space
> ..
> +  if (lhs)
> +{
> +  tree ptr = gimple_call_arg (stmt, 0);
> +  gimple *new_stmt = gimple_build_assign (lhs, ptr);
> +  unlink_stmt_vdef (stmt);
> +  if (gsi_replace (&gsi, new_stmt, true))
> +bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
>
>   release_ssa_name (gimple_vdef (stmt));
>
>
> +  { m_live_bytes = sbitmap_alloc (PARAM_VALUE
> (PARAM_DSE_MAX_OBJECT_SIZE));m_byte_tracking_enabled = false; }
>
> formatting.
>
> The DSE parts look good to me with the nits above fixed.  Just re-spin
> the sbitmap.[ch] part please.
>
> Thanks,
> Richard.
>
>>
>> PR tree-optimization/33562
>> * params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM.
>> * sbitmap.h (bitmap_clear_range, bitmap_set_range): Prototype new
>> functions.
>> (bitmap_count_bits): Likewise.
>> * sbitmap.c (bitmap_clear_range, bitmap_set_range): New functions.
>> (bitmap_count_bits): Likewise.
>> * tree-ssa-dse.c: Include params.h.
>> (dse_store_status): New enum.
>> (initialize_ao_ref_for_dse): New, partially extracted from
>> dse_optimize_stmt.
>> (valid_ao_ref_for_dse, normalize_ref): New.
>> (setup_live_bytes_from_ref, compute_trims): Likewise.
>> (clear_bytes_written_by, trim_complex_store): Likewise.
>> (maybe_trim_partially_dead_s

Re: [RFA] [PR tree-optimization/33562] [PATCH 1/4] Byte tracking in DSE - v3

2017-01-04 Thread Richard Biener
On Thu, Dec 22, 2016 at 7:26 AM, Jeff Law  wrote:
> This is the first of the 4 part patchkit to address deficiencies in our DSE
> implementation.
>
> This patch addresses the P2 regression 33562 which has been a low priority
> regression since gcc-4.3.  To summarize, DSE no longer has the ability to
> detect an aggregate store as dead if subsequent stores are done in a
> piecemeal fashion.
>
> I originally tackled this by changing how we lower complex objects. That was
> sufficient to address 33562, but was reasonably rejected.
>
> This version attacks the problem by improving DSE to track stores to memory
> at a byte level.  That allows us to determine if a series of stores
> completely covers an earlier store (thus making the earlier store dead).
>
> A useful side effect of this is we can detect when parts of a store are dead
> and potentially rewrite the store.  This patch implements that for complex
> object initializations.  While not strictly part of 33562, it's so closely
> related that I felt it belongs as part of this patch.
>
> This originally limited the size of the tracked memory space to 64 bytes.  I
> bumped the limit after working through the CONSTRUCTOR and mem* trimming
> patches.  The 256 byte limit is still fairly arbitrary and I wouldn't lose
> sleep if we throttled back to 64 or 128 bytes.
>
> Later patches in the kit will build upon this patch.  So if pieces look like
> skeleton code, that's because it is.
>
> The changes since the V2 patch are:
>
> 1. Using sbitmaps rather than bitmaps.
> 2. Returning a tri-state from dse_classify_store (renamed from
> dse_possible_dead_store_p)
> 3. More efficient trim computation
> 4. Moving trimming code out of dse_classify_store
> 5. Refactoring code to delete dead calls/assignments
> 6. dse_optimize_stmt moves into the dse_dom_walker class
>
> Not surprisingly, this patch has most of the changes based on prior feedback
> as it includes the raw infrastructure.
>
> Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?

New functions in sbitmap.c lack function comments.

bitmap_count_bits fails to guard against GCC_VERSION >= 3400 (the version
is not important, but non-GCC host compilers are).  See bitmap.c for a
fallback.

Both bitmap_clear_range and bitmap_set_range look rather inefficient...
(it's not likely anybody will clean this up after you)

I'd say split out the sbitmap.[ch] changes.

+DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
+"dse-max-object-size",
+"Maximum size (in bytes) of objects tracked by dead store
elimination.",
+256, 0, 0)

the docs suggest that DSE doesn't handle larger stores but it does (just in
the original limited way).  Maybe "tracked bytewise" is better.

+static bool
+valid_ao_ref_for_dse (ao_ref *ref)
+{
+  return (ao_ref_base (ref)
+ && ref->max_size != -1
+ && (ref->offset % BITS_PER_UNIT) == 0
+ && (ref->size % BITS_PER_UNIT) == 0
+ && (ref->size / BITS_PER_UNIT) > 0);

I think the last test is better written as ref->size != -1.

Btw, seeing you discount non-byte size/offset stores this somehow asks
for store-merging being done before the last DSE (it currently runs after).
Sth to keep in mind for GCC 8.

+/* Delete a dead store STMT, which is mem* call of some kind.  */

call STMT

+static void
+delete_dead_call (gimple *stmt)
+{
+
excess vertical space
..
+  if (lhs)
+{
+  tree ptr = gimple_call_arg (stmt, 0);
+  gimple *new_stmt = gimple_build_assign (lhs, ptr);
+  unlink_stmt_vdef (stmt);
+  if (gsi_replace (&gsi, new_stmt, true))
+bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);

  release_ssa_name (gimple_vdef (stmt));


+  { m_live_bytes = sbitmap_alloc (PARAM_VALUE
(PARAM_DSE_MAX_OBJECT_SIZE));m_byte_tracking_enabled = false; }

formatting.

The DSE parts look good to me with the nits above fixed.  Just re-spin
the sbitmap.[ch] part please.

Thanks,
Richard.

>
> PR tree-optimization/33562
> * params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM.
> * sbitmap.h (bitmap_clear_range, bitmap_set_range): Prototype new
> functions.
> (bitmap_count_bits): Likewise.
> * sbitmap.c (bitmap_clear_range, bitmap_set_range): New functions.
> (bitmap_count_bits): Likewise.
> * tree-ssa-dse.c: Include params.h.
> (dse_store_status): New enum.
> (initialize_ao_ref_for_dse): New, partially extracted from
> dse_optimize_stmt.
> (valid_ao_ref_for_dse, normalize_ref): New.
> (setup_live_bytes_from_ref, compute_trims): Likewise.
> (clear_bytes_written_by, trim_complex_store): Likewise.
> (maybe_trim_partially_dead_store): Likewise.
> (maybe_trim_complex_store): Likewise.
> (dse_classify_store): Renamed from dse_possibly_dead_store_p.
> Track what bytes live from the original store.  Return tri-state
> for dead, partially dead or live.
> (dse_dom_walker): Add constru

[RFA] [PR tree-optimization/33562] [PATCH 1/4] Byte tracking in DSE - v3

2016-12-21 Thread Jeff Law
This is the first of the 4 part patchkit to address deficiencies in our 
DSE implementation.


This patch addresses the P2 regression 33562 which has been a low 
priority regression since gcc-4.3.  To summarize, DSE no longer has the 
ability to detect an aggregate store as dead if subsequent stores are 
done in a piecemeal fashion.


I originally tackled this by changing how we lower complex objects. That 
was sufficient to address 33562, but was reasonably rejected.


This version attacks the problem by improving DSE to track stores to 
memory at a byte level.  That allows us to determine if a series of 
stores completely covers an earlier store (thus making the earlier store 
dead).


A useful side effect of this is we can detect when parts of a store are 
dead and potentially rewrite the store.  This patch implements that for 
complex object initializations.  While not strictly part of 33562, it's 
so closely related that I felt it belongs as part of this patch.


This originally limited the size of the tracked memory space to 64 
bytes.  I bumped the limit after working through the CONSTRUCTOR and 
mem* trimming patches.  The 256 byte limit is still fairly arbitrary and 
I wouldn't lose sleep if we throttled back to 64 or 128 bytes.


Later patches in the kit will build upon this patch.  So if pieces look 
like skeleton code, that's because it is.


The changes since the V2 patch are:

1. Using sbitmaps rather than bitmaps.
2. Returning a tri-state from dse_classify_store (renamed from 
dse_possible_dead_store_p)

3. More efficient trim computation
4. Moving trimming code out of dse_classify_store
5. Refactoring code to delete dead calls/assignments
6. dse_optimize_stmt moves into the dse_dom_walker class

Not surprisingly, this patch has most of the changes based on prior 
feedback as it includes the raw infrastructure.


Bootstrapped and regression tested on x86_64-linux-gnu.  OK for the trunk?

PR tree-optimization/33562
* params.def (PARM_DSE_MAX_OBJECT_SIZE): New PARAM.
* sbitmap.h (bitmap_clear_range, bitmap_set_range): Prototype new
functions.
(bitmap_count_bits): Likewise.
* sbitmap.c (bitmap_clear_range, bitmap_set_range): New functions.
(bitmap_count_bits): Likewise.
* tree-ssa-dse.c: Include params.h.
(dse_store_status): New enum.
(initialize_ao_ref_for_dse): New, partially extracted from
dse_optimize_stmt.
(valid_ao_ref_for_dse, normalize_ref): New.
(setup_live_bytes_from_ref, compute_trims): Likewise.
(clear_bytes_written_by, trim_complex_store): Likewise.
(maybe_trim_partially_dead_store): Likewise.
(maybe_trim_complex_store): Likewise.
(dse_classify_store): Renamed from dse_possibly_dead_store_p.
Track what bytes live from the original store.  Return tri-state
for dead, partially dead or live.
(dse_dom_walker): Add constructor, destructor and new private members.
(delete_dead_call, delete_dead_assignment): New extracted from
dse_optimize_stmt.
(dse_optimize_stmt): Make a member of dse_dom_walker.
Use initialize_ao_ref_for_dse.


* gcc.dg/tree-ssa/complex-4.c: No longer xfailed.
* gcc.dg/tree-ssa/complex-5.c: Likewise.
* gcc.dg/tree-ssa/ssa-dse-9.c: Likewise.
* gcc.dg/tree-ssa/ssa-dse-18.c: New test.
* gcc.dg/tree-ssa/ssa-dse-19.c: Likewise.
* gcc.dg/tree-ssa/ssa-dse-20.c: Likewise.
* gcc.dg/tree-ssa/ssa-dse-21.c: Likewise.

diff --git a/gcc/params.def b/gcc/params.def
index 50f75a7..f367c1d 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -532,6 +532,11 @@ DEFPARAM(PARAM_AVG_LOOP_NITER,
 "Average number of iterations of a loop.",
 10, 1, 0)
 
+DEFPARAM(PARAM_DSE_MAX_OBJECT_SIZE,
+"dse-max-object-size",
+"Maximum size (in bytes) of objects tracked by dead store 
elimination.",
+256, 0, 0)
+
 DEFPARAM(PARAM_SCEV_MAX_EXPR_SIZE,
 "scev-max-expr-size",
 "Bound on size of expressions used in the scalar evolutions analyzer.",
diff --git a/gcc/sbitmap.c b/gcc/sbitmap.c
index 10b4347..2b66a6c 100644
--- a/gcc/sbitmap.c
+++ b/gcc/sbitmap.c
@@ -202,6 +202,39 @@ bitmap_empty_p (const_sbitmap bmap)
   return true;
 }
 
+void
+bitmap_clear_range (sbitmap bmap, unsigned int start, unsigned int count)
+{
+  for (unsigned int i = start; i < start + count; i++)
+bitmap_clear_bit (bmap, i);
+}
+
+void
+bitmap_set_range (sbitmap bmap, unsigned int start, unsigned int count)
+{
+  for (unsigned int i = start; i < start + count; i++)
+bitmap_set_bit (bmap, i);
+}
+
+
+unsigned int
+bitmap_count_bits (const_sbitmap bmap)
+{
+  unsigned int count = 0;
+  for (unsigned int i = 0; i < bmap->size; i++)
+if (bmap->elms[i])
+  {
+# if HOST_BITS_PER_WIDEST_FAST_INT == HOST_BITS_PER_LONG
+   count += __builtin_popcountl (bmap->elms[i]);
+# elif HOST_BITS_PER_WIDEST_FAST_INT == HOST_BITS_PER_LONGL