Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-04 Thread Richard Biener
On Wed, Aug 3, 2016 at 5:15 PM, Kyrill Tkachov
 wrote:
> Hi Richard,
>
> On 18/07/16 13:22, Richard Biener wrote:
>>
>> On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
>>  wrote:
>
>
> 
>
>
>> +  /* Record the original statements so that we can keep track of
>> +statements emitted in this pass and not re-process new
>> +statements.  */
>> +  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> ())
>> +   {
>> + gimple *stmt = gsi_stmt (gsi);
>> + if (!is_gimple_debug (stmt))
>> +   orig_stmts.add (stmt);
>> + num_statements++;
>> +   }
>>
>> please use gimple_set_visited () instead, that should be cheaper.
>>
>>
>> +  do
>> +   {
>> + changes_made = false;
>> + for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> ())
>> +   {
>> ...
>> +   }
>> +  while (changes_made);
>>
>> looks pretty quadratic to me.  Instead of tracking things with
>> m_curr_base_expr
>> why not use a hash-map to track stores related to a base?
>>
>> +  /* Don't handle bitfields that are not a
>> multiple
>> + of BITS_PER_UNIT for now.  Can be extended
>> + later.  */
>> +  && (bitsize % BITS_PER_UNIT == 0)
>>
>> :(
>>
>> +  && !stmt_interferes_with_mem_accesses_p (lhs))
>>
>> given this function loops over all accesses this is quadratic as well.
>>
>> I think alias queries could be simplified if you reduce them to alias
>> checks on the base object (and allow overlapping constant stores
>> which should be easy to handle during merging).
>
>
> I've implemented that and it simplified the detecting code as well as its
> complexity
> but I'm now missing some cases that were being caught before.
> An example is:
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
>   char g;
> };
>
> void
> foo1 (struct bar *p, char tmp)
> {
>   p->a = 0;
>   p->b = 0;
>   p->g = tmp;
>   p->c = 0;
>   p->d = 0;
>   p->e = 0;
> }
>
> The store to 'g' doesn't interfere with the contiguous stores to the early
> fields but because
> we perform the aliasing checks on the base object 'p' rather than the full
> LHS of the assignments
> this is deemed to alias the constant stores and only the first two and the
> last three constant stores
> are merged instead of the full 5 stores.
>
> I'll experiment with some solutions for this involving recording the
> non-constant stores and performing
> some trickery during the merging phase.

Not sure how/where exactly you perform the alias checks but alias
checks inbetween a group of same bases should use the full
reference to also factor in offsets/sizes.  Only cross-group I'd
resort to base-only alias-checks.

Richard.

> Thanks,
> Kyrill
>
>
>> The VUSE/VDEF handling is somewhat odd.  All stores have both
>> a VDEF and a VUSE, if you merge a set of them you can simply
>> re-use the VDEF/VUSE of one of them, effectively replacing the
>> stmt.  For all other stores you remove you miss a
>>unlink_stmt_vdef (stmt);
>>release_defs (stmt);
>> to update virtual SSA form and properly release SSA names.
>>
>> As you use TBAA in your alias checks you may only _sink_
>> stores.  Not sure if your merged store placement ensures this.
>>
>> I think this kind of transforms is useful in early optimizations, not only
>> very late as you perform it.  Of course it should be really cheap there.
>>
>> Note that options like -ftree-store-widening are discouraged
>> ("tree" does mean nothing to our users and store widening isn't
>> what is done - it's store merging).  Simply name it -fstore-merging.
>> Also the file name should be gimple-ssa-store-merging.c
>>
>> Looking forward to an algorithmically enhanced version.
>>
>> Richard.
>>
>>
>>> Thanks,
>>> Kyrill
>>>
>>> N.B. I'm going on vacation until August so I won't be able to respond to
>>> any
>>> feedback until I get back.
>>>
>>> [1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
>>>
>>> 2016-07-15  Kyrylo Tkachov  
>>>
>>>  PR middle-end/22141
>>>  * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
>>>  * common.opt (ftree-store-widening): New Optimization option.
>>>  * opts.c (default_options_table): Add entry for
>>>  OPT_ftree_store_widening.
>>>  * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
>>>  * passes.def: Insert pass_tree_store_widening.
>>>  * tree-pass.h (make_pass_tree_store_widening): Declare extern
>>>  prototype.
>>>  * tree-ssa-store-widening.c: New file.
>>>  * doc/invoke.texi (Optimization Options): Document
>>>  -ftree-store-widening.
>>>
>>> 2016-07-15  Kyrylo Tkachov  
>>>  Jakub Jelinek  
>>>
>>>  PR middle-end/22141
>>>  * gcc.c-torture/execute/pr22141-1.c: New test.

Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-03 Thread Kyrill Tkachov

Hi Richard,

On 18/07/16 13:22, Richard Biener wrote:

On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
 wrote:





+  /* Record the original statements so that we can keep track of
+statements emitted in this pass and not re-process new
+statements.  */
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ())
+   {
+ gimple *stmt = gsi_stmt (gsi);
+ if (!is_gimple_debug (stmt))
+   orig_stmts.add (stmt);
+ num_statements++;
+   }

please use gimple_set_visited () instead, that should be cheaper.


+  do
+   {
+ changes_made = false;
+ for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ())
+   {
...
+   }
+  while (changes_made);

looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
why not use a hash-map to track stores related to a base?

+  /* Don't handle bitfields that are not a multiple
+ of BITS_PER_UNIT for now.  Can be extended
+ later.  */
+  && (bitsize % BITS_PER_UNIT == 0)

:(

+  && !stmt_interferes_with_mem_accesses_p (lhs))

given this function loops over all accesses this is quadratic as well.

I think alias queries could be simplified if you reduce them to alias
checks on the base object (and allow overlapping constant stores
which should be easy to handle during merging).


I've implemented that and it simplified the detecting code as well as its 
complexity
but I'm now missing some cases that were being caught before.
An example is:
struct bar {
  int a;
  char b;
  char c;
  char d;
  char e;
  char g;
};

void
foo1 (struct bar *p, char tmp)
{
  p->a = 0;
  p->b = 0;
  p->g = tmp;
  p->c = 0;
  p->d = 0;
  p->e = 0;
}

The store to 'g' doesn't interfere with the contiguous stores to the early 
fields but because
we perform the aliasing checks on the base object 'p' rather than the full LHS 
of the assignments
this is deemed to alias the constant stores and only the first two and the last 
three constant stores
are merged instead of the full 5 stores.

I'll experiment with some solutions for this involving recording the 
non-constant stores and performing
some trickery during the merging phase.

Thanks,
Kyrill


The VUSE/VDEF handling is somewhat odd.  All stores have both
a VDEF and a VUSE, if you merge a set of them you can simply
re-use the VDEF/VUSE of one of them, effectively replacing the
stmt.  For all other stores you remove you miss a
   unlink_stmt_vdef (stmt);
   release_defs (stmt);
to update virtual SSA form and properly release SSA names.

As you use TBAA in your alias checks you may only _sink_
stores.  Not sure if your merged store placement ensures this.

I think this kind of transforms is useful in early optimizations, not only
very late as you perform it.  Of course it should be really cheap there.

Note that options like -ftree-store-widening are discouraged
("tree" does mean nothing to our users and store widening isn't
what is done - it's store merging).  Simply name it -fstore-merging.
Also the file name should be gimple-ssa-store-merging.c

Looking forward to an algorithmically enhanced version.

Richard.



Thanks,
Kyrill

N.B. I'm going on vacation until August so I won't be able to respond to any
feedback until I get back.

[1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html

2016-07-15  Kyrylo Tkachov  

 PR middle-end/22141
 * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
 * common.opt (ftree-store-widening): New Optimization option.
 * opts.c (default_options_table): Add entry for
 OPT_ftree_store_widening.
 * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
 * passes.def: Insert pass_tree_store_widening.
 * tree-pass.h (make_pass_tree_store_widening): Declare extern
 prototype.
 * tree-ssa-store-widening.c: New file.
 * doc/invoke.texi (Optimization Options): Document
 -ftree-store-widening.

2016-07-15  Kyrylo Tkachov  
 Jakub Jelinek  

 PR middle-end/22141
 * gcc.c-torture/execute/pr22141-1.c: New test.
 * gcc.c-torture/execute/pr22141-2.c: Likewise.
 * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
 * gcc.dg/store_widening_1.c: New test.
 * gcc.dg/store_widening_2.c: Likewise.
 * gcc.dg/store_widening_3.c: Likewise.
 * gcc.dg/store_widening_4.c: Likewise.
 * gcc.target/i386/pr22141.c: Likewise.




Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-03 Thread Kyrill Tkachov


On 03/08/16 14:50, Richard Biener wrote:

On Wed, Aug 3, 2016 at 11:59 AM, Kyrill Tkachov
 wrote:

Hi Richard,

On 18/07/16 13:22, Richard Biener wrote:




+  /* Record the original statements so that we can keep track of
+statements emitted in this pass and not re-process new
+statements.  */
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
())
+   {
+ gimple *stmt = gsi_stmt (gsi);
+ if (!is_gimple_debug (stmt))
+   orig_stmts.add (stmt);
+ num_statements++;
+   }

please use gimple_set_visited () instead, that should be cheaper.


+  do
+   {
+ changes_made = false;
+ for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
())
+   {
...
+   }
+  while (changes_made);

looks pretty quadratic to me.  Instead of tracking things with
m_curr_base_expr
why not use a hash-map to track stores related to a base?


I've implemented this scheme but I'm having trouble making it work.
In particular I have a hash_map keyed on a 'tree' that is the base
object (as extracted by get_inner_reference) but I can't get the hash_map
to properly extract the already recorded stores to the same base.
For example for the simple code:
struct bar {
   int a;
   char b;
   char c;
   char d;
   char e;
   char f;
   char g;
};

void
foo1 (struct bar *p)
{
   p->b = 0;
   p->a = 0;
   p->c = 0;
   p->d = 0;
   p->e = 0;
}

As we can see, the stores are all to the same object and should
be recognised as such.

The base of the first store is recorded as:

and for the second store as 
where the dumps of the two mem_refs are identical except for that first
hex number (their address in memory?)
In my first version of the patch I compare these with operand_equal_p and
that
detects that they are the same, but in the hash_map they are not detected
as equal. Is there some special hashing function I must specify?

If you just use hash_map  then it will hash on the pointer value.
I think you need to use tree_operand_hash.


Ah, thanks. That did the trick.

Kyrill


Richard.


Thanks,
Kyrill




Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-03 Thread Richard Biener
On Wed, Aug 3, 2016 at 11:59 AM, Kyrill Tkachov
 wrote:
> Hi Richard,
>
> On 18/07/16 13:22, Richard Biener wrote:
>
> 
>
>> +  /* Record the original statements so that we can keep track of
>> +statements emitted in this pass and not re-process new
>> +statements.  */
>> +  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> ())
>> +   {
>> + gimple *stmt = gsi_stmt (gsi);
>> + if (!is_gimple_debug (stmt))
>> +   orig_stmts.add (stmt);
>> + num_statements++;
>> +   }
>>
>> please use gimple_set_visited () instead, that should be cheaper.
>>
>>
>> +  do
>> +   {
>> + changes_made = false;
>> + for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next
>> ())
>> +   {
>> ...
>> +   }
>> +  while (changes_made);
>>
>> looks pretty quadratic to me.  Instead of tracking things with
>> m_curr_base_expr
>> why not use a hash-map to track stores related to a base?
>
>
> I've implemented this scheme but I'm having trouble making it work.
> In particular I have a hash_map keyed on a 'tree' that is the base
> object (as extracted by get_inner_reference) but I can't get the hash_map
> to properly extract the already recorded stores to the same base.
> For example for the simple code:
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
>   char f;
>   char g;
> };
>
> void
> foo1 (struct bar *p)
> {
>   p->b = 0;
>   p->a = 0;
>   p->c = 0;
>   p->d = 0;
>   p->e = 0;
> }
>
> As we can see, the stores are all to the same object and should
> be recognised as such.
>
> The base of the first store is recorded as:
> 
> and for the second store as 
> where the dumps of the two mem_refs are identical except for that first
> hex number (their address in memory?)
> In my first version of the patch I compare these with operand_equal_p and
> that
> detects that they are the same, but in the hash_map they are not detected
> as equal. Is there some special hashing function I must specify?

If you just use hash_map  then it will hash on the pointer value.
I think you need to use tree_operand_hash.

Richard.

> Thanks,
> Kyrill


Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-03 Thread Kyrill Tkachov

Hi Richard,

On 18/07/16 13:22, Richard Biener wrote:




+  /* Record the original statements so that we can keep track of
+statements emitted in this pass and not re-process new
+statements.  */
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ())
+   {
+ gimple *stmt = gsi_stmt (gsi);
+ if (!is_gimple_debug (stmt))
+   orig_stmts.add (stmt);
+ num_statements++;
+   }

please use gimple_set_visited () instead, that should be cheaper.


+  do
+   {
+ changes_made = false;
+ for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ())
+   {
...
+   }
+  while (changes_made);

looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
why not use a hash-map to track stores related to a base?


I've implemented this scheme but I'm having trouble making it work.
In particular I have a hash_map keyed on a 'tree' that is the base
object (as extracted by get_inner_reference) but I can't get the hash_map
to properly extract the already recorded stores to the same base.
For example for the simple code:
struct bar {
  int a;
  char b;
  char c;
  char d;
  char e;
  char f;
  char g;
};

void
foo1 (struct bar *p)
{
  p->b = 0;
  p->a = 0;
  p->c = 0;
  p->d = 0;
  p->e = 0;
}

As we can see, the stores are all to the same object and should
be recognised as such.

The base of the first store is recorded as:

and for the second store as 
where the dumps of the two mem_refs are identical except for that first
hex number (their address in memory?)
In my first version of the patch I compare these with operand_equal_p and that
detects that they are the same, but in the hash_map they are not detected
as equal. Is there some special hashing function I must specify?

Thanks,
Kyrill


Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-01 Thread Jeff Law

On 08/01/2016 03:15 AM, Kyrill Tkachov wrote:


On 18/07/16 13:22, Richard Biener wrote:

On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
 wrote:

Hi all,

This is a GIMPLE pass to implement PR middle-end/22141. that is merge
narrow
stores of constants
into fewer wider stores.  A 2009 patch from Jakub [1] contains many
testcases but a simple motivating

[ ... ]
Given your work on 22141, you might want to pick up my work on 33562. 
They're different issues, but they touch on similar concepts.  IIRC the 
major thing left on my plate for 33562 was rewriting existing stores 
into smaller pieces when parts of the original store are found to be dead.


jeff



Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-08-01 Thread Kyrill Tkachov


On 18/07/16 13:22, Richard Biener wrote:

On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
 wrote:

Hi all,

This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow
stores of constants
into fewer wider stores.  A 2009 patch from Jakub [1] contains many
testcases but a simple motivating
case can be:

struct bar {
   int a;
   char b;
   char c;
   char d;
   char e;
}; // packed 64-bit structure

void bar (struct bar *);

void
foo (struct bar *p)
{
   p->b = 0;
   p->a = 0;
   p->c = 0;
   p->d = 1;
   p->e = 0;
}

Currently on aarch64 this will generate:
foo:
 mov w1, 1
 str wzr, [x0]
 strbwzr, [x0, 4]
 strbwzr, [x0, 5]
 strbw1, [x0, 6]
 strbwzr, [x0, 7]
 ret

With this patch this can be improved into a single unaligned store:
foo:
 mov x1, 0x1
 str x1, [x0]
 ret

or, if compiled with -mstrict-align:
foo:
 mov w1, 0x1
 stp wzr, w1, [x0]
 ret

The pass is a tree-ssa pass that runs fairly late in the pipeline, after
pass_optimize_widening_mul.
I explain the approach taken in the comments in the new
tree-ssa-store-widening.c file but essentially
it has 3 phases applied to each basic block:

1) Scan through the statements recording assignments of constants to
destinations like ARRAY_REF,
COMPONENT_REF, MEM_REF which are determined to write to an ultimate common
destination. get_inner_reference
is used to decompose these destinations. Continue recording these until we
encounter a statement that may
interfere with the stores we've been recording (load or store that may
alias, volatile access etc).
These assignments of interest are recorded as store_immediate_info objects
in the m_store_info vector.

2) Analyse the stores recorded in phase one (they all write to a destination
offset from a common base)
and merge them into wider assignments up to BITS_PER_WORD bits wide. These
widened assignments are represented
as merged_store_group objects and they are recorded in the
m_merged_store_groups vector. This is the
coalesce_immediate_stores function. It sorts the stores by the bitposition
they write to and iterates through
them, merging consecutive stores (it fails the transformation on overlapping
stores, I don't think that case
appears often enough to warrant extra logic) up to BITS_PER_WORD-wide
accesses.

3) Go through the merged stores recorded in m_merged_store_groups and output
each widened store. Widened stores
that are not of a bitsize that is a power of two (for example 48 bits wide)
are output as multiple stores of decreasing
power-of-two width. So, for a widened store 48-bits wide this phase would a
emit a 32-bit store followed by a
16-bit store. The new sequence is only emitted if it contains fewer
statements than the original sequence that it
will replace.  This phase also avoids outputting unaligned stores for
STRICT_ALIGNMENT targets or targets where
SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want
to avoid generation of unaligned
stores even when it is legal I've added the new param
PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
to disallow unaligned store generation.  Its default setting is to allow
them (assuming that STRICT_ALIGNMENT
and SLOW_UNALIGNED_ACCESS allows it).

This is my first GIMPLE-level pass so please do point out places where I'm
not using the interfaces correctly.
This patch handles bitfields as well, but only if they are a multiple of
BITS_PER_UNIT. It should be easily
extensible to handle other bitfields as well, but I'm not entirely sure of
the rules for laying out such bitfields
and in particular the byteswap logic that needs to be applied for big-endian
targets. If someone can shed some light
on how they should be handed I'll be happy to try it out, but I believe this
patch is an improvement as it is.

This has been bootstrapped and tested on aarch64-none-linux-gnu,
arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
I've also tested it on the big-endian targets: armeb-none-eabi,
aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.

I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no
regressions, the overall score improved a bit
(about 0.1%). The interesting improvements were:
458.sjeng (+0.8%)
483.xalancbmk (+1.1%)
416.gamess(+1.0%)
454.calculix  (+1.1%)

An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it
transformed a long sequence of constant
byte stores into a much shorter sequence of word-size stores (from ~550
instructions to ~190).

On x86_64 SPECINT there was no change in the overall score. The code size at
-Ofast is consistently smaller
with this patch but the preformance differences on sub-benchmarks are in the
noise.

I've included the testcases from Jakub's patch [1] and added a few of my
own.

Is this direction acceptable for the problem this is trying 

Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-07-18 Thread Richard Biener
On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
 wrote:
> Hi all,
>
> This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow
> stores of constants
> into fewer wider stores.  A 2009 patch from Jakub [1] contains many
> testcases but a simple motivating
> case can be:
>
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
> }; // packed 64-bit structure
>
> void bar (struct bar *);
>
> void
> foo (struct bar *p)
> {
>   p->b = 0;
>   p->a = 0;
>   p->c = 0;
>   p->d = 1;
>   p->e = 0;
> }
>
> Currently on aarch64 this will generate:
> foo:
> mov w1, 1
> str wzr, [x0]
> strbwzr, [x0, 4]
> strbwzr, [x0, 5]
> strbw1, [x0, 6]
> strbwzr, [x0, 7]
> ret
>
> With this patch this can be improved into a single unaligned store:
> foo:
> mov x1, 0x1
> str x1, [x0]
> ret
>
> or, if compiled with -mstrict-align:
> foo:
> mov w1, 0x1
> stp wzr, w1, [x0]
> ret
>
> The pass is a tree-ssa pass that runs fairly late in the pipeline, after
> pass_optimize_widening_mul.
> I explain the approach taken in the comments in the new
> tree-ssa-store-widening.c file but essentially
> it has 3 phases applied to each basic block:
>
> 1) Scan through the statements recording assignments of constants to
> destinations like ARRAY_REF,
> COMPONENT_REF, MEM_REF which are determined to write to an ultimate common
> destination. get_inner_reference
> is used to decompose these destinations. Continue recording these until we
> encounter a statement that may
> interfere with the stores we've been recording (load or store that may
> alias, volatile access etc).
> These assignments of interest are recorded as store_immediate_info objects
> in the m_store_info vector.
>
> 2) Analyse the stores recorded in phase one (they all write to a destination
> offset from a common base)
> and merge them into wider assignments up to BITS_PER_WORD bits wide. These
> widened assignments are represented
> as merged_store_group objects and they are recorded in the
> m_merged_store_groups vector. This is the
> coalesce_immediate_stores function. It sorts the stores by the bitposition
> they write to and iterates through
> them, merging consecutive stores (it fails the transformation on overlapping
> stores, I don't think that case
> appears often enough to warrant extra logic) up to BITS_PER_WORD-wide
> accesses.
>
> 3) Go through the merged stores recorded in m_merged_store_groups and output
> each widened store. Widened stores
> that are not of a bitsize that is a power of two (for example 48 bits wide)
> are output as multiple stores of decreasing
> power-of-two width. So, for a widened store 48-bits wide this phase would a
> emit a 32-bit store followed by a
> 16-bit store. The new sequence is only emitted if it contains fewer
> statements than the original sequence that it
> will replace.  This phase also avoids outputting unaligned stores for
> STRICT_ALIGNMENT targets or targets where
> SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want
> to avoid generation of unaligned
> stores even when it is legal I've added the new param
> PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
> to disallow unaligned store generation.  Its default setting is to allow
> them (assuming that STRICT_ALIGNMENT
> and SLOW_UNALIGNED_ACCESS allows it).
>
> This is my first GIMPLE-level pass so please do point out places where I'm
> not using the interfaces correctly.
> This patch handles bitfields as well, but only if they are a multiple of
> BITS_PER_UNIT. It should be easily
> extensible to handle other bitfields as well, but I'm not entirely sure of
> the rules for laying out such bitfields
> and in particular the byteswap logic that needs to be applied for big-endian
> targets. If someone can shed some light
> on how they should be handed I'll be happy to try it out, but I believe this
> patch is an improvement as it is.
>
> This has been bootstrapped and tested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
> I've also tested it on the big-endian targets: armeb-none-eabi,
> aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.
>
> I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no
> regressions, the overall score improved a bit
> (about 0.1%). The interesting improvements were:
> 458.sjeng (+0.8%)
> 483.xalancbmk (+1.1%)
> 416.gamess(+1.0%)
> 454.calculix  (+1.1%)
>
> An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it
> transformed a long sequence of constant
> byte stores into a much shorter sequence of word-size stores (from ~550
> instructions to ~190).
>
> On x86_64 SPECINT there was no change in the overall score. The code size at
> -Ofast is consistently smaller
> with this patch but the preformance differences 

[PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

2016-07-15 Thread Kyrill Tkachov

Hi all,

This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow 
stores of constants
into fewer wider stores.  A 2009 patch from Jakub [1] contains many testcases 
but a simple motivating
case can be:

struct bar {
  int a;
  char b;
  char c;
  char d;
  char e;
}; // packed 64-bit structure

void bar (struct bar *);

void
foo (struct bar *p)
{
  p->b = 0;
  p->a = 0;
  p->c = 0;
  p->d = 1;
  p->e = 0;
}

Currently on aarch64 this will generate:
foo:
mov w1, 1
str wzr, [x0]
strbwzr, [x0, 4]
strbwzr, [x0, 5]
strbw1, [x0, 6]
strbwzr, [x0, 7]
ret

With this patch this can be improved into a single unaligned store:
foo:
mov x1, 0x1
str x1, [x0]
ret

or, if compiled with -mstrict-align:
foo:
mov w1, 0x1
stp wzr, w1, [x0]
ret

The pass is a tree-ssa pass that runs fairly late in the pipeline, after 
pass_optimize_widening_mul.
I explain the approach taken in the comments in the new 
tree-ssa-store-widening.c file but essentially
it has 3 phases applied to each basic block:

1) Scan through the statements recording assignments of constants to 
destinations like ARRAY_REF,
COMPONENT_REF, MEM_REF which are determined to write to an ultimate common 
destination. get_inner_reference
is used to decompose these destinations. Continue recording these until we 
encounter a statement that may
interfere with the stores we've been recording (load or store that may alias, 
volatile access etc).
These assignments of interest are recorded as store_immediate_info objects in 
the m_store_info vector.

2) Analyse the stores recorded in phase one (they all write to a destination 
offset from a common base)
and merge them into wider assignments up to BITS_PER_WORD bits wide. These 
widened assignments are represented
as merged_store_group objects and they are recorded in the 
m_merged_store_groups vector. This is the
coalesce_immediate_stores function. It sorts the stores by the bitposition they 
write to and iterates through
them, merging consecutive stores (it fails the transformation on overlapping 
stores, I don't think that case
appears often enough to warrant extra logic) up to BITS_PER_WORD-wide accesses.

3) Go through the merged stores recorded in m_merged_store_groups and output 
each widened store. Widened stores
that are not of a bitsize that is a power of two (for example 48 bits wide) are 
output as multiple stores of decreasing
power-of-two width. So, for a widened store 48-bits wide this phase would a 
emit a 32-bit store followed by a
16-bit store. The new sequence is only emitted if it contains fewer statements 
than the original sequence that it
will replace.  This phase also avoids outputting unaligned stores for 
STRICT_ALIGNMENT targets or targets where
SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want to 
avoid generation of unaligned
stores even when it is legal I've added the new param 
PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
to disallow unaligned store generation.  Its default setting is to allow them 
(assuming that STRICT_ALIGNMENT
and SLOW_UNALIGNED_ACCESS allows it).

This is my first GIMPLE-level pass so please do point out places where I'm not 
using the interfaces correctly.
This patch handles bitfields as well, but only if they are a multiple of 
BITS_PER_UNIT. It should be easily
extensible to handle other bitfields as well, but I'm not entirely sure of the 
rules for laying out such bitfields
and in particular the byteswap logic that needs to be applied for big-endian 
targets. If someone can shed some light
on how they should be handed I'll be happy to try it out, but I believe this 
patch is an improvement as it is.

This has been bootstrapped and tested on aarch64-none-linux-gnu, 
arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
I've also tested it on the big-endian targets: armeb-none-eabi, 
aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.

I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no 
regressions, the overall score improved a bit
(about 0.1%). The interesting improvements were:
458.sjeng (+0.8%)
483.xalancbmk (+1.1%)
416.gamess(+1.0%)
454.calculix  (+1.1%)

An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it 
transformed a long sequence of constant
byte stores into a much shorter sequence of word-size stores (from ~550 
instructions to ~190).

On x86_64 SPECINT there was no change in the overall score. The code size at 
-Ofast is consistently smaller
with this patch but the preformance differences on sub-benchmarks are in the 
noise.

I've included the testcases from Jakub's patch [1] and added a few of my own.

Is this direction acceptable for the problem this is trying to solve?

Thanks,
Kyrill

N.B. I'm going on vacation until August so I won't be able to respond to any 
feedback until