On 21/10/16 19:37, Richard Biener wrote:
On October 21, 2016 3:29:15 PM GMT+02:00, Kyrill Tkachov 
<kyrylo.tkac...@foss.arm.com> wrote:
Hi Richard,

On 21/10/16 13:37, Richard Biener wrote:
On Tue, 18 Oct 2016, Kyrill Tkachov wrote:

Hi Richard,

This patch is a merge of [1] and [2] and implements the manual
merging of
bitfields
as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN
too.
It caused me a lot of headeache because the bit offset is counted
>from the
most significant bit
in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN
looks
irrelevant for store merging
anyway as it's just used to described RTL extract operations).
I've included ASCII diagrams of the steps in the merging algorithm.
Heh, thanks.

Bootstrapped and tested on arm-none-linux-gnueabihf,
aarch64-none-linux-gnu,
x86_64-unknown-linux-gnu.
Also tested on aarch64_be-none-elf.

How does this version look now?
Mostly good.  For

+bool
+pass_store_merging::terminate_all_aliasing_chains (tree dest, tree
base,
+                                                  gimple *stmt)
+{
...
+  /* Check if the assignment destination (BASE) is part of a store
chain.
+     This is to catch non-constant stores to destinations that may
be
part
+     of a chain.  */
+  if (base)
+    {
+      chain_info = m_stores.get (base);
+      if (chain_info)
+       {
+         struct store_immediate_info *info;
+         unsigned int i;
+         FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
+           {
+             if (refs_may_alias_p (info->dest, dest))
+               {

I suppose the chain is not yet sorted in any way?

At least for 'dest' which do not have a known constant offset we
could do

     if (base)
       terminate_and_release_chain (base);
Do you mean when get_inner_reference returns non-NULL for POFFSET?
Yes.

Or do you think we should try to look into dest in this function?

to speed things up?  IIRC we do not terminate chains early in
this phase when we have enough stores to form a group, so
writing a testcase that triggers quadraticness would be as simple
as having

char a[1000000];

void foo ()
{
   a[0] = 1;
   a[1] = 2;
   ....
   a[9999999] = 3;
}

?

so I think you probably want to limit the number of stores you
ever put onto a chain and if you reach that limit, terminate
and release it?  Like just choose 16 or 64?  (and experiment
with the above kind of testcases)
I was initially thinking of imposing such a limit as well but
later I thought we'd want to extend the output code to be able to emit
a memcpy (or memset) call for large regions, so detecting the largest
possible
regions would be needed.
But then we need a better data structure here to avoid the quadraticness.

For the testcase:

char a[1000000];

void foo ()
{
  a[0] = 1;
  a[1] = 2;
  ....
  a[9999999] = 3;
}

and similar we don't have quadratic behaviour because constant stores to a 
constant offset
from base (i.e. stores valid for merging) get automatically inserted into the 
chain without
doing the alias checking (we keep track of their program order so that 
overlapping stores
are merged properly). Since pushing into a vec is not a linear operation we're 
fine.

Once I teach it to quickly terminate chains when encountering stores to base 
with a variable
offset (like you suggest above) there is only one case of quadraticness:
We walk the stores in the chain (causing quadratic behaviour) when among
the valid mergeable stores we have interspersed stores to a constant offset 
from base but that
do not store a constant value e.g. a[5] = x;
So the input needs to be something like:

char a[16000];

void
foo (char x)
{
  a[0] = 1; a[8000 + 0] = x;
  a[1] = 1; a[8000 + 1] = x;
  a[2] = 1; a[8000 + 2] = x;
  a[3] = 1; a[8000 + 3] = x;
  a[4] = 1; a[8000 + 4] = x;
  ...
  a[7999] = 1; a[8000 + 7999] = x;
}

I've reproduced a slowdown (with store merging taking a disproportional
time in -ftime-report) with this case and I've verified that limiting the
maximum number of statements in the chain fixes it.



  But that is not implemented yet (though I have
experimented
with it) so I can add a limit here. Should I just hardcode a limit or
should I make it
into a --param (MAX_STMTS_IN_STORE_MERGING_CHAIN or something)?
A param is preferred.

Thanks. I'll introduce the limit but, as mentioned above, the simple case
of inserting multiple stores is not quadratic.

Kyrill


+                 bit_off = byte_off << LOG2_BITS_PER_UNIT;
+                 if (!wi::neg_p (bit_off) && wi::fits_shwi_p
(bit_off))
+                   {
+                     bitpos += bit_off.to_shwi ();
+

I think you want bit_off += bitpos before the fits_shwi check
otherwise this add may still overflow.

+                     base_addr = copy_node (base_addr);
+                     TREE_OPERAND (base_addr, 1)
+                       = build_zero_cst (TREE_TYPE (TREE_OPERAND (
+                                                      base_addr,
1)));
I'd prefer

        base_addr = build2 (MEM_REF, ...);

here.
Thanks for the feedback,
Kyrill

Thanks,
Richard.

Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00573.html
[2] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00572.html

2016-10-18  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

      PR middle-end/22141
      * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
      * common.opt (fstore-merging): New Optimization option.
      * opts.c (default_options_table): Add entry for
      OPT_ftree_store_merging.
      * fold-const.h (can_native_encode_type_p): Declare prototype.
      * fold-const.c (can_native_encode_type_p): Define.
      * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
      * passes.def: Insert pass_tree_store_merging.
      * tree-pass.h (make_pass_store_merging): Declare extern
      prototype.
      * gimple-ssa-store-merging.c: New file.
      * doc/invoke.texi (Optimization Options): Document
      -fstore-merging.

2016-10-18  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
              Jakub Jelinek  <ja...@redhat.com>
              Andrew Pinski  <pins...@gmail.com>

      PR middle-end/22141
      PR rtl-optimization/23684
      * gcc.c-torture/execute/pr22141-1.c: New test.
      * gcc.c-torture/execute/pr22141-2.c: Likewise.
      * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
      * gcc.target/aarch64/ldp_stp_4.c: Likewise.
      * gcc.dg/store_merging_1.c: New test.
      * gcc.dg/store_merging_2.c: Likewise.
      * gcc.dg/store_merging_3.c: Likewise.
      * gcc.dg/store_merging_4.c: Likewise.
      * gcc.dg/store_merging_5.c: Likewise.
      * gcc.dg/store_merging_6.c: Likewise.
      * gcc.dg/store_merging_7.c: Likewise.
      * gcc.target/i386/pr22141.c: Likewise.
      * gcc.target/i386/pr34012.c: Add -fno-store-merging to
dg-options.
      * g++.dg/init/new17.C: Likewise.



Reply via email to