Re: [RFC, PR 80689] Copy small aggregates element-wise

Martin Jambor Fri, 03 Nov 2017 09:39:07 -0700

Hi,

On Thu, Oct 26, 2017 at 02:43:02PM +0200, Richard Biener wrote:
> On Thu, Oct 26, 2017 at 2:18 PM, Martin Jambor <mjam...@suse.cz> wrote:
> >
> > Nevertheless, I still intend to experiment with the limit, I sent out
> > this RFC exactly so that I don't spend a lot of time benchmarking
> > something that is eventually not deemed acceptable on principle.
> 
> I think the limit should be on the number of generated copies and not
> the overall size of the structure...  If the struct were composed of
> 32 individual chars we wouldn't want to emit 32 loads and 32 stores...


I have added another parameter to also limit the number of generated
element copies.  I have kept the size limit so that we don't even
attempt to count them for large structures.

> Given that load bandwith is usually higher than store bandwith it
> might make sense to do the store combining in our copying sequence,
> like for the 8 byte entry case use sth like
> 
>   movq 0(%eax), %xmm0
>   movhps 8(%eax), %xmm0 // or vpinsert
>   mov[au]ps %xmm0, 0%(ebx)

I would be concerned about the cost of GPR->XMM moves when the value
being stored is in a GPR, especially with generic tuning which (with
-O2) is the main thing I am targeting here.  Wouldn't we actually pass
it through stack with all the associated penalties?

Also, while such store combining might work for ImageMagick, if a
programmer  did:

region1->x = x1;
region2->x = x2;
region1->y = 0;
region2->y = 20;
...
SetPixelCacheNexusPixels(cache_info, ReadMode, region1, ...)

The transformation would not work unless it could prove region1 and
region2 are not the same thing.

> As said a general concern was you not copying padding.  If you
> put this into an even more common place you surely will break
> stuff, no?

I don't understand, what even more common place do you mean?

I have been testing the patch also on a bunch of other architectures
and those have tests in their testsuite that check that padding is
copied, for example some tests in gcc.target/aarch64/aapcs64/ check
whether a structure passed to a function is binary the same as the
original, and the test fail because of padding.  That is the only
"breakage" I know about but I believe that the assumption that padding
must always be is wrong (if it is not than we need to make SRA quite a
bit more conservative).


On Thu, Oct 26, 2017 at 05:09:42PM +0200, Richard Biener wrote:
> Also if we do the stores in smaller chunks we are more
> likely hitting the same store-to-load-forwarding issue
> elsewhere.  Like in case the destination is memcpy'ed
> away.
> 
> So the proposed change isn't necessarily a win without
> a possible similar regression that it tries to fix.
>

With some encouragement by Honza, I have done some benchmarking anyway
and I did not see anything of that kind.

> Whole-program analysis of accesses might allow
> marking affected objects.

Attempting to save access patterns before IPA and then tracking them
and keep them in sync across inlining and all gimple late passes seems
like a nightmarish task.  If this approach is indeed rejected I might
attempt to do the store combining but a WPA analysis seems just too
complex.

Anyway, here are the numbers.  They were taken on two different
Zen-based machines.  I am also in the process of measuring at least
something on a Haswell machine but I started later and the machine is
quite a bit slower so I will not have the numbers until next week (and
not all equivalents in any way).  I found out I do not have access to
any more modern .*Lake intel CPU.

trunk is pristine trunk revision 254205.  All benchmarks were run
three times and the median was chosen.

s or strict means the patch with the strictest possible settings to
speed-up ImageMagick, i.e. --param max-size-for-elementwise-copy=32
--param max-insns-for-elementwise-copy=4.  Also run three times.

x1 is patched trunk with the parameters having the default values was
going to propose, i.e. --param max-size-for-elementwise-copy=35
--param max-insns-for-elementwise-copy=6.  Also run three times.

I then increased the parameter, in search for further missed
opportunities and to see what and how soon will start to regress.
x2 is roughly twice that, --param max-size-for-elementwise-copy=67
--param max-insns-for-elementwise-copy=12.  Run twice, outliers
manually checked.

x4 is roughly four times x1, namely --param max-size-for-elementwise-copy=143
--param max-insns-for-elementwise-copy=24.  Run only once.

The times below are of course "non-reportable," for a whole bunch of
reasons.


Zen SPECINT 2006  -O2 generic tuning
====================================

 Run-time
 --------
 
| Benchmark      | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     
% |
|----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 400.perlbench  |   237 | 236 | -0.42 | 236 | -0.42 | 238 | +0.42 | 237 | 
+0.00 |
| 401.bzip2      |   341 | 342 | +0.29 | 341 | +0.00 | 341 | +0.00 | 341 | 
+0.00 |
| 403.gcc        |   217 | 217 | +0.00 | 217 | +0.00 | 216 | -0.46 | 217 | 
+0.00 |
| 429.mcf        |   224 | 218 | -2.68 | 223 | -0.45 | 221 | -1.34 | 226 | 
+0.89 |
| 445.gobmk      |   361 | 361 | +0.00 | 361 | +0.00 | 360 | -0.28 | 363 | 
+0.55 |
| 456.hmmer      |   296 | 296 | +0.00 | 296 | +0.00 | 297 | +0.34 | 296 | 
+0.00 |
| 458.sjeng      |   453 | 452 | -0.22 | 454 | +0.22 | 454 | +0.22 | 460 | 
+1.55 |
| 462.libquantum |   289 | 289 | +0.00 | 291 | +0.69 | 289 | +0.00 | 291 | 
+0.69 |
| 464.h264ref    |   391 | 391 | +0.00 | 385 | -1.53 | 385 | -1.53 | 385 | 
-1.53 |
| 471.omnetpp    |   269 | 255 | -5.20 | 250 | -7.06 | 247 | -8.18 | 268 | 
-0.37 |
| 473.astar      |   320 | 321 | +0.31 | 317 | -0.94 | 320 | +0.00 | 320 | 
+0.00 |
| 483.xalancbmk  |   187 | 188 | +0.53 | 188 | +0.53 | 187 | +0.00 | 187 | 
+0.00 |

Although the omnetpp looks like a sizeable improvement I should warn
that this is one of the few slightly jumpy benchmarks. However, I
re-run it a few more times and it seems like it is jumping around a
lower value when compiled with the patched compiler.  It might not be
the 5-8% though.

 Text size
 ---------

| Benchmark      |   trunk | struict |     % |      x1 |     % |      x2 |     
% |      x4 |     % |
|----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
| 400.perlbench  |  875874 |  875954 | +0.01 |  875954 | +0.01 |  876018 | 
+0.02 |  876146 | +0.03 |
| 401.bzip2      |   44754 |   44754 | +0.00 |   44754 | +0.00 |   44754 | 
+0.00 |   44754 | +0.00 |
| 403.gcc        | 2294466 | 2294930 | +0.02 | 2296098 | +0.07 | 2296306 | 
+0.08 | 2296466 | +0.09 |
| 429.mcf        |    8226 |    8226 | +0.00 |    8226 | +0.00 |    8258 | 
+0.39 |    8258 | +0.39 |
| 445.gobmk      |  579778 |  579778 | +0.00 |  579826 | +0.01 |  579826 | 
+0.01 |  580402 | +0.11 |
| 456.hmmer      |  221058 |  221058 | +0.00 |  221058 | +0.00 |  221058 | 
+0.00 |  221058 | +0.00 |
| 458.sjeng      |   93362 |   93362 | +0.00 |   94882 | +1.63 |   94882 | 
+1.63 |   96066 | +2.90 |
| 462.libquantum |   28314 |   28314 | +0.00 |   28362 | +0.17 |   28362 | 
+0.17 |   28362 | +0.17 |
| 464.h264ref    |  393874 |  393874 | +0.00 |  393922 | +0.01 |  393922 | 
+0.01 |  394226 | +0.09 |
| 471.omnetpp    |  430306 |  430306 | +0.00 |  430418 | +0.03 |  430418 | 
+0.03 |  430418 | +0.03 |
| 473.astar      |   29362 |   29538 | +0.60 |   29538 | +0.60 |   29554 | 
+0.65 |   29554 | +0.65 |
| 483.xalancbmk  | 2361298 | 2361506 | +0.01 | 2361506 | +0.01 | 2361506 | 
+0.01 | 2361506 | +0.01 |



Zen SPECINT 2006  -Ofast native tuning
======================================

 Run-time
 --------

| Benchmark      | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     
% |
|----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 400.perlbench  |   240 | 239 | -0.42 | 239 | -0.42 | 241 | +0.42 | 238 | 
-0.83 |
| 401.bzip2      |   341 | 341 | +0.00 | 341 | +0.00 | 341 | +0.00 | 340 | 
-0.29 |
| 403.gcc        |   210 | 208 | -0.95 | 207 | -1.43 | 209 | -0.48 | 208 | 
-0.95 |
| 429.mcf        |   225 | 225 | +0.00 | 225 | +0.00 | 228 | +1.33 | 226 | 
+0.44 |
| 445.gobmk      |   352 | 352 | +0.00 | 352 | +0.00 | 351 | -0.28 | 352 | 
+0.00 |
| 456.hmmer      |   131 | 131 | +0.00 | 131 | +0.00 | 131 | +0.00 | 131 | 
+0.00 |
| 458.sjeng      |   442 | 442 | +0.00 | 438 | -0.90 | 438 | -0.90 | 437 | 
-1.13 |
| 462.libquantum |   291 | 292 | +0.34 | 286 | -1.72 | 287 | -1.37 | 287 | 
-1.37 |
| 464.h264ref    |   364 | 365 | +0.27 | 364 | +0.00 | 364 | +0.00 | 363 | 
-0.27 |
| 471.omnetpp    |   266 | 266 | +0.00 | 265 | -0.38 | 265 | -0.38 | 265 | 
-0.38 |
| 473.astar      |   306 | 307 | +0.33 | 306 | +0.00 | 306 | +0.00 | 309 | 
+0.98 |
| 483.xalancbmk  |   177 | 173 | -2.26 | 170 | -3.95 | 170 | -3.95 | 170 | 
-3.95 |

 Text size
 ---------
 
| Benchmark      |   trunk |  strict |     % |      x1 |     % |      x2 |     
% |      x4 |     % |
|----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
| 400.perlbench  | 1161762 | 1161874 | +0.01 | 1161874 | +0.01 | 1162226 | 
+0.04 | 1162338 | +0.05 |
| 401.bzip2      |   80834 |   80834 | +0.00 |   80834 | +0.00 |   80834 | 
+0.00 |   80834 | +0.00 |
| 403.gcc        | 3170946 | 3171394 | +0.01 | 3172914 | +0.06 | 3173170 | 
+0.07 | 3174818 | +0.12 |
| 429.mcf        |   10418 |   10418 | +0.00 |   10418 | +0.00 |   10450 | 
+0.31 |   10450 | +0.31 |
| 445.gobmk      |  779778 |  779778 | +0.00 |  779842 | +0.01 |  779842 | 
+0.01 |  780418 | +0.08 |
| 456.hmmer      |  328258 |  328258 | +0.00 |  328258 | +0.00 |  328258 | 
+0.00 |  328258 | +0.00 |
| 458.sjeng      |  146386 |  146386 | +0.00 |  148162 | +1.21 |  148162 | 
+1.21 |  149330 | +2.01 |
| 462.libquantum |   30666 |   30666 | +0.00 |   30730 | +0.21 |   30730 | 
+0.21 |   30730 | +0.21 |
| 464.h264ref    |  737826 |  737826 | +0.00 |  737890 | +0.01 |  737890 | 
+0.01 |  739186 | +0.18 |
| 471.omnetpp    |  561570 |  561570 | +0.00 |  561826 | +0.05 |  561826 | 
+0.05 |  561826 | +0.05 |
| 473.astar      |   39314 |   39522 | +0.53 |   39522 | +0.53 |   39538 | 
+0.57 |   39538 | +0.57 |
| 483.xalancbmk  | 3319682 | 3319842 | +0.00 | 3319842 | +0.00 | 3319842 | 
+0.00 | 3319842 | +0.00 |



Zen SPECFP 2006 -O2 generic tuning
==================================

 Run-time
 --------
 
| Benchmark     | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % 
|
|---------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 410.bwaves    |   214 | 213 | -0.47 | 214 | +0.00 | 214 | +0.00 | 214 | +0.00 
|
| 433.milc      |   290 | 291 | +0.34 | 290 | +0.00 | 295 | +1.72 | 289 | -0.34 
|
| 434.zeusmp    |   182 | 182 | +0.00 | 182 | +0.00 | 184 | +1.10 | 182 | +0.00 
|
| 435.gromacs   |   218 | 218 | +0.00 | 217 | -0.46 | 216 | -0.92 | 220 | +0.92 
|
| 436.cactusADM |   350 | 349 | -0.29 | 349 | -0.29 | 343 | -2.00 | 349 | -0.29 
|
| 437.leslie3d  |   196 | 195 | -0.51 | 196 | +0.00 | 194 | -1.02 | 196 | +0.00 
|
| 444.namd      |   273 | 273 | +0.00 | 273 | +0.00 | 273 | +0.00 | 273 | +0.00 
|
| 447.dealII    |   211 | 211 | +0.00 | 210 | -0.47 | 210 | -0.47 | 211 | +0.00 
|
| 450.soplex    |   187 | 188 | +0.53 | 188 | +0.53 | 187 | +0.00 | 187 | +0.00 
|
| 453.povray    |   119 | 118 | -0.84 | 119 | +0.00 | 119 | +0.00 | 118 | -0.84 
|
| 454.calculix  |   534 | 533 | -0.19 | 531 | -0.56 | 531 | -0.56 | 532 | -0.37 
|
| 459.GemsFDTD  |   236 | 235 | -0.42 | 235 | -0.42 | 242 | +2.54 | 237 | +0.42 
|
| 465.tonto     |   366 | 365 | -0.27 | 365 | -0.27 | 364 | -0.55 | 365 | -0.27 
|
| 470.lbm       |   181 | 180 | -0.55 | 180 | -0.55 | 180 | -0.55 | 180 | -0.55 
|
| 481.wrf       |   303 | 303 | +0.00 | 302 | -0.33 | 304 | +0.33 | 304 | +0.33 
|
| 482.sphinx3   |   362 | 362 | +0.00 | 360 | -0.55 | 361 | -0.28 | 363 | +0.28 
|

 Text size
 ---------

| Benchmark     |   trunk |  strict |     % |      x1 |     % |      x2 |     % 
|      x4 |     % |
|---------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
| 410.bwaves    |   25954 |   25954 | +0.00 |   25954 | +0.00 |   25954 | +0.00 
|   25954 | +0.00 |
| 433.milc      |   87922 |   87922 | +0.00 |   87922 | +0.00 |   88610 | +0.78 
|   89042 | +1.27 |
| 434.zeusmp    |  212034 |  212034 | +0.00 |  212034 | +0.00 |  212034 | +0.00 
|  212034 | +0.00 |
| 435.gromacs   |  747026 |  747026 | +0.00 |  747026 | +0.00 |  747026 | +0.00 
|  747026 | +0.00 |
| 436.cactusADM |  526178 |  526178 | +0.00 |  526178 | +0.00 |  526274 | +0.02 
|  526274 | +0.02 |
| 437.leslie3d  |   83234 |   83234 | +0.00 |   83234 | +0.00 |   83234 | +0.00 
|   83234 | +0.00 |
| 444.namd      |  297234 |  297266 | +0.01 |  297266 | +0.01 |  297266 | +0.01 
|  297266 | +0.01 |
| 447.dealII    | 2165282 | 2167650 | +0.11 | 2172290 | +0.32 | 2174034 | +0.40 
| 2174082 | +0.41 |
| 450.soplex    |  347122 |  347122 | +0.00 |  347122 | +0.00 |  347122 | +0.00 
|  347122 | +0.00 |
| 453.povray    |  800914 |  800962 | +0.01 |  801570 | +0.08 |  802002 | +0.14 
|  803138 | +0.28 |
| 454.calculix  | 1342802 | 1342802 | +0.00 | 1342802 | +0.00 | 1342802 | +0.00 
| 1342802 | +0.00 |
| 459.GemsFDTD  |  353410 |  354050 | +0.18 |  354050 | +0.18 |  354050 | +0.18 
|  354098 | +0.19 |
| 465.tonto     | 3464210 | 3465058 | +0.02 | 3465058 | +0.02 | 3468434 | +0.12 
| 3476594 | +0.36 |
| 470.lbm       |    9202 |    9202 | +0.00 |    9202 | +0.00 |    9202 | +0.00 
|    9202 | +0.00 |
| 481.wrf       | 3345170 | 3345170 | +0.00 | 3345170 | +0.00 | 3351586 | +0.19 
| 3351586 | +0.19 |
| 482.sphinx3   |  125026 |  125026 | +0.00 |  125026 | +0.00 |  125026 | +0.00 
|  125026 | +0.00 |



Zen SPECFP 2006 -Ofast native tuning
====================================

 Run-time
 --------

| Benchmark     | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % 
|
|---------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 410.bwaves    |   151 | 150 | -0.66 | 151 | +0.00 | 151 | +0.00 | 151 | +0.00 
|
| 433.milc      |   197 | 197 | +0.00 | 197 | +0.00 | 194 | -1.52 | 186 | -5.58 
|
| 434.zeusmp    |   128 | 128 | +0.00 | 128 | +0.00 | 128 | +0.00 | 128 | +0.00 
|
| 435.gromacs   |   181 | 181 | +0.00 | 180 | -0.55 | 180 | -0.55 | 181 | +0.00 
|
| 436.cactusADM |   139 | 139 | +0.00 | 139 | +0.00 | 132 | -5.04 | 139 | +0.00 
|
| 437.leslie3d  |   159 | 160 | +0.63 | 160 | +0.63 | 159 | +0.00 | 159 | +0.00 
|
| 444.namd      |   256 | 256 | +0.00 | 255 | -0.39 | 255 | -0.39 | 256 | +0.00 
|
| 447.dealII    |   200 | 200 | +0.00 | 199 | -0.50 | 201 | +0.50 | 201 | +0.50 
|
| 450.soplex    |   184 | 184 | +0.00 | 185 | +0.54 | 184 | +0.00 | 184 | +0.00 
|
| 453.povray    |   124 | 122 | -1.61 | 123 | -0.81 | 124 | +0.00 | 122 | -1.61 
|
| 454.calculix  |   192 | 192 | +0.00 | 192 | +0.00 | 193 | +0.52 | 193 | +0.52 
|
| 459.GemsFDTD  |   208 | 208 | +0.00 | 208 | +0.00 | 214 | +2.88 | 208 | +0.00 
|
| 465.tonto     |   320 | 320 | +0.00 | 320 | +0.00 | 320 | +0.00 | 320 | +0.00 
|
| 470.lbm       |   142 | 142 | +0.00 | 142 | +0.00 | 142 | +0.00 | 142 | +0.00 
|
| 481.wrf       |   195 | 195 | +0.00 | 195 | +0.00 | 195 | +0.00 | 195 | +0.00 
|
| 482.sphinx3   |   256 | 258 | +0.78 | 256 | +0.00 | 256 | +0.00 | 257 | +0.39 
|

 Text size
 ---------

| Benchmark     |   trunk |  strict |     % |      x1 |     % |      x2 |     % 
|      x4 |     % |
|---------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
| 410.bwaves    |   27490 |   27490 | +0.00 |   27490 | +0.00 |   27490 | +0.00 
|   27490 | +0.00 |
| 433.milc      |  118178 |  118178 | +0.00 |  118178 | +0.00 |  118962 | +0.66 
|  119634 | +1.23 |
| 434.zeusmp    |  411106 |  411106 | +0.00 |  411106 | +0.00 |  411106 | +0.00 
|  411106 | +0.00 |
| 435.gromacs   |  935970 |  935970 | +0.00 |  935970 | +0.00 |  935970 | +0.00 
|  936162 | +0.02 |
| 436.cactusADM |  750546 |  750546 | +0.00 |  750546 | +0.00 |  750626 | +0.01 
|  750626 | +0.01 |
| 437.leslie3d  |  123410 |  123410 | +0.00 |  123410 | +0.00 |  123410 | +0.00 
|  123410 | +0.00 |
| 444.namd      |  284082 |  284114 | +0.01 |  284114 | +0.01 |  284114 | +0.01 
|  284114 | +0.01 |
| 447.dealII    | 2438610 | 2440946 | +0.10 | 2444978 | +0.26 | 2446882 | +0.34 
| 2446930 | +0.34 |
| 450.soplex    |  443218 |  443218 | +0.00 |  443218 | +0.00 |  443218 | +0.00 
|  443218 | +0.00 |
| 453.povray    | 1077778 | 1077890 | +0.01 | 1078658 | +0.08 | 1079026 | +0.12 
| 1080370 | +0.24 |
| 454.calculix  | 1639138 | 1639138 | +0.00 | 1639138 | +0.00 | 1639474 | +0.02 
| 1639474 | +0.02 |
| 459.GemsFDTD  |  451202 |  451234 | +0.01 |  451234 | +0.01 |  451234 | +0.01 
|  451282 | +0.02 |
| 465.tonto     | 4584690 | 4585250 | +0.01 | 4585250 | +0.01 | 4588130 | +0.08 
| 4595442 | +0.23 |
| 470.lbm       |    9858 |    9858 | +0.00 |    9858 | +0.00 |    9858 | +0.00 
|    9858 | +0.00 |
| 481.wrf       | 4588002 | 4588002 | +0.00 | 4588290 | +0.01 | 4621010 | +0.72 
| 4621922 | +0.74 |
| 482.sphinx3   |  179602 |  179602 | +0.00 |  179602 | +0.00 |  179602 | +0.00 
|  179602 | +0.00 |



Zen SPEC INT 2017 -O2 generic tuning
====================================

 Run-time
 --------

| Benchmark       | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     
% |
|-----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 500.perlbench_r |   529 | 529 | +0.00 | 531 | +0.38 | 530 | +0.19 | 534 | 
+0.95 |
| 502.gcc_r       |   338 | 333 | -1.48 | 334 | -1.18 | 339 | +0.30 | 339 | 
+0.30 |
| 505.mcf_r       |   382 | 381 | -0.26 | 382 | +0.00 | 382 | +0.00 | 381 | 
-0.26 |
| 520.omnetpp_r   |   511 | 503 | -1.57 | 497 | -2.74 | 497 | -2.74 | 497 | 
-2.74 |
| 523.xalancbmk_r |   391 | 388 | -0.77 | 389 | -0.51 | 390 | -0.26 | 391 | 
+0.00 |
| 525.x264_r      |   590 | 590 | +0.00 | 591 | +0.17 | 592 | +0.34 | 593 | 
+0.51 |
| 531.deepsjeng_r |   427 | 427 | +0.00 | 427 | +0.00 | 428 | +0.23 | 427 | 
+0.00 |
| 541.leela_r     |   716 | 716 | +0.00 | 716 | +0.00 | 719 | +0.42 | 719 | 
+0.42 |
| 548.exchange2_r |   593 | 593 | +0.00 | 593 | +0.00 | 593 | +0.00 | 593 | 
+0.00 |
| 557.xz_r        |   452 | 452 | +0.00 | 453 | +0.22 | 454 | +0.44 | 452 | 
+0.00 |

 Text size
 ---------

| Benchmark       |   trunk |  strict |     % |      x1 |     % |      x2 |     
% |      x4 |     % |
|-----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
| 500.perlbench_r | 1599442 | 1599522 | +0.01 | 1599522 | +0.01 | 1599522 | 
+0.01 | 1600082 | +0.04 |
| 502.gcc_r       | 6757602 | 6758978 | +0.02 | 6759090 | +0.02 | 6759842 | 
+0.03 | 6760306 | +0.04 |
| 505.mcf_r       |   16098 |   16098 | +0.00 |   16098 | +0.00 |   16098 | 
+0.00 |   16306 | +1.29 |
| 520.omnetpp_r   | 1262498 | 1262562 | +0.01 | 1264034 | +0.12 | 1264034 | 
+0.12 | 1264034 | +0.12 |
| 523.xalancbmk_r | 3989026 | 3989202 | +0.00 | 3989202 | +0.00 | 3989202 | 
+0.00 | 3989202 | +0.00 |
| 525.x264_r      |  414130 |  414194 | +0.02 |  414194 | +0.02 |  414738 | 
+0.15 |  415122 | +0.24 |
| 531.deepsjeng_r |   67426 |   67426 | +0.00 |   67458 | +0.05 |   67458 | 
+0.05 |   67458 | +0.05 |
| 541.leela_r     |  219378 |  219378 | +0.00 |  219378 | +0.00 |  224082 | 
+2.14 |  237026 | +8.04 |
| 548.exchange2_r |   61234 |   61234 | +0.00 |   61234 | +0.00 |   61234 | 
+0.00 |   61234 | +0.00 |
| 557.xz_r        |  111490 |  111490 | +0.00 |  111490 | +0.00 |  111506 | 
+0.01 |  111890 | +0.36 |



Zen SPEC INT 2017 -Ofast native tuning
======================================

 Run-time
 ---------

| Benchmark       | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     
% |
|-----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 500.perlbench_r |   525 | 524 | -0.19 | 525 | +0.00 | 525 | +0.00 | 534 | 
+1.71 |
| 502.gcc_r       |   331 | 329 | -0.60 | 324 | -2.11 | 330 | -0.30 | 324 | 
-2.11 |
| 505.mcf_r       |   380 | 380 | +0.00 | 381 | +0.26 | 380 | +0.00 | 379 | 
-0.26 |
| 520.omnetpp_r   |   487 | 486 | -0.21 | 488 | +0.21 | 489 | +0.41 | 488 | 
+0.21 |
| 523.xalancbmk_r |   373 | 369 | -1.07 | 367 | -1.61 | 370 | -0.80 | 368 | 
-1.34 |
| 525.x264_r      |   319 | 319 | +0.00 | 320 | +0.31 | 321 | +0.63 | 322 | 
+0.94 |
| 531.deepsjeng_r |   418 | 418 | +0.00 | 418 | +0.00 | 418 | +0.00 | 419 | 
+0.24 |
| 541.leela_r     |   674 | 674 | +0.00 | 674 | +0.00 | 672 | -0.30 | 672 | 
-0.30 |
| 548.exchange2_r |   466 | 466 | +0.00 | 466 | +0.00 | 466 | +0.00 | 466 | 
+0.00 |
| 557.xz_r        |   443 | 443 | +0.00 | 443 | +0.00 | 449 | +1.35 | 449 | 
+1.35 |
  
 Text size
 ---------

| Benchmark       |   trunk |  strict |     % |      x1 |     % |      x2 |     
% |      x4 |     % |
|-----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
| 500.perlbench_r | 2122882 | 2122962 | +0.00 | 2122962 | +0.00 | 2122962 | 
+0.00 | 2122514 | -0.02 |
| 502.gcc_r       | 8566290 | 8567794 | +0.02 | 8569138 | +0.03 | 8570066 | 
+0.04 | 8570642 | +0.05 |
| 505.mcf_r       |   26770 |   26770 | +0.00 |   26770 | +0.00 |   26770 | 
+0.00 |   26962 | +0.72 |
| 520.omnetpp_r   | 1713938 | 1713954 | +0.00 | 1714754 | +0.05 | 1714754 | 
+0.05 | 1714754 | +0.05 |
| 523.xalancbmk_r | 4881890 | 4882114 | +0.00 | 4882114 | +0.00 | 4882114 | 
+0.00 | 4882114 | +0.00 |
| 525.x264_r      |  601522 |  601602 | +0.01 |  601602 | +0.01 |  602130 | 
+0.10 |  602834 | +0.22 |
| 531.deepsjeng_r |   90306 |   90306 | +0.00 |   90338 | +0.04 |   90338 | 
+0.04 |   90338 | +0.04 |
| 541.leela_r     |  277634 |  277650 | +0.01 |  277650 | +0.01 |  282386 | 
+1.71 |  295778 | +6.54 |
| 548.exchange2_r |  109058 |  109058 | +0.00 |  109058 | +0.00 |  109058 | 
+0.00 |  109058 | +0.00 |
| 557.xz_r        |  154594 |  154594 | +0.00 |  154594 | +0.00 |  154610 | 
+0.01 |  154930 | +0.22 |



Zen SPEC 2017 FP -O2 generic tuning
===================================

 Run-time
 --------
| Benchmark       | trunk |   s |      % |  x1 |      % |  x2 |      % |  x4 |  
    % |
|-----------------+-------+-----+--------+-----+--------+-----+--------+-----+--------|
| 503.bwaves_r    |   801 | 801 |  +0.00 | 801 |  +0.00 | 801 |  +0.00 | 801 |  
+0.00 |
| 507.cactuBSSN_r |   303 | 302 |  -0.33 | 299 |  -1.32 | 302 |  -0.33 | 307 |  
+1.32 |
| 508.namd_r      |   306 | 306 |  +0.00 | 307 |  +0.33 | 306 |  +0.00 | 306 |  
+0.00 |
| 510.parest_r    |   558 | 553 |  -0.90 | 561 |  +0.54 | 554 |  -0.72 | 562 |  
+0.72 |
| 511.povray_r    |   679 | 672 |  -1.03 | 673 |  -0.88 | 680 |  +0.15 | 644 |  
-5.15 |
| 519.lbm_r       |   240 | 240 |  +0.00 | 240 |  +0.00 | 240 |  +0.00 | 240 |  
+0.00 |
| 521.wrf_r       |   851 | 827 |  -2.82 | 827 |  -2.82 | 827 |  -2.82 | 828 |  
-2.70 |
| 526.blender_r   |   376 | 376 |  +0.00 | 379 |  +0.80 | 377 |  +0.27 | 376 |  
+0.00 |
| 527.cam4_r      |   529 | 527 |  -0.38 | 533 |  +0.76 | 536 |  +1.32 | 528 |  
-0.19 |
| 538.imagick_r   |   646 | 570 | -11.76 | 570 | -11.76 | 569 | -11.92 | 570 | 
-11.76 |
| 544.nab_r       |   467 | 467 |  +0.00 | 467 |  +0.00 | 467 |  +0.00 | 467 |  
+0.00 |
| 549.fotonik3d_r |   413 | 413 |  +0.00 | 414 |  +0.24 | 415 |  +0.48 | 413 |  
+0.00 |
| 554.roms_r      |   459 | 455 |  -0.87 | 456 |  -0.65 | 456 |  -0.65 | 456 |  
-0.65 |

 Text size
 ---------

| Benchmark       |    trunk |   strict |     % |       x1 |     % |       x2 | 
    % |       x4 |     % |
|-----------------+----------+----------+-------+----------+-------+----------+-------+----------+-------|
| 503.bwaves_r    |    32034 |    32034 | +0.00 |    32034 | +0.00 |    32034 | 
+0.00 |    32034 | +0.00 |
| 507.cactuBSSN_r |  2951634 |  2951634 | +0.00 |  2951634 | +0.00 |  2951698 | 
+0.00 |  2951730 | +0.00 |
| 508.namd_r      |   837458 |   837490 | +0.00 |   837490 | +0.00 |   837490 | 
+0.00 |   837490 | +0.00 |
| 510.parest_r    |  6540866 |  6545618 | +0.07 |  6546754 | +0.09 |  6561426 | 
+0.31 |  6569426 | +0.44 |
| 511.povray_r    |   803618 |   803666 | +0.01 |   804274 | +0.08 |   804706 | 
+0.14 |   805842 | +0.28 |
| 519.lbm_r       |    12018 |    12018 | +0.00 |    12018 | +0.00 |    12018 | 
+0.00 |    12018 | +0.00 |
| 521.wrf_r       | 16292962 | 16296786 | +0.02 | 16296978 | +0.02 | 16302594 | 
+0.06 | 16419842 | +0.78 |
| 526.blender_r   |  7268224 |  7281264 | +0.18 |  7282608 | +0.20 |  7289168 | 
+0.29 |  7295296 | +0.37 |
| 527.cam4_r      |  5063666 |  5063922 | +0.01 |  5065010 | +0.03 |  5068114 | 
+0.09 |  5072946 | +0.18 |
| 538.imagick_r   |  1608178 |  1609282 | +0.07 |  1609282 | +0.07 |  1613458 | 
+0.33 |  1613970 | +0.36 |
| 544.nab_r       |   156242 |   156242 | +0.00 |   156242 | +0.00 |   156242 | 
+0.00 |   156242 | +0.00 |
| 549.fotonik3d_r |   326738 |   326738 | +0.00 |   326738 | +0.00 |   326738 | 
+0.00 |   326738 | +0.00 |
| 554.roms_r      |   728546 |   728546 | +0.00 |   728546 | +0.00 |   728546 | 
+0.00 |   728546 | +0.00 |



Zen SPEC 2017 FP -Ofast native tuning
=====================================

 Run-time
 --------

| Benchmark       | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     
% |
|-----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
| 503.bwaves_r    |   310 | 310 | +0.00 | 310 | +0.00 | 310 | +0.00 | 309 | 
-0.32 |
| 507.cactuBSSN_r |   269 | 266 | -1.12 | 266 | -1.12 | 268 | -0.37 | 270 | 
+0.37 |
| 508.namd_r      |   270 | 269 | -0.37 | 269 | -0.37 | 268 | -0.74 | 268 | 
-0.74 |
| 510.parest_r    |   607 | 601 | -0.99 | 599 | -1.32 | 599 | -1.32 | 604 | 
-0.49 |
| 511.povray_r    |   662 | 664 | +0.30 | 671 | +1.36 | 680 | +2.72 | 675 | 
+1.96 |
| 519.lbm_r       |   186 | 186 | +0.00 | 186 | +0.00 | 186 | +0.00 | 186 | 
+0.00 |
| 521.wrf_r       |   550 | 554 | +0.73 | 550 | +0.00 | 550 | +0.00 | 549 | 
-0.18 |
| 526.blender_r   |   355 | 354 | -0.28 | 355 | +0.00 | 354 | -0.28 | 354 | 
-0.28 |
| 527.cam4_r      |   434 | 437 | +0.69 | 435 | +0.23 | 437 | +0.69 | 435 | 
+0.23 |
| 538.imagick_r   |   433 | 420 | -3.00 | 420 | -3.00 | 420 | -3.00 | 419 | 
-3.23 |
| 544.nab_r       |   424 | 425 | +0.24 | 425 | +0.24 | 425 | +0.24 | 425 | 
+0.24 |
| 549.fotonik3d_r |   421 | 422 | +0.24 | 422 | +0.24 | 422 | +0.24 | 422 | 
+0.24 |
| 554.roms_r      |   360 | 361 | +0.28 | 361 | +0.28 | 361 | +0.28 | 361 | 
+0.28 |

+1.36% for 511.povray_r is the worst regression for the proposed x1
defaults, by the way.  I have not investigated it further, however.

 Text size
 ---------

| Benchmark       |    trunk |   strict |     % |       x1 |     % |       x2 | 
    % |       x4 |     % |
|-----------------+----------+----------+-------+----------+-------+----------+-------+----------+-------|
| 503.bwaves_r    |    34562 |    34562 | +0.00 |    34562 | +0.00 |    34562 | 
+0.00 |    34562 | +0.00 |
| 507.cactuBSSN_r |  3978402 |  3978402 | +0.00 |  3978402 | +0.00 |  3978514 | 
+0.00 |  3978546 | +0.00 |
| 508.namd_r      |   869106 |   869154 | +0.01 |   869154 | +0.01 |   869154 | 
+0.01 |   869154 | +0.01 |
| 510.parest_r    |  7186258 |  7189298 | +0.04 |  7190370 | +0.06 |  7203890 | 
+0.25 |  7211202 | +0.35 |
| 511.povray_r    |  1063314 |  1063410 | +0.01 |  1064178 | +0.08 |  1064546 | 
+0.12 |  1065890 | +0.24 |
| 519.lbm_r       |    12178 |    12178 | +0.00 |    12178 | +0.00 |    12178 | 
+0.00 |    12178 | +0.00 |
| 521.wrf_r       | 19480946 | 19484146 | +0.02 | 19484466 | +0.02 | 19607538 | 
+0.65 | 19716178 | +1.21 |
| 526.blender_r   |  9708752 |  9719952 | +0.12 |  9722768 | +0.14 |  9730224 | 
+0.22 |  9737760 | +0.30 |
| 527.cam4_r      |  6217970 |  6218162 | +0.00 |  6219570 | +0.03 |  6223362 | 
+0.09 |  6227762 | +0.16 |
| 538.imagick_r   |  2255682 |  2256162 | +0.02 |  2256162 | +0.02 |  2261346 | 
+0.25 |  2261938 | +0.28 |
| 544.nab_r       |   212418 |   212418 | +0.00 |   212418 | +0.00 |   212418 | 
+0.00 |   212578 | +0.08 |
| 549.fotonik3d_r |   454738 |   454738 | +0.00 |   454738 | +0.00 |   454738 | 
+0.00 |   454738 | +0.00 |
| 554.roms_r      |   910978 |   910978 | +0.00 |   910978 | +0.00 |   910978 | 
+0.00 |   910978 | +0.00 |


I believe the numbers are good and thus I would like to ask-for
re-consideration of the objection and for approval to commit the patch
below.  Needless to say, it has passed bootstrap and testing on
x86_64-linux.

Thanks

Martin


2017-10-27  Martin Jambor  <mjam...@suse.cz>

        PR target/80689
        * tree-sra.h: New file.
        * ipa-prop.h: Moved declaration of build_ref_for_offset to
        tree-sra.h.
        * expr.c: Include params.h and tree-sra.h.
        (emit_move_elementwise): New function.
        (store_expr_with_bounds): Optionally use it.
        * ipa-cp.c: Include tree-sra.h.
        * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
        (PARAM_MAX_INSNS_FOR_ELEMENTWISE_COPY): Likewise.
        * config/i386/i386.c (ix86_option_override_internal): Set
        PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
        * tree-sra.c: Include tree-sra.h.
        (scalarizable_type_p): Renamed to
        simple_mix_of_records_and_arrays_p, made public, renamed the
        second parameter to allow_char_arrays, added count_p parameter.
        (extract_min_max_idx_from_array): New function.
        (completely_scalarize): Moved bits of the function to
        extract_min_max_idx_from_array.

        testsuite/
        * gcc.target/i386/pr80689-1.c: New test.

Added insns count param limit
---
 gcc/config/i386/i386.c                    |   4 +
 gcc/expr.c                                | 106 ++++++++++++++++++++++-
 gcc/ipa-cp.c                              |   1 +
 gcc/ipa-prop.h                            |   4 -
 gcc/params.def                            |  12 +++
 gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++
 gcc/tree-sra.c                            | 134 +++++++++++++++++++++---------
 gcc/tree-sra.h                            |  34 ++++++++
 8 files changed, 288 insertions(+), 45 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
 create mode 100644 gcc/tree-sra.h

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 80c8ce7ecb9..0bff2da72dd 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -4580,6 +4580,10 @@ ix86_option_override_internal (bool main_args_p,
                         ix86_tune_cost->l2_cache_size,
                         opts->x_param_values,
                         opts_set->x_param_values);
+  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
+                        35,
+                        opts->x_param_values,
+                        opts_set->x_param_values);
 
   /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
   if (opts->x_flag_prefetch_loop_arrays < 0
diff --git a/gcc/expr.c b/gcc/expr.c
index 496d492c9fa..971880b635d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-chkp.h"
 #include "rtl-chkp.h"
 #include "ccmp.h"
-
+#include "params.h"
+#include "tree-sra.h"
 
 /* If this is nonzero, we do not bother generating VOLATILE
    around volatile memory references, and we are willing to
@@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
   return maybe_expand_insn (code, 2, ops);
 }
 
+/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
+   plus OFFSET, but do so element-wise and/or field-wise for each record and
+   array within TYPE.  TYPE must either be a register type or an aggregate
+   complying with scalarizable_type_p.
+
+   If CALL_PARAM_P is nonzero, this is a store into a call param on the
+   stack, and block moves may need to be treated specially.  */
+
+static void
+emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
+                      int call_param_p)
+{
+  switch (TREE_CODE (type))
+    {
+    case RECORD_TYPE:
+      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
+       if (TREE_CODE (fld) == FIELD_DECL)
+         {
+           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
+           tree ft = TREE_TYPE (fld);
+           emit_move_elementwise (ft, target, source, fld_offset,
+                                  call_param_p);
+         }
+      break;
+
+    case ARRAY_TYPE:
+      {
+       tree elem_type = TREE_TYPE (type);
+       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
+       gcc_assert (el_size > 0);
+
+       offset_int idx, max;
+       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
+       if (extract_min_max_idx_from_array (type, &idx, &max))
+         {
+           HOST_WIDE_INT el_offset = offset;
+           for (; idx <= max; ++idx)
+             {
+               emit_move_elementwise (elem_type, target, source, el_offset,
+                                      call_param_p);
+               el_offset += el_size;
+             }
+         }
+      }
+      break;
+    default:
+      machine_mode mode = TYPE_MODE (type);
+
+      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
+      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
+
+      /* TODO: Figure out whether the following is actually necessary.  */
+      if (target == ntgt)
+       ntgt = copy_rtx (target);
+      if (source == nsrc)
+       nsrc = copy_rtx (source);
+
+      gcc_assert (mode != VOIDmode);
+      if (mode != BLKmode)
+       emit_move_insn (ntgt, nsrc);
+      else
+       {
+         /* For example vector gimple registers can end up here.  */
+         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
+                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
+         emit_block_move (ntgt, nsrc, size,
+                          (call_param_p
+                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
+       }
+      break;
+    }
+  return;
+}
+
 /* Generate code for computing expression EXP,
    and storing the value into TARGET.
 
@@ -5713,9 +5788,32 @@ store_expr_with_bounds (tree exp, rtx target, int 
call_param_p,
        emit_group_store (target, temp, TREE_TYPE (exp),
                          int_size_in_bytes (TREE_TYPE (exp)));
       else if (GET_MODE (temp) == BLKmode)
-       emit_block_move (target, temp, expr_size (exp),
-                        (call_param_p
-                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
+       {
+         /* Copying smallish BLKmode structures with emit_block_move and thus
+            by-pieces can result in store-to-load stalls.  So copy some simple
+            small aggregates element or field-wise.  */
+         int count = 0;
+         if (GET_MODE (target) == BLKmode
+             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
+             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
+             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
+             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
+                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
+                     * BITS_PER_UNIT))
+             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false,
+                                                    &count)
+             && (count <= PARAM_VALUE (PARAM_MAX_INSNS_FOR_ELEMENTWISE_COPY)))
+           {
+             /* FIXME:  Can this happen?  What would it mean?  */
+             gcc_assert (!reverse);
+             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
+                                    call_param_p);
+           }
+         else
+           emit_block_move (target, temp, expr_size (exp),
+                            (call_param_p
+                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
+       }
       /* If we emit a nontemporal store, there is nothing else to do.  */
       else if (nontemporal && emit_storent_insn (target, temp))
        ;
diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index d23c1d8ba3e..30f91e70c22 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-ccp.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "tree-sra.h"
 
 template <typename valtype> class ipcp_value;
 
diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
index fa5bed49ee0..2313cc884ed 100644
--- a/gcc/ipa-prop.h
+++ b/gcc/ipa-prop.h
@@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree 
**, bool *,
 void ipa_release_body_info (struct ipa_func_body_info *);
 tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
 
-/* From tree-sra.c:  */
-tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
-                          gimple_stmt_iterator *, bool);
-
 /* In ipa-cp.c  */
 void ipa_cp_c_finalize (void);
 
diff --git a/gcc/params.def b/gcc/params.def
index 8881f4c403a..9c778f9540a 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1287,6 +1287,18 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
          "Enable loop epilogue vectorization using smaller vector size.",
          0, 0, 1)
 
+DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
+         "max-size-for-elementwise-copy",
+         "Maximum size in bytes of a structure or an array to by considered "
+         "for copying by its individual fields or elements",
+         0, 0, 512)
+
+DEFPARAM (PARAM_MAX_INSNS_FOR_ELEMENTWISE_COPY,
+         "max-insns-for-elementwise-copy",
+         "Maximum number of instructions needed to consider copying "
+          "a structure or an array by its individual fields or elements",
+         6, 0, 64)
+
 /*
 
 Local variables:
diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c 
b/gcc/testsuite/gcc.target/i386/pr80689-1.c
new file mode 100644
index 00000000000..4156d4fba45
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+typedef struct st1
+{
+        long unsigned int a,b;
+        long int c,d;
+}R;
+
+typedef struct st2
+{
+        int  t;
+        R  reg;
+}N;
+
+void Set (const R *region,  N *n_info );
+
+void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  
const long int c,const long int d)
+{
+        R reg;
+
+        reg.a=a;
+        reg.b=b;
+        reg.c=c;
+        reg.d=d;
+        Set (&reg, n_obj);
+
+}
+
+void Set (const R *reg,  N *n_obj )
+{
+        n_obj->reg=(*reg);
+}
+
+
+/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "movdqu" } } */
+/* { dg-final { scan-assembler-not "movups" } } */
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index bac593951e7..d06463ce21c 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-fnsummary.h"
 #include "ipa-utils.h"
 #include "builtins.h"
+#include "tree-sra.h"
 
 /* Enumeration of all aggregate reductions we can do.  */
 enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
@@ -952,14 +953,15 @@ create_access (tree expr, gimple *stmt, bool write)
 }
 
 
-/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
-   ARRAY_TYPE with fields that are either of gimple register types (excluding
-   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
-   we are considering a decl from constant pool.  If it is false, char arrays
-   will be refused.  */
+/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
+   fields/elements that are not bit-fields and are either register types or
+   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
+   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
+   contains an array of elements that only have one byte.  */
 
-static bool
-scalarizable_type_p (tree type, bool const_decl)
+bool
+simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays,
+                                   int *count_p)
 {
   gcc_assert (!is_gimple_reg_type (type));
   if (type_contains_placeholder_p (type))
@@ -976,8 +978,13 @@ scalarizable_type_p (tree type, bool const_decl)
          if (DECL_BIT_FIELD (fld))
            return false;
 
-         if (!is_gimple_reg_type (ft)
-             && !scalarizable_type_p (ft, const_decl))
+         if (is_gimple_reg_type (ft))
+           {
+             if (count_p)
+               (*count_p)++;
+           }
+         else if (!simple_mix_of_records_and_arrays_p (ft, allow_char_arrays,
+                                                      count_p))
            return false;
        }
 
@@ -986,7 +993,7 @@ scalarizable_type_p (tree type, bool const_decl)
   case ARRAY_TYPE:
     {
       HOST_WIDE_INT min_elem_size;
-      if (const_decl)
+      if (allow_char_arrays)
        min_elem_size = 0;
       else
        min_elem_size = BITS_PER_UNIT;
@@ -1007,9 +1014,45 @@ scalarizable_type_p (tree type, bool const_decl)
        return false;
 
       tree elem = TREE_TYPE (type);
-      if (!is_gimple_reg_type (elem)
-         && !scalarizable_type_p (elem, const_decl))
-       return false;
+      if (!count_p)
+       {
+         if (!is_gimple_reg_type (elem)
+             && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays,
+                                                     NULL))
+           return false;
+         else
+           return true;
+       }
+
+      offset_int min, max;
+      HOST_WIDE_INT ds;
+      bool nonzero = extract_min_max_idx_from_array (type, &min, &max);
+
+      if (nonzero && (min <= max))
+       {
+         offset_int d = max - min + 1;
+         if (!wi::fits_shwi_p (d))
+           return false;
+         ds = d.to_shwi ();
+         if (ds > INT_MAX)
+           return false;
+       }
+      else
+       ds = 0;
+
+      if (is_gimple_reg_type (elem))
+       *count_p += (int) ds;
+      else
+       {
+         int elc = 0;
+         if (!simple_mix_of_records_and_arrays_p (elem, allow_char_arrays,
+                                                  &elc))
+           return false;
+         ds *= elc;
+         if (ds > INT_MAX)
+           return false;
+         *count_p += (unsigned) ds;
+       }
       return true;
     }
   default:
@@ -1017,10 +1060,38 @@ scalarizable_type_p (tree type, bool const_decl)
   }
 }
 
-static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, 
tree);
+static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
+                           tree);
+
+/* For a given array TYPE, return false if its domain does not have any maximum
+   value.  Otherwise calculate MIN and MAX indices of the first and the last
+   element.  */
+
+bool
+extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
+{
+  tree domain = TYPE_DOMAIN (type);
+  tree minidx = TYPE_MIN_VALUE (domain);
+  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
+  tree maxidx = TYPE_MAX_VALUE (domain);
+  if (!maxidx)
+    return false;
+  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
+
+  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
+     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
+  *min = wi::to_offset (minidx);
+  *max = wi::to_offset (maxidx);
+  if (!TYPE_UNSIGNED (domain))
+    {
+      *min = wi::sext (*min, TYPE_PRECISION (domain));
+      *max = wi::sext (*max, TYPE_PRECISION (domain));
+    }
+  return true;
+}
 
 /* Create total_scalarization accesses for all scalar fields of a member
-   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
+   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
    must be the top-most VAR_DECL representing the variable; within that,
    OFFSET locates the member and REF must be the memory reference expression 
for
    the member.  */
@@ -1047,27 +1118,14 @@ completely_scalarize (tree base, tree decl_type, 
HOST_WIDE_INT offset, tree ref)
       {
        tree elemtype = TREE_TYPE (decl_type);
        tree elem_size = TYPE_SIZE (elemtype);
-       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
        HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
        gcc_assert (el_size > 0);
 
-       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
-       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
-       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
+       offset_int idx, max;
        /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
-       if (maxidx)
+       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
          {
-           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
            tree domain = TYPE_DOMAIN (decl_type);
-           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
-              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
-           offset_int idx = wi::to_offset (minidx);
-           offset_int max = wi::to_offset (maxidx);
-           if (!TYPE_UNSIGNED (domain))
-             {
-               idx = wi::sext (idx, TYPE_PRECISION (domain));
-               max = wi::sext (max, TYPE_PRECISION (domain));
-             }
            for (int el_off = offset; idx <= max; ++idx)
              {
                tree nref = build4 (ARRAY_REF, elemtype,
@@ -1088,10 +1146,10 @@ completely_scalarize (tree base, tree decl_type, 
HOST_WIDE_INT offset, tree ref)
 }
 
 /* Create total_scalarization accesses for a member of type TYPE, which must
-   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
-   top-most VAR_DECL representing the variable; within that, POS and SIZE 
locate
-   the member, REVERSE gives its torage order. and REF must be the reference
-   expression for it.  */
+   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
+   BASE must be the top-most VAR_DECL representing the variable; within that,
+   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
+   be the reference expression for it.  */
 
 static void
 scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
@@ -1111,7 +1169,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, 
HOST_WIDE_INT size, bool reverse,
 }
 
 /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
-   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
+   RECORD_TYPE or ARRAY_TYPE conforming to
+   simple_mix_of_records_and_arrays_p.  */
 
 static void
 create_total_scalarization_access (tree var)
@@ -2803,8 +2862,9 @@ analyze_all_variable_accesses (void)
       {
        tree var = candidate (i);
 
-       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
-                                               constant_decl_p (var)))
+       if (VAR_P (var)
+           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
+                                                  constant_decl_p (var), NULL))
          {
            if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
                <= max_scalarization_size)
diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
new file mode 100644
index 00000000000..2857688b21e
--- /dev/null
+++ b/gcc/tree-sra.h
@@ -0,0 +1,34 @@
+/* tree-sra.h - Run-time parameters.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef TREE_SRA_H
+#define TREE_SRA_H
+
+
+bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays,
+                                        int *count_pg);
+bool extract_min_max_idx_from_array (tree type, offset_int *idx,
+                                    offset_int *max);
+tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
+                          bool reverse, tree exp_type,
+                          gimple_stmt_iterator *gsi, bool insert_after);
+
+
+
+#endif /* TREE_SRA_H */
-- 
2.14.2

Re: [RFC, PR 80689] Copy small aggregates element-wise

Reply via email to