RE: [PATCH][AARCH64] Fix for PR86901

2020-04-30 Thread Modi Mo via Gcc-patches
Hey all-apologies for the long delay. Haven't had time until recently to look 
into this further.
>>> The zero extract now matching against other modes would generate a
>>> test + branch rather than the combined instruction which led to the
>>> code size regression. I've updated the patch so that tbnz etc. matches GPI 
>>> and
>>that brings code size down to <0.2% in spec2017 and <0.4% in spec2006.
>>
>>That's looking better indeed. I notice there are still differences, eg. 
>>tbz/tbnz
>>counts are significantly different in perlbench, with ~350 missed cases 
>>overall
>>(mostly tbz reg, #7).
>>
>>There are also more uses of uxtw, ubfiz, sbfiz - for example I see cases like 
>>this
>>in namd:
>>
>>  42c7dc:       13007400        sbfx    w0, w0, #0, #30
>>  42c7e0:       937c7c00        sbfiz   x0, x0, #4, #32
>>
>>So it would be a good idea to check any benchmarks where there is still a non-
>>trivial codesize difference. You can get a quick idea what is happening by
>>grepping for instructions like this:
>>
>>grep -c sbfiz out1.txt out2.txt
>>out1.txt:872
>>out2.txt:934
>>
>>grep -c tbnz out1.txt out2.txt
>>out1.txt:5189
>>out2.txt:4989

That's really good insight Wilco! I took a look at the tbnz/tbz case in perl 
and we lose matching against this because allowing SI mode on extv/extzv causes 
subst in combine.c to generate:

(lshiftrt:SI (reg:SI 107 [ _16 ])
(const_int 7 [0x7]))
(nil)

Instead of:

(and:DI (lshiftrt:DI (subreg:DI (reg:SI 107 [ _16 ]) 0)
(const_int 7 [0x7]))
(const_int 1 [0x1]))

The latter case is picked up in make_compound_operation_int to transform into a 
zero_extract while the new case is left alone. A lshiftrt generally can't be 
reduced down to a bit-test but in this case it can because we have zero_bit 
information on it. Given that, looking around try_combine it seems like the 
best place to detect this pattern is in the 2nd chance code after the first 
failure of recog_for_combine which I've done in this patch. I think this is the 
place to put this fix given changing subst/make_compound_operation_int leads to 
significantly more diffs.

After this change the total number of tbnz/tbz lines up near identical to the 
baseline which is good and overall size within .1% on spec 2017 and spec 2006. 
However, looking further at ubfiz there's a pretty large increase in certain 
benchmarks. I looked into spec 2017/blender and we fail to combine this pattern:

Trying 653 -> 654:
  653: r512:SI=r94:SI 0>>0x8
  REG_DEAD r94:SI
  654: r513:DI=zero_extend(r512:SI)
  REG_DEAD r512:SI
Failed to match this instruction:
(set (reg:DI 513)
(zero_extract:DI (reg:SI 94 [ bswapdst_4 ])
(const_int 8 [0x8])
(const_int 8 [0x8])))

Where previously we combined it like this:

Trying 653 -> 654:
  653: r512:SI=r94:SI 0>>0x8
  REG_DEAD r94:SI
  654: r513:DI=zero_extend(r512:SI)
  REG_DEAD r512:SI
Successfully matched this instruction:
(set (reg:DI 513)
(zero_extract:DI (subreg:DI (reg:SI 94 [ bswapdst_4 ]) 0) // subreg used
(const_int 8 [0x8])
(const_int 8 [0x8])))

Here's where I'm at an impasse. The code that generates the modes in 
get_best_reg_extraction_insn looks at the inner mode of SI now that extzvsi is 
valid and generates a non-subreg use. However, the MD pattern is looking for 
all modes being DI or SI not a mix. I think a fix could be done to canonicalize 
these extracts to the same mode but am unsure if in general a mode mismatched 
extract RTX is valid which would make this a fairly large change. 

Latest patch with fix for tbnz/tbz is attached alongside the numbers for SPEC 
and instruction count for SPEC 2017 are attached for reference.

>>> Can you send me the necessary documents to make that happen? Thanks!
>>
>>That's something you need to sort out with the fsf. There is a mailing list 
>>for this:
>>mailto:ass...@gnu.org.

I haven't had any response from my previous mail there. Should I add one of you 
to the CC or mail someone specifically to get traction?

Best,
Modi


pr86901.patch
Description: pr86901.patch
basediff
% increase
textdatabss total   filenametextdata
bss total   filename
1038264 243802  12472   1294538 
base/benchspec/CPU2006/400.perlbench/exe/perlbench_base.gcc9-base   
1038344 243714  12472   1294530 
diff/benchspec/CPU2006/400.perlbench/exe/perlbench_base.gcc9-diff   
0.01%
72024   16030   432892382   
base/benchspec/CPU2006/401.bzip2/exe/bzip2_base.gcc9-base   
72016   16030   432892374   
diff/benchspec/CPU2006/401.bzip2/exe/bzip2_base.gcc9-diff   
-0.01%
2651976 816398  749792  4218166 
base/benchspec/CPU2006/403.gcc/exe/gcc_base.gcc9-base   2652096 
816558  749792  4218446 

RE: [PATCH][AARCH64] Fix for PR86901

2020-02-21 Thread Modi Mo via gcc-patches
> > Sounds good. I'll get those setup and running and will report back on
> > findings. What's the preferred way to measure codesize? I'm assuming
> > by default the code pages are aligned so smaller differences would need to 
> > trip
> over the boundary to actually show up.
> 
> You can use the size command on the binaries:
> 
> >size /bin/ls
>text  data bss dec hex filename
>  107271  20243472  112767   1b87f /bin/ls
> 
> As you can see it shows the text size in bytes. It is not rounded up to a 
> page, so it
> is an accurate measure of the codesize. Generally -O2 size is most useful to
> check (since that is what most applications build with), but -Ofast -flto can 
> be
> useful as well (the global inlining means you get instruction combinations 
> which
> appear less often with -O2).
> 
> Cheers,
> Wilco
Alrighty, I've got spec 2017 and spec 2006 setup and building. Using default 
configurations so -O2 in spec2006 and -O3 in spec2017. Testing the patch as 
last sent showed a 1% code size regression in spec 2017 perlbench which turned 
out to be a missing pattern for tbnz and all its variants:

(define_insn "*tb1"
  [(set (pc) (if_then_else
  (EQL (zero_extract:DI (match_operand:GPI 0 "register_operand" 
"r")  <--- only matches against zero_extract:DI
(const_int 1)
(match_operand 1
  "aarch64_simd_shift_imm_" "n"))

The zero extract now matching against other modes would generate a test + 
branch rather than the combined instruction which led to the code size 
regression. I've updated the patch so that tbnz etc. matches GPI and that 
brings code size down to <0.2% in spec2017 and <0.4% in spec2006.

Spec results are attached for reference.

@Wilco I've gotten instruction on my side to set up an individual contributor's 
license for the time being. Can you send me the necessary documents to make 
that happen? Thanks!

ChangeLog:
2020-02-21  Di Mo  

gcc/
* config/aarch64/aarch64.md: Add GPI modes to extsv/extv patterns. 
Allow tb1 pattern to match against zero_extract:GPI.
* expmed.c (extract_bit_field_using_extv): Change gen_lowpart to 
gen_lowpart_if_possible to avoid compiler assert building libgcc.
testsuite/
* gcc.target/aarch64/pr86901.c: Add new test.

Modi


pr86901.patch
Description: pr86901.patch


spec2006
Description: spec2006


spec2017
Description: spec2017


RE: [PATCH][AARCH64] Fix for PR86901

2020-02-06 Thread Modi Mo via gcc-patches
> I did a quick bootstrap, this shows several failures like:
> 
> gcc/builtins.c:9427:1: error: unrecognizable insn:
>  9427 | }
>   | ^
> (insn 212 211 213 24 (set (reg:SI 207)
> (zero_extract:SI (reg:SI 206)
> (const_int 26 [0x1a])
> (const_int 6 [0x6]))) "/gcc/builtins.c":9413:44 -1
>  (nil))
> 
> The issue here is that 26+6 = 32 and that's not a valid ubfx encoding.
> Currently cases like this are split into a right shift in aarch64.md around 
> line
> 5569:

Appreciate you taking a look and the validation. I've gotten access to an 
aarch64 server and the bootstrap demonstrated the issue you saw. This was 
caused by my re-definition of the pattern to:
+  if (width == 0 || (pos + width) > GET_MODE_BITSIZE (mode))
+FAIL;

Which meant for SImode only a sum of >32 bit actually triggers the fail 
condition for the define_expand whereas the existing define_insn fails on >=32 
bit. I looked into the architecture reference manual and the bits are available 
for ubfx/sbfx for that type of encoding and the documentation says you can use 
[lsb, 32-lsb] for SImode as a legal pair. Checking with the GNU assembler it 
does accept a sum of 32 but transforms it into a LSR:

Assembly file:
ubfxw0, w0, 24, 8

Disassembly of section .text:

 :
   0:   53187c00lsr w0, w0, #24

Similarly with the 64 bit version it'll become a 64 bit LSR. Certainly other 
assemblers could trip over, I've attached a new patch that allows this encoding 
and bootstrap + testing c/c++ testsuite looks good. I'll defer to you if it's 
better to explicitly do the transformation in GCC.

> ;; When the bit position and width add up to 32 we can use a W-reg LSR ;;
> instruction taking advantage of the implicit zero-extension of the X-reg.
> (define_split
>   [(set (match_operand:DI 0 "register_operand")
> (zero_extract:DI (match_operand:DI 1 "register_operand")
>  (match_operand 2
>"aarch64_simd_shift_imm_offset_di")
>  (match_operand 3
>"aarch64_simd_shift_imm_di")))]
>   "IN_RANGE (INTVAL (operands[2]) + INTVAL (operands[3]), 1,
>  GET_MODE_BITSIZE (DImode) - 1)
>&& (INTVAL (operands[2]) + INTVAL (operands[3]))
>== GET_MODE_BITSIZE (SImode)"
>   [(set (match_dup 0)
> (zero_extend:DI (lshiftrt:SI (match_dup 4) (match_dup 3]
>   {
> operands[4] = gen_lowpart (SImode, operands[1]);
>   }
> 
> However that only supports DImode, not SImode, so it needs to be changed
> to be more general using GPI.
> 
> Your new extv patterns should replace the magic patterns above it:

With the previous discovery that a sum of 32/64 will trigger LSR in the 
assembler I was curious what would happen if I remove this pattern. Turns out, 
you will end up with a UBFX x0, x0, 24, 8 compared to a LSR w0, w0, 24 in the 
test case associated with this change (gcc.target/aarch64/ubfx_lsr_1.c) which 
doesn't get transformed into an LSR by the assembler since it's in 64 bit mode. 
So this pattern still has value but I don't think it's necessary to extend it 
to DI since that'll automatically get turned into a LSR by the assembler as I 
previously mentioned.


> ;; ---
> ;; Bitfields
> ;; ---
> 
> (define_expand ""
> 
> These are the current extv/extzv patterns, but without a mode. They are no
> longer used when we start using the new ones.
> 
> Note you can write  to combine the extzv and extz patterns.
> But please add a comment mentioning the pattern names so they are easy to
> find!

Good call here, made this change in the new patch. I did see the define_insn of 
these guys below it but somehow missed that they were expanded just above. I 
believe, from my understanding of GCC, that the matching pattern below the 
first line is what constrains  into just extv/extsv from the long list 
of iterators it belongs to. Still, I see that there's constrained iterators 
elsewhere like: 

;; Optab prefix for sign/zero-extending operations
(define_code_attr su_optab [(sign_extend "") (zero_extend "u")

I added a comment in this patch before the pattern. Thoughts on defining 
another constrained version to make it clearer (in addition or in lieu of the 
comment)?

> Besides a bootstrap it is always useful to compile a large body of code with
> your change (say SPEC2006/2017) and check for differences in at least
> codesize. If there is an increase in instruction count then there may be more
> issues that need to be resolved.

Sounds good. I'll get those setup and running and will report back on findings. 
What's the preferred way to measure codesize? I'm assuming by default the code 
pages are aligned so smaller differences would need to trip over the boundary 
to actually show up. 
 
> I find it easiest to develop on a many-core AArch64 

[PATCH][AARCH64] Fix for PR86901

2020-02-04 Thread Modi Mo via gcc-patches
Hi,

Adding support for extv and extzv on aarch64 as described in 
PR86901. I also changed extract_bit_field_using_extv to use 
gen_lowpart_if_possible instead of gen_lowpart directly. Using gen_lowpart 
directly will fail with an ICE in building libgcc when the compiler fails to 
successfully do so whereas gen_lowpart_if_possible will bail out of matching 
this pattern gracefully.

I'm looking through past mails and https://gcc.gnu.org/contribute.html which 
details testing bootstrap. I'm building a cross-compiler (x64_aarch64) and the 
instructions don't address that scenario. The GCC cross-build is green and 
there's no regressions on the C/C++ tests (The go/fortran etc. look like they 
need additional infrastructure built on my side to work). Is there a workflow 
for cross-builds or should I aim to get an aarch64 native machine for full 
validation?


ChangeLog:
2020-02-03  Di Mo  

gcc/
* config/aarch64/aarch64.md: Add define_expand for extv and 
extzv.
* expmed.c (extract_bit_field_using_extv): Change gen_lowpart to 
gen_lowpart_if_possible to avoid compiler assert building libgcc.
testsuite/
* gcc.target/aarch64/pr86901.c: Add new test.


Best,
Modi


pr86901.patch
Description: pr86901.patch