Re: [AArch64] Backporting -moutline-atomics to gcc 9.x and 8.x

2020-02-27 Thread Kyrill Tkachov

Hi Sebastian,

On 2/27/20 4:53 PM, Pop, Sebastian wrote:


Hi,

is somebody already working on backporting -moutline-atomics to gcc 
8.x and 9.x branches?



I'm not aware of such work going on.

Thanks,

Kyrill


Thanks,

Sebastian



Re: Kyrylo Tkachov and Richard Sandiford appointed AArch64 maintainers.

2019-09-26 Thread Kyrill Tkachov


On 9/26/19 8:02 AM, Ramana Radhakrishnan wrote:

Hi,

I'm pleased to announce that the GCC steering committee has appointed
Kyrylo Tkachov and Richard Sandiford as AArch64 maintainers.

Please join me in congratulating them both on their additional roles
in the community. Kyrill / Richard, please update your listings in the
MAINTAINERS file.


Thanks!

Committed the attached with r276142.

Kyrill

2019-09-26  Kyrylo Tkachov  

    * MAINTAINERS: Add myself as aarch64 maintainer.


Thanks,
Ramana
diff --git a/MAINTAINERS b/MAINTAINERS
index 948d56d8346ba2df42142955910d4e8a74f568e5..4bbedb4e5c06ac341abc0e2be3720376893a17f4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -46,6 +46,7 @@ docs, and the testsuite related to that.
 aarch64 port		Richard Earnshaw	
 aarch64 port		James Greenhalgh	
 aarch64 port		Marcus Shawcroft	
+aarch64 port		Kyrylo Tkachov		
 aarch64 SVE port	Richard Sandiford	
 alpha port		Richard Henderson	
 amdgcn port		Julian Brown		


Re: Can we vectorize the code below ?

2019-06-12 Thread Kyrill Tkachov

Hi Lijia,

On 6/12/19 4:22 AM, Li Jia He wrote:

Hi,

I recently did some analysis on the automatic vectorization of gcc, I
found that singed char can not be vectorized in the following code.

---
#define ITERATIONS 100

#if defined(do_reduce_signed_char)
#define TYPE signed char
#elif defined(do_reduce_unsigned_char)
#define TYPE unsigned char
#else
#error bad define
#endif

#define SIZE (16384/sizeof(TYPE))

static TYPE x[SIZE] __attribute__ ((aligned (16)));

void obfuscate(void *a, ...);

static void __attribute__((noinline)) do_one(void)
{
 unsigned long i;
 TYPE a = 0;

 obfuscate(x);

 for (i = 0; i < SIZE; i++)
 a += x[i];

 obfuscate(x, a);
}

int main(void)
{
 unsigned long i;

 for (i = 0; i < ITERATIONS; i++)
 do_one();

 return 0;
}
---
If we use the following command line

gcc reduce.c -Ddo_reduce_unsigned_char -Ofast -c -S 
-fdump-tree-vect-details


We can see that this code can be vectorized under the unsigned char data
type.
If we use the following command

gcc reduce.c -Ddo_reduce_signed_char -Ofast -c -S -fdump-tree-vect-details

We can see that this code cannot be vectorized under the singed char
data type.
I found in the below code for singed char
---
a += x[i];
---
Will do something like the following conversion.
---
a = (signed char) ((unsigned char) x[i] + (unsigned char) a);
---
As a result, the reduction in the code cannot be effectively identified.
Can we vectorize the code like the above when the data type is signed 
char ?


This looks like the known limitation 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930


Thanks,

Kyrill




Thanks,
Lijia He



Re: autovectorization in gcc

2019-01-09 Thread Kyrill Tkachov

Hi Kay,

On 09/01/19 08:29, Kay F. Jahnke wrote:

Hi there!

I am developing software which tries to deliberately exploit the
compiler's autovectorization facilities by feeding data in
autovectorization-friendly loops. I'm currently using both g++ and
clang++ to see how well this approach works. Using simple arithmetic, I
often get good results. To widen the scope of my work, I was looking for
documentation on which constructs would be recognized by the
autovectorization stage, and found

https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html



Yeah, that page hasn't been updated in ages AFAIK.


By the looks of it, this document has not seen any changes for several
years. Has development on the autovectorization stage stopped, or is
there simply no documentation?



There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.


In my experience, vectorization is essential to speed up arithmetic on
the CPU, and reliable recognition of vectorization opportunities by the
compiler can provide vectorization to programs which don't bother to
code it explicitly. I feel the topic is being neglected - at least the
documentation I found suggests this. To demonstrate what I mean, I have
two concrete scenarios which I'd like to be handled by the
autovectorization stage:

- gather/scatter with arbitrary indexes

In C, this would be loops like

// gather from B to A using gather indexes

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = B [ indexes [ i ] ] ;

 From the AVX2 ISA onwards, there are hardware gather/scatter
operations, which can speed things up a good deal.

- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)



I believe GCC will do some of that already given a high-enough optimisation 
level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained source code
and compiler flags would be useful to analyse.


If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.



I wouldn't say it's a niche topic :)
From my monitoring of the GCC development over the last few years there's been 
lots
of improvements in auto-vectorisation in compilers (at least in GCC).

The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the 
vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar form,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own judgement on 
when
to auto-vectorise rather than enforce a "contract". If the user really only 
wants
vector code, they should use one of the explicit programming paradigms.

HTH,
Kyrill


Can you point me to where 'the action is' in this regard?

With regards

Kay F. Jahnke






Re: Testing compiler reliability using Csmith

2018-12-07 Thread Kyrill Tkachov

Hi Radu,

On 06/12/18 22:10, Radu Ometita wrote:

Hello everyone!

We are working on writing a paper about testing the reliability of C compilers 
by using Csmith (a random C99 program generator).

A previous testing effort, using Csmith, found 79 GCC bugs, and 25 of those have been 
marked by developers as P1 (https://www.flux.utah.edu/download?uid=114 
): . However, after this paper was 
published we are unaware of any further testing using Csmith, and we would like to 
ask you, if you are aware of any such efforts or further results.



We've had a large amount of really good bug reports come out of the research at:
https://people.inf.ethz.ch/suz/emi/index.html

If I understand the research correctly, the seed programs are either 
csmith-generated or taken from the GCC testsuite
and the mutations applied to them expose compiler bugs.

Thanks,
Kyrill


Best regards,
Radu Ometita,
Functional compilers engineer @IOHK





Re: Semantics of SAD_EXPR and usad/ssad optabs

2018-05-10 Thread Kyrill Tkachov

Hi Richard,

On 09/05/18 19:37, Richard Biener wrote:

On May 9, 2018 6:19:47 PM GMT+02:00, Kyrill  Tkachov 
 wrote:

Hi all,

I'm looking into implementing the usad/ssad optabs for aarch64 to catch
code like in PR 85693
and I'm a bit lost with what the midend expects the optabs to produce.
The documentation for them says that the addend operand (op 3) is of
mode equal or wider than
the mode of the product (and consequently of operands 1 and 2) with the
result operand 0 being
the same mode as operand 3.

The x86 implementation for usadv16qi (for example) takes a V16QI vector
and returns a V4SI vector.
I'm confused as to what is the reduction logic expected by the midend?
The PSADBW instruction that x86 uses in that case accumulates the two
V8QI halves of the input into
two 16-bit values (you don't need any more bits to represent a sum of 8
byte differences I believe):
one placed at bit 0, and the other placed at bit 64. The bit ranges [16
- 63] and [80 - 127] are left as zeroes.
So it produces a V2DI result in essence.

If the input V16QI vectors look like:
{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15
}
{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15
}

then the result V4SI view (before being added into operand 3) is:
{ SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 }
(1)

whereas a normal widening reduction of V16QI -> V4SI to me would look
more like:

{ SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS
(a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) }  (2)

My question is, does the vectoriser depend on the semantics of [us]sad
producing the result in (1)?

No, it doesn't. It is required that any association of the embedded reduction 
is correct and thus this requires appropriate - ffast-math flags. Note it's 
also the reason why we do not implement constant folding of SAD.


At the moment I'm looking at the integer modes, so I guess reassociation and 
-ffast-math doesn't come into play, but I'll keep that in mind.


If so, do you think it's worth clarifying in the documentation?

Probably yes - but I'm not sure the current state of affairs is best... Do 
other targets implement the same reduction order as x86? Other similar 
reduction ops have high /low or even /odd variants. But they also do not reduce 
the outputs.


AFAICS only x86 and powerpc implement this so far. The powerpc implementation 
synthesises the V16QI -> V4SI reduction using multiple instructions.
The result it produces is variant (2) in my original post. So the two ports 
differ.

From a purely target implementation perspective it is convenient to not impose 
any particular reduction strategy.
If we say that the only requirement from the [us]sad optabs is that the result 
vector should be suitable for a full V4SI -> SI reduction
but not rely on any particular approach, then each target can provide its 
optimal sequence.

For example, an aarch64 implementation I'm experimenting with now would compute 
the V16QI -> V16QI absolute differences vector,
reduce that into a single HImode value (there is a full widening reduction 
instruction in aarch64 for that) and then do a widening add of
that value into element zero of the result V4SI vector. Following the notation 
above, this would produce from:

{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }

the V4SI result:

{ SUM (ABS (a[0-15] - b[0-15])), 0, 0, 0 }

Matching the x86 or powerpc strategy would require a more costly sequence on 
aarch64, but of course this would only be
safe if we had some guarantees that the midend won't rely on any particular 
reduction strategy and just treat it as a vector
on which to perform a full reduction at the end of a loop.

Thanks,
Kyrill


Note DOT_PROD has the very same issue.

Richard.


Thanks,
Kyrill




Semantics of SAD_EXPR and usad/ssad optabs

2018-05-09 Thread Kyrill Tkachov

Hi all,

I'm looking into implementing the usad/ssad optabs for aarch64 to catch code 
like in PR 85693
and I'm a bit lost with what the midend expects the optabs to produce.
The documentation for them says that the addend operand (op 3) is of mode equal 
or wider than
the mode of the product (and consequently of operands 1 and 2) with the result 
operand 0 being
the same mode as operand 3.

The x86 implementation for usadv16qi (for example) takes a V16QI vector and 
returns a V4SI vector.
I'm confused as to what is the reduction logic expected by the midend?
The PSADBW instruction that x86 uses in that case accumulates the two V8QI 
halves of the input into
two 16-bit values (you don't need any more bits to represent a sum of 8 byte 
differences I believe):
one placed at bit 0, and the other placed at bit 64. The bit ranges [16 - 63] 
and [80 - 127] are left as zeroes.
So it produces a V2DI result in essence.

If the input V16QI vectors look like:
{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }

then the result V4SI view (before being added into operand 3) is:
{ SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 }   (1)

whereas a normal widening reduction of V16QI -> V4SI to me would look more like:

{ SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS (a[8-11] - 
b[8-11])), SUM (ABS (a[12-15] - b[12-15])) }  (2)

My question is, does the vectoriser depend on the semantics of [us]sad 
producing the result in (1)?
If so, do you think it's worth clarifying in the documentation?

Thanks,
Kyrill


Re: Load and parse RTL from textual dump files

2017-12-18 Thread Kyrill Tkachov

Hi,

On 18/12/17 10:30, HEBBAL Yacine wrote:

Hello,
In one of my projects, I need to determine automatically what are the 
names
and types of data fields manipulated by functions in binary code of a 
given

program (e.g. Linux kernel).
I found that RTL dumps contains most of information I need in a form very
close to the one of the binary code.
For this end, I need to parse generated RTL expressions in order to 
extract

operands properties.
Is it possible to load and parse RTL expressions from textual dump files
using existing code in GCC ? Thanks


Since version 7 GCC does contain functionality to parse RTL input.
This is mostly used for writing more targeted unit tests for the RTL passes.
Have a look at https://gcc.gnu.org/onlinedocs/gccint/RTL-Tests.html or 
grep for "__RTL"
in the gcc testsuite (gcc/testsuite in the source tree) to see examples 
of how it's used.


Not sure how useful it will be for large-scale analysis of source code 
though.

You may want to write/use a GCC plugin for that.

Kyrill

P.S. This list is used for discussions about the development of GCC itself.
For help using GCC please use the gcc-help list in the future.


Re: Announcing ARM and AArch64 port maintainers.

2017-09-11 Thread Kyrill Tkachov


On 09/09/17 12:44, Ramana Radhakrishnan wrote:

I'm pleased to announce that the steering committee has appointed

-  James Greenhalgh as a full maintainer for the AArch64 port

and

-  Kyrylo Tkachov as a full maintainer for the ARM port.

James & Kyrylo, if you could update your entries in the MAINTAINERS
file to reflect these roles, it would be appreciated.


Thank you for your trust. I look forward to continuing contributing to GCC!
I've committed this patch to trunk as r251979.

Kyrill

2017-09-11  Kyrylo Tkachov  

* MAINTAINERS (Reviewers): Move myself from here...
(CPU Port Maintainers): ... to here.

commit ec06d430aba1698fa4e653b8dc93bcad10852fb3
Author: Kyrylo Tkachov 
Date:   Mon Sep 11 10:27:00 2017 +0100

[MAINTAINERS] Add myself as ARM maintainer

diff --git a/MAINTAINERS b/MAINTAINERS
index 2ed1ef9..e5b9bc1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -47,6 +47,7 @@ arc port		Joern Rennecke		
 arm port		Nick Clifton		
 arm port		Richard Earnshaw	
 arm port		Ramana Radhakrishnan	
+arm port		Kyrylo Tkachov		
 avr port		Denis Chertykov		
 bfin port		Jie Zhang		
 c6x port		Bernd Schmidt		
@@ -252,7 +253,6 @@ check in changes outside of the parts of the compiler they maintain.
 
 arc port		Andrew Burgess		
 arc port		Claudiu Zissulescu	
-arm port		Kyrylo Tkachov		
 C front end		Marek Polacek		
 dataflow		Paolo Bonzini		
 dataflow		Seongbae Park		


Re: 'make check' questions

2017-05-11 Thread Kyrill Tkachov


On 11/05/17 11:43, Simon Wright wrote:

I see from https://gcc.gnu.org/install/test.html that it's possible to run 
tests in parallel. I get the impression from gcc/Makefile that the check 
concerned has to be set up in the Makefile (in my build tree, configured with 
--target=x86_64-apple-darwin16 
--enable-languages=c,c++,ada,fortran,objc,obj-c++ , I see both lang_checks and 
lang_checks_parallelized set empty). So, is it necessary for check-ada or 
check-acats to cope with being run in parallel (i.e., will they ever see 
GCC_RUNTEST_PARALLELIZE_DIR set?)


I don't usually build Ada, but testing with "make -j check" works for me where 
 is the parallelism I want


Also in https://gcc.gnu.org/install/test.html, under what circumstances would a 
test report ERROR (the testsuite detected an error) or WARNING (the testsuite 
detected a possible problem)? For example, if a particular test that should compile 
& run has a build error, is that a FAIL or an ERROR?


ERROR results are usually problems with the testsuite infrastructure, like 
misformed DejaGNU directives. They don't usually appear in a clean test run.
If a test fails to build due to a compiler problem i.e. an ICE or other bug it 
will be a FAIL. If the test harness has a problem with the testsuite directives
syntax I think it will be reported as ERROR.

I usually get WARNINGs if a runtime tests times out. It can happen when testing 
against simulators or if the test was miscompiled into an infinite loop.

Kyrill




Re: Right way to represent flag-setting arithmetic instructions in MD files

2017-03-13 Thread Kyrill Tkachov


On 10/03/17 23:56, Segher Boessenkool wrote:

On Fri, Mar 10, 2017 at 10:30:33AM +, Kyrill Tkachov wrote:

I'm trying to improve the cases where the result of the arithmetic
operation is used in multiple places besides the comparison.
For example:
 add w0, w0, w1
 add w1, w0, 2
 cmp w0, 0

Combine will not attempt to merge the first ADD and CMP because W0 is used
in the second ADD.

So the LOG_LINK from the first instruction will point at the second,
and combining the first and the third isn't considered (because the
first and second insns don't combine).


Yeah, that's what I'm seeing


combine doesn't try to combine all producer-consumer pairs, only
producer with first consumer, because it would not often help and
could easily take much more time.  On the other hand I'd love to get
rid of the LOG_LINKS and use DF directly.


Are there any significant barriers to moving to DF?
I heard some other passes (e.g. LRA) are reluctant to use
DF because it's too slow.



Note that combining the first and third insns in your example requires
to put the combined insns in the place of the first insn, where
normally it would put it at the third insn.  Maybe we could treat
compares specially?


We'd also need to do a SELECT_CC_MODE if we're replacing the operands of the 
comparison
and update the users of the CC reg from that comparison I believe.



Do such cases happen a lot, btw?


I hacked together an extension to the cmpelim pass that merges such comparisons
(that previous passes like combine don't catch) and on aarch64 for SPEC2006 it 
merged
about 480 compares. If combine were extended as you described above I think it 
could catch
it, but it's a more complex and fragile pass than I feel comfortable hacking 
for this :)

Thanks,
Kyrill


Segher




Re: Right way to represent flag-setting arithmetic instructions in MD files

2017-03-10 Thread Kyrill Tkachov



On 10/03/17 10:23, Eric Botcazou wrote:

My understanding was that the order of the two in this pattern here doesn't
matter because there is an implicit PARALLEL around them, but I found that
the compare-elimination pass (compare-elim.c) assumes that the COMPARE set
must be in the second position for it to do the transformations it wants.

Why do you want to use the compare-elimination pass exactly if the flags are
exposed before reload, as is the case on Aarch64 I think?  The combiner is
supposed to do the same job instead for these targets.



I'm trying to improve the cases where the result of the arithmetic
operation is used in multiple places besides the comparison.
For example:
add w0, w0, w1
add w1, w0, 2
cmp w0, 0

Combine will not attempt to merge the first ADD and CMP because W0 is used
in the second ADD. The compare-elimination pass so far looks like a far simpler
place to implement this transformation than combine.

Thanks,
Kyrill


Right way to represent flag-setting arithmetic instructions in MD files

2017-03-10 Thread Kyrill Tkachov

Hi all,

Some (many?) targets have instructions that perform an arithmetic operation and 
set the condition flags based on the result.
For example, on aarch64, we have instructions like ADDS, SUBS, ANDS etc.
In the machine description we represent them as a PARALLEL pattern of a COMPARE 
and the arithmetic operation.
For example, the ADDS instruction is represented as:

(define_insn "add3_compare0"
  [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
 (plus:GPI (match_operand:GPI 1 "register_operand" "%r,r,r")
   (match_operand:GPI 2 "aarch64_plus_operand" "r,I,J"))
 (const_int 0)))
   (set (match_operand:GPI 0 "register_operand" "=r,r,r")
(plus:GPI (match_dup 1) (match_dup 2)))]

My understanding was that the order of the two in this pattern here doesn't 
matter because there is
an implicit PARALLEL around them, but I found that the compare-elimination pass 
(compare-elim.c)
assumes that the COMPARE set must be in the second position for it to do the 
transformations it wants.

Is there a recommended order for specifying the compare and the arithmetic 
operation in the MD files?
(in which case we should go through the aarch64 MD files and make sure the 
patterns are written the right
way round). Or is the compare-elimination pass just not robust enough? (In 
which case we should teach it
to look into both SETs of the pattern).

Thanks,
Kyrill


Re: ICE on using -floop-nest-optimize

2017-01-06 Thread Kyrill Tkachov


On 06/01/17 14:22, Toon Moene wrote:

On the attached (Fortran) source, the following version of gfortran draws an 
ICE:

$ gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.2.1-5' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin 
--enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64 
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc=auto --enable-multiarch --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic 
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu

Thread model: posix
gcc version 6.2.1 20161124 (Debian 6.2.1-5)

using the following command line arguments:

gfortran -S -g -Ofast -fprotect-parens -fbacktrace -march=native -mtune=native 
-floop-nest-optimize corr_to_spec_2D.F

The error message is:

corr_to_spec_2D.F:3:0:

   subroutine corr_to_spec_2D(nx_local,ny_local,

internal compiler error: in create_pw_aff_from_tree, at 
graphite-sese-to-poly.c:445
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.

I will retry this with trunk gfortran as soon as my automatic builds have 
constructed that compiler.

In the mean time - anyone has a clue ?



Looks like PR 69823 ?

Kyrill


Thanks,





Setting alias set and vuse/vdef on gimple statements

2016-06-17 Thread Kyrill Tkachov

Hi all,

I'm working on a tree-ssa pass to implement PR 22141, a pass that merges 
adjacent stores.
I've gotten to the point where I can identify the adjacent accesses, merge them 
into a single value
and am now working on emitting the new statements but, as I don't have a lot of 
experience with the gimple
machinery, am not sure what to do about alias sets and other bookkeeping.

At the point where I'm emitting the single wide store to replace a number of 
narrow consecutive stores
I construct a MEM_REF that I assign the wide merged value to.
I think I need to also set its alias info but am not sure how to construct it.

Conceptually I need the disjunction of all the alias sets of the stores that 
the new store replaces
but I'm not sure how to get that. I can get the alias set of a single gimple 
statement through
get_alias_set of the LHS of each gimple assignment but how do I merge them?
I don't see a helper function for that that springs to mind...

Also, from what I understand gimple statements that write to memory have these 
vdef operands but I'm
not sure what the vdef operand for the new store that replaces the series of 
adjacent stores should be
set to (or how to construct it).

Any guidance on this would be very appreciated.

Thanks,
Kyrill


Re: Implementing atomic load as compare-and-swap for read-only memory

2016-06-03 Thread Kyrill Tkachov

Hi Jakub, Torvald,

On 03/06/16 13:32, Jakub Jelinek wrote:

On Fri, Jun 03, 2016 at 02:26:09PM +0200, Torvald Riegel wrote:

And that would be fine, IMO.  If you can't even load atomically, doing
something useful with this type will be hard except in special cases.
Also, doing a CAS (compare-and-swap) and thus potentially bringing in
the cache line in exclusive mode can be a lot more costly than what
users might expect from a load.  A short critical section might not be
much slower.

If you only have a CAS as base of the atomic operations on a type, then
a CAS operation exposed to the user will still be a just a single HW
CAS.  But any other operation besides the CAS and a load will need *two*
CAS operations; even an atomic store has to be implemented as a CAS
loop.

Would we just stop expanding all those __atomic_*/__sync_* builtins inline
then (which would IMHO break tons of stuff), or just some predicate that
atomic.h/atomic headers use?


But doesn't that mean you should fall back to locked operation also for any
other atomic operation on such types, because otherwise if you atomic_store
or any other kind of atomic operation, it wouldn't use the locking, while
for atomic load it would?

I suppose you mean that one must fall back to using locking for all
operations?  If load isn't atomic, then it can't be made atomic using
external locks if the other operations don't use the locks.


That would be an ABI change and quite significant
pessimization in many cases.

A change from wide CAS to locking would be an ABI change I suppose, but
it could also be considered a necessary bugfix if we don't want to write
to read-only memory.  Does this affect anything but i686?

Also x86_64 (for 128-bit atomics), clearly also either arm or aarch64
(judging from who initiated this thread), I bet there are many others.


I'm looking at pre-LPAE ARMv7-A targets for which the
ARM Architecture Reference Manual (rev C.c) section A3.5.3 recommends:
"The way to atomically load two 32-bit quantities is to perform a
LDREXD/STREXD sequence, reading and writing the same value, for which the
STREXD succeeds, and use the read values."

Currently we emit just a single load-doubleword-exclusive which, according to 
the above,
would not be enough on such targets.

On aarch64 doubleword (128 bit) atomic loads are done through locks (PR 70814).

Kyrill


Jakub




Implementing atomic load as compare-and-swap for read-only memory

2016-06-03 Thread Kyrill Tkachov

Hi all,

expand_atomic_load in optabs.c tries to expand a wide atomic load using an 
atomic_compare_and_swap
with the comment saying that sometimes a redundant harmless store may be 
performed.
Is the store really valid if the memory is read-only?

I've been looking at implementing a similar compare-and-swap strategy for 
atomic_loaddi for some
arm targets and this concern came up. I don't think GCC can statically prove 
that a particular
piece of memory is guaranteed to be writeable at runtime in all cases, so 
emitting a spurious
store would not be always valid.

I see this concern was already raised in 
https://gcc.gnu.org/ml/gcc-patches/2011-11/msg00278.html
but that doesn't seem to have gone anywhere.

Any thoughts? Should we remove the assumption that atomic loads always access 
writeable memory?

Thanks,
Kyrill


Re: GCC 5.4 Release Candidate available from gcc.gnu.org

2016-05-31 Thread Kyrill Tkachov


On 31/05/16 12:17, Richard Biener wrote:

On Tue, 31 May 2016, Kyrill Tkachov wrote:


Hi Richard,

On 27/05/16 12:43, Richard Biener wrote:

The first release candidate for GCC 5.4 is available from

   ftp://gcc.gnu.org/pub/gcc/snapshots/5.4.0-RC-20160527

and shortly its mirrors.  It has been generated from SVN revision 236809.

I have sofar bootstrapped the release candidate on x86_64-suse-linux-gnu.

Please test the release candidate and report any issues to bugzilla.

If all goes well I'd like to release GCC 5.4 at the beginning of next
week.


Bootstrap and test on arm-none-linux-gnueabihf looks fine.
Unfortunately, on aarch64-none-linux-gnu I noticed a regression compared to
GCC 5.3:
FAIL: gcc.target/aarch64/vbslq_u64_1.c scan-assembler-times bif\\tv 1

This is PR 68696 that has been triggered on that branch.
The patch fixing that is at:
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg00826.html

I have bootstrapped and tested it on aarch64-none-linux-gnu on top of the 5.4
RC
and it fixes the regression. It applies cleanly to that branch.

Is it okay to backport it to the branch?

While it doesn't look like having the same cause (r231178) the patch
looks safe to me to backport.


Indeed, the bug is a missing pattern in the backend that ends up
with us relying on xor+and sequences being emitted in a certain order
after the tree passes, so it can be triggered by any of a number of tree-level
changes.


Thus, ok.


Thanks,
I'll commit it shortly.

Kyrill



Thanks,
Richard.


Thanks,
Kyrill






Re: GCC 5.4 Release Candidate available from gcc.gnu.org

2016-05-31 Thread Kyrill Tkachov

Hi Richard,

On 27/05/16 12:43, Richard Biener wrote:

The first release candidate for GCC 5.4 is available from

  ftp://gcc.gnu.org/pub/gcc/snapshots/5.4.0-RC-20160527

and shortly its mirrors.  It has been generated from SVN revision 236809.

I have sofar bootstrapped the release candidate on x86_64-suse-linux-gnu.

Please test the release candidate and report any issues to bugzilla.

If all goes well I'd like to release GCC 5.4 at the beginning of next
week.



Bootstrap and test on arm-none-linux-gnueabihf looks fine.
Unfortunately, on aarch64-none-linux-gnu I noticed a regression compared to GCC 
5.3:
FAIL: gcc.target/aarch64/vbslq_u64_1.c scan-assembler-times bif\\tv 1

This is PR 68696 that has been triggered on that branch.
The patch fixing that is at:
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg00826.html

I have bootstrapped and tested it on aarch64-none-linux-gnu on top of the 5.4 RC
and it fixes the regression. It applies cleanly to that branch.

Is it okay to backport it to the branch?

Thanks,
Kyrill


Re: Handling aliasing memory accesses in gimple/ssa

2016-05-10 Thread Kyrill Tkachov


On 10/05/16 14:46, Richard Biener wrote:

On Tue, May 10, 2016 at 3:41 PM, Kyrill Tkachov
 wrote:

Hi all,

I'm taking a stab at fixing PR 22141 by merging adjacent stores into wider
stores in a late gimple pass.

My current plan is to go through all the assignments in a basic block and
keep track of
LHS expressions that are COMPONENT_REF, BIT_FIELD_REF, ARRAY_REF or
ARRAY_RANGE_REF until
we encounter an expression that affects the recorded stores or a mergeable
store that aliases with
any of the preceeding loads. My question is how to express that question in
gimple/tree-ssa?

Is taking get_alias_set of the two expressions that I want to compare and
checking alias_sets_conflict_p enough?

That's too conservative.  You want to use stmt_may_clobber_ref_p /
ref_maybe_used_by_stmt_p
and likely feed both with the "base" you want to perform the merging with.


Thanks!
I think that's exactly what I need.


Richard.


Thanks,
Kyrill





Handling aliasing memory accesses in gimple/ssa

2016-05-10 Thread Kyrill Tkachov

Hi all,

I'm taking a stab at fixing PR 22141 by merging adjacent stores into wider 
stores in a late gimple pass.

My current plan is to go through all the assignments in a basic block and keep 
track of
LHS expressions that are COMPONENT_REF, BIT_FIELD_REF, ARRAY_REF or 
ARRAY_RANGE_REF until
we encounter an expression that affects the recorded stores or a mergeable 
store that aliases with
any of the preceeding loads. My question is how to express that question in 
gimple/tree-ssa?

Is taking get_alias_set of the two expressions that I want to compare and
checking alias_sets_conflict_p enough?

Thanks,
Kyrill



Canonical forms of edges and fallthroughs

2016-04-29 Thread Kyrill Tkachov

Hi all,

I'm looking at an issue in RTL ifcvt and I'm trying to understand the way edges 
between
basic blocks are treated and in particular what is the canonical use of 
EDGE_FALLTHRU.
Is it governed by the conditional jump condition?

In find_if_header in ifcvt.c there is a comment that says:
  /* The THEN edge is canonically the one that falls through.  */

But I'm encountering a case where the jump expression from the test block is:
(set (pc)
(if_then_else (eq (reg:CC 66 cc)
(const_int 0 [0]))
(label_ref:DI 22)
(pc)))

that is, the fallthrough happens when the condition is false. Does that make 
this
basic block sequence non-canonical?

Thanks,
Kyrill


Re: Validity of SUBREG+AND-imm transformations

2016-03-11 Thread Kyrill Tkachov


On 08/03/16 19:11, Jeff Law wrote:

On 03/08/2016 11:49 AM, Richard Henderson wrote:

On 03/07/2016 02:49 PM, Jeff Law wrote:

On 03/07/2016 03:44 AM, Kyrill Tkachov wrote:




The RTL documentation for ASHIFT and friends says that the shift amount
must be:
"a fixed-point mode or be a constant with mode @code{VOIDmode}; which
mode is determined by the mode called for in the machine description
entry for the left-shift instruction". For example, on the VAX, the mode
of @var{c} is @code{QImode} regardless of @var{m}.

Use QImode in the named pattern/expander and use the other modes in an
unnamed/anonymous pattern.


I thought the same thing you did.  But I tried it out on the aarch64
port and it didn't work.  Combine kept coming back to the QImode pattern.

I didn't want to look into it any farther than that, lest I fall down a
rabbit hole, but there's more to it than just tweaking the patterns.

Strange.  Probably worth some investigating for gcc-7.



FYI I've attached an example patch that I had implemented for my original 
proposal to PR 70119.

Kyrill


jeff




Re: Validity of SUBREG+AND-imm transformations

2016-03-07 Thread Kyrill Tkachov


On 05/03/16 05:52, Jeff Law wrote:

On 03/04/2016 09:33 AM, Kyrill Tkachov wrote:


On 04/03/16 16:21, Jeff Law wrote:

On 03/04/2016 08:05 AM, Richard Biener wrote:

does that mean that the shift amount should be DImode?
Seems like a more flexible approach would be for the midend to be able
to handle these things...


Or macroize for all integer modes?

That's probably worth exploring.  I wouldn't be at all surprised if it
that turns out to be better than any individual mode,  not just for
arm & aarch64, but would help a variety of targets.



What do you mean by 'macroize' here? Do you mean use iterators to create
multple variants of patterns with different
modes on the shift amount?
I believe we'd still run into the issue at
https://gcc.gnu.org/ml/gcc/2016-03/msg00036.html.

We might, but I would expect the the number of incidences to be fewer.

Essentially we're giving the compiler multiple options when it comes to representation of the shift amount -- allowing the compiler (combine in particular) to use the shift amount in whatever mode is most natural. ie, if the count is 
sitting in a QI, HI, SI or possibly even a DI register, then it can be used as-is.  No subregs, no zero/sign extensions, or and-imm masking.




The RTL documentation for ASHIFT and friends says that the shift amount must be:
"a fixed-point mode or be a constant with mode @code{VOIDmode}; which
mode is determined by the mode called for in the machine description
entry for the left-shift instruction". For example, on the VAX, the mode
of @var{c} is @code{QImode} regardless of @var{m}.

From what I understand the "ashl" standard name should expand to an ASHIFT with 
a particular mode for the shift amount.
Currently that's QImode for aarch64.
So whenever combine tries to propagate anything into the shift amount it has to 
force it into QImode.
I don't see how specifying multiple matching patterns for different modes will 
help as combine propagates
the and-immediate operation into the shift amounts, creates the awkward subreg 
and tries to match that.
It won't try different modes on the shift amount to help matching (and from 
previous discussions I understand
that's not the direction we want combine to take).

I've filed PR 70119 to track this down easier (sourceware archives cut of the 
thread across months :( )
with example code.

Thanks for the ideas,
Kyrill


jeff




Re: Validity of SUBREG+AND-imm transformations

2016-03-04 Thread Kyrill Tkachov


On 04/03/16 16:21, Jeff Law wrote:

On 03/04/2016 08:05 AM, Richard Biener wrote:

does that mean that the shift amount should be DImode?
Seems like a more flexible approach would be for the midend to be able
to handle these things...


Or macroize for all integer modes?

That's probably worth exploring.  I wouldn't be at all surprised if it that turns 
out to be better than any individual mode,  not just for arm & aarch64, but 
would help a variety of targets.



What do you mean by 'macroize' here? Do you mean use iterators to create 
multple variants of patterns with different
modes on the shift amount?
I believe we'd still run into the issue at 
https://gcc.gnu.org/ml/gcc/2016-03/msg00036.html.

My worry is that such a change will bleeds out beyond just the standard shifting and shadd style patterns in each port.  I guess that would largely depend on how many combiner patterns a port has which combine a shift with some other 
operation.




I see. For my purposes restricting this transformation to the cases where the 
shifted value is a REG seems to work ok,
so maybe we can avoid most side effects.

Kyrill


jeff





Re: Validity of SUBREG+AND-imm transformations

2016-03-04 Thread Kyrill Tkachov


On 04/03/16 15:12, Kyrill Tkachov wrote:


On 04/03/16 15:07, Segher Boessenkool wrote:

On Fri, Mar 04, 2016 at 02:48:21PM +, Kyrill Tkachov wrote:

Although there are case where we hit the same problem:
unsigned long
f3 (unsigned long bit_addr)
{
   unsigned long bitnumb = bit_addr & 63;
   return (1L << bitnumb);
}

combine will try to match:
(set (reg:DI 78)
 (ashift:DI (reg:DI 80)
 (subreg:SI (and:DI (reg:DI 0 x0 [ bit_addr ])
 (const_int 63 [0x3f])) 0)))

does that mean that the shift amount should be DImode?

Heh.  Maybe?  Try it out, see which works best.

My point was that you do not have QI anywhere else.  Registers are
always SI or DI (I think?  Not totally familiar with the aarch64 code).


Yeah, registers can be accessed either in SImode (w-form) or DImode (x-form)
and if an instruction writes the SImode form the top 32 bits are implicitly
zeroed out.



Making DImode performs worst of all...
For code:
unsigned long
foo (unsigned long a, unsigned int b)
{
  return a << b;
}

If we represent the shift amount as DImode then the SImode register for 'b' is 
wrapped
into a zero-extend to DImode during expand, so we'd need all of the arith+shift 
patterns
to match a zero_extend form. When the shift amount is QImode we don't have any 
of that
because we just take the QImode subreg in the ashift RTX. I suspect that's the 
original
reason why it was not made DImode.


Seems like a more flexible approach would be for the midend to be able to
handle these things...

Of course.  OTOH, there is no reason to make it harder than necessary
for the compiler to do a reasonable job ;-)


I'll do some more investigations. Last time I looked at changing the mode I 
recall
having various zero_extends being introduced that hurt matching...

Kyrill




Segher






Re: Validity of SUBREG+AND-imm transformations

2016-03-04 Thread Kyrill Tkachov


On 04/03/16 15:07, Segher Boessenkool wrote:

On Fri, Mar 04, 2016 at 02:48:21PM +, Kyrill Tkachov wrote:

Although there are case where we hit the same problem:
unsigned long
f3 (unsigned long bit_addr)
{
   unsigned long bitnumb = bit_addr & 63;
   return (1L << bitnumb);
}

combine will try to match:
(set (reg:DI 78)
 (ashift:DI (reg:DI 80)
 (subreg:SI (and:DI (reg:DI 0 x0 [ bit_addr ])
 (const_int 63 [0x3f])) 0)))

does that mean that the shift amount should be DImode?

Heh.  Maybe?  Try it out, see which works best.

My point was that you do not have QI anywhere else.  Registers are
always SI or DI (I think?  Not totally familiar with the aarch64 code).


Yeah, registers can be accessed either in SImode (w-form) or DImode (x-form)
and if an instruction writes the SImode form the top 32 bits are implicitly
zeroed out.


Seems like a more flexible approach would be for the midend to be able to
handle these things...

Of course.  OTOH, there is no reason to make it harder than necessary
for the compiler to do a reasonable job ;-)


I'll do some more investigations. Last time I looked at changing the mode I 
recall
having various zero_extends being introduced that hurt matching...

Kyrill




Segher




Re: Validity of SUBREG+AND-imm transformations

2016-03-04 Thread Kyrill Tkachov


On 04/03/16 14:41, Kyrill Tkachov wrote:


On 04/03/16 11:59, Segher Boessenkool wrote:

On Mon, Feb 29, 2016 at 10:51:24AM +, Kyrill Tkachov wrote:

So I'm trying to create a define_insn to match something like:
   [(set (match_operand:SI 0 "register_operand" "=r")
 (ashift:SI
   (match_operand:SI 1 "register_operand" "r")
   (and:QI
 (match_operand:QI 2 "register_operand" "r")
 (match_operand:QI 3 "const_int_operand" "n"]


where operands[3] is 31 for SImode. The 'and' expression has to be in
QImode because our shift expanders
expand the shift amount to QImode.

Is there any reason for that?  Why not SImode?


It's been that way since the beginning. I don't know the reason.
I tried changing to SImode. It requires a lot mechanical changes to
all the shift+arithmetic patterns, but it does work.
It catches a few more cases that my original approach (not sure why yet,
would have to dig in the dumps) but it also widens a few memory accesses
i.e. we'll now be loading a 32-bit value from memory instead of a byte.
Overall it could be a better approach, though it would be a larger patch
and would need more investigation of potential side effects.



Although there are case where we hit the same problem:
unsigned long
f3 (unsigned long bit_addr)
{
  unsigned long bitnumb = bit_addr & 63;
  return (1L << bitnumb);
}

combine will try to match:
(set (reg:DI 78)
(ashift:DI (reg:DI 80)
(subreg:SI (and:DI (reg:DI 0 x0 [ bit_addr ])
(const_int 63 [0x3f])) 0)))

does that mean that the shift amount should be DImode?
Seems like a more flexible approach would be for the midend to be able to 
handle these things...

Kyrill



Thanks,
Kyrill



Segher






Re: Validity of SUBREG+AND-imm transformations

2016-03-04 Thread Kyrill Tkachov


On 04/03/16 11:59, Segher Boessenkool wrote:

On Mon, Feb 29, 2016 at 10:51:24AM +, Kyrill Tkachov wrote:

So I'm trying to create a define_insn to match something like:
   [(set (match_operand:SI 0 "register_operand" "=r")
 (ashift:SI
   (match_operand:SI 1 "register_operand" "r")
   (and:QI
 (match_operand:QI 2 "register_operand" "r")
 (match_operand:QI 3 "const_int_operand" "n"]


where operands[3] is 31 for SImode. The 'and' expression has to be in
QImode because our shift expanders
expand the shift amount to QImode.

Is there any reason for that?  Why not SImode?


It's been that way since the beginning. I don't know the reason.
I tried changing to SImode. It requires a lot mechanical changes to
all the shift+arithmetic patterns, but it does work.
It catches a few more cases that my original approach (not sure why yet,
would have to dig in the dumps) but it also widens a few memory accesses
i.e. we'll now be loading a 32-bit value from memory instead of a byte.
Overall it could be a better approach, though it would be a larger patch
and would need more investigation of potential side effects.

Thanks,
Kyrill



Segher




Re: [WWWDocs] Deprecate support for non-thumb ARM devices

2016-02-29 Thread Kyrill Tkachov


On 28/02/16 21:34, Joel Sherrill wrote:


On February 28, 2016 3:20:24 PM CST, Gerald Pfeifer  wrote:

On Wed, 24 Feb 2016, Richard Earnshaw (lists) wrote:

I propose to commit this patch later this week.

+   Support for revisions of the ARM architecture prior to ARMv4t
has
+   been deprecated and will be removed in a future GCC release.
+   This affects ARM6, ARM7 (but not ARM7TDMI), ARM8, StrongARM,
and
+   Faraday fa526 and fa626 devices, which do not have support for
+   the Thumb execution state.

I am wondering whether this may be confusing for those not
intricately familiar with the older history of ARM platforms.

ARMv8 is pretty new, googling for it has
  http://www.arm.com/products/processors/armv8-architecture.php
as first hit, for example, and the only difference versus ARM8
is that little lower-case "v".

I assume this means a number of values for the various -mXXX arguments will be 
removed. Would it be more helpful to list those values?

I have to agree with Gerald. I think this will obsolete a few older RTEMS BSPs 
but based on that wording, I don't know which.


ARM8 is a processor, whereas ARMv8-A is an architecture.
I think Richard's link earlier in the thread:

https://community.arm.com/groups/processors/blog/2011/11/02/arm-fundamentals-introduction-to-understanding-arm-processors

gives a good explanation of the naming schemes.
The -mcpu/-mtune arguments that would be deprecated can be found by looking at 
the
file config/arm/arm-cores.def and finding all the ARM_CORE entries that have 
'4' or lower in their
4th field These would be:
arm2,arm250,arm3,arm6,arm60,arm600,arm610,arm620,arm7,arm7d,arm7di,arm70,arm700,arm700i,arm710,
arm720,arm710c,arm7100,arm7500,arm7500fe,arm7m,arm7dm,arm7dmi,arm8,arm810,strongarm,strongarm110,
strongarm1100,strongarm1110,fa526,fa626.

The arguments to -march that would be deprecated are:
armv2,armv2a,armv3,armv3m,armv4.

I personally think that list is a bit too long for changes.html.
Do you think it would add more clarity for people who are not familiar with the 
situation?

Thanks,
Kyrill


Gerald

--joel





Re: Validity of SUBREG+AND-imm transformations

2016-02-29 Thread Kyrill Tkachov

Hi Jeff,

On 26/02/16 21:24, Jeff Law wrote:

On 02/26/2016 06:40 AM, Kyrill Tkachov wrote:

Hi all,

I'm looking at a case where some RTL passes create an RTL expression of
the form:
(subreg:QI (and:SI (reg:SI x1)
 (const_int 31)) 0)

which I'd like to simplify to:
(and:QI (subreg:QI (reg:SI x1) 0)
 (const_int 31))

I can think of cases where the first is better and other cases where the second 
is better -- a lot depends on context.  I don't have a good sense for which is 
better in general.

Note that as-written these don't trigger the subtle issues in what happens with 
upper bits.  That's more for extensions.

(subreg:SI (whatever:QI))

vs

{zero,sign}_extend:SI (whatever:QI))

vs

(and:SI (subreg:SI (whatever:QI) (const_int 0x255)))


The first leave the bits beyond QI as "undefined" and sometimes (but I doubt 
all that often in practice) the compiler will use the undefined nature of those bits to 
enable optimizations.


The second & 3rd variants crisply define the upper bits.



Thanks for the explanation.





It's easy enough to express in RTL but I'm trying to convince myself on
its validity.
I know there are some subtle points in this area. combine_simplify_rtx
in combine.c
has a comment:
   /* Note that we cannot do any narrowing for non-constants since
  we might have been counting on using the fact that some bits were
  zero.  We now do this in the SET.  */

That comment makes no sense.  Unfortunately it goes back to a change from 
Kenner in 1994 -- which predates having patch discussions here and consistently 
adding tests to the testsuite.

The code used to do this:


-  if (GET_MODE_CLASS (mode) == MODE_INT
- && GET_MODE_CLASS (GET_MODE (SUBREG_REG (x))) == MODE_INT
- && GET_MODE_SIZE (mode) < GET_MODE_SIZE (GET_MODE (SUBREG_REG (x)))
- && subreg_lowpart_p (x))
-   return force_to_mode (SUBREG_REG (x), mode, GET_MODE_MASK (mode),
- NULL_RTX, 0);

Which appears to check that we've got a narrowing subreg expression, and if we 
do try to force the SUBREG_REG into the right mode using force_to_mode.

But if we had a narrowing SUBREG_REG, then I can't see how anything would have 
been dependign on the upper bits being zero.




and if I try to implement this transformation in simplify_subreg from
simplify-rtx.c
I get some cases where combine goes into an infinite recursion in
simplify_comparison
because it tries to do:

   /* If this is (and:M1 (subreg:M1 X:M2 0) (const_int C1)) where C1
  fits in both M1 and M2 and the SUBREG is either paradoxical
  or represents the low part, permute the SUBREG and the AND
  and try again.  */
Right.  I think you just end up ping-ponging between the two equivalent representations.  Which may indeed argue that the existing representation is preferred and we should look deeper into why the existing representation isn't being 
handled as well as it should be.




After a bit more experimentation I found it's easy enough to avoid looping on 
this spot.
Just check if the gen_lowpart part of that transformation returned something 
different from the original
expression...



Performing this transformation would help a lot with recognition of some
patterns that
I'm working on, so would it be acceptable to teach combine or
simplify-rtx to do this?

How does it help recognition?   What kinds of patterns are you trying to 
recognize?



On aarch64 the normal integer-side variable shift/rotate instructions truncate 
their shift amount.
However, we can't enable SHIFT_COUNT_TRUNCATED by default on aarch64 because 
some of the alternatives
in our shift patterns use the vector shift instructions, which don't truncate 
the shift amount.
So I'm trying to add a combine pattern to eliminate redundant truncations where 
possible when using
the normal shift instructions.

So I'm trying to create a define_insn to match something like:
  [(set (match_operand:SI 0 "register_operand" "=r")
(ashift:SI
  (match_operand:SI 1 "register_operand" "r")
  (and:QI
(match_operand:QI 2 "register_operand" "r")
(match_operand:QI 3 "const_int_operand" "n"]


where operands[3] is 31 for SImode. The 'and' expression has to be in QImode 
because our shift expanders
expand the shift amount to QImode.
However, when combine tries to combine the and-mask operation with the shift, 
for example from the code:
unsigned f1(unsigned x, int y) { return x << (y & 31); }

the expression it tries to match is:
(set (reg/i:SI 0 x0)
(ashift:SI (reg:SI 0 x0 [ x ])
(subreg:QI (and:SI (reg:SI 1 x1 [ y ])
(const_int 31 [0x1f])) 0)))

That is, the subreg is not propagated inside. I think the define_insn pattern 
for that would look
somewhat unnatural, so I'm trying to get the subreg to be propagated to the reg 
inside.

Thanks,
Kyrill



Validity of SUBREG+AND-imm transformations

2016-02-26 Thread Kyrill Tkachov

Hi all,

I'm looking at a case where some RTL passes create an RTL expression of the 
form:
(subreg:QI (and:SI (reg:SI x1)
(const_int 31)) 0)

which I'd like to simplify to:
(and:QI (subreg:QI (reg:SI x1) 0)
(const_int 31))

Because (const_int 31) masks out the upper bits after the 5th one, we should be 
able
to safely perform the operation in QImode.

It's easy enough to express in RTL but I'm trying to convince myself on its 
validity.
I know there are some subtle points in this area. combine_simplify_rtx in 
combine.c
has a comment:
  /* Note that we cannot do any narrowing for non-constants since
 we might have been counting on using the fact that some bits were
 zero.  We now do this in the SET.  */

and if I try to implement this transformation in simplify_subreg from 
simplify-rtx.c
I get some cases where combine goes into an infinite recursion in 
simplify_comparison
because it tries to do:

  /* If this is (and:M1 (subreg:M1 X:M2 0) (const_int C1)) where C1
 fits in both M1 and M2 and the SUBREG is either paradoxical
 or represents the low part, permute the SUBREG and the AND
 and try again.  */

I think the transformation is valid in general because in the original case we
care only about the bits within QImode which are well defined by the wider
inner operation, and in the transformed case the same bits are also well-defined
because of the narrow bitmask.

Performing this transformation would help a lot with recognition of some 
patterns that
I'm working on, so would it be acceptable to teach combine or simplify-rtx to 
do this?

Thanks,
Kyrill


Re: PIE/PIC issue ...w.r.t linker variable

2016-02-12 Thread Kyrill Tkachov


On 12/02/16 09:58, Umesh Kalappa wrote:

Hi Kyrill ,
Thank you for the info ,before i file a bug ,need to confirm its a bug or not .


I'm not familiar with linker scripts and the details of -fpie,
so unfortunately I can't judge.

Do file a bug report.
If it is not a bug, it will be closed as such by someone who does
the analysis.

Kyrill


Thank you
~Umesh

On Fri, Feb 12, 2016 at 3:00 PM, Kyrill Tkachov
 wrote:

Hi,


On 12/02/16 09:19, Umesh Kalappa wrote:

Hi Guys ,

we  do have a issue with below code ,When we enabled the pie (-fpie/pie)
option
   i.e

main.c
extern int *my_ptr ;

int main()
{
 return *my_ptr;
}

foo.s
   .syntax unified
   .cpu cortex-m0
   .fpu softvfp
.thumb
.global my_ptr
 .global my_var
  .data
  .align  2
 .type   my_ptr, %object
 .size   my_ptr, 4
my_ptr:
.word   my_var   //where my_var is the linker variable

custom.ld  (linker script)
/* Set stack top to end of RAM, and stack limit move down by
  190  * size of stack_dummy section */
  191 my_var = 20;
  192 __StackTop = ORIGIN(RAM) + LENGTH(RAM);
  193 __StackLimit = __StackTop - SIZEOF(.stack_dummy);
  194 PROVIDE(__stack = __StackTop);


command used

3  arm-none-eabi-gcc -c -fPIC main.c -mthumb -mcpu=cortex-m0
4
5 arm-none-eabi-gcc -c -fPIC foo.S -mthumb -mcpu=cortex-m0
6 arm-none-eabi-gcc -c -fPIC

/home/egoumal/Downloads/gcc-arm-none-eabi-5_2-2015q4/share/gcc-arm-none-eabi/samples/startup/startu
 p_ARMCM0.S -mthumb -mcpu=cortex-m0 -D__STARTUP_CLEAR_BSS
-D__START=main
7
8 arm-none-eabi-ld -pie main.o  foo.o startup_ARMCM0.o -L.
-L/home/egoumal/Downloads/gcc-arm-none-eabi-5_2-2015q4/share/gcc-ar
  m-none-eabi/samples/ldscripts -T nokeep.ld -Map=test.map -o test

we expect my_ptr value to be 20 ,but we do see the value 0 and without
pie option ,the my_ptr has the value 20 .

do we missing something here  or value 0 expected (which is incorrect)


note that gcc-bugs is the list where the automatic bug tracker sends all the
emails
logging almost all activity, so your email would be lost there...
Please file a bug report in bugzilla according to https://gcc.gnu.org/bugs/

Thanks,
Kyrill



Thank you and appreciate any lights on this
~Umesh





Re: PIE/PIC issue ...w.r.t linker variable

2016-02-12 Thread Kyrill Tkachov

Hi,

On 12/02/16 09:19, Umesh Kalappa wrote:

Hi Guys ,

we  do have a issue with below code ,When we enabled the pie (-fpie/pie)  option
  i.e

main.c
extern int *my_ptr ;

int main()
{
return *my_ptr;
}

foo.s
  .syntax unified
  .cpu cortex-m0
  .fpu softvfp
   .thumb
   .global my_ptr
.global my_var
 .data
 .align  2
.type   my_ptr, %object
.size   my_ptr, 4
my_ptr:
   .word   my_var   //where my_var is the linker variable

custom.ld  (linker script)
/* Set stack top to end of RAM, and stack limit move down by
 190  * size of stack_dummy section */
 191 my_var = 20;
 192 __StackTop = ORIGIN(RAM) + LENGTH(RAM);
 193 __StackLimit = __StackTop - SIZEOF(.stack_dummy);
 194 PROVIDE(__stack = __StackTop);


command used

   3  arm-none-eabi-gcc -c -fPIC main.c -mthumb -mcpu=cortex-m0
   4
   5 arm-none-eabi-gcc -c -fPIC foo.S -mthumb -mcpu=cortex-m0
   6 arm-none-eabi-gcc -c -fPIC
/home/egoumal/Downloads/gcc-arm-none-eabi-5_2-2015q4/share/gcc-arm-none-eabi/samples/startup/startu
p_ARMCM0.S -mthumb -mcpu=cortex-m0 -D__STARTUP_CLEAR_BSS
-D__START=main
   7
   8 arm-none-eabi-ld -pie main.o  foo.o startup_ARMCM0.o -L.
-L/home/egoumal/Downloads/gcc-arm-none-eabi-5_2-2015q4/share/gcc-ar
 m-none-eabi/samples/ldscripts -T nokeep.ld -Map=test.map -o test

we expect my_ptr value to be 20 ,but we do see the value 0 and without
pie option ,the my_ptr has the value 20 .

do we missing something here  or value 0 expected (which is incorrect)


note that gcc-bugs is the list where the automatic bug tracker sends all the 
emails
logging almost all activity, so your email would be lost there...
Please file a bug report in bugzilla according to https://gcc.gnu.org/bugs/

Thanks,
Kyrill



Thank you and appreciate any lights on this
~Umesh





Re: RTL CSE picking simpler but more expensive expressions (re: PR 65932)

2016-01-14 Thread Kyrill Tkachov

Hi Jeff,

On 13/01/16 19:28, Jeff Law wrote:

On 01/13/2016 04:33 AM, Kyrill Tkachov wrote:


I've been able to get it to do the right thing by changing the line
where it initially folds the source of the SET. That is line 4639 in
cse.c: /* Simplify and foldable subexpressions in SRC.  Then get the
fully- simplified result, which may not necessarily be valid. */
src_folded = fold_rtx (src, insn);

In this instance SRC is: (plus:SI (mult:SI (sign_extend:SI (subreg:HI
(reg:SI 136) 0)) (sign_extend:SI (subreg:HI (reg:SI 138) 0))) (reg:SI
141))

and the resulting src_folded is: (plus:SI (mult:SI (reg:SI 136)
(reg:SI 138)) (reg:SI 141))

However, fold_rtx also modifies src itself, so after the call to
fold_rtx both src and src_folded contain the plus+mult form without
the extends. So further down in cse_insn where it does the cost
analysis of src, src_folded and other expressions it never considers
the original form of src. Changing that call to fold_rtx to not
modify its argument like so: src_folded = fold_rtx (src, 0); // An
argument of 0 means "make a copy of src before modifying"

fixes the testcase and allows CSE to properly select the cheaper
multiply-extend-add form and doesn't seem to regress codegen in any
way on SPEC2006 for arm. Archeology says this line has been that way
since forever, so does anyone know of the rationale for passing insn
to fold_rtx in that callsite?
That callsite was added during gcc-2 development, in an era where we didn't even have public lists where such a change might have been discussed.  I didn't try to dig into the old, private gcc2 archives as it's unlikely there's going to 
be any rationale there.


ISTM the code clearly expects that SRC and SRC_FOLDED could be different.  I think you could make a case that INSN should just be replaced with NULL based on that alone.  Verifying across a reasonable body of code that there aren't any 
undesirable effects would be wise.




There are cases where calls to fold_rtx with a non-NULL insn don't end up 
modifying src, ending up in src to not be
always equal to src_folded (I added an assert to that effect and saw it 
trigger).
It seems that fold_rtx is not *guaranteed* to modify src if insn is non-NULL, 
but it just does in many cases.


Alternately you could compute stuff for "SRC" prior to the call to fold_rtx. 
It's less likely to have unexpected side effects.

My inclination would be to go with changing INSN to NULL though. It seems to 
match the overall intent here better.



I agree and, as I said, I saw no impact on codegen for all of SPEC2006 on arm. 
It just improved the testcase
I described above. I'll evaluate this patch on aarch64 and x86_64 and hopefully 
it has a low impact there as well.

Thanks,
Kyrill


Jeff




RTL CSE picking simpler but more expensive expressions (re: PR 65932)

2016-01-13 Thread Kyrill Tkachov

Hi all,

I'm looking at mitigating some of the codegen regressions on arm that come from 
a proposed fix
to PR 65932 and I believe RTL CSE is getting in the way.

The problematic sequences involve sign-extending loads and sign-extending 
multiplies.
From the gcc.target/arm/wmul-1.c case we have:
(set (reg:SI 136)
 (sign_extend:SI (mem:HI (reg:SI 135
(set (reg:SI 138)
 (sign_extend:SI (mem:HI (reg:SI 144

(set (reg:SI 141)
 (plus:SI
   (mult:SI (sign_extend:SI (subreg:HI (reg:SI 136) 0))
(sign_extend:SI (subreg:HI (reg:SI 138) 0)))
   (reg:SI 141)))

And cse1 transforms this into:
(set (reg:SI 136)
 (sign_extend:SI (mem:HI (reg:SI 135
(set (reg:SI 138)
 (sign_extend:SI (mem:HI (reg:SI 144
(set (reg:SI 141)
 (plus:SI
   (mult:SI (reg:SI 136)
(reg:SI 138))
   (reg:SI 141)))

Now, for some arm targets the second sequence is more expensive because a 
sign-extending multiply-add (smlabb)
is cheaper than a full 32-bit multiply-add (mla). This is reflected in rtx 
costs.
That is, the second form is simpler from an rtx structure point of view but is 
more expensive.
I see that we have some costing logic in the cse_insn function in cse.c that is 
supposed to guard against this,
but it doesn't seem to be doing the right thing in this case.

I've been able to get it to do the right thing by changing the line where it 
initially folds the source
of the SET. That is line 4639 in cse.c:
  /* Simplify and foldable subexpressions in SRC.  Then get the fully-
 simplified result, which may not necessarily be valid.  */
  src_folded = fold_rtx (src, insn);

In this instance SRC is:
 (plus:SI
   (mult:SI (sign_extend:SI (subreg:HI (reg:SI 136) 0))
(sign_extend:SI (subreg:HI (reg:SI 138) 0)))
   (reg:SI 141))

and the resulting src_folded is:
 (plus:SI
   (mult:SI (reg:SI 136)
(reg:SI 138))
   (reg:SI 141))

However, fold_rtx also modifies src itself, so after the call to fold_rtx both 
src and src_folded contain
the plus+mult form without the extends.
So further down in cse_insn where it does the cost analysis of src, src_folded 
and other expressions
it never considers the original form of src.
Changing that call to fold_rtx to not modify its argument like so:
  src_folded = fold_rtx (src, 0); // An argument of 0 means "make a copy of src 
before modifying"

fixes the testcase and allows CSE to properly select the cheaper 
multiply-extend-add form and doesn't
seem to regress codegen in any way on SPEC2006 for arm.
Archeology says this line has been that way since forever, so does anyone know 
of the rationale for passing
insn to fold_rtx in that callsite?

Thanks,
Kyrill



Re: Test case mis-categorized as UNSUPPORTED?

2016-01-07 Thread Kyrill Tkachov

Hi Bin,

On 07/01/16 14:15, Bin.Cheng wrote:

Hi,
Below test is supposed to be compiled and run, but we failed to link
the binary with tiny memory model.

spawn 
/data/work/build-aarch64_be-none-elf/obj/gcc2/gcc/testsuite/g++14/../../xg++
-B/data/work/build-aarch64_be-none-elf/obj/gcc2/gcc/testsuite/g++14/../../
/data/work/src/gcc/gcc/testsuite/g++.dg/torture/pr67600.C
-fno-diagnostics-show-caret -fdiagnostics-color=never -nostdinc++
-I/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/libstdc++-v3/include/aarch64_be-none-elf
-I/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/libstdc++-v3/include
-I/data/work/src/gcc/libstdc++-v3/libsupc++
-I/data/work/src/gcc/libstdc++-v3/include/backward
-I/data/work/src/gcc/libstdc++-v3/testsuite/util -fmessage-length=0
-O2 -flto -fuse-linker-plugin -fno-fat-lto-objects
-specs=aem-validation.specs
-L/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs
-B/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs
-lm -mcmodel=tiny -o ./pr67600.exe
/tmp/ccd32hub.ltrans0.ltrans.o: In function `main':
:(.text.startup+0x68): relocation truncated to fit:
R_AARCH64_ADR_PREL_LO21 against symbol `std::cout' defined in
.bss._ZSt4cout section in
/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs/libstdc++.a(globals_io.o)
collect2: error: ld returned 1 exit status
compiler exited with status 1
output is:
/tmp/ccd32hub.ltrans0.ltrans.o: In function `main':
:(.text.startup+0x68): relocation truncated to fit:
R_AARCH64_ADR_PREL_LO21 against symbol `std::cout' defined in
.bss._ZSt4cout section in
/data/work/build-aarch64_be-none-elf/obj/gcc2/aarch64_be-none-elf/./libstdc++-v3/src/.libs/libstdc++.a(globals_io.o)
collect2: error: ld returned 1 exit status

UNSUPPORTED: g++.dg/torture/pr67600.C   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects : memory full

In my understanding, dg-do run test case should be marked as
FAIL&UNRESOLVED if binary file can't be generated.  But here it's
categorized as an UNSUPPORTED test.  This could be mis-leading
sometimes since unsupported test could be ignored.


The problem is that many of these libstdc++ tests got too big for the tiny 
memory model
and the whole testsuite got very noisy due to these relocation truncation 
errors.
That's why we try to mark them as unsupported. I tried doing it in the past and
Szabolcs fixed it properly with 
https://gcc.gnu.org/ml/libstdc++/2015-10/msg00037.html

Thanks,
Kyrill


Any idea why it acts in this way?

Thanks,
bin





Re: Finding insns to reorder using dataflow

2015-08-14 Thread Kyrill Tkachov


On 14/08/15 16:31, Jeff Law wrote:

On 08/14/2015 03:05 AM, Kyrill Tkachov wrote:

The problem I'm trying to solve can be expressed in this way: "An
insn that satisfies predicate pred_p (insn) cannot appear exactly N
insns apart from another insn 'insn2' that satisfies pred_p (insn2).
N is a constant". So, the problem here is that this restriction is
not something expressed in terms of cycles or DFA states, but rather
distance in the instruction stream.

I wasn't really suggesting to model it in DFA states, but instead use
the dependency analysis + hooks.   The dependency analysis in particular
when it's safe to interchange two insns.

Given the additional information, I think you'd want to note when an
insn fires and satisfies pred_p, and associate a counter with each
firing.  THe active counters are bumped (decremented?) at each firing
(so you can track how many insns appear after the one that satisfied
pred_p).  Note that for insns which generate multiple assembly
instructions, you need to decrement the counter by the number of
assembly instructions they emit.

Then when sorting the ready list, if you have an insn that satisfies
pred_p and an active counter has just reached zero, make sure some other
insn fires (what if there aren't any other ready insns?  Is this a
correctness or performance issue?)





I don't think I can do this reliably during sched2 because there is
still splitting that can be done that will create more insns that
will invalidate any book keeping that I do there.

Right.  You need everything split and you need accurate insn length
information for every insn in the backend that isn't split.  If this is
a correctness issue, then you also have to deal with final deleting
insns behind your back as well.

Many years ago I did something which required 100% accurate length
information from the backend.  It was painful, very very painful.
Ultimately it didn't work out and the code was scrapped.




However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all
  insns and then call schedule_insns () to do another round of
scheduling. However, I'm a bit confused by all the different
scheduler hooks and when each one is called in relation to the
other.

You'll have to work through them -- I haven't kept close tabs on the
various hooks we have, I just know we have them.


I'd need to keep some kind of bitfield recording for the previous N
instructions in the stream whether they satisfy pred_p. Where would I
record that? Can I just do everything in TARGET_SCHED_REORDER? i.e.
given a ready list check that no pred_p insns in it appear N insns
apart from another such insn (using my bitfield as a lookup helper),
reorder insns as appropriate and then record the order of the pred_p
insns in the bitfield. Would the scheduler respect the order of the
insns that was set by TARGET_SCHED_REORDER and not do any further
reordering?

The problem I see is that once one of these insns fire, other new insns
will be added to the ready list.  So you have to keep some kind of state
about how many instructions back one of these insns fired and consult
that data when making a decision about the next instruction to fire.

All this will fall apart if this is a correctness issue since you'd have
to issue a nop or somesuch.  Though I guess you might be able to arrange
to get a nop into the scheduled stream.  If this is a correctness issue,
tackling it in the assembler may make more sense.


Thanks. This is not a correctness issue.
I got some code to add nops and it was easy enough,
but I'm using it only to investigate the effectiveness
of a proper scheduling approach (the more effective the
scheduling approach, the fewer nops we emit).

I'll try out the ideas you suggested.

Thanks,
Kyrill



Jeff





Re: Finding insns to reorder using dataflow

2015-08-14 Thread Kyrill Tkachov

Hi Jeff,

On 13/08/15 17:20, Jeff Law wrote:

On 08/13/2015 05:06 AM, Kyrill Tkachov wrote:

Hi all,

I'm implementing a target-specific reorg pass, and one thing that I want
to do
is for a given insn in the stream to find an instruction
in the stream that I can swap it with, without violating any dataflow
dependencies.
The candidate instruction could be earlier or later in the stream.

I'm stuck on finding an approach to do this. It seems that using some of
the dataflow
infrastructure is the right way to go, but I can't figure out the details.
can_move_insns_across looks like relevant, but it looks too heavyweight
with quite a lot
of arguments.

I suppose somehow constructing regions of interchangeable instructions
would be the way
to go, but I'm not sure how clean/cheap that would be outside the scheduler

Any ideas would be appreciated.

I think you want all the dependency analysis done by the scheduler.

Which leads to the question, can you model what you're trying to do in
the various scheduler hooks -- in particular walking through the ready
list seems appropriate.


The problem I'm trying to solve can be expressed in this way:
"An insn that satisfies predicate pred_p (insn) cannot appear exactly N insns
apart from another insn 'insn2' that satisfies pred_p (insn2). N is a constant".
So, the problem here is that this restriction is not something expressed in 
terms
of cycles or DFA states, but rather distance in the instruction stream.

I don't think I can do this reliably during sched2 because there is still 
splitting
that can be done that will create more insns that will invalidate any book 
keeping
that I do there.

However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all insns and 
then
call schedule_insns () to do another round of scheduling. However, I'm a bit 
confused
by all the different scheduler hooks and when each one is called in relation to
the other.

I'd need to keep some kind of bitfield recording for the previous N 
instructions in
the stream whether they satisfy pred_p. Where would I record that?
Can I just do everything in TARGET_SCHED_REORDER? i.e.
given a ready list check that no pred_p insns in it appear N insns apart from
another such insn (using my bitfield as a lookup helper), reorder insns as
appropriate and then record the order of the pred_p insns in the bitfield.
Would the scheduler respect the order of the insns that was set by 
TARGET_SCHED_REORDER
and not do any further reordering?

Thanks,
Kyrill



jeff





Finding insns to reorder using dataflow

2015-08-13 Thread Kyrill Tkachov

Hi all,

I'm implementing a target-specific reorg pass, and one thing that I want to do
is for a given insn in the stream to find an instruction
in the stream that I can swap it with, without violating any dataflow 
dependencies.
The candidate instruction could be earlier or later in the stream.

I'm stuck on finding an approach to do this. It seems that using some of the 
dataflow
infrastructure is the right way to go, but I can't figure out the details.
can_move_insns_across looks like relevant, but it looks too heavyweight with 
quite a lot
of arguments.

I suppose somehow constructing regions of interchangeable instructions would be 
the way
to go, but I'm not sure how clean/cheap that would be outside the scheduler

Any ideas would be appreciated.

Thanks,
Kyrill



Re: [DWARF] Tracking uninitialized variables

2015-07-17 Thread Kyrill Tkachov


On 17/07/15 11:43, Nikolai Bozhenov wrote:

Hello!

It is certainly true that debugging an optimized code is an inherently
difficult task. Though, I wonder if the compiler could make such
debugging experience slightly less surprising.

Consider the following example:

  1 extern void bar(int *i1, int *i2, int *i3);
  2
  3 int __attribute__((noinline)) foo(int i1, int i2) {
  4   int a, b, c;
  5   a = i1 << i2;
  6   b = (i1 + i2) * i1;
  7   c = (b + i1);
  8   bar(&a, &b, &c);
  9 }
 10
 11 int main() {
 12   foo(42, 12);
 13 }

Let's compile it:

 $ gcc-trunk tst.c -g -fvar-tracking-uninit -O2


Just a drive-by thought.
Have you tried the -Og option? The documentation for it says:
"Optimize debugging experience.-Ogenables optimizations that do not interfere with debugging. It should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while 
maintaining fast compilation and a good debugging experience."


Kyrill



Re: ifcvt limitations?

2015-06-10 Thread Kyrill Tkachov


On 02/06/15 17:50, Jeff Law wrote:

On 06/02/2015 09:57 AM, Kyrill Tkachov wrote:

I'm stuck on noce_process_if_block (in ifcvt.c) and what I think is a
restriction that the THEN-block contents have to be only a single set
insn. This fails on aarch64 because we get an extra zero_extend.

In particular, the following check in noce_process_if_block triggers:
   insn_a = first_active_insn (then_bb);
   if (! insn_a
   || insn_a != last_active_insn (then_bb, FALSE)
   || (set_a = single_set (insn_a)) == NULL_RTX)
 return FALSE;

Is there any particular reason why the code shouldn't be able to handle
arbitrarily large contents
in then_bb (within a sane limit)?

It's just never been implemented or tested per this comment in 
noce_process_if_block.

  /* We're looking for patterns of the form

 (1) if (...) x = a; else x = b;
 (2) x = b; if (...) x = a;
 (3) if (...) x = a;   // as if with an initial x = x.

 The later patterns require jumps to be more expensive.

 ??? For future expansion, look for multiple X in such patterns.  */

I think folks would look favorably upon removing that limitation, obviously 
with some kind of cost checking.


Thanks, I've made some progress towards making it more aggressive.
A question since I'm in the area...
noce_try_cmove_arith that I've been messing around with has this code:

  /* A conditional move from two memory sources is equivalent to a
 conditional on their addresses followed by a load.  Don't do this
 early because it'll screw alias analysis.  Note that we've
 already checked for no side effects.  */
  /* ??? FIXME: Magic number 5.  */
  if (cse_not_expected
  && MEM_P (a) && MEM_P (b)
  && MEM_ADDR_SPACE (a) == MEM_ADDR_SPACE (b)
  && if_info->branch_cost >= 5)


Any ideas on where the rationale for that 5 came from?
I see it's been there since the very introduction of ifcvt.c
I'd like to replace it with something more sane, maybe even remove it?

Thanks,
Kyrill



Jeff




ifcvt limitations?

2015-06-02 Thread Kyrill Tkachov

Hi all,

I'm looking at a case on aarch64 that's not if-converted to use conditional 
moves:


typedef unsigned char uint8_t;
typedef unsigned int uint16_t;

uint8_t foo(const uint8_t byte, const uint16_t generator)
{
  if (byte & 0x80) {
return (byte << 1) ^ (generator & 0xff);
  } else {
return byte << 1;
  }
}

For aarch64 we fail to if-convert and generate:
foo:
uxtbw2, w0
lsl w3, w2, 1
uxtbw0, w3
tbnzx2, 7, .L5
ret
.p2align 3
.L5:
eor w0, w3, w1
uxtbw0, w0
ret


whereas on x86 we if convert successfully and use a conditional move/select:
leal(%rdi,%rdi), %eax
xorl%eax, %esi
testb   %dil, %dil
cmovs   %esi, %eax
ret



After fixing some of the branch costs in aarch64 and a bogus cost calculation 
in cheap_bb_rtx_cost_p
I'm stuck on noce_process_if_block (in ifcvt.c) and what I think is a 
restriction that the THEN-block contents have to be only a single set insn. 
This fails on aarch64 because we get an extra zero_extend.

In particular, the following check in noce_process_if_block triggers:
  insn_a = first_active_insn (then_bb);
  if (! insn_a
  || insn_a != last_active_insn (then_bb, FALSE)
  || (set_a = single_set (insn_a)) == NULL_RTX)
return FALSE;

Is there any particular reason why the code shouldn't be able to handle 
arbitrarily large contents
in then_bb (within a sane limit)?

Thanks,
Kyrill


Re: target attributes/pragmas changing vector instruction availability and custom types

2015-05-19 Thread Kyrill Tkachov


On 19/05/15 16:21, Kyrill Tkachov wrote:

On 19/05/15 15:55, Christian Bruel wrote:

Hi Kiril,

This is funny, I've updated bz65837 today in the same direction.

On 05/19/2015 04:17 PM, Kyrill Tkachov wrote:

Hi all,

I'm working on enabling target attributes and pragmas on aarch64 and I'm stuck 
on a particular issue.
I want to be able to use a target pragma to enable SIMD support in a SIMD 
intrinsics header file.
So it will look like this:

$ cat simd_header.h
#pragma GCC push_options
#pragma GCC target ("arch=armv8-a+simd")

#pragma GCC pop_options

I would then include it in a file with a function tagged with a simd target 
attribute:

$ cat foo.c
#inlcude "simd_header.h"

__attribute__((target("arch=armv8-a+simd")))
uint32x4_t
foo (uint32x4_t a)
{
  return simd_intrinsic (a); //simd_intrinsic defined in simd_header.h and 
implemented by a target builtin
}

This works fine for me. But if I try to compile this without SIMD support, say: 
aarch64-none-elf-gcc -c -march=armv8-a+nosimd foo.c
I get an ICE during builtin expansion time.

I think I've tracked it down to the problem that the type uint32x4_t is a 
builtin type that we define in the backend
(with add_builtin_type) during the target builtins initialisation code.

From what I can see, this code gets called early on after the command line 
options have been processed,
but before target pragmas or attributes are processed, so the builtin types are 
laid out assuming that no SIMD is available,
as per the command line option -march=armv8-a+nosimd, but later while expanding 
the builtin in simd_intrinsic with SIMD available
the ICE occurs. I think that is because the types were not re-laid out.


I share this analysis.


I'm somewhat stumped on ideas to work around this issue.
I notice that rs6000 also defines custom builtin vector types.
Michael, did you notice any issue similar to what I described above?

Would re-laying the builtin vector type on target attribute changes be a valid 
way to go forward here?

this is what I've done on arm, seems to work locally.

for aarch64, you can try to call aarch64_init_simd_builtins () from your
target hook, then we need to think how to unset them.

Hmmm, calling aarch64_init_simd_builtin_types in the VALID_TARGET_ATTRIBUTE_P 
hook seems to work for me.


Actually, scratch that. I had used the wrong compiler. Calling the builtin init 
code through
the target attribute hook didn't work :(.
Looking further into int...

Kyrill


Thanks for the suggestion! I think undefining shouldn't be a concern since they 
are target-specific builtins and we make no guarantee on their availability or 
behaviour through any other use other than
the intrinsics in arm_neon.h. Of course, we'd need to massage the 
aarch64_init_simd_builtin_types to only
re-layout the types once, so that we don't end up doing redundant work or 
bloating memory.

Thanks again!
Kyrill


Cheers

Christian


Thanks,
Kyrill





Re: target attributes/pragmas changing vector instruction availability and custom types

2015-05-19 Thread Kyrill Tkachov


On 19/05/15 15:55, Christian Bruel wrote:

Hi Kiril,

This is funny, I've updated bz65837 today in the same direction.

On 05/19/2015 04:17 PM, Kyrill Tkachov wrote:

Hi all,

I'm working on enabling target attributes and pragmas on aarch64 and I'm stuck 
on a particular issue.
I want to be able to use a target pragma to enable SIMD support in a SIMD 
intrinsics header file.
So it will look like this:

$ cat simd_header.h
#pragma GCC push_options
#pragma GCC target ("arch=armv8-a+simd")

#pragma GCC pop_options

I would then include it in a file with a function tagged with a simd target 
attribute:

$ cat foo.c
#inlcude "simd_header.h"

__attribute__((target("arch=armv8-a+simd")))
uint32x4_t
foo (uint32x4_t a)
{
 return simd_intrinsic (a); //simd_intrinsic defined in simd_header.h and 
implemented by a target builtin
}

This works fine for me. But if I try to compile this without SIMD support, say: 
aarch64-none-elf-gcc -c -march=armv8-a+nosimd foo.c
I get an ICE during builtin expansion time.

I think I've tracked it down to the problem that the type uint32x4_t is a 
builtin type that we define in the backend
(with add_builtin_type) during the target builtins initialisation code.

   From what I can see, this code gets called early on after the command line 
options have been processed,
but before target pragmas or attributes are processed, so the builtin types are 
laid out assuming that no SIMD is available,
as per the command line option -march=armv8-a+nosimd, but later while expanding 
the builtin in simd_intrinsic with SIMD available
the ICE occurs. I think that is because the types were not re-laid out.


I share this analysis.


I'm somewhat stumped on ideas to work around this issue.
I notice that rs6000 also defines custom builtin vector types.
Michael, did you notice any issue similar to what I described above?

Would re-laying the builtin vector type on target attribute changes be a valid 
way to go forward here?

this is what I've done on arm, seems to work locally.

for aarch64, you can try to call aarch64_init_simd_builtins () from your
target hook, then we need to think how to unset them.


Hmmm, calling aarch64_init_simd_builtin_types in the VALID_TARGET_ATTRIBUTE_P 
hook seems to work for me.
Thanks for the suggestion! I think undefining shouldn't be a concern since they 
are target-specific builtins and we make no guarantee on their availability or 
behaviour through any other use other than
the intrinsics in arm_neon.h. Of course, we'd need to massage the 
aarch64_init_simd_builtin_types to only
re-layout the types once, so that we don't end up doing redundant work or 
bloating memory.

Thanks again!
Kyrill



Cheers

Christian


Thanks,
Kyrill





target attributes/pragmas changing vector instruction availability and custom types

2015-05-19 Thread Kyrill Tkachov

Hi all,

I'm working on enabling target attributes and pragmas on aarch64 and I'm stuck 
on a particular issue.
I want to be able to use a target pragma to enable SIMD support in a SIMD 
intrinsics header file.
So it will look like this:

$ cat simd_header.h
#pragma GCC push_options
#pragma GCC target ("arch=armv8-a+simd")

#pragma GCC pop_options

I would then include it in a file with a function tagged with a simd target 
attribute:

$ cat foo.c
#inlcude "simd_header.h"

__attribute__((target("arch=armv8-a+simd")))
uint32x4_t
foo (uint32x4_t a)
{
  return simd_intrinsic (a); //simd_intrinsic defined in simd_header.h and 
implemented by a target builtin
}

This works fine for me. But if I try to compile this without SIMD support, say: 
aarch64-none-elf-gcc -c -march=armv8-a+nosimd foo.c
I get an ICE during builtin expansion time.

I think I've tracked it down to the problem that the type uint32x4_t is a 
builtin type that we define in the backend
(with add_builtin_type) during the target builtins initialisation code.

From what I can see, this code gets called early on after the command line 
options have been processed,
but before target pragmas or attributes are processed, so the builtin types are 
laid out assuming that no SIMD is available,
as per the command line option -march=armv8-a+nosimd, but later while expanding 
the builtin in simd_intrinsic with SIMD available
the ICE occurs. I think that is because the types were not re-laid out.

I'm somewhat stumped on ideas to work around this issue.
I notice that rs6000 also defines custom builtin vector types.
Michael, did you notice any issue similar to what I described above?

Would re-laying the builtin vector type on target attribute changes be a valid 
way to go forward here?

Thanks,
Kyrill



Re: target attributes, pragmas and preprocessor macros

2015-05-18 Thread Kyrill Tkachov


On 18/05/15 09:25, Kyrill Tkachov wrote:

Hi Christian,

On 18/05/15 07:26, Christian Bruel wrote:

Hi Kyrill,

On 05/13/2015 05:43 PM, Kyrill Tkachov wrote:

Hi all,

Are target attributes supposed to redefine the preprocessor macros available?
For example, on aarch64 if the file is compiled with floating point support
the __ARM_FEATURE_FMA predefine is available. If the user adds to a function
a target attribute disabling floating point, then is __ARM_FEATURE_FMA supposed
to be undefined in the body of that function?

Looking at some backends, it seems that only #pragmas are supposed to have that 
effect,
but I just wanted to confirm.


yes they do, But careful, even the "inherited" macros should be redefined...

Thanks.
Implementation question then.
Is TARGET_PRAGMA_PARSE actually called during preprocessing time?
I built an x86_64 toolchain, compiled a file with target pragmas in it,
put a breakpoint on the hook entry and it only triggered later on during 
parsing.
Same for an aarch64 implementation of target pragmas that I'm working on.
Am I just doing something wrong? Or is there some other place where I should
be looking?


Actually, scratch that. I got it working, sorry for the noise.

Kyrill



Thanks,
Kyrill


for instance for arm/thumb we can have something like:

#pragma GCC target ("thumb")

#ifndef __thumb__
#error "__thumb__ is not defined"
#endif

#ifdef __thumb2__
#ifndef __ARM_32BIT_STATE
#error  "__ARM_32BIT_STATE is not defined"
#endif
#else /* thumb1 */
#ifdef __ARM_32BIT_STATE
#error  "__ARM_32BIT_STATE is defined"
#endif
#endif /* thumb1 */

...


Thanks,
Kyrill


Cheers

Christian





Re: target attributes, pragmas and preprocessor macros

2015-05-18 Thread Kyrill Tkachov

Hi Christian,

On 18/05/15 07:26, Christian Bruel wrote:

Hi Kyrill,

On 05/13/2015 05:43 PM, Kyrill Tkachov wrote:

Hi all,

Are target attributes supposed to redefine the preprocessor macros available?
For example, on aarch64 if the file is compiled with floating point support
the __ARM_FEATURE_FMA predefine is available. If the user adds to a function
a target attribute disabling floating point, then is __ARM_FEATURE_FMA supposed
to be undefined in the body of that function?

Looking at some backends, it seems that only #pragmas are supposed to have that 
effect,
but I just wanted to confirm.


yes they do, But careful, even the "inherited" macros should be redefined...


Thanks.
Implementation question then.
Is TARGET_PRAGMA_PARSE actually called during preprocessing time?
I built an x86_64 toolchain, compiled a file with target pragmas in it,
put a breakpoint on the hook entry and it only triggered later on during 
parsing.
Same for an aarch64 implementation of target pragmas that I'm working on.
Am I just doing something wrong? Or is there some other place where I should
be looking?

Thanks,
Kyrill



for instance for arm/thumb we can have something like:

#pragma GCC target ("thumb")

#ifndef __thumb__
#error "__thumb__ is not defined"
#endif

#ifdef __thumb2__
#ifndef __ARM_32BIT_STATE
#error  "__ARM_32BIT_STATE is not defined"
#endif
#else /* thumb1 */
#ifdef __ARM_32BIT_STATE
#error  "__ARM_32BIT_STATE is defined"
#endif
#endif /* thumb1 */

...


Thanks,
Kyrill


Cheers

Christian





target attributes, pragmas and preprocessor macros

2015-05-13 Thread Kyrill Tkachov

Hi all,

Are target attributes supposed to redefine the preprocessor macros available?
For example, on aarch64 if the file is compiled with floating point support
the __ARM_FEATURE_FMA predefine is available. If the user adds to a function
a target attribute disabling floating point, then is __ARM_FEATURE_FMA supposed
to be undefined in the body of that function?

Looking at some backends, it seems that only #pragmas are supposed to have that 
effect,
but I just wanted to confirm.

Thanks,
Kyrill



Re: Target attribute hooks questions

2015-05-05 Thread Kyrill Tkachov

Hi Christian,

On 05/05/15 15:25, Christian Bruel wrote:

Hi Kyrill,

you are right it's not easy to get its way among all those macros, my
main source of inspiration for ARM was the x86 implementation.


Yeah, I've been looking at that and rs6000 for some perspective.



You can have a look at the ARM implementation to start with (on
gcc-patches, under review). That would be best not to diverge too much
aarch64 might have a few code to share with the arm be. FYI I'm planning
to add the fpu/neon attribute extensions

A few quick answer bellow, ask if you need more.

Cheers

Christian

On 05/05/2015 03:38 PM, Kyrill Tkachov wrote:

Hi all,

I'm looking at implementing target attributes for aarch64 and I have some 
questions about the hooks involved.
I haven't looked at this part of the compiler before, so forgive me if some of 
them seem obvious. I couldn't
figure it out from the documentation 
(https://gcc.gnu.org/onlinedocs/gccint/Target-Attributes.html#Target-Attributes)

* Seems to me that TARGET_OPTION_VALID_ATTRIBUTE_P is the most important one 
that parses
the string inside the __attribute__ ((target ("..."))) and sets the 
target-specific
flags appropriately. Is that correct?

Yes, it parses the string that goes into DECL_FUNCTION_SPECIFIC_TARGET
(fndecl) and then builds the struct gcc_options that will be switched
between functions. Note that this gone must go again to the
option_override machinery since global options can be affected by the
target options.


Right, so I'll need to call the option_override hook manually...




* What is TARGET_ATTRIBUTE_TABLE used for? It's supposed to map attributes to 
handlers?
Isn't that what TARGET_OPTION_VALID_ATTRIBUTE_P is for?

I think it's different.  the TARGET_ATTRIBUTE_TABLE specifies specific
attributes (e.g naked, interrupt, ...) while the target attribute allows
to pass target flags (e.g: -marm, -mfpu=neon, ...)


Ok, I see from the rs6000 backend that it's something different, I'll leave it 
alone for now.




* What is the use of TARGET_OPTION_SAVE and TARGET_OPTION_RESTORE? Is that used 
during
   something like LTO when different object files and functions are compiled 
with different
flags? Are these functions just supposed to 'backup' various tuning and ISA 
decisions?


This is to save custom function information that are not restored by
TARGET_SET_CURRENT_FUNCTION. I didn't need it for arm/thumb.


I'm looking at these in the context of LTO. From what I understand, LTO uses 
target attributes
to tag each function with target-specific flags so that it can keep track of 
the flags when
linking object files compiled with different target flags (e.g. different mcpu 
options).
Which hooks are used in this process?




* Is TARGET_COMP_TYPE_ATTRIBUTES the one that's supposed to handle incompatible 
attributes
being specified? (for example incompatible endianness or architecture levels)?

like TARGET_ATTRIBUTE_TABLE, it's different and doesn't pertain to
attribute target


Thanks for the help!
Kyrill



Cheers

Christian


Thanks for any insight,
Kyrill





Target attribute hooks questions

2015-05-05 Thread Kyrill Tkachov

Hi all,

I'm looking at implementing target attributes for aarch64 and I have some 
questions about the hooks involved.
I haven't looked at this part of the compiler before, so forgive me if some of 
them seem obvious. I couldn't
figure it out from the documentation 
(https://gcc.gnu.org/onlinedocs/gccint/Target-Attributes.html#Target-Attributes)

* Seems to me that TARGET_OPTION_VALID_ATTRIBUTE_P is the most important one 
that parses
the string inside the __attribute__ ((target ("..."))) and sets the 
target-specific
flags appropriately. Is that correct?

* What is TARGET_ATTRIBUTE_TABLE used for? It's supposed to map attributes to 
handlers?
Isn't that what TARGET_OPTION_VALID_ATTRIBUTE_P is for?

* What is the use of TARGET_OPTION_SAVE and TARGET_OPTION_RESTORE? Is that used 
during
 something like LTO when different object files and functions are compiled with 
different
flags? Are these functions just supposed to 'backup' various tuning and ISA 
decisions?

* Is TARGET_COMP_TYPE_ATTRIBUTES the one that's supposed to handle incompatible 
attributes
being specified? (for example incompatible endianness or architecture levels)?

Thanks for any insight,
Kyrill



Bootstrap configuration for hppa-linux-gnu ?

2015-04-29 Thread Kyrill Tkachov

Hi all,

I'm trying to run a bootstrap on a 64-bit hppa-linux-gnu but am getting
an error when building libgcc:
/usr/include/features.h:374:25: fatal error: sys/cdefs.h: No such file or 
directory

This suggests to me that it's a problem with multilibs.
I configured the build with  --disable-multiarch 
--enable-languages=c,c++,fortran.
Is there anything else that's usually added to bootstrap on hppa-linux-gnu?

Thanks,
Kyrill



Re: Fixing inconsistent uses of address costs

2015-03-30 Thread Kyrill Tkachov


On 30/03/15 08:14, Bin.Cheng wrote:

On Sat, Mar 28, 2015 at 1:31 AM, Sandra Loosemore
 wrote:

On 03/27/2015 03:43 AM, Kyrill Tkachov wrote:


On 27/03/15 03:29, Bin.Cheng wrote:

[much snippage]


As for tree ivopts, address cost is used in both ways.  For any
address computation that's invalid, it tries to legitimize it into two
parts, the first part results in alu instructions, the second part
results in address expression of different addressing modes.  Right
now the rtx cost (for the first part) and the address cost (for the
second part) are accumulated and compared altogether.  I do like the
idea split costs into different types because all elements are
entangled in single cost model, and it's hard to tune for specific
case.


Thanks for explaining.
I think it would be possible to make the comparisons there a bit more
sane.
If an address computation is split into alu instructions and a
legitimate address
then carry around the rtx cost of the alu part and the address. When the
time
comes to compare two computations, we create a more involved way of
comparing.
For example (would need benchmarking, of course):
* Compare the rtx costs of the alu components and check the address
preference for
the legitimate address components using the new hook.
* If the alu part costs are of equal rtx cost, pick the one which has
the preferable legitimate address.
* If expression 'a' has a more expensive alu component than 'b' but a more
preferable address component, then use some tie breaker. Perhaps apply
rtx costs
on the address expression and compare those...


Just as an aside here, tree-ssa-loop-ivopts.c has a lot of other problems
with how it's computing address costs, or it least it did when I last looked
at it a few years ago:

https://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html

Shortly after I posted that, Qualcomm lost interest in pushing the Hexagon
port upstream or improving performance, so I lost interest in pursuing that
patch when it was evident that it was going to take a lot of work to resolve
the objections.  I think fixes for problems (2) and (3) there have since
been pushed by other people, but I'm not sure about the others.

FWIW, I think a big part of the problem here is that the GCC internals
documentation isn't very clear on where the cost of legitimizing an address
(the "alu components" above) should be computed.  IIRC when I was last

I aggre that legitimize_address interface matters more than address
cost itself.  IVOPT uses it to compute the cost but doesn't use it to
generate final code when rewriting address type uses (though it tries
to mimicry the legitimization behavior and generate lower cost
instructions and addressing expression?).


When you say the legitimize_address interface, do you mean the stuff 
generated

from TARGET_LEGITIMIZE_ADDRESS and memory_address_addr_space in explow.c?
Do you think we can just use memory_address_addr_space and estimate the cost
of that in ivopts?

Thanks,
Kyrill



Thanks,
bin

looking at current practice, most targets' implementation of
TARGET_RTX_COSTS didn't make any attempt to account for the address cost in
a MEM -- either adding the cost of legitimizing the address or the cost of
the addressing mode itself (TARGET_ADDRESS_COST).  If TARGET_RTX_COSTS is
supposed to do that, its documentation should say so.  Or maybe we need a
separate hook like TARGET_LEGITIMIZE_ADDRESS_COST to capture the "alu
components" cost.

-Sandra





Re: Fixing inconsistent uses of address costs

2015-03-30 Thread Kyrill Tkachov


On 30/03/15 08:07, Bin.Cheng wrote:

On Fri, Mar 27, 2015 at 5:43 PM, Kyrill Tkachov  wrote:

On 27/03/15 03:29, Bin.Cheng wrote:

On Thu, Mar 26, 2015 at 11:40 PM, Kyrill Tkachov 
wrote:

Hi all,

I'd like to attempt to make GCC's usage of costs in the backends
consistent.
We have a lot of different types: rtx costs, address costs, regmove
costs,
vector costs etc. Some of them are use in different units, compared
against
different types of costs and in general are a bit of a mess. For now I'd
like
to clean up address costs since they are actually not used as much as
some
other costs and seem to me to be a slightly simpler task.

  From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook are
used
in 5 places overall:
* fwprop.c: forward propagation tries to propagate an rtx into an address
and compare the address cost of the old address and the one it wants to
propagate into it and picks the most profitable.

* postreload.c: Again, tries to replace an address with a new one and
compares
the two address costs and picks the most profitable.

* tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell
it
is
used to assign a cost to various addressing modes, but ends up comparing
but
in the computation_cost function adds/mixes the address cost with rtx
costs.

* Some targets use address costs as part of their calculation of rtx
costs
for, say, a memory access instruction.

* The final and worst part is in loop-invariant.c in the
create_new_invariant
function that needs to get a feel for how 'cheap' an address is and for
that
it compares the address cost to a magic number 3. See the thread that
introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.

There are a few issues wiht the above usages:
- The magic number in loop-invariant.c was picked to help bwaves on x86
and
makes no analytical sense for any other target.

- Some targets use COSTS_N_INSNS units for their address costs, others
use
normal scalar units (every target that doesn't define the address costs
hook
ends up going through rtx costs which use COSTS_N_INSNS).
I think this rather undermines the usage in tree-ssa-loop-ivopts.c which
uses
rtx costs in COSTS_N_INSNS and address costs interchangeably.

An insight here is that address costs have two types of usage:
1. To compare two addresses and choose whether to replace one by the
other
(as in fwprop and postreload)
2. To give a metric of cheapness for use in loop optimisations.

So I propose to split the two uses up. For the fwprop and postreload case
introduce a hook that will take two addresses 'x' and 'y' and determine
whether to substitute 'x' by 'y'. This can use address costs and allow
the
backend to do as deep comparison between two addresses as it could
possibly
want, unconstrained by the requirement to assign just a single number to
each one individually and having the midend just pick the smallest
number.
We can call it something like TARGET_ADDRESS_PREFERABLE_P or something
like
that. Then get fwprop and postreload to use that to determine whether to
substitute one address for another.

What to do with the remaining address costs hook and its uses in
tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
the address costs interchangeably with rtx costs I think we should at
least
convert all address costs to use the same units: COSTS_N_INSNS. This
raises
a question: could we not just use rtx costs for the addresses? Someone
with
more familiarity with the code can correct me, but is the address cost
usage
in that file supposed to reflect the cost of computing the address
outside
of
an addressing mode? If so, then shouldn't rtx costs be used? Conversely,
if
address costs are a measure of something different, surely we shouldn't
be
adding or comparing them with rtx costs, but rather create some
two-dimensional
structure to compare two rtx/address entities? (similarly to the
rtx cost/latency calculations in expmed for the mult synthesis code).

Then there's the usage in loop-invariant.c where the address cost number
is
compared with '3'. This is really not portable across targets. The code
wants
some measure of whether an address is 'cheap'. There could be various
approaches on how to provide that.
Could we enumerate across an rtx array of representatives of
all possible addresses for a target sorted by the
TARGET_ADDRESS_PREFERABLE_P
hook above? Should we be asking the target to tell us which address type
is
cheap? In any case, it seems to me that the cheapest address in a target
is
something that only the target can tell, unless we agree on a universal
meaning and unit of measurement for address costs.

What do people think? Would this approach be worth pursuing if the above
questions can be answered? From having a quick look at config/ I think
that
converting the targets to use COSTS_N_INSNS units would not be a
controvers

Re: Fixing inconsistent uses of address costs

2015-03-30 Thread Kyrill Tkachov


On 27/03/15 17:31, Sandra Loosemore wrote:

On 03/27/2015 03:43 AM, Kyrill Tkachov wrote:

On 27/03/15 03:29, Bin.Cheng wrote:

[much snippage]

As for tree ivopts, address cost is used in both ways.  For any
address computation that's invalid, it tries to legitimize it into two
parts, the first part results in alu instructions, the second part
results in address expression of different addressing modes.  Right
now the rtx cost (for the first part) and the address cost (for the
second part) are accumulated and compared altogether.  I do like the
idea split costs into different types because all elements are
entangled in single cost model, and it's hard to tune for specific
case.

Thanks for explaining.
I think it would be possible to make the comparisons there a bit more sane.
If an address computation is split into alu instructions and a
legitimate address
then carry around the rtx cost of the alu part and the address. When the
time
comes to compare two computations, we create a more involved way of
comparing.
For example (would need benchmarking, of course):
* Compare the rtx costs of the alu components and check the address
preference for
the legitimate address components using the new hook.
* If the alu part costs are of equal rtx cost, pick the one which has
the preferable legitimate address.
* If expression 'a' has a more expensive alu component than 'b' but a more
preferable address component, then use some tie breaker. Perhaps apply
rtx costs
on the address expression and compare those...

Just as an aside here, tree-ssa-loop-ivopts.c has a lot of other
problems with how it's computing address costs, or it least it did when
I last looked at it a few years ago:

https://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html


Thanks, that's a useful read.



Shortly after I posted that, Qualcomm lost interest in pushing the
Hexagon port upstream or improving performance, so I lost interest in
pursuing that patch when it was evident that it was going to take a lot
of work to resolve the objections.  I think fixes for problems (2) and
(3) there have since been pushed by other people, but I'm not sure about
the others.

FWIW, I think a big part of the problem here is that the GCC internals
documentation isn't very clear on where the cost of legitimizing an
address (the "alu components" above) should be computed.  IIRC when I
was last looking at current practice, most targets' implementation of
TARGET_RTX_COSTS didn't make any attempt to account for the address cost
in a MEM -- either adding the cost of legitimizing the address or the
cost of the addressing mode itself (TARGET_ADDRESS_COST).


Currently some targets (aarch64 at least) add the TARGET_ADDRESS_COST to the
cost of a MEM, some (most?) don't, though I'm not sure how much effect that
has had on codegen quality. Do you think it would be a good idea to require
from TARGET_RTX_COST to try to estimate the legitimization cost of an
invalid address in units compatible with rtx cost for the purposes of
ivopts?


   If
TARGET_RTX_COSTS is supposed to do that, its documentation should say
so.  Or maybe we need a separate hook like
TARGET_LEGITIMIZE_ADDRESS_COST to capture the "alu components" cost.


Can't we just capture the sequence that TARGET_LEGITIMIZE_ADDRESS would
emit and call rtx_cost (or seq_cost) on that?

Thanks,
Kyrill


-Sandra





Re: Fixing inconsistent uses of address costs

2015-03-27 Thread Kyrill Tkachov


On 27/03/15 03:29, Bin.Cheng wrote:

On Thu, Mar 26, 2015 at 11:40 PM, Kyrill Tkachov  wrote:

Hi all,

I'd like to attempt to make GCC's usage of costs in the backends consistent.
We have a lot of different types: rtx costs, address costs, regmove costs,
vector costs etc. Some of them are use in different units, compared against
different types of costs and in general are a bit of a mess. For now I'd
like
to clean up address costs since they are actually not used as much as some
other costs and seem to me to be a slightly simpler task.

 From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook are
used
in 5 places overall:
* fwprop.c: forward propagation tries to propagate an rtx into an address
and compare the address cost of the old address and the one it wants to
propagate into it and picks the most profitable.

* postreload.c: Again, tries to replace an address with a new one and
compares
the two address costs and picks the most profitable.

* tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell it
is
used to assign a cost to various addressing modes, but ends up comparing but
in the computation_cost function adds/mixes the address cost with rtx costs.

* Some targets use address costs as part of their calculation of rtx costs
for, say, a memory access instruction.

* The final and worst part is in loop-invariant.c in the
create_new_invariant
function that needs to get a feel for how 'cheap' an address is and for that
it compares the address cost to a magic number 3. See the thread that
introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.

There are a few issues wiht the above usages:
- The magic number in loop-invariant.c was picked to help bwaves on x86 and
makes no analytical sense for any other target.

- Some targets use COSTS_N_INSNS units for their address costs, others use
normal scalar units (every target that doesn't define the address costs hook
ends up going through rtx costs which use COSTS_N_INSNS).
I think this rather undermines the usage in tree-ssa-loop-ivopts.c which
uses
rtx costs in COSTS_N_INSNS and address costs interchangeably.

An insight here is that address costs have two types of usage:
1. To compare two addresses and choose whether to replace one by the other
(as in fwprop and postreload)
2. To give a metric of cheapness for use in loop optimisations.

So I propose to split the two uses up. For the fwprop and postreload case
introduce a hook that will take two addresses 'x' and 'y' and determine
whether to substitute 'x' by 'y'. This can use address costs and allow the
backend to do as deep comparison between two addresses as it could possibly
want, unconstrained by the requirement to assign just a single number to
each one individually and having the midend just pick the smallest number.
We can call it something like TARGET_ADDRESS_PREFERABLE_P or something like
that. Then get fwprop and postreload to use that to determine whether to
substitute one address for another.

What to do with the remaining address costs hook and its uses in
tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
the address costs interchangeably with rtx costs I think we should at least
convert all address costs to use the same units: COSTS_N_INSNS. This raises
a question: could we not just use rtx costs for the addresses? Someone with
more familiarity with the code can correct me, but is the address cost usage
in that file supposed to reflect the cost of computing the address outside
of
an addressing mode? If so, then shouldn't rtx costs be used? Conversely, if
address costs are a measure of something different, surely we shouldn't be
adding or comparing them with rtx costs, but rather create some
two-dimensional
structure to compare two rtx/address entities? (similarly to the
rtx cost/latency calculations in expmed for the mult synthesis code).

Then there's the usage in loop-invariant.c where the address cost number is
compared with '3'. This is really not portable across targets. The code
wants
some measure of whether an address is 'cheap'. There could be various
approaches on how to provide that.
Could we enumerate across an rtx array of representatives of
all possible addresses for a target sorted by the
TARGET_ADDRESS_PREFERABLE_P
hook above? Should we be asking the target to tell us which address type is
cheap? In any case, it seems to me that the cheapest address in a target is
something that only the target can tell, unless we agree on a universal
meaning and unit of measurement for address costs.

What do people think? Would this approach be worth pursuing if the above
questions can be answered? From having a quick look at config/ I think that
converting the targets to use COSTS_N_INSNS units would not be a
controversial
task as long as the midend usage of address costs is consistent.

I totally agree that use 

Fixing inconsistent uses of address costs

2015-03-26 Thread Kyrill Tkachov

Hi all,

I'd like to attempt to make GCC's usage of costs in the backends consistent.
We have a lot of different types: rtx costs, address costs, regmove costs,
vector costs etc. Some of them are use in different units, compared against
different types of costs and in general are a bit of a mess. For now I'd 
like

to clean up address costs since they are actually not used as much as some
other costs and seem to me to be a slightly simpler task.

From what I can see, address costs i.e. the TARGET_ADDRESS_COST hook 
are used

in 5 places overall:
* fwprop.c: forward propagation tries to propagate an rtx into an address
and compare the address cost of the old address and the one it wants to
propagate into it and picks the most profitable.

* postreload.c: Again, tries to replace an address with a new one and 
compares

the two address costs and picks the most profitable.

* tree-ssa-loop-ivopts.c: A bit more involved here. From what I can tell 
it is

used to assign a cost to various addressing modes, but ends up comparing but
in the computation_cost function adds/mixes the address cost with rtx costs.

* Some targets use address costs as part of their calculation of rtx costs
for, say, a memory access instruction.

* The final and worst part is in loop-invariant.c in the 
create_new_invariant

function that needs to get a feel for how 'cheap' an address is and for that
it compares the address cost to a magic number 3. See the thread that
introduced it at http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html.

There are a few issues wiht the above usages:
- The magic number in loop-invariant.c was picked to help bwaves on x86 and
makes no analytical sense for any other target.

- Some targets use COSTS_N_INSNS units for their address costs, others use
normal scalar units (every target that doesn't define the address costs hook
ends up going through rtx costs which use COSTS_N_INSNS).
I think this rather undermines the usage in tree-ssa-loop-ivopts.c which 
uses

rtx costs in COSTS_N_INSNS and address costs interchangeably.

An insight here is that address costs have two types of usage:
1. To compare two addresses and choose whether to replace one by the other
(as in fwprop and postreload)
2. To give a metric of cheapness for use in loop optimisations.

So I propose to split the two uses up. For the fwprop and postreload case
introduce a hook that will take two addresses 'x' and 'y' and determine
whether to substitute 'x' by 'y'. This can use address costs and allow the
backend to do as deep comparison between two addresses as it could possibly
want, unconstrained by the requirement to assign just a single number to
each one individually and having the midend just pick the smallest number.
We can call it something like TARGET_ADDRESS_PREFERABLE_P or something like
that. Then get fwprop and postreload to use that to determine whether to
substitute one address for another.

What to do with the remaining address costs hook and its uses in
tree-ssa-loop-ivopts and loop-invariant? Since tree-ssa-loop-ivopts uses
the address costs interchangeably with rtx costs I think we should at least
convert all address costs to use the same units: COSTS_N_INSNS. This raises
a question: could we not just use rtx costs for the addresses? Someone with
more familiarity with the code can correct me, but is the address cost usage
in that file supposed to reflect the cost of computing the address 
outside of

an addressing mode? If so, then shouldn't rtx costs be used? Conversely, if
address costs are a measure of something different, surely we shouldn't be
adding or comparing them with rtx costs, but rather create some 
two-dimensional

structure to compare two rtx/address entities? (similarly to the
rtx cost/latency calculations in expmed for the mult synthesis code).

Then there's the usage in loop-invariant.c where the address cost number is
compared with '3'. This is really not portable across targets. The code 
wants

some measure of whether an address is 'cheap'. There could be various
approaches on how to provide that.
Could we enumerate across an rtx array of representatives of
all possible addresses for a target sorted by the 
TARGET_ADDRESS_PREFERABLE_P

hook above? Should we be asking the target to tell us which address type is
cheap? In any case, it seems to me that the cheapest address in a target is
something that only the target can tell, unless we agree on a universal
meaning and unit of measurement for address costs.

What do people think? Would this approach be worth pursuing if the above
questions can be answered? From having a quick look at config/ I think that
converting the targets to use COSTS_N_INSNS units would not be a 
controversial

task as long as the midend usage of address costs is consistent.

Thanks,
Kyrill



Re: Is there a way to use define_subst when operands need to change modes?

2015-03-02 Thread Kyrill Tkachov


On 02/03/15 17:38, Ilya Tocar wrote:

On 02 Mar 15:22, Kyrill Tkachov wrote:

Hi all,

I'm looking at using the define_subst machinery to auto-generate
zero-extended
versions of some patterns, for example having:
(set reg:SI
  (xor:SI a:SI b:SI))

generate a pattern of the form:
(set reg:DI
  (zero_extend:DI
(xor:SI (a:SI b:SI

How do I go about achieving this? From the documentation, I think I need
something like:
  (define_subst "add_z_extend"
[(set (match_operand:SI 0 "" "")
  (match_operand:SI 1 "" ""))]
""
[(set (match_dup 0)
  (zero_extend:DI (match_dup 1)))]

but in the resultant pattern I need operand 0 to be tranfsormed into DImode.
Is there a way to write that?


Can't you just use  [(set (match_operand:DI 0 "" "")...  instead of
match_dup?


So, something like:

 (define_subst "add_z_extend"
   [(set (match_operand:SI 0 "" "")
 (match_operand:SI 1 "" ""))]
   ""
   [(set (match_operand:DI 0 "" "")
 (zero_extend:DI (match_dup 1)))]


?


Are we allowed to match an operand twice?




Is there a way to use define_subst when operands need to change modes?

2015-03-02 Thread Kyrill Tkachov

Hi all,

I'm looking at using the define_subst machinery to auto-generate 
zero-extended

versions of some patterns, for example having:
(set reg:SI
 (xor:SI a:SI b:SI))

generate a pattern of the form:
(set reg:DI
 (zero_extend:DI
   (xor:SI (a:SI b:SI

How do I go about achieving this? From the documentation, I think I need 
something like:

 (define_subst "add_z_extend"
   [(set (match_operand:SI 0 "" "")
 (match_operand:SI 1 "" ""))]
   ""
   [(set (match_dup 0)
 (zero_extend:DI (match_dup 1)))]

but in the resultant pattern I need operand 0 to be tranfsormed into DImode.
Is there a way to write that?

Thanks,
Kyrill




Re: Confusing description of GCC option `-freorder-blocks'

2014-12-01 Thread Kyrill Tkachov


On 01/12/14 08:20, Pengfei Yuan wrote:

Hi,

Hi,



In https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html , the
description of option `-freorder-blocks' says `in order to reduce
number of taken branches and improve code locality'. It is confusing.
When will the `and' condition happen? That is, taken branches reduced
AND code locality improved.


I would think that one implies the other. If you reorder the blocks
so as to reduce taken branches, that is by definition an improvement of
code locality since you don't jump all over the place as much.

Cheers,
Kyrill



Thanks!

YUAN, Pengfei






How to prune tests that are too large for a tiny memory model in libstdc++?

2014-11-28 Thread Kyrill Tkachov

Hi all,

I'm seeing some relocation truncation failures in the libstdc++ 
testsuite where the test
is too large to fit into the memory model. In the gcc testsuite we mark 
these tests as unsupported

using something like the below fragment in gcc-dg.exp

if { [regexp "(^|\n)\[^\n\]*: relocation truncated to fit" $text]
  && [check_effective_target_tiny] } {
return "::unsupported::memory full"
}

Where would I go about adding similar logic in the libstdc++ tesuite 
.exp files?

I can't find similar pruning infrastructure there...

Thanks,
Kyrill



Re: Expansion of memset and memcpy calls.

2014-10-21 Thread Kyrill Tkachov


On 21/10/14 08:37, Ajit Kumar Agarwal wrote:

Hello All:

Memset and Memcpy calls are extensively used in many benchmarks. Inlining or 
expansion
the memcpy and memset calls improves the performance of many performance
Benchmark.
I have implemented the expansion of strcmp  to the optimizaed sequence of 
instruction
In open64 compiler for AMD x86 target.

Can I suggest and propose to expand the memset and memcpy calls to the sequence
Of instruction as many targets like ARM are moving implementation of memcpy and
Memset in assembly instead of C. This makes it easier to expand the memcpy and
Memset call in gcc.


There is the 'movmem' standard name that backends can expand to 
implement memcpy semantics. For aarch64, for example, look at the 
movmemdi pattern in aarch64.md and the aarch64_expand_movmem helper in 
aarch64.c


Kyrill



To implement this in GCC we need to expand similarly to the implementation as  
builtins.

Let me know what do you think.

Thanks & Regards
Ajit







Re: optimization for simd intrinsics vld2_dup_* on aarch64-none-elf

2014-09-03 Thread Kyrill Tkachov

Hi Shanyao,

On 03/09/14 16:02, shanyao chen wrote:

Hi,
I found there is a performance problem with some simd intrinsics
(vld2_dup_*) on aarch64-none-elf. Now the vld2_dup_* are defined as
follows:

#define __LD2R_FUNC(rettype, structtype, ptrtype, \
 regsuffix, funcsuffix, Q) \
   __extension__ static __inline rettype \
   __attribute__ ((__always_inline__))  \
   vld2 ## Q ## _dup_ ## funcsuffix (const ptrtype *ptr) \
   { \
 rettype result; \
 __asm__ ("ld2r {v16." #regsuffix ", v17." #regsuffix "}, %1\n\t" \
  "st1 {v16." #regsuffix ", v17." #regsuffix "}, %0\n\t" \
  : "=Q"(result) \
  : "Q"(*(const structtype *)ptr) \
  : "memory", "v16", "v17"); \
 return result; \
   }

It loads from memory to registers, and then store the value of
registers to memory as a result. Such code is terribly low in
performance because of redundant memory visit and limited registers
allocation.

Some intinsics like vld2_* were similar to vld2_dup_*, but now they
are realized by builtin functions.

__extension__ static __inline int16x4x2_t __attribute__ ((__always_inline__))
vld2_s16 (const int16_t * __a)
{
   int16x4x2_t ret;
   __builtin_aarch64_simd_oi __o;
   __o = __builtin_aarch64_ld2v4hi ((const __builtin_aarch64_simd_hi *) __a);
   ret.val[0] = (int16x4_t) __builtin_aarch64_get_dregoiv4hi (__o, 0);
   ret.val[1] = (int16x4_t) __builtin_aarch64_get_dregoiv4hi (__o, 1);
   return ret;
}

Could vld2_dup_* also be written as builtin ?  If not, i think the
inline assembler can be optimized as follows :


The arm port implements these with builtins, it should possible to 
implement them that way on aarch64 as well.


Could you log an issue in bugzilla please, including some source code 
demonstrating the poor

codegen if possible?

Thanks,
Kyrill


#define __LD2R_FUNC(rettype, structtype, ptrtype, \
 regsuffix, funcsuffix, Q) \
   __extension__ static __inline rettype \
   __attribute__ ((__always_inline__))  \
   vld2 ## Q ## _dup_ ## funcsuffix (const ptrtype *ptr) \
   { \
 rettype result; \
 __asm__ (
  "ld2r {%0.4h, %1.4h}, %2"   \
  : "=V16"(result.val[0]), "=V17"(result.val[1]) \
  : "Q"(*(const structtype *)ptr) \
  : "memory", "v16", "v17"); \
 return result; \
   }

It need to add a reg_class_name v16&v17 and add constraints V16 &  V17
for them. For this, aarch64.h、aarch64.c、constraints.md should be
modified.





Re: Conditional negation elimination in tree-ssa-phiopt.c

2014-08-18 Thread Kyrill Tkachov


On 18/08/14 10:19, Richard Earnshaw wrote:

On 14/08/14 09:45, Kyrill Tkachov wrote:

On 13/08/14 18:32, Segher Boessenkool wrote:

On Wed, Aug 13, 2014 at 03:57:31PM +0100, Richard Earnshaw wrote:

The problem with the frankenmonster patterns is that they tend to
proliferate into the machine description, and before you know where you
are the back-end is full of them.

Furthermore, they are very sensitive to the greedy first-match nature of
combine: a better, later, combination is missed because a less good,
earlier, optimization matched.  If the first insn in the sequence is
merged into an earlier instruction then you can end up with a junk
sequence that completely fails to simplify.  That ends up with
super-frankenmonster patterns to deal with all the subcases and the
problems grow exponentially from there.

Right.  Of course, combine should be fixed, yadda yadda.


I really do think that the best solution would be to try and catch this
during expand if possible and generate the right pattern from the start;
then you don't risk combine failing to come to the rescue after several
intermediate transformations have taken place.

I think ssa-phiopt should simply not do this obfuscation at all.  Without
it, RTL ifcvt picks it up just fine on targets with conditional assignment
instructions.  I agree on targets without expand should do a better job
(also for more generic conditional assignment).

That particular transformation was added to tree-ssa-phiopt.c for PR
45685, the problem it was trying to solve was a missed vectorisation
opportunity and transforming it made it into straightline code that was
more amenable to vectorisation, that's why I'm somewhat reluctant to
completely disable it.

Hmm... I noticed in the midend we guard some optimisations on
HAVE_conditional_move. Maybe we can guard this one on something like
!HAVE_conditional_negation ?


Can't we just guard it on HAVE_conditional_move?  With such an
instruction expand would then generate

t1 = -a
r =  ? b : t1

and combine will do the rest.


That was my first idea, but then it disables this transformation for 
x86, for which it was added

specifically to solve PR45685...

Kyrill


R.


Kyrill


Instruction selection belongs in RTL land.


Segher








Re: Conditional negation elimination in tree-ssa-phiopt.c

2014-08-14 Thread Kyrill Tkachov


On 13/08/14 18:32, Segher Boessenkool wrote:

On Wed, Aug 13, 2014 at 03:57:31PM +0100, Richard Earnshaw wrote:

The problem with the frankenmonster patterns is that they tend to
proliferate into the machine description, and before you know where you
are the back-end is full of them.

Furthermore, they are very sensitive to the greedy first-match nature of
combine: a better, later, combination is missed because a less good,
earlier, optimization matched.  If the first insn in the sequence is
merged into an earlier instruction then you can end up with a junk
sequence that completely fails to simplify.  That ends up with
super-frankenmonster patterns to deal with all the subcases and the
problems grow exponentially from there.

Right.  Of course, combine should be fixed, yadda yadda.


I really do think that the best solution would be to try and catch this
during expand if possible and generate the right pattern from the start;
then you don't risk combine failing to come to the rescue after several
intermediate transformations have taken place.

I think ssa-phiopt should simply not do this obfuscation at all.  Without
it, RTL ifcvt picks it up just fine on targets with conditional assignment
instructions.  I agree on targets without expand should do a better job
(also for more generic conditional assignment).


That particular transformation was added to tree-ssa-phiopt.c for PR 
45685, the problem it was trying to solve was a missed vectorisation 
opportunity and transforming it made it into straightline code that was 
more amenable to vectorisation, that's why I'm somewhat reluctant to 
completely disable it.


Hmm... I noticed in the midend we guard some optimisations on 
HAVE_conditional_move. Maybe we can guard this one on something like 
!HAVE_conditional_negation ?


Kyrill



Instruction selection belongs in RTL land.


Segher






Re: Conditional negation elimination in tree-ssa-phiopt.c

2014-08-12 Thread Kyrill Tkachov


On 12/08/14 16:16, Kyrill Tkachov wrote:

On 12/08/14 16:11, Kyrill Tkachov wrote:

On 12/08/14 15:22, Jeff Law wrote:

On 08/12/14 04:31, Kyrill Tkachov wrote:

On 12/08/14 10:39, Richard Biener wrote:

On Mon, Aug 11, 2014 at 9:56 PM, Jeff Law  wrote:

On 08/11/14 07:41, Kyrill Tkachov wrote:

I haven't been able to get combine to match the comparison+xor+neg+plus
RTL and it seems like it would be just a workaround to undo the
tree-level transformation.

Yea, it'd just be a workaround, but it's probably the easiest way to
deal
with this problem.  Can you describe in further detail why you
weren't able
to get this to work?

Too many instructions to combine I guess.  You might want to add
intermediate "combine" insn-and-splits.  If that's still a no-go then
read on.

My guess was too many insns as well..  But that's often solvable.


From the combine dump I can see that it tried to combine up to:
(set (reg/i:SI 0 x0)
(plus:SI (xor:SI (neg:SI (reg:SI 84 [ D.2565 ]))
(reg:SI 73 [ D.2564 ]))
(reg:SI 84 [ D.2565 ])))

And did it find a match for this?   What happens if (just for testing
purposes), you create a pattern for this?  Does combine then try
something even more complex, possibly getting your conditional negation?

I managed to get combine to recognise this pattern:
(set (match_operand:GPI 0 "register_operand" "=r")
   (plus:GPI (xor:GPI (neg:GPI (match_operand:GPI 1
"register_operand" "r"))
  (match_operand:GPI 2 "register_operand" "r"))
 (match_dup 1)))

Now what I need is for operand 1 to instead be a cc_reg comparison, but
when I change operand 1 to this pattern:

(set (match_operand:GPI 0 "register_operand" "=r")
   (plus:GPI (xor:GPI (neg:GPI (match_operator:GPI 1
"aarch64_comparison_operator"
   [(match_operand:CC 2 "cc_register" "")
(const_int 0)]))
  (match_operand:GPI 3 "register_operand" "r"))
 (match_dup 1)))

This doesn't match. Is there any way to express this in a combineable
pattern?

argh, Thunderbird enforced the 80 character limit...

The pattern that matched was:
(set (match_operand:GPI 0 "register_operand" "=r")
  (plus:GPI
(xor:GPI
  (neg:GPI (match_operand:GPI 1 "register_operand" "r"))
  (match_operand:GPI 2 "register_operand" "r"))
(match_dup 1)))


If I match that to a define_and_split and split it into two sets:
(set (reg1) (neg reg2))
(set (plus (xor (reg1) (reg3))
(reg2)))

and then add a define_insn with the full pattern:

(set (match_operand:GPI 0 "register_operand" "=r")
 (plus:GPI
   (xor:GPI
 (neg:GPI
   (match_operator:GPI 1 "aarch64_comparison_operator"
 [(match_operand:CC 2 "cc_register" "")
(const_int 0)]))
   (match_operand:GPI 3 "register_operand" "r"))
   (match_dup 1)))

Then this does manage to match the full thing that I want. But I had to 
add another define_split to break up the plus+xor Frankenstein's monster 
back into two separate insns for the cases where it picks up 
non-conditional-negate patterns.


Does that seem reasonable or too much of a hack? Any plus+xor rtxs that 
get left after combine should be split up again relatively quickly in 
split1 and shouldn't inhibit further optimisation too badly, no?



Kyrill


And the one that failed to combine (with operand 1 substituted for a
cc_reg comparison) was:

(set (match_operand:GPI 0 "register_operand" "=r")
  (plus:GPI
(xor:GPI
  (neg:GPI
(match_operator:GPI 1 "aarch64_comparison_operator"
  [(match_operand:CC_ZERO 2 "cc_register" "")
 (const_int 0)]))
(match_operand:GPI 3 "register_operand" "r"))
(match_dup 1)))


Thanks,
Kyrill



On the other hand, I did manage to write a peephole2 that detected the
sequence of compare+neg+xor+plus and transformed it into the
if_then_else form that our current conditional negation pattern has,
although I'm not sure how flexible this is.

Probably not very.  We really should be looking at combine.  In fact, I
would argue that we should be looking at combine regardless of whether
or not we twiddle expansion as humans or machine generated code could
look like this...

jeff












Re: Conditional negation elimination in tree-ssa-phiopt.c

2014-08-12 Thread Kyrill Tkachov


On 12/08/14 16:11, Kyrill Tkachov wrote:

On 12/08/14 15:22, Jeff Law wrote:

On 08/12/14 04:31, Kyrill Tkachov wrote:

On 12/08/14 10:39, Richard Biener wrote:

On Mon, Aug 11, 2014 at 9:56 PM, Jeff Law  wrote:

On 08/11/14 07:41, Kyrill Tkachov wrote:

I haven't been able to get combine to match the comparison+xor+neg+plus
RTL and it seems like it would be just a workaround to undo the
tree-level transformation.

Yea, it'd just be a workaround, but it's probably the easiest way to
deal
with this problem.  Can you describe in further detail why you
weren't able
to get this to work?

Too many instructions to combine I guess.  You might want to add
intermediate "combine" insn-and-splits.  If that's still a no-go then
read on.

My guess was too many insns as well..  But that's often solvable.


   From the combine dump I can see that it tried to combine up to:
(set (reg/i:SI 0 x0)
   (plus:SI (xor:SI (neg:SI (reg:SI 84 [ D.2565 ]))
   (reg:SI 73 [ D.2564 ]))
   (reg:SI 84 [ D.2565 ])))

And did it find a match for this?   What happens if (just for testing
purposes), you create a pattern for this?  Does combine then try
something even more complex, possibly getting your conditional negation?

I managed to get combine to recognise this pattern:
(set (match_operand:GPI 0 "register_operand" "=r")
  (plus:GPI (xor:GPI (neg:GPI (match_operand:GPI 1
"register_operand" "r"))
 (match_operand:GPI 2 "register_operand" "r"))
(match_dup 1)))

Now what I need is for operand 1 to instead be a cc_reg comparison, but
when I change operand 1 to this pattern:

(set (match_operand:GPI 0 "register_operand" "=r")
  (plus:GPI (xor:GPI (neg:GPI (match_operator:GPI 1
"aarch64_comparison_operator"
  [(match_operand:CC 2 "cc_register" "")
(const_int 0)]))
 (match_operand:GPI 3 "register_operand" "r"))
(match_dup 1)))

This doesn't match. Is there any way to express this in a combineable
pattern?


argh, Thunderbird enforced the 80 character limit...

The pattern that matched was:
(set (match_operand:GPI 0 "register_operand" "=r")
(plus:GPI
  (xor:GPI
(neg:GPI (match_operand:GPI 1 "register_operand" "r"))
(match_operand:GPI 2 "register_operand" "r"))
  (match_dup 1)))


And the one that failed to combine (with operand 1 substituted for a 
cc_reg comparison) was:


(set (match_operand:GPI 0 "register_operand" "=r")
(plus:GPI
  (xor:GPI
(neg:GPI
  (match_operator:GPI 1 "aarch64_comparison_operator"
[(match_operand:CC_ZERO 2 "cc_register" "")
   (const_int 0)]))
  (match_operand:GPI 3 "register_operand" "r"))
  (match_dup 1)))



Thanks,
Kyrill



On the other hand, I did manage to write a peephole2 that detected the
sequence of compare+neg+xor+plus and transformed it into the
if_then_else form that our current conditional negation pattern has,
although I'm not sure how flexible this is.

Probably not very.  We really should be looking at combine.  In fact, I
would argue that we should be looking at combine regardless of whether
or not we twiddle expansion as humans or machine generated code could
look like this...

jeff










Re: Conditional negation elimination in tree-ssa-phiopt.c

2014-08-12 Thread Kyrill Tkachov


On 12/08/14 15:22, Jeff Law wrote:

On 08/12/14 04:31, Kyrill Tkachov wrote:

On 12/08/14 10:39, Richard Biener wrote:

On Mon, Aug 11, 2014 at 9:56 PM, Jeff Law  wrote:

On 08/11/14 07:41, Kyrill Tkachov wrote:

I haven't been able to get combine to match the comparison+xor+neg+plus
RTL and it seems like it would be just a workaround to undo the
tree-level transformation.

Yea, it'd just be a workaround, but it's probably the easiest way to
deal
with this problem.  Can you describe in further detail why you
weren't able
to get this to work?

Too many instructions to combine I guess.  You might want to add
intermediate "combine" insn-and-splits.  If that's still a no-go then
read on.

My guess was too many insns as well..  But that's often solvable.


  From the combine dump I can see that it tried to combine up to:
(set (reg/i:SI 0 x0)
  (plus:SI (xor:SI (neg:SI (reg:SI 84 [ D.2565 ]))
  (reg:SI 73 [ D.2564 ]))
  (reg:SI 84 [ D.2565 ])))

And did it find a match for this?   What happens if (just for testing
purposes), you create a pattern for this?  Does combine then try
something even more complex, possibly getting your conditional negation?


I managed to get combine to recognise this pattern:
(set (match_operand:GPI 0 "register_operand" "=r")
(plus:GPI (xor:GPI (neg:GPI (match_operand:GPI 1 
"register_operand" "r"))

   (match_operand:GPI 2 "register_operand" "r"))
  (match_dup 1)))

Now what I need is for operand 1 to instead be a cc_reg comparison, but 
when I change operand 1 to this pattern:


(set (match_operand:GPI 0 "register_operand" "=r")
(plus:GPI (xor:GPI (neg:GPI (match_operator:GPI 1 
"aarch64_comparison_operator"
[(match_operand:CC 2 "cc_register" "") 
(const_int 0)]))

   (match_operand:GPI 3 "register_operand" "r"))
  (match_dup 1)))

This doesn't match. Is there any way to express this in a combineable 
pattern?


Thanks,
Kyrill






On the other hand, I did manage to write a peephole2 that detected the
sequence of compare+neg+xor+plus and transformed it into the
if_then_else form that our current conditional negation pattern has,
although I'm not sure how flexible this is.

Probably not very.  We really should be looking at combine.  In fact, I
would argue that we should be looking at combine regardless of whether
or not we twiddle expansion as humans or machine generated code could
look like this...

jeff







Re: Conditional negation elimination in tree-ssa-phiopt.c

2014-08-12 Thread Kyrill Tkachov


On 12/08/14 10:39, Richard Biener wrote:

On Mon, Aug 11, 2014 at 9:56 PM, Jeff Law  wrote:

On 08/11/14 07:41, Kyrill Tkachov wrote:


I haven't been able to get combine to match the comparison+xor+neg+plus
RTL and it seems like it would be just a workaround to undo the
tree-level transformation.

Yea, it'd just be a workaround, but it's probably the easiest way to deal
with this problem.  Can you describe in further detail why you weren't able
to get this to work?

Too many instructions to combine I guess.  You might want to add
intermediate "combine" insn-and-splits.  If that's still a no-go then
read on.


From the combine dump I can see that it tried to combine up to:
(set (reg/i:SI 0 x0)
(plus:SI (xor:SI (neg:SI (reg:SI 84 [ D.2565 ]))
(reg:SI 73 [ D.2564 ]))
(reg:SI 84 [ D.2565 ])))

What I need is for that reg 84 to be the result of the comparison, 
something like:

(ne (cc_reg) (const_int 0)) which I couldn't get combine to shove in there.


On the other hand, I did manage to write a peephole2 that detected the 
sequence of compare+neg+xor+plus and transformed it into the 
if_then_else form that our current conditional negation pattern has, 
although I'm not sure how flexible this is.




OTOH a suitable place to "undo" would be smarter RTL expansion
that detects this pattern (and hope for TER to still apply - thus
no CSE opportunities going in your way).  For your testcase TER
allows

r_5 replace with --> r_5 = (int) _4;

_8 replace with --> _8 = _4 != 0;

_10 replace with --> _10 = -_9;

_11 replace with --> _11 = _10 ^ r_5;

_12 replace with --> _12 = _11 + _9;

which unfortunately already "breaks" because of the multi-use _9
and _4.

Now - the way out here is a GIMPLE pass right before expansion
that performs pattern detection and generates target specific code suitable
for optimal expand.  Fortunately we already have that with
pass_optimize_widening_mul (which does widen-mul and fma detection).


Maybe we can piggyback on this pass then? (probably renaming it to 
something more generic in the process...)




This is the place where you'd deal with such pattern, generating
a suitable compound operation (or two, dependent on expansion
and insn pattern details).  This of course usually requires new
tree codes or internal functions.  For more arcane stuff like
your conditional negate I'd prefer an internal function.


What is an internal function in this context? Is that like a 
target-specific builtin?




Of course you then have to add an optab or a target hook to
do custom internal function expansion.  I suppose for internal
functions a target hook would be nicer than a generic optab?


So something like TARGET_EXPAND_CONDITIONAL_NEGATION that gets called 
from the aforementioned GIMPLE pass when it finds a an appropriate 
sequence of compare+neg+xor+plus ?


Thanks for the pointers,
Kyrill



Thanks,
Richard.




What is the most acceptable way of disabling this transformation for a
target that has a conditional negation instruction?

In general, we don't want target dependencies in the gimple/ssa optimizers.

Jeff





Conditional negation elimination in tree-ssa-phiopt.c

2014-08-11 Thread Kyrill Tkachov

Hi all,

The aarch64 target has a conditional negation instruction
CSNEG Rd, Rs1, Rs2, cond

with semantics Rd = if cond then Rs1 else -Rs2.

This, however doesn't get end up getting matched for code such as:
int
foo2 (unsigned a, unsigned b)
{
  int r = 0;
  r = a & b;
  if (a & b)
return -r;
  return r;
}

because the code in tree-ssa-phiopt.c transforms conditional negation 
into (rhs ^ -cond) + cond. The transformation is guarded by this:


  /* The replacement of conditional negation with a non-branching
 sequence is really only a win when optimizing for speed and we
 can avoid transformations by gimple if-conversion that result
 in poor RTL generation.

 Ideally either gimple if-conversion or the RTL expanders will
 be improved and the code to emit branchless conditional negation
 can be removed.  */
  bool replace_conditional_negation = false;
  if (!do_store_elim)
replace_conditional_negation
  = ((!optimize_size && optimize >= 2)
 || (((flag_tree_loop_vectorize || cfun->has_force_vectorize_loops)
  && flag_tree_loop_if_convert != 0)
 || flag_tree_loop_if_convert == 1
 || flag_tree_loop_if_convert_stores == 1));

I haven't been able to get combine to match the comparison+xor+neg+plus 
RTL and it seems like it would be just a workaround to undo the 
tree-level transformation.


What is the most acceptable way of disabling this transformation for a 
target that has a conditional negation instruction?


Thanks,
Kyrill



Re: [ARM] Is TARGET_UNIFIED_ASM still needed?

2014-07-23 Thread Kyrill Tkachov


On 23/07/14 09:59, Kyrill Tkachov wrote:

On 23/07/14 09:55, Richard Earnshaw wrote:

On 22/07/14 16:23, Ramana Radhakrishnan wrote:

On 22/07/14 14:14, Kyrill Tkachov wrote:

Hi all,

In the arm backend we've got this TARGET_UNIFIED_ASM macro that is
currently on for TARGET_THUMB2 with a comment that says:
/* We could use unified syntax for arm mode, but for now we just use it
   for Thumb-2.  */

I've been doing some work converting the pre-UAL floating point
mnemonics to unified syntax and it seems if we were to strictly adhere
to this TARGET_UNIFIED_ASM I would have to guard those changes, which
would be somewhat ugly.

I would just change vfp.md to UAL and expect it to work because GAS
accepts unified syntax for the floating point instructions even without
.syntax unified.

We need T_U_A until the point of time that the Thumb1 port is converted
to UAL, GAS validated against Thumb1 and the rest of the "arm" port is
converted to UAL and verified with GAS.

Additionally if someone were to do the full transition, remember that
users need to have a way of mixing non-unified syntax in inline
assembler with unified syntax in the rest of the C code.


regards
Ramana


Is it perhaps time to just drop this and assume unified asm everywhere?


Kyrill





We also need to be able to support User's inline assembly that is not in
unified syntax.  Though that might be a different issue to the one
you're trying to address here.

Thanks for the responses,
I was just thinking that the TARGET_UNIFIED_ASM macro is not honored
right now anyway (due to the pre-UAL mnemonics in vfp.md) so we might
want to get rid of it. I don't think this would be a user-visible
change, now would it reject pre-UAL inline assembly from what I can see
in its uses.


s/now would it/nor would it/



Kyrill


R.








Re: [ARM] Is TARGET_UNIFIED_ASM still needed?

2014-07-23 Thread Kyrill Tkachov


On 23/07/14 09:55, Richard Earnshaw wrote:

On 22/07/14 16:23, Ramana Radhakrishnan wrote:


On 22/07/14 14:14, Kyrill Tkachov wrote:

Hi all,

In the arm backend we've got this TARGET_UNIFIED_ASM macro that is
currently on for TARGET_THUMB2 with a comment that says:
/* We could use unified syntax for arm mode, but for now we just use it
  for Thumb-2.  */

I've been doing some work converting the pre-UAL floating point
mnemonics to unified syntax and it seems if we were to strictly adhere
to this TARGET_UNIFIED_ASM I would have to guard those changes, which
would be somewhat ugly.

I would just change vfp.md to UAL and expect it to work because GAS
accepts unified syntax for the floating point instructions even without
.syntax unified.

We need T_U_A until the point of time that the Thumb1 port is converted
to UAL, GAS validated against Thumb1 and the rest of the "arm" port is
converted to UAL and verified with GAS.

Additionally if someone were to do the full transition, remember that
users need to have a way of mixing non-unified syntax in inline
assembler with unified syntax in the rest of the C code.


regards
Ramana


Is it perhaps time to just drop this and assume unified asm everywhere?


Kyrill






We also need to be able to support User's inline assembly that is not in
unified syntax.  Though that might be a different issue to the one
you're trying to address here.


Thanks for the responses,
I was just thinking that the TARGET_UNIFIED_ASM macro is not honored 
right now anyway (due to the pre-UAL mnemonics in vfp.md) so we might 
want to get rid of it. I don't think this would be a user-visible 
change, now would it reject pre-UAL inline assembly from what I can see 
in its uses.


Kyrill



R.





[ARM] Is TARGET_UNIFIED_ASM still needed?

2014-07-22 Thread Kyrill Tkachov

Hi all,

In the arm backend we've got this TARGET_UNIFIED_ASM macro that is 
currently on for TARGET_THUMB2 with a comment that says:

/* We could use unified syntax for arm mode, but for now we just use it
   for Thumb-2.  */

I've been doing some work converting the pre-UAL floating point 
mnemonics to unified syntax and it seems if we were to strictly adhere 
to this TARGET_UNIFIED_ASM I would have to guard those changes, which 
would be somewhat ugly.


Is it perhaps time to just drop this and assume unified asm everywhere?


Kyrill



Re: Question for ARM person re asm_fprintf

2014-07-22 Thread Kyrill Tkachov

Hi David,

On 22/07/14 02:46, David Wohlferd wrote:

I have been looking at asm_fprintf in final.c, and I think there's a
design flaw.  But since the change affects ARM and since I have no
access to an ARM system, I need a second opinion.

asm_fprintf allows platforms to add support for new format specifiers by
using the ASM_FPRINTF_EXTENSIONS macro.  ARM uses this to add support
for %@ and %r specifiers.  Pretty straight-forward.

However, it isn't enough to add these two items to the case statement in
asm_fprintf.  Over in c-format.c, there is compile-time checking that is
done against calls to asm_fprintf to validate the format string.  %@ and
%r have been added to this checking (see asm_fprintf_char_table), but
NOT in a platform-specific way.

This means that using %r or %@ will successfully pass the format
checking on all platforms, but will ICE on non-ARM platforms since there
are no case statements in asm_fprintf to support them.

Compiling the code in asm_fprintf-1.c (see the patch) with this patch
correctly reports "unknown conversion type character" for both 'r' and
'@' in  x86_64-pc-cygwin.  It would be helpful if someone could confirm
that it still compiles without error under ARM after applying this
patch.  I'm reluctant to post this to gcc-patches when it has never been
run.


I've tested the asm_fprintf-1.c test on an arm-none-eabi cross compiler 
with this patch which I think should be enough for this patch's 
purposes, I've confirmed that the assembly generated contains the:

.ascii  "%@%r\000"

string so I'd suggest you go ahead and post it to gcc-patches.
This is, of course, not a full regression suite run and I'm not sure if 
there's anything else in the testsuite that would exercise this bit of 
code, but I can kick off a run for that later if you'd like.


Kyrill

dw





Some __builtin_round and cast subtleties

2014-07-04 Thread Kyrill Tkachov

Hi all,

Consider code:
long int
foo (double a)
{
   return __builtin_round (a);
}

Compiling for aarch64-none-elf (bare-metal aarch64 with newlib as 
C-library) with -O2 gives the 003t.original dump:


;; Function foo (null)
;; enabled by -tree-original


{
  return (long int) __builtin_round (a);
}


whereas compiling for aarch64-none-linux-gnu (linux target with glibc) 
gets translated into:


;; Function foo (null)
;; enabled by -tree-original


{
  return __builtin_lround (a);
}

These end up taking different codepaths through the compiler () because 
__builtin_lround has to take -fmath-errno into account and does not end 
up getting inlined (generating a call to the library lround).


__builtin_round, however, is defined everywhere and ends up getting 
expanded to a


rounddf optab + (set r:DI (fix:DI (r:DF)))

which then later gets combined into the expansion we've got for the 
lrounddfdi2 optab


Is that correct/expected behaviour?
I tried grepping around the gcc sources but I'm not familiar with code 
that would do the frontend transformation mentioned above.


Thanks,
Kyrill



Re: TARGET_MACRO_FUSION_PAIR for something besides compare-and-branch ?

2014-05-28 Thread Kyrill Tkachov


On 28/05/14 17:36, Alexander Monakov wrote:


On Wed, 28 May 2014, Kyrill Tkachov wrote:


Hi all,

The documentation for TARGET_MACRO_FUSION_PAIR says that it can be used to
tell the scheduler that two insns should not be scheduled apart. It doesn't
specify what kinds of insns those can be.

Yet from what I can see in sched-deps.c it can only be used on compares and
conditional branches, as implemented in i386.

Please note that it's not only restricted to conditional branches, but also to
keeping the instructions together if they were consecutive in the first place
(i.e. it does not try to move a compare insn closer to the branch).

Doing it that way allowed to solve the issue at hand at that time without a
separate scan of the whole RTL instruction stream.
  

Say I want to specify two other types of instruction that I want to force
together, would it be worth generalising the TARGET_MACRO_FUSION_PAIR usage
to achieve that?

Hi Alexander,
Thanks for the insight,


I'd say yes, but that would be the least of the problems; the more important
question is how to trigger the hook (you probably want to integrate it into
the existing scheduler dependencies evaluation loop rather than adding a new
loop just to discover macro-fusable pairs).


Yeah, I was afraid that would be the case.


  You'll also have to invent
something new if you want to move non-consecutive fusable insns together if
they are apart.


Seems to me that TARGET_SCHED_REORDER would be a better fit for that.

Kyrill



HTH.
Alexander






TARGET_MACRO_FUSION_PAIR for something besides compare-and-branch ?

2014-05-28 Thread Kyrill Tkachov

Hi all,

The documentation for TARGET_MACRO_FUSION_PAIR says that it can be used 
to tell the scheduler that two insns should not be scheduled apart. It 
doesn't specify what kinds of insns those can be.


Yet from what I can see in sched-deps.c it can only be used on compares 
and conditional branches,

as implemented in i386.

Say I want to specify two other types of instruction that I want to 
force together, would it be worth

generalising the TARGET_MACRO_FUSION_PAIR usage to achieve that?

Thanks,
Kyrill



Re: Live range shrinkage in pre-reload scheduling

2014-05-16 Thread Kyrill Tkachov

On 15/05/14 09:52, Ramana Radhakrishnan wrote:

On Thu, May 15, 2014 at 8:36 AM, Maxim Kuvyrkov
 wrote:

On May 15, 2014, at 6:46 PM, Ramana Radhakrishnan  
wrote:

I'm not claiming it's a great heuristic or anything.  There's bound to
be room for improvement.  But it was based on "reality" and real results.

Of course, if it turns out not be a win for ARM or s390x any more then it
should be disabled.

The current situation that Kyrill is investigating is a case where we
notice the first scheduler pass being a bit too aggressive with
creating ILP opportunities with the A15 scheduler that causes
performance differences with not turning on the first scheduler pass
vs using the defaults.

Charles has a work-in-progress patch that fixes a bug in SCHED_PRESSURE_MODEL 
that causes the above symptoms.  The bug causes 1st scheduler to unnecessarily 
increase live ranges of pseudo registers when there are a lot of instructions 
in the ready list.

Is this something that you've seen shows up in general integer code as
well ? Do you or Charles have an example for us to look at ? I'm not
sure what "lot of instructions in the ready list" really means here.
The specific case Kyrill's been looking into is Dhrystone Proc_8 when
tuned for a Cortex-A15 with neon and float-abi=hard but I am not sure
if that has "too many instructions" :) .

Kyrill, could you also look into the other cases we have from SPEC2k
where we see this as well and come back with any specific testcases
that Charles / Richard could also take a look into.

Hi all,

From what I can see the most significant regression from this pre-regalloc 
scheduling on SPEC2k is in 171.swim. It seems to suffer from similar symptoms to 
Proc_8 (lots of extra spills on the stack)


Looking forward to the patch :). Let me know if I can help with any 
testing/validation.


Kyrill

Charles, can you finish your patch in the next several days and post it for 
review?

I think we'll await this but we'll go look into some of the benchmarks.

Ramana


Thank you,

--
Maxim Kuvyrkov
www.linaro.org







Live range shrinkage in pre-reload scheduling

2014-05-13 Thread Kyrill Tkachov

Hi all,

In haifa-sched.c (in rank_for_schedule) I notice that live range shrinkage is 
not performed when SCHED_PRESSURE_MODEL is used and the comment mentions that it 
results in much worse code.


Could anyone elaborate on this? Was it just empirically noticed on x86_64?

Thanks,
Kyrill



Re: status of wide-int patch.

2014-04-29 Thread Kyrill Tkachov

On 28/04/14 18:03, Kenneth Zadeck wrote:

At this point we have believe that we have addressed all of the concerns
that the community has made about the wide-int branch.   We have also
had each of the sections of the branch approved by the area maintainers.

We are awaiting a clean build on the arm


Unfortunately arm bootstrap fails a bit further down the line in stage2 while 
building libstdc++-v3/src/c++98/ios.cc:


xgcc: internal compiler error: Segmentation fault (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.

Running the cc1plus subcommand through gdb gives:
Program received signal SIGSEGV, Segmentation fault.
0x005c32c8 in real_to_decimal_for_mode(char*, real_value const*, unsigned int, 
unsigned int, int, machine_mode) ()

(gdb) bt
#0  0x005c32c8 in real_to_decimal_for_mode(char*, real_value const*, unsigned 
int, unsigned int, int, machine_mode) ()

#1  0x33f6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

The debug info seems to be scarce here, any hints on where to look?

Thanks,
Kyrill

P.S. The aarch64 build and testsuite run looks fine.



Re: [buildrobot] ARM: Missing initializers for Cortex A8

2014-04-25 Thread Kyrill Tkachov

On 24/04/14 22:07, Jan-Benedict Glaw wrote:

Hi!

Seems the new cost model for Cortex A8 is missing two initializer
fields:

g++ -c   -g -O2 -DIN_GCC  -DCROSS_DIRECTORY_STRUCTURE  -fno-exceptions 
-fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
-Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic 
-Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common 
 -DHAVE_CONFIG_H -I. -I. -I../../../gcc/gcc -I../../../gcc/gcc/. 
-I../../../gcc/gcc/../include -I../../../gcc/gcc/../libcpp/include 
-I/opt/cfarm/mpc/include  -I../../../gcc/gcc/../libdecnumber 
-I../../../gcc/gcc/../libdecnumber/dpd -I../libdecnumber 
-I../../../gcc/gcc/../libbacktrace-o arm.o -MT arm.o -MMD -MP -MF 
./.deps/arm.TPo ../../../gcc/gcc/config/arm/arm.c
../../../gcc/gcc/config/arm/arm.c:1714:1: error: missing initializer for member 
‘tune_params::disparage_flag_setting_t16_encodings’ 
[-Werror=missing-field-initializers]
  };
  ^
../../../gcc/gcc/config/arm/arm.c:1714:1: error: missing initializer for member 
‘tune_params::disparage_partial_flag_setting_t16_encodings’ 
[-Werror=missing-field-initializers]
cc1plus: all warnings being treated as errors
make[2]: *** [arm.o] Error 1


I'll fix it up. The problem is the new Cortex-A8 tuning struct rather than the 
cost table itself. The new fields were added very recently and the Cortex-A8 
patch was developed before that.


Thanks for reporting.

Kyrill




(Cf. http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=207555)

MfG, JBG






Re: add_branch_dependences in sched-rgn.c

2014-04-10 Thread Kyrill Tkachov

On 10/04/14 02:50, Maxim Kuvyrkov wrote:

On Apr 9, 2014, at 4:15 AM, Kyrill Tkachov  wrote:


Hi all,

I'm looking at some curious pre-reload scheduling behaviour and I noticed this:

At the add_branch_dependences function sched-rgn.c there is a comment that says 
"branches, calls, uses, clobbers, cc0 setters, and instructions that can throw 
exceptions" should be scheduled at the end of the basic block.

However right below it the code that detects this kind of insns seems to only 
look for these insns that are directly adjacent to the end of the block 
(implemented with a while loop that ends as soon as the current insn is not one 
of the aforementioned).

Shouldn't the code look through the whole basic block, gather all of the 
branches, clobbers etc. and schedule them at the end?


Not really.  The instruction sequences mentioned in the comment end basic block by 
definition -- if there is a jump or other "special" sequence, then basic block 
can't continue beyond that as control may be transffered to something other than the next 
instruction.


Makes sense for things like branches, calls and potential exception-throwers, 
but how can clobbers and uses change the control flow?


Thanks for the help,
Kyrill



   Add_branch_dependencies() makes sure that scheduler does not "accidentally" place 
something after those "special" sequences thus creating a corrupted basic block.

--
Maxim Kuvyrkov
www.linaro.org








add_branch_dependences in sched-rgn.c

2014-04-08 Thread Kyrill Tkachov

Hi all,

I'm looking at some curious pre-reload scheduling behaviour and I noticed this:

At the add_branch_dependences function sched-rgn.c there is a comment that says 
"branches, calls, uses, clobbers, cc0 setters, and instructions that can throw 
exceptions" should be scheduled at the end of the basic block.


However right below it the code that detects this kind of insns seems to only 
look for these insns that are directly adjacent to the end of the block 
(implemented with a while loop that ends as soon as the current insn is not one 
of the aforementioned).


Shouldn't the code look through the whole basic block, gather all of the 
branches, clobbers etc. and schedule them at the end?



Any ideas?

Thanks,
Kyrill



Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Kyrill Tkachov

On 27/02/14 11:07, Andrew Haley wrote:

Over the years there has been a great deal of traffic on these lists
caused by misunderstandings of GCC's inline assembler.  That's partly
because it's inherently tricky, but the existing documentation needs
to be improved.

dw  has done a fairly thorough reworking of
the documentation.  I've helped a bit.

Section 6.41 of the GCC manual has been rewritten.  It has become:

6.41 How to Use Inline Assembly Language in C Code
6.41.1 Basic Asm - Assembler Instructions with No Operands
6.41.2 Extended Asm - Assembler Instructions with C Expression Operands

We could simply post the patch to GCC-patches and have at it, but I
think it's better to discuss the document here first.  You can read it
at

http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
and affected html pages)

All comments are very welcome.

Hi Andrew, dw,

Thanks for doing this!

In the Extended Asm documentation: Other format strings section:
"'%=' outputs a number that is unique to each instruction in the entire 
compilation."


I find the term 'instruction' to be confusing here. From what I understand the 
number is unique to each asm statement, which may contain multiple assembly 
instructions. IMHO it would be clearer to say "unique to each asm statement"


Kyrill




Andrew.






libatomic Makefile unconditionally sets -march=armv7-a when configuring with ifunc support

2014-01-16 Thread Kyrill Tkachov

Hi Richard,

I noticed that Makefile.in in libatomic sets -march=armv7-a when compiling for 
arm linux targets with ifunc support:


@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=armv7-a 
-DHAVE_KERNEL64

Is there any particular reason why it must do that?
It seems that if we're trying to build a multilib for armv8-a, for example, this 
would override any architecture options, and even conflict with any -mcpu 
options that we might specify elsewhere.


Would there be any fallout I'm not seeing from removing -march=armv7-a from 
there?

Thanks,
Kyrill



Truncate optimisation question

2013-12-03 Thread Kyrill Tkachov

Hi all,

I'm investigating a testsuite failure on arm: gcc.target/arm/unsigned-extend-1.c

For code:

unsigned char foo (unsigned char c)
{
  return (c >= '0') && (c <= '9');
}

we end up generating:

sub r0, r0, #48
uxtbr0, r0
cmp r0, #9
movhi   r0, #0
movls   r0, #1
bx  lr

The extra uxtb (extend) is causing the test to fail. We started generating the 
extra extend when a particular optimisation went in with (revision r191928).


The comment in simplify-rtx.c says it transforms
(truncate:SI (op:DI (x:DI) (y:DI)))

into

(op:SI (truncate:SI (x:DI)) (truncate:SI (x:DI)))

but from what I can see it also transforms

(truncate:QI (op:SI (x:SI) (y:SI)))

into

(op:QI (truncate:QI (x:SI)) (truncate:QI (x:SI)))

From the combine dump I see that the sub and extend operations come from the 
RTL:

(insn 6 3 7 2 (set (reg:SI 116)
(plus:SI (reg:SI 0 r0 [ c ])
(const_int -48 [0xffd0])))

(insn 7 6 8 2 (set (reg:SI 117)
(zero_extend:SI (subreg:QI (reg:SI 116) 0)))


If I add a QImode compare pattern to the arm backend it gets matched and the 
extra extend goes away, but it seems to me that that's not the correct solution. 
Ideally, if a QImode operation is performed as an SImode operation on a target 
(like the sub and compare operations on arm) then we should not be doing this 
optimisation?


My question is, how does one express that information in the simplify-rtx.c 
code?
It seems that the PR that optimisation fixed (54457) only cared about DI -> SI 
truncations, so perhaps we should disable it for conversions between other modes 
where it's not beneficial altogether?


Thanks,
Kyrill



Re: Compilation flags in libgfortran

2013-10-16 Thread Kyrill Tkachov

On 16/10/13 10:37, pins...@gmail.com wrote:

On Oct 15, 2013, at 6:58 AM, Igor Zamyatin  wrote:
Hi All!

Is there any particular reason that matmul* modules from libgfortran
are compiled with -O2 -ftree-vectorize?

I see some regressions on Atom processor after r202980
(http://gcc.gnu.org/ml/gcc-cvs/2013-09/msg00846.html)

Why not just use O3 for those modules?

-O3 and -O2 -ftree-vectorize won't give much performance difference.  What you 
are seeing is the cost model needs improvement; at least for atom.

Hi all,
I think http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01908.html introduced the 
new "cheap" vectoriser cost model that favors compilation time over runtime 
performance and is set as default for -O2. -O3 uses the "dynamic" model which 
potentially gives better runtime performance in exchange for longer compile 
times (if I understand the new rules correctly).

Therefore, I'd expect -O3 to give a better vector performance than -O2...

Kyrill