Re: FDO and LTO on ARM

2011-08-08 Thread Jan Hubicka
> On Fri, Aug 5, 2011 at 3:24 PM, Jan Hubicka  wrote:
> >> >
> >> > In a way I like the current scheme since it is simple and extending it
> >> > should IMO have some good reason. We could refine -Os behaviour without
> >> > changing current predicates to optimize for speed in
> >> > a) functions declared as "hot" by user and BBs in them that are not 
> >> > proved
> >> > cold.
> >> > b) based on profile feedback - i.e. we could have two thresholds, BBs 
> >> > with
> >> > very arge counts wil be probably hot, BBs in between will be maybe
> >> > hot/normal and BBs with low counts will be cold.
> >> > This would probably motivate introduction of probably_hot predicate that
> >> > summarize the above.
> >>
> >> Introducing a new 'probably_hot' will be very confusing -- unless you
> >> also rename 'maybe_hot', but this leads to finer grained control:
> >> very_hot, hot, normal, cold, unlikely which can be hard to use.  The
> >> three state partition (not counting exec_once) seems ok, but
> >
> > OK, I also preffer to have fewer stages than more ;)
> >>
> >> 1) the unlikely state does not have controllable parameter
> >
> > Well, it is defined as something that is not likely to be executed, so the 
> > requirement
> > on count to be less than 1/(number_of_test_runs*2) is very natural and 
> > don't seem
> > to need to be tuned.
> 
> Ok, so it is defined to be different from 'rarely-executed' case.
> However rarely-executed seems more general and can perhaps be used in
> place of unlikely case. If there are situation that applies only to
> 'unlikely', they can be split apart.

So you thing of having hot (as for optimize for speed), cold (as for optimize
for size) and rarely executed (as for optimize very heavily for size)?
(as a replacement of current hot=speed/cold=size scheme)

It may not be completely crazy - i.e. at least kernel people tends to call
for something that is like -Os but not doing extreme tradeoffs (like not
expanding simple division by constant sequences or doing similar things that
hurts performance a lot and usually save just small amount of code).

I however wonder how large portions of program can be safely characterized as
rarely executed that are not unlikely. I.e. my intuition would be that it is
relatively small portion of program since code tends to be either dead or used
resonably often.

BTW The original motivation for "unlikely" was the function splitting pass, so
the functions put into unlikely section are having good chance to be never
touched in program execution and thus never mapped in.

It is in fact the only place it seems to be used in till today...

Honza
> 
> David


Re: FDO and LTO on ARM

2011-08-08 Thread Xinliang David Li
On Fri, Aug 5, 2011 at 3:24 PM, Jan Hubicka  wrote:
>> >
>> > In a way I like the current scheme since it is simple and extending it
>> > should IMO have some good reason. We could refine -Os behaviour without
>> > changing current predicates to optimize for speed in
>> > a) functions declared as "hot" by user and BBs in them that are not proved
>> > cold.
>> > b) based on profile feedback - i.e. we could have two thresholds, BBs with
>> > very arge counts wil be probably hot, BBs in between will be maybe
>> > hot/normal and BBs with low counts will be cold.
>> > This would probably motivate introduction of probably_hot predicate that
>> > summarize the above.
>>
>> Introducing a new 'probably_hot' will be very confusing -- unless you
>> also rename 'maybe_hot', but this leads to finer grained control:
>> very_hot, hot, normal, cold, unlikely which can be hard to use.  The
>> three state partition (not counting exec_once) seems ok, but
>
> OK, I also preffer to have fewer stages than more ;)
>>
>> 1) the unlikely state does not have controllable parameter
>
> Well, it is defined as something that is not likely to be executed, so the 
> requirement
> on count to be less than 1/(number_of_test_runs*2) is very natural and don't 
> seem
> to need to be tuned.

Ok, so it is defined to be different from 'rarely-executed' case.
However rarely-executed seems more general and can perhaps be used in
place of unlikely case. If there are situation that applies only to
'unlikely', they can be split apart.

David


Re: PATCH RFA: Build stages 2 and 3 with C++

2011-08-08 Thread Romain Geissler

Le 8 août 2011 à 20:49, Ian Lance Taylor a écrit :
> 
> However, since we currently permit plugins to call anything in gcc, I
> think the answer is going to have to be that plugins which do that
> should be compiled with C++.

Ok, i'll move to C++ then, until a dedicated C plugin API comes out.

> I don't think that adding extern "C" to
> all gcc header files is the right approach.  Adding extern "C" to a few
> selected header files seems fine.

Adding extern "C" to a small set of files doesn't make sense to me. When
working with real-world plugin, you will certainly end up calling many
different gcc primitives coming from a wide bunch of header files (which
for most of them don't use extern "C" for now). This hack should be applied
to all of the plugin visible files, or none, but not to just a few when someone
needs it.


Re: Question about SMS scheduling windows

2011-08-08 Thread Ayal Zaks
2011/8/8 Richard Sandiford 
>
> Ayal Zaks  writes:
> >> OK.  For the follow-on iv patch, it seemed easier to keep both bounds
> >> inclusive for the loop, then make the "end" exclusive when setting the
> >> out parameters.  (Note that there shouldn't be any problem with overflow
> >> when making the bound exclusive, because the size of the range has been
> >> restricted to "ii" by that stage.  The follow-on patch does use
> >> saturating maths for all ops though.)
> >
> > Sure, no problem having "end" inclusive, as it simplifies things. We
> > can even keep it inclusive all the way through, iterating over: for (c
> > = start; c != end+step; c += step)...
>
> Yeah, I'd prefer that too.  I might do it once Revital's and
> Roman's changes are in.
>
> >> I don't have powerpc hardware that I can do meaningful performance
> >> testing on, but I did run it through a Popular* Embedded Benchmark
> >> on an ARM Cortex-A8 board with -O3 -fmodulo-sched
> >> -fmodulo-sched-allow-regmoves.  There were no changes.  (And this is
> >> a benchmark that does benefit from modulo scheduling, in some cases
> >> by a significant amount.)
> >>
> > Ok, the patch should be neutral performance-wise. One could check btw
> > if SMS produces exactly the same output in both cases, even w/o a
> > powerpc (or any other) HW.
>
> OK.  The patch does change the output a little because of the MEM_DEP
> range thing.


Ahh, right.

>
>  To compensate for that, I tried compiling libav on ARM with:
>
> Index: gcc/modulo-sched.c
> ===
> --- gcc/modulo-sched.c  2011-08-05 11:23:16.0 +0100
> +++ gcc/modulo-sched.c  2011-08-05 11:27:57.0 +0100
> @@ -1680,7 +1680,7 @@ get_sched_window (partial_schedule_ptr p
>                          p_st, early_start, e->latency);
>
>              if (e->data_type == MEM_DEP)
> -               end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> +               end = MIN (end, SCHED_TIME (v_node) + ii);
>            }
>          else if (dump_file)
>             fprintf (dump_file, "the node is not scheduled\n");
> @@ -1729,7 +1729,7 @@ get_sched_window (partial_schedule_ptr p
>                          s_st, late_start, e->latency);
>
>              if (e->data_type == MEM_DEP)
> -               end = MAX (end, SCHED_TIME (v_node) - ii + 1);
> +               end = MAX (end, SCHED_TIME (v_node) - ii);
>              if (dump_file)
>                  fprintf (dump_file, "end = %d\n", end);
>
> @@ -1791,7 +1791,7 @@ get_sched_window (partial_schedule_ptr p
>                 count_preds++;
>
>              if (e->data_type == MEM_DEP)
> -               end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> +               end = MIN (end, SCHED_TIME (v_node) + ii);
>            }
>           else if (dump_file)
>             fprintf (dump_file, "the node is not scheduled\n");
>
> applied, then tried the same thing with the revised patch below.
> The output was the same.


ok, that's a good sanity check.

>
> (FWIW, libav did show up extra differences when using the patch
> that I'd originally submitted.  They were due to the count_preds
> and count_succs thing that you picked up in your review.)


(These differences had no noticable consequences performance-wise, right?)

>
> >> Bootstrapped & regression-tested on powerpc-ibm-aix5.3 with the
> >> additional patch:
> >>
> >> Index: gcc/opts.c
> >> ===
> >> --- gcc/opts.c  2011-08-03 10:56:50.0 +0100
> >> +++ gcc/opts.c  2011-08-03 10:56:50.0 +0100
> >> @@ -449,6 +449,8 @@ static const struct default_options defa
> >>     { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
> >>     { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
> >>     { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
> >> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched, NULL, 1 },
> >> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched_allow_regmoves, NULL, 1 },
> >>
> >>     /* -O2 optimizations.  */
> >>     { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
> >>
> >> applied for both the "before" and "after" runs.  OK to install?
> >
> >
> > Yes, with just a couple of minor non-mandatory comments:
> > 1. Setting "count_preds = psp_not_empty;" and "count_succs =
> > pss_not_empty;" suggests that you want to decrement non-interesting
> > neighbors instead of incrementing interesting ones. Otherwise can
> > initialize to zero (doesn't really matter, just a bit confusing).
>
> Yeah, good point.  I initialised them to zero, like you said,
> and changed the reversal condition to:
>
>  if (pss_not_empty && count_succs >= count_preds)
>
> Which I have admit makes a lot more sense than what I submitted,
> and avoids some unwanted changes in output.
>
> > 2. I find it easier to read "late_start = MIN (late_start, early_start
> > + (ii - 1));" instead of "late_start = MIN (early_start + (ii - 1),
> > late_start);".
>
> Oops, that was accidental.  Fixed

Re: [named address] rejects-valid: error: initializer element is not computable at load time

2011-08-08 Thread Georg-Johann Lay

Ulrich Weigand schrieb:

Georg-Johann Lay wrote:


Ulrich Weigand wrote:


This is pretty much working as expected.  "pallo" is a string literal
which (always) resides in the default address space.  According to the
named address space specification (TR 18037) there are no string literals
in non-default address spaces ...


The intension of TR 18037 to supply "Extension to Support Embedded Systems"
and these are systems where every byte counts -- and it counts *where* the
byte will be placed.

Basically named AS provide something like target specific qualifiers, and
if GCC, maybe under the umbrella of GNU-C, would actually implement a feature
like target specific qualifiers, that would be a great gain and much more
appreciated than -- users will perceive it that way -- being more catholic
than the pope ;-)


The problem with all language extensions is that you really need to be careful
that the new feature you want to add is fully specified, in all its potential
interactions with the rest of the (existing) language features.  If you don't,
some of those ambiguities are certain to be causing you problems later on --
in fact, that's just what has happened time and again with GCC extensions
that were added early on ...  This is why these days, extensions usually are
accepted only if they *are* fully specified (which usually means providing
a "diff" to the C standard text that would add the feature to the standard).

This is a non-trivial task.  One of the reasons why we decided to follow the
TR 18037 spec when implementing the __ea extension for SPU is that this task
had already been done for us.  If you want to deviate from that existing spec,
you're back to doing this work yourself.



For example, you can have any combination of qualifiers like const, restrict
or volatile, but it is not possible for named AS.  That's clear as long as
named AS is as strict as TR 18037.  However, users want features to write
down their code an a comfortable, type-safe way and not as it is at the moment,
i.e. by means of dreaded inline assembler macros (for my specific case).


A named AS qualifier *can* be combined with other qualifiers like const.
It cannot be combined with *other* named AS qualifiers, because that doesn't
make sense in the semantics underlying the address space concept of TR 18037.
What would you expect a combination of two AS qualifiers to mean?


Not two AS qualifiers, but two or more qualifiers, some of which might 
be target specific.  You cannot use AS qualifiers and "abuse" them to 
implement a feature like target specific qualifiers.



The assignment above would therefore need to convert a pointer to the
string literal in the default space to a pointer to the __pgm address
space.  This might be impossible (depending on whether __pgm encloses
the generic space), and even if it is possible, it is not guaranteed
that the conversion can be done as a constant expression at compile time.


The backend can tell. It likes to implement features to help users.
It knows about the semantic and if it's legal or not.

And even if it's all strict under TR 18037, the resulting error messages
are *really* confusing to users because to them, a string literal's address
is known.


It would be possible to the extend named AS implementation to allow AS pointer
conversions in initializers in those cases where the back-end knows this can
be done at load time.  (Since this is all implementation-defined anyway, it
would cause no issues with the standard.  We simply didn't do it because on
the SPU, it is not easily possible.)

Of course, that still wouldn't place the string literal into the non-generic
address space, it just would convert its address.


What I'd do to get a string placed into the __pgm space is to explicitly
initialize an *array* in __pgm space, e.g. like so:

const __pgm char pallo[] = "pallo";

This is defined to work according to TR 18037, and it does actually
work for me on spu-elf.


Ya, but it different to the line above.


Sure, because it allocates only the string data, and not in addition a
pointer to it as your code did ...



Was just starting with the work and
it worked some time ago, so I wondered.


I think some time in the past, there was a bug where initalizers like in
you original line were silently accepted but then incorrect code was
generated (i.e. the pointer would just be initialized to an address in
the generic address space, without any conversion).


And I must admit I am not familiar
will all the dreaded restriction TR 18037 imposes to render it less functional 
:-(


It's not a restriction so much, it's simply that TR 18037 does not say anything
about string literals at all, so they keep working as they do in standard C.


Take the following bit more elaborate example where you want to describe 
a menu:


typedef struct
{
int id;
const char * labels[];
} menu1_t;

const menu1_t menu1  =
{
1,
{
"Yes",
"No",
"Don't really know and will decide

Re: PATCH RFA: Build stages 2 and 3 with C++

2011-08-08 Thread Ian Lance Taylor
Romain Geissler  writes:

> This new build behavior broke former plugins built with gcc. Indeed,
> all cc1 function symbols are now mangled and thus with the current
> trunk, plugins should also look for mangled symbols (and so built
> with g++).
>
> What's the new GCC policy about that ? Do plugins have to be built
> using g++ only, or does the plugin developer have the choice to
> use both gcc and g++ according to it's need (at the cost of adding
> extern "C" {…} in almost every headers to forbid mangling) ?

I think that ideally we would have a well-defined plugin interface,
plugins would stick to calling that, and that interface would be a C
interface.  (I see that gcc-plugin.h does already provide a C interface
even when gcc is compiled with C++.)

However, since we currently permit plugins to call anything in gcc, I
think the answer is going to have to be that plugins which do that
should be compiled with C++.  I don't think that adding extern "C" to
all gcc header files is the right approach.  Adding extern "C" to a few
selected header files seems fine.

Ian


Re: PATCH RFA: Build stages 2 and 3 with C++

2011-08-08 Thread Romain Geissler
Hi

Le 16 juil. 2011 à 08:52, Ian Lance Taylor a écrit :

> I would like to propose this patch as a step toward building gcc using a
> C++ compiler.  This patch builds stage1 with the C compiler as usual,
> and defaults to building stages 2 and 3 with a C++ compiler built during
> stage 1.  This means that the gcc installed and used by most people will
> be built by a C++ compiler.  This will ensure that gcc is fully
> buildable with C++, while retaining the ability to bootstrap with only a
> C compiler, not a C++ compiler.  This will permit us to experiment with
> optionally using C++ for some code, using a #ifdef to select the C
> implementation or the C++ implementation.
> 
> I would suggest that we consider releasing 4.7 this way, as a small
> trial for building gcc with C++.
> 
> This is a big step, so I am sending the patch to both gcc@ and
> gcc-patches@ for comments.
> 
> Bootstrapped and ran testsuite on x86_64-unknown-linux-gnu.
> 
> Ian

This new build behavior broke former plugins built with gcc. Indeed,
all cc1 function symbols are now mangled and thus with the current
trunk, plugins should also look for mangled symbols (and so built
with g++).

What's the new GCC policy about that ? Do plugins have to be built
using g++ only, or does the plugin developer have the choice to
use both gcc and g++ according to it's need (at the cost of adding
extern "C" {…} in almost every headers to forbid mangling) ?

Romain Geissler


Re: [RFC PATCH, i386]: Allow zero_extended addresses (+ problems with reload and offsetable address, "o" constraint)

2011-08-08 Thread H.J. Lu
On Mon, Aug 8, 2011 at 10:11 AM, Uros Bizjak  wrote:
> On Mon, Aug 8, 2011 at 5:30 PM, Ulrich Weigand  wrote:
>> Uros Bizjak wrote:
>>
>>> Although, it would be nice for reload to subsequently fix CSE'd
>>> non-offsetable memory by copying address to temporary reg (*as said in
>>> the documentation*), we could simply require an XMM temporary for
>>> TImode reloads to/from integer registers, and this fixes ICE for x32.
>>
>> Moves are special as far as reload is concerned.  If there is already
>> a move instruction present *before* reload, it will get fixed up
>> according to its constraints as any other instruction.
>>
>> However, reload will *introduce* new moves as part of its operation,
>> and those will *not* themselves get reloaded.  Instead, reload simply
>> assumes that every plain move will just succeed without requiring
>> any reload; if this is not true, the target *must* provide a
>> secondary reload for this move.
>>
>> (Note that the secondary reload could also work by reloading the
>> target address into a temporary; that's up to the target to
>> implement.)
>
> Whoa, indeed.
>
> Using attached patch that reloads memory address instead of going
> through XMM register, the code for the testcase improves from:
>
> test:
> .LFB0:
>        .cfi_startproc
>        pushq   %rbx
>        .cfi_def_cfa_offset 16
>        .cfi_offset 3, -16
>        sall    $4, %esi
>        addl    %edi, %esi
>        movdqa  (%esi), %xmm0
>        movdqa  %xmm0, -16(%rsp)
>        movq    -16(%rsp), %rcx
>        movq    -8(%rsp), %rbx
>        addq    $1, %rcx
>        adcq    $0, %rbx
>        movq    %rcx, -16(%rsp)
>        sall    $4, %edx
>        movq    %rbx, -8(%rsp)
>        movdqa  -16(%rsp), %xmm0
>        movdqa  %xmm0, (%esi)
>        pxor    %xmm0, %xmm0
>        movdqa  %xmm0, (%edx,%esi)
>        popq    %rbx
>        .cfi_def_cfa_offset 8
>        ret
>
> to:
>
> test:
> .LFB0:
>        .cfi_startproc
>        sall    $4, %esi
>        pushq   %rbx
>        .cfi_def_cfa_offset 16
>        .cfi_offset 3, -16
>        addl    %edi, %esi
>        pxor    %xmm0, %xmm0
>        mov     %esi, %eax
>        movq    (%rax), %rcx
>        movq    8(%rax), %rbx
>        addq    $1, %rcx
>        adcq    $0, %rbx
>        sall    $4, %edx
>        movq    %rcx, (%rax)
>        movq    %rbx, 8(%rax)
>        movdqa  %xmm0, (%edx,%esi)
>        popq    %rbx
>        .cfi_def_cfa_offset 8
>        ret
>
> H.J., can you please test attached patch? This optimization won't
> trigger on x86_64 anymore.
>

I will test it.

Thanks.


-- 
H.J.


Re: [RFC PATCH, i386]: Allow zero_extended addresses (+ problems with reload and offsetable address, "o" constraint)

2011-08-08 Thread Uros Bizjak
On Mon, Aug 8, 2011 at 5:30 PM, Ulrich Weigand  wrote:
> Uros Bizjak wrote:
>
>> Although, it would be nice for reload to subsequently fix CSE'd
>> non-offsetable memory by copying address to temporary reg (*as said in
>> the documentation*), we could simply require an XMM temporary for
>> TImode reloads to/from integer registers, and this fixes ICE for x32.
>
> Moves are special as far as reload is concerned.  If there is already
> a move instruction present *before* reload, it will get fixed up
> according to its constraints as any other instruction.
>
> However, reload will *introduce* new moves as part of its operation,
> and those will *not* themselves get reloaded.  Instead, reload simply
> assumes that every plain move will just succeed without requiring
> any reload; if this is not true, the target *must* provide a
> secondary reload for this move.
>
> (Note that the secondary reload could also work by reloading the
> target address into a temporary; that's up to the target to
> implement.)

Whoa, indeed.

Using attached patch that reloads memory address instead of going
through XMM register, the code for the testcase improves from:

test:
.LFB0:
.cfi_startproc
pushq   %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
sall$4, %esi
addl%edi, %esi
movdqa  (%esi), %xmm0
movdqa  %xmm0, -16(%rsp)
movq-16(%rsp), %rcx
movq-8(%rsp), %rbx
addq$1, %rcx
adcq$0, %rbx
movq%rcx, -16(%rsp)
sall$4, %edx
movq%rbx, -8(%rsp)
movdqa  -16(%rsp), %xmm0
movdqa  %xmm0, (%esi)
pxor%xmm0, %xmm0
movdqa  %xmm0, (%edx,%esi)
popq%rbx
.cfi_def_cfa_offset 8
ret

to:

test:
.LFB0:
.cfi_startproc
sall$4, %esi
pushq   %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
addl%edi, %esi
pxor%xmm0, %xmm0
mov %esi, %eax
movq(%rax), %rcx
movq8(%rax), %rbx
addq$1, %rcx
adcq$0, %rbx
sall$4, %edx
movq%rcx, (%rax)
movq%rbx, 8(%rax)
movdqa  %xmm0, (%edx,%esi)
popq%rbx
.cfi_def_cfa_offset 8
ret

H.J., can you please test attached patch? This optimization won't
trigger on x86_64 anymore.

Thanks,
Uros.
Index: config/i386/i386.md
===
--- config/i386/i386.md (revision 177565)
+++ config/i386/i386.md (working copy)
@@ -2073,6 +2073,40 @@
(const_string "orig")))
(set_attr "mode" "SI,DI,DI,DI,SI,DI,DI,DI,DI,DI,TI,DI,TI,DI,DI,DI,DI,DI")])
 
+;; Reload patterns to support multi-word load/store
+;; with non-offsetable address.
+(define_expand "reload_noff_store"
+  [(parallel [(match_operand 0 "memory_operand" "=m")
+  (match_operand 1 "register_operand" "r")
+  (match_operand:DI 2 "register_operand" "=&r")])]
+  "TARGET_64BIT"
+{
+  rtx mem = operands[0];
+  rtx addr = XEXP (mem, 0);
+
+  emit_move_insn (operands[2], addr);
+  mem = replace_equiv_address_nv (mem, operands[2]);
+
+  emit_insn (gen_rtx_SET (VOIDmode, mem, operands[1]));
+  DONE;
+})
+
+(define_expand "reload_noff_load"
+  [(parallel [(match_operand 0 "register_operand" "=r")
+  (match_operand 1 "memory_operand" "m")
+  (match_operand:DI 2 "register_operand" "=r")])]
+  "TARGET_64BIT"
+{
+  rtx mem = operands[1];
+  rtx addr = XEXP (mem, 0);
+
+  emit_move_insn (operands[2], addr);
+  mem = replace_equiv_address_nv (mem, operands[2]);
+
+  emit_insn (gen_rtx_SET (VOIDmode, operands[0], mem));
+  DONE;
+})
+
 ;; Convert impossible stores of immediate to existing instructions.
 ;; First try to get scratch register and go through it.  In case this
 ;; fails, move by 32bit parts.
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 177566)
+++ config/i386/i386.c  (working copy)
@@ -28245,18 +28245,25 @@
 
 static reg_class_t
 ix86_secondary_reload (bool in_p, rtx x, reg_class_t rclass,
-  enum machine_mode mode,
-  secondary_reload_info *sri ATTRIBUTE_UNUSED)
+  enum machine_mode mode, secondary_reload_info *sri)
 {
   /* Double-word spills from general registers to non-offsettable memory
- references (zero-extended addresses) go through XMM register.  */
+ references (zero-extended addresses) require special handling.  */
   if (TARGET_64BIT
   && MEM_P (x)
   && GET_MODE_SIZE (mode) > UNITS_PER_WORD
   && rclass == GENERAL_REGS
   && !offsettable_memref_p (x))
-return SSE_REGS;
+{
+  sri->icode = (in_p
+   ? CODE_FOR_reload_noff_load
+   : CODE_FOR_reload_noff_store);
+  /* Add the cost of move to a temporary.  */
+  sri->extra_cost = 1;
 
+  return NO_R

Re: FDO and LTO on ARM

2011-08-08 Thread Jonathan Wakely
On 8 August 2011 13:20, Mike Hommey wrote:
>
> I unfortunately hit several problems with gcc 4.7 (latest snapshot).
> One is bug 50022 that I filed today.
>
> Another is the following error in stlport headers:
>  error: invalid use of incomplete type 'std::string {aka struct
>  std::basic_string, std::allocator >}'
>
> I also tried GNU libstdc++ instead of stlport but I hit some other
> errors that boil down to the following:
>  error: 'std::wstring' has not been declared

They both look as though they could be caused by something as simple
as failing to include  rather than a problem in GCC.  Could
you send me more context for the errors (offlist if you prefer)?  I'll
see if it's something we've changed in libstdc++, though given that
STlport fails too it seems unlikely.


Re: [RFC PATCH, i386]: Allow zero_extended addresses (+ problems with reload and offsetable address, "o" constraint)

2011-08-08 Thread Ulrich Weigand
Uros Bizjak wrote:

> Although, it would be nice for reload to subsequently fix CSE'd
> non-offsetable memory by copying address to temporary reg (*as said in
> the documentation*), we could simply require an XMM temporary for
> TImode reloads to/from integer registers, and this fixes ICE for x32.

Moves are special as far as reload is concerned.  If there is already
a move instruction present *before* reload, it will get fixed up
according to its constraints as any other instruction.

However, reload will *introduce* new moves as part of its operation,
and those will *not* themselves get reloaded.  Instead, reload simply
assumes that every plain move will just succeed without requiring
any reload; if this is not true, the target *must* provide a
secondary reload for this move.

(Note that the secondary reload could also work by reloading the
target address into a temporary; that's up to the target to
implement.)

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com


Re: FDO and LTO on ARM

2011-08-08 Thread Mike Hommey
On Thu, Aug 04, 2011 at 08:51:41PM +0200, Jan Hubicka wrote:
> > +Mark who has done size optimization tuning with FDO.
> > 
> > On Thu, Aug 4, 2011 at 7:05 AM, Mike Hommey  wrote:
> > > Hi,
> > >
> > > We (Mozilla) are trying to get the best of the ARM toolchain for our
> > > Android build. I recently built an Android Native-code Development Kit
> > > with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
> > > 2.19 that come with the default NDK.
> > >
> > > LTO doesn't work at all, I'm getting an ICE that looks like the one from
> > > bug 41159.
> > >
> > > FDO however, works, but sadly, the resulting build is not only quite
> > > bigger,
> > 
> > Is this true for both 4.6 and 4.4 gcc? There is a bug in 4.6 that
> > prevents cold functions from be optimized for size with FDO. The bug
> > was fixed in trunk recently.
> 
> You can also backport the patch to 4.6 tree. If the bug exists there, consider
> the patch preaproved.

Do you have a link to the bug or the patch?

> With FDO, -O2 and -O3 is not really that significandly different (i.e. -O2
> gets all the extra inlining, but it does not get vectorization that is 
> probably
> not big deal for you). -Os is however different storry.
> > >
> > > Has there been significant changes to the ARM backend that would justify
> > > that I try some more with current GCC HEAD? Should I maybe try some more
> > > with the linaro GCC branch? Are there things we can do to help getting
> > > better ARM performance?
> > 
> > It does not hurt to try it :)
> 
> One thing that is really changed is inliner heuristics.  If would be very 
> happy
> to have some feedback on this early, since we plan to do re-tunning of it
> (LTO changes many things and there are also fortran benchmarks that shows a 
> lot
> of problems. Mozilla may chime in and make my life even harder with hopefully
> some positive results on it).
> 
> As discused earlier, I think it would be very good idea to start trakcing
> perfomrance of Mozilla built with mainline GCC like we track other benchmarks.
> We don't really monitor anything of this size and thus we are quite likely
> to find new interesting issues by doing so. 

I unfortunately hit several problems with gcc 4.7 (latest snapshot).
One is bug 50022 that I filed today.

Another is the following error in stlport headers:
  error: invalid use of incomplete type 'std::string {aka struct
  std::basic_string, std::allocator >}'

I also tried GNU libstdc++ instead of stlport but I hit some other
errors that boil down to the following:
  error: 'std::wstring' has not been declared

Mike


Re: FDO and LTO on ARM

2011-08-08 Thread Mike Hommey
On Mon, Aug 08, 2011 at 11:25:56AM +0200, Mike Hommey wrote:
> On Fri, Aug 05, 2011 at 09:32:05AM +0200, Richard Guenther wrote:
> > Well, but unless your training coverage is 100% all parts with no coverage
> > get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
> > isn't even close to 100%.  Thus I think recommending -O3 for FDO is
> > usually a bad idea.
> 
> Experience shows that this isn't true.
> 
> > So - did you try FDO with -O2? ;)
> 
> Leads to roughly the same as -O3+FDO. The apk is also the same size, as
> well as individual libraries.

Actually, I have to revise my result, I had forgottent to change the
optimization level for the javascript engine. The apk and library sizes
are still similar, but the performance is actually slower than both
-O3+PGO and -Os.

Mike


Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands

2011-08-08 Thread Georg-Johann Lay
Ulrich Weigand wrote:
> Georg-Johann Lay wrote:
>> Ulrich Weigand wrote:
>>> This means that problems like the one you're seeing have been hidden
>>> so far.  I've started looking into fixing this, but since I don't
>>> have a target where this is needed, this effort never really went
>>> anywhere.  I'll append below a patch I did a year or so ago; just
>>> as something to look at, it probably will not even apply to the
>>> current sources any more.
>> Sounds great that you tried to fix it!  Much work below... wow.
>>
>>> I'd be happy to bring this up to date if you're willing to work with
>>> me to get this tested on a target that needs this support ...
> 
> Here's an updated version of the patch that build with current GCC
> mainline and passes the SPU test suite with no regressions.
> 
> Let me know if this works for you ...

Thanks, it works.

However, it results in bloated code. I guess it's an IRA issue.

int read_from_pgm_3 (const int __pgm * const __pgm * const addr)
{
return **addr;
}

results in bloated (-Os -mmcu=atmega8)

Z = r30/r31
Y = r28/r29 is frame pointer

Even if the dead store insn 30 is eliminated a frame is generated.

read_from_pgm_3:
push r28 ;  31  pushqi1/1   [length = 1]
push r29 ;  32  pushqi1/1   [length = 1]
rcall .  ;  36  *addhi3_sp_R_pc2[length = 1]
in r28,__SP_L__  ;  37  *movhi_sp/2 [length = 2]
in r29,__SP_H__
/* prologue: function */
/* frame size = 2 */
/* stack size = 4 */
.L__stack_usage = 4
movw r30,r24 ;  28  *movphi/1   [length = 2]
lpm __tmp_reg__,Z+   ;  19  *movphi/2   [length = 6]
lpm r31,Z
mov r30,__tmp_reg__
std Y+2,r31  ;  30  *movphi/3   [length = 7]
std Y+1,r30
lpm r24,Z+   ;  23  *movhi/2[length = 2]
lpm r25,Z+
/* epilogue start */
pop __tmp_reg__  ;  42  *addhi3_sp_R_pc2[length = 2]
pop __tmp_reg__
pop r29  ;  43  popqi   [length = 1]
pop r28  ;  44  popqi   [length = 1]
ret  ;  45  return_from_epilogue[length = 1]


The code could be just


read_from_pgm_3:
/* prologue: function */
;; Z = r24
movw r30,r24 ;  28  *movphi/1   [length = 2]
;; Z = *Z
lpm __tmp_reg__,Z+   ;  19  *movphi/2   [length = 6]
lpm r31,Z
mov r30,__tmp_reg__
;; r24 = *Z++
lpm r24,Z+   ;  23  *movhi/2[length = 2]
lpm r25,Z+
/* epilogue start */
ret  ;  45  return_from_epilogue[length = 1]


MODE_CODE_BASE_REG_CLASS no reads

reg_class_t
avr_mode_code_base_reg_class (enum machine_mode mode ATTRIBUTE_UNUSED,
  addr_space_t address_space,
  RTX_CODE outer_code ATTRIBUTE_UNUSED,
  RTX_CODE index_code ATTRIBUTE_UNUSED)
{
  if (ADDR_SPACE_PGM == address_space)
{
  return POINTER_Z_REGS;
}

  return reload_completed ? BASE_POINTER_REGS : POINTER_REGS;
}

even though for __pgm no base+offset addressing is available.
Returning NO_REGS for __pgm results in ICE in push_relaod.

I frequently see IRA doing a very bad job for small register classes
like here.  Maybe it's better to take it completely away from IRA
and write the load as

(set (reg:HI)
 (unspec:HI (post_inc:PHI (reg:PHI Z

Loading from __pgm is actually an unspec, i.e. reading two times from
the same address will yield the same result.

IRA gets fancy about spilling:
Spilling for insn 19.
Using reg 30 for reload 0
  Try Assign 47(a0), cost=48000
changing reg in insn 19
changing reg in insn 25
changing reg in insn 22
  Assigning 47(freq=3000) a new slot 0
 Register 47 now on stack.

Spilling for insn 19.
Using reg 30 for reload 1
Using reg 30 for reload 0
Using reg 18 for reload 3
Spilling for insn 22.
Using reg 26 for reload 0
Spilling for insn 25.
Using reg 26 for reload 0
Spilling for insn 19.
Using reg 30 for reload 0
Using reg 18 for reload 1
Spilling for insn 25.
Spilling for insn 19.
Using reg 30 for reload 0
Using reg 18 for reload 1
Spilling for insn 25.

Johann


> Bye,
> Ulrich
> 
> 
> ChangeLog:
> 
>   * doc/tm.texi.in (MODE_CODE_BASE_REG_CLASS): Add address space
>   argument.
>   (REGNO_MODE_CODE_OK_FOR_BASE_P): Likewise.
>   * doc/tm.texi: Regenerate.
> 
>   * config/cris/cris.h (MODE_CODE_BASE_REG_CLASS): Add address
>   space argument.
>   (REGNO_MODE_CODE_OK_FOR_BASE_P): Likewise.
>   * config/bfin/bfin.h (MODE_CODE_BASE_REG_CLASS): Likewise.
>   (REGNO_MODE_CODE_OK_FOR_BASE_P): Likewise.
> 
>   * addresses.h (base_reg_class): Add address space argument.
>   Pass to MODE_CODE_BASE_REG_CLASS.
>   (ok_for_base_p_1): Add address space argument.  Pass to
>   REGNO_MODE_CODE_OK_FOR_BASE_P.
>   (regno_ok_for_base_p): Add address space argument.  Pass to
>   ok_for_base_p_1.
> 
>   * regrename.c (scan_rtx

Re: libgcc: strange optimization

2011-08-08 Thread Paolo Bonzini

On 08/08/2011 10:06 AM, Richard Guenther wrote:

Like if

register unsigned char *ip;

would increase spill cost of ip compared to

unsigned char *ip;

?


Remember we're talking about a function with 11000 pseudos and 4000 
allocnos (not to mention a 1500 basic blocks).  You cannot really blame 
IRA for not doing the right thing.  And actually, ip and sp are live 
everywhere, so there's no hope of reserving a register for them, 
especially since all x86 callee-save registers have special uses in 
string functions.


If I understand the huge dumps correctly, the missing part is trying to 
use callee-save registers for spilling, rather than memory.  However, 
perhaps another way to do it is a specialized region management scheme 
for large switch statements, treating each switch arm as a separate 
region??  There are few registers live across the switch, and all of 
them are used either "a lot" or "almost never" (and always in cold blocks).


BTW, here are some measurements on x86-64:

1) with regalloc hints: 450060432 bytecodes/sec; 12819996 calls/sec
2) without regalloc hints: 263002439 bytecodes/sec; 9458816 sends/sec

Probably even worse on x86-32.

None of -fira-region=all, -fira-region=one, -fira-algorithm=priority had 
significant changes.  In fact, it's pretty much a "binary" result: I'd 
expect register allocation results to be either on par with (1) or 
similar to (2); everything else is mostly noise.


Paolo


New mirror

2011-08-08 Thread Sergey Kutserey
Hi there! We just raised a new mirror in US, Missouri, Saint Louis.
It has 100Mb/s connection and synced twice a day from main site gcc.gnu.org
URL of mirror is: gcc.petsads.us
My email s.kutse...@gmail.com
My name is Sergey Kutserey

Hopefully you can add this mirror into public mirror list for GCC project.
Thank you.


Re: Question about SMS scheduling windows

2011-08-08 Thread Richard Sandiford
Ayal Zaks  writes:
>> OK.  For the follow-on iv patch, it seemed easier to keep both bounds
>> inclusive for the loop, then make the "end" exclusive when setting the
>> out parameters.  (Note that there shouldn't be any problem with overflow
>> when making the bound exclusive, because the size of the range has been
>> restricted to "ii" by that stage.  The follow-on patch does use
>> saturating maths for all ops though.)
>
> Sure, no problem having "end" inclusive, as it simplifies things. We
> can even keep it inclusive all the way through, iterating over: for (c
> = start; c != end+step; c += step)...

Yeah, I'd prefer that too.  I might do it once Revital's and
Roman's changes are in.

>> I don't have powerpc hardware that I can do meaningful performance
>> testing on, but I did run it through a Popular* Embedded Benchmark
>> on an ARM Cortex-A8 board with -O3 -fmodulo-sched
>> -fmodulo-sched-allow-regmoves.  There were no changes.  (And this is
>> a benchmark that does benefit from modulo scheduling, in some cases
>> by a significant amount.)
>>
> Ok, the patch should be neutral performance-wise. One could check btw
> if SMS produces exactly the same output in both cases, even w/o a
> powerpc (or any other) HW.

OK.  The patch does change the output a little because of the MEM_DEP
range thing.  To compensate for that, I tried compiling libav on ARM with:

Index: gcc/modulo-sched.c
===
--- gcc/modulo-sched.c  2011-08-05 11:23:16.0 +0100
+++ gcc/modulo-sched.c  2011-08-05 11:27:57.0 +0100
@@ -1680,7 +1680,7 @@ get_sched_window (partial_schedule_ptr p
  p_st, early_start, e->latency);
 
  if (e->data_type == MEM_DEP)
-   end = MIN (end, SCHED_TIME (v_node) + ii - 1);
+   end = MIN (end, SCHED_TIME (v_node) + ii);
}
  else if (dump_file)
 fprintf (dump_file, "the node is not scheduled\n");
@@ -1729,7 +1729,7 @@ get_sched_window (partial_schedule_ptr p
  s_st, late_start, e->latency);
 
  if (e->data_type == MEM_DEP)
-   end = MAX (end, SCHED_TIME (v_node) - ii + 1);
+   end = MAX (end, SCHED_TIME (v_node) - ii);
  if (dump_file)
  fprintf (dump_file, "end = %d\n", end);
 
@@ -1791,7 +1791,7 @@ get_sched_window (partial_schedule_ptr p
 count_preds++;
 
  if (e->data_type == MEM_DEP)
-   end = MIN (end, SCHED_TIME (v_node) + ii - 1);
+   end = MIN (end, SCHED_TIME (v_node) + ii);
}
   else if (dump_file)
 fprintf (dump_file, "the node is not scheduled\n");

applied, then tried the same thing with the revised patch below.
The output was the same.

(FWIW, libav did show up extra differences when using the patch
that I'd originally submitted.  They were due to the count_preds
and count_succs thing that you picked up in your review.)

>> Bootstrapped & regression-tested on powerpc-ibm-aix5.3 with the
>> additional patch:
>>
>> Index: gcc/opts.c
>> ===
>> --- gcc/opts.c  2011-08-03 10:56:50.0 +0100
>> +++ gcc/opts.c  2011-08-03 10:56:50.0 +0100
>> @@ -449,6 +449,8 @@ static const struct default_options defa
>>     { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
>>     { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
>>     { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
>> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched, NULL, 1 },
>> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched_allow_regmoves, NULL, 1 },
>>
>>     /* -O2 optimizations.  */
>>     { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
>>
>> applied for both the "before" and "after" runs.  OK to install?
>
>
> Yes, with just a couple of minor non-mandatory comments:
> 1. Setting "count_preds = psp_not_empty;" and "count_succs =
> pss_not_empty;" suggests that you want to decrement non-interesting
> neighbors instead of incrementing interesting ones. Otherwise can
> initialize to zero (doesn't really matter, just a bit confusing).

Yeah, good point.  I initialised them to zero, like you said,
and changed the reversal condition to:

  if (pss_not_empty && count_succs >= count_preds)

Which I have admit makes a lot more sense than what I submitted,
and avoids some unwanted changes in output.

> 2. I find it easier to read "late_start = MIN (late_start, early_start
> + (ii - 1));" instead of "late_start = MIN (early_start + (ii - 1),
> late_start);".

Oops, that was accidental.  Fixed.

> 3. Suggest to set step=1 at the beginning, explaining from the start
> that we first compute a forward range (start possibly revert it.

Yeah, that's better.

> 4. "late_start" would better be renamed "late_end".

Since you said this was optional, I decided to leave it as it is.
The name and value of "late start" come directly from th

Re: FDO and LTO on ARM

2011-08-08 Thread Mike Hommey
On Fri, Aug 05, 2011 at 09:32:05AM +0200, Richard Guenther wrote:
> Well, but unless your training coverage is 100% all parts with no coverage
> get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
> isn't even close to 100%.  Thus I think recommending -O3 for FDO is
> usually a bad idea.

Experience shows that this isn't true.

> So - did you try FDO with -O2? ;)

Leads to roughly the same as -O3+FDO. The apk is also the same size, as
well as individual libraries.

Mike


Re: libgcc: strange optimization

2011-08-08 Thread Richard Guenther
On Sat, Aug 6, 2011 at 5:00 PM, Paolo Bonzini  wrote:
> On 08/04/2011 01:10 PM, Andrew Haley wrote:

 >>  It's the sort of thing that gets done in threaded interpreters,
 >>  where you really need to keep a few pointers in registers and
 >>  the interpreter itself is a very long function.  gcc has always
 >>  done a dreadful job of register allocation in such cases.
>>>
>>> >
>>> >  Sure, but what I have seen people use global register variables
>>> >  for this (which means they get taken away from the register
>>> > allocator).
>>
>> Not always though, and the x86 has so few registers that using a
>> global register variable is very problematic.  I suppose you could
>> compile the threaded interpreter in a file of its own, but I'm not
>> sure that has quite the same semantics as local register variables.
>
> Indeed, local register variables give almost the same benefit as globals
> with half the burden.  The idea is that you don't care about the exact
> register that holds the contents but, by specifying a callee-save register,
> GCC will use those instead of memory across calls.  This reduces _a lot_ the
> number of spills.
>
>> The problem is that people who care about this stuff very much don't
>> always read...@gcc.gnu.org  so won't be heard.  But in their own world
>> (LISP, Forth) nice features like register variables and labels as
>> values have led to gcc being the preferred compiler for this kind of
>> work.
>
> /me raises hands.
>
> For GNU Smalltalk, using
>
> #if defined(__i386__)
> # define __DECL_REG1 __asm("%esi")
> # define __DECL_REG2 __asm("%edi")
> # define __DECL_REG3 /* no more caller-save regs if PIC is in use!  */
> #endif
>
> #if defined(__x86_64__)
> # define __DECL_REG1 __asm("%r12")
> # define __DECL_REG2 __asm("%r13")
> # define __DECL_REG3 __asm("%rbx")
> #endif
>
> ...
>
>  register unsigned char *ip __DECL_REG1;
>  register OOP * sp __DECL_REG2;
>  register intptr_t arg __DECL_REG3;
>
> improves performance by up to 20% if I remember correctly.  I can benchmark
> it if desired.
>
> It does not come for free, in some cases the register allocator does some
> stupid things due to the hard register declaration.  But it gets much better
> code overall, so who cares about the microoptimization.
>
> Of course, if the register allocator did the right thing, or if I could use
> simply
>
>  unsigned char *ip __attribute__(__do_not_spill_me__(20)));
>  OOP *sp __attribute__(__do_not_spill_me__(10)));
>  intptr_t arg __attrbite__(__do_not_spill_me__(0)));
>
> that would be just fine.

Like if

register unsigned char *ip;

would increase spill cost of ip compared to

unsigned char *ip;

?  It is, after all, a cost issue - forcefully pinning down registers can
lead to problems.  We'd of course have to somehow "preserve" the
register state of ip for all relevant pseudos (and avoid coalescing with
non-register ones).

Richard.

> Paolo
>