[Bug target/62275] ARM should use vcvta instructions when possible for float - int rounding

2014-09-02 Thread josh.m.conner at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62275

--- Comment #5 from Joshua Conner josh.m.conner at gmail dot com ---
Thanks!


[Bug target/62275] New: ARM should use vcvta instructions when possible for float - int rounding

2014-08-26 Thread josh.m.conner at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62275

Bug ID: 62275
   Summary: ARM should use vcvta instructions when possible for
float - int rounding
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: josh.m.conner at gmail dot com

Instead of generating a library call for lround/lroundf, the ARM backend should
use vcvta.s32.f64 and vcvta.s32.f32 instructions instead (as long as
-fno-math-errno has been given, since this obviously won't set errno).


[Bug middle-end/56924] Folding of checks into a range check should check upper boundary

2014-07-31 Thread josh.m.conner at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56924

--- Comment #3 from Joshua Conner josh.m.conner at gmail dot com ---
It appears that gcc has a different approach now, which has its own advantages
and disadvantages.  Specifically, when I compile this same example I'm now
seeing an initial tree of:

  if ((SAVE_EXPR BIT_FIELD_REF input, 8, 0  240) == 224 || (SAVE_EXPR
BIT_FIELD_REF input, 8, 0  240) == 240)
{
  bar ();
}

Which indeed generates much better assembly code (for ARM):

and r0, r0, #224
cmp r0, #224
beq .L4

But with a slight modification of the original code to:

  if ((input.val == 0xd) || (input.val == 0xe) || (input.val == 0xf))
bar();

The tree looks like:

  if (((SAVE_EXPR BIT_FIELD_REF input, 8, 0  240) == 208 || (SAVE_EXPR
BIT_FIELD_REF input, 8, 0  240) == 224) || (BIT_FIELD_REF input, 8, 0 
240) == 240)

And the generated assembly is:

uxtbr0, r0
and r3, r0, #240
and r0, r0, #208
cmp r0, #208
cmpne   r3, #224
beq .L4

Which could be much better as:

ubfxr0, r0, #4, #4
cmp r0, #12
bhi .L4


[Bug target/56315] ARM: Improve use of 64-bit constants in logical operations

2014-01-22 Thread josh.m.conner at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56315

--- Comment #4 from Joshua Conner josh.m.conner at gmail dot com ---
Excellent - thanks!


[Bug rtl-optimization/57462] ira-costs considers only a single register at a time

2013-06-03 Thread josh.m.conner at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57462

--- Comment #2 from Joshua Conner josh.m.conner at gmail dot com ---
No problem - I appreciate you taking the time to respond.  This has a
noticeable impact on codegen for ARM because of the redundancy in the CPU/FPU
functionality and cost of transferring data between integer/FP registers, so I
thought it would be worth mentioning in case it wasn't recognized already. 
Thanks.


[Bug rtl-optimization/57462] New: ira-costs considers only a single register at a time

2013-05-29 Thread josh.m.conner at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57462

Bug ID: 57462
   Summary: ira-costs considers only a single register at a time
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: josh.m.conner at gmail dot com

In this code:

  int PopCnt(unsigned long long a, unsigned long long b)
  {
register int c=0;

while(a) {
  c++;
  a = a + b;
}
return(c);
  }

Built for ARM with:

  gcc test.c -O2 -S -o test.s

The code generated for the loop is:

  .L3:
  fmdrr   d18, r0, r1 @ int
  vadd.i64d16, d18, d17
  fmrrd   r4, r5, d16 @ int
  and r0, r0, r4
  and r1, r1, r5
  orrsr5, r0, r1
  add r3, r3, #1
  bne .L3

There is quite a bit of gymnastics in order to use the FP registers for the add
instruction.  The code is simpler if all registers are allocated to integer
registers:

  .L3:
  addsr2, r4, r6
  adc r3, r5, r7
  and r4, r4, r2
  and r5, r5, r3
  orrsr3, r4, r5
  add r0, r0, #1
  bne .L3

The code is shorter, and doesn't include the potentially-expensive FP=INT
register move operations.

*** The rest of this bug is my analysis, providing an explanation of why I have
put this bug into the rtl-optimization category.

The problem I see is that the register classifier (ira-costs.c) makes decisions
on register classes for each register in relative isolation, without adequately
considering the impact of that decision on other registers.  In this example,
we have 3 main registers we're concerned with: a, b, and a temporary register
(ignoring c, which we don't need to consider).  The code when costs are
calculated is roughly:

  tmp = a + b
  a = a  tmp
  CC = compare (a, 0)

Both the adddi3 and anddi3 operations can be performed in either integer or FP
regs, with a preference for the FP regs because the sequence is shorter (1 insn
instead of 2).

The compare operation can only be performed in an integer register.

In the first pass of the cost analysis:
a is assigned to the integer registers, since the cheaper adddi/anddi
operations are outweighed by the cost of having to move the value from FP=INT
for the compare.
b and tmp are both assigned to FP registers, since they are only involved
in operations that are cheaper with the FP hardware.

In the second pass of the cost analysis, each register is again analyzed
independently:
a is left in the integer register because moving it to a FP register would
add an additional FP=INT move for the compare.
b and tmp are both left in FP registers because moving either one would
still leave us with mixed FP/INT operations.

The biggest problem I see is that the first pass should recognize that since
a must be in an integer register, there is an unconsidered cost to putting
b and tmp in FP registers since they are involved in instructions where the
operands must be in the same register class.

A lesser, and probably more difficult, problem is that the second pass could do
better if it would consider changing register classes of more than one register
at a time.  This seems potentially complex, but perhaps we could just consider
register pairs that are involved in instructions with mismatched operand
classes, where the combination is invalid for the instruction.


[Bug rtl-optimization/57231] Hoist zero-extend operations when possible

2013-05-10 Thread josh.m.conner at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57231

--- Comment #3 from Joshua Conner josh.m.conner at gmail dot com ---
Exactly - there's no need to truncate every iteration, we should be able to
safely do it when the loop is complete.


[Bug rtl-optimization/57231] New: Hoist zero-extend operations when possible

2013-05-09 Thread josh.m.conner at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57231

Bug ID: 57231
   Summary: Hoist zero-extend operations when possible
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: josh.m.conner at gmail dot com

Compiling this code at -O2:

  unsigned char *value;

  unsigned short foobar (int iters)
  {
unsigned short total;
unsigned int i;

for (i = 0; i  iters; i++)
  total += value[i];

return total;
  }

On ARM generates a zero-extend of total for every iteration of the loop:

  .L3:
ldrbr1, [ip, r3]@ zero_extendqisi2
add r3, r3, #1
cmp r3, r0
add r2, r2, r1
uxthr2, r2
bne .L3

I believe we should be able to hoist the zero-extend (uxth) after the loop.

Note that although I manifested this for ARM, I believe it's a general case
that would have to be handled by the rtl optimizers.

This shows up in a hot loop of bzip2:

for (i = gs; i = ge; i++) {
   UInt16 icv = szptr[i];
   cost0 += len[0][icv];
   cost1 += len[1][icv];
   cost2 += len[2][icv];
   cost3 += len[3][icv];
   cost4 += len[4][icv];
   cost5 += len[5][icv];
}


[Bug c/56924] New: Folding of checks into a range check should check upper boundary

2013-04-11 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56924



 Bug #: 56924

   Summary: Folding of checks into a range check should check

upper boundary

Classification: Unclassified

   Product: gcc

   Version: 4.9.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: c

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





When we are performing folding of checks into a range check, if the values are

at the top-end of the range we should just use a  test instead of normalizing

them into the bottom of the range and using a  test.



For example, consider:



  struct stype {

unsigned int pad:4;

unsigned int val:4;

  };



  void bar (void);



  void foo (struct stype input)

  {

if ((input.val == 0xe) || (input.val == 0xf))

  bar();

  }





When compiled at -O2, the original tree generated is:





  ;; Function foo (null)

  ;; enabled by -tree-original





  {

if (input.val + 2 = 1)

  {

bar ();

  }

  }



This is likely to be more efficient if we instead generate:



if (input.val = 0xe)

  {

bar ();

  }



This can be seen in the inefficient codegen for an ARM cortex-a15:



ubfxr0, r0, #4, #4

add r3, r0, #2

and r3, r3, #15

cmp r3, #1



(the add and the and are not necessary if we change the test condition).



I was able to improve this by adding detection of this case into

build_range_check.


[Bug tree-optimization/56925] New: SRA should take into account likelihood of statements being executed

2013-04-11 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56925



 Bug #: 56925

   Summary: SRA should take into account likelihood of statements

being executed

Classification: Unclassified

   Product: gcc

   Version: 4.9.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





In the following code:



  struct stype {

unsigned int pad:4;

unsigned int val:4;

  };



  void bar (void);

  void baz (void);



  int x, y;



  unsigned int foo (struct stype input)

  {

if (__builtin_expect (x, 0))

  return input.val;



if (__builtin_expect (y, 0))

  return input.val + 1;



return 0;

  }



When compiled with -O2, SRA moves the read of input.val to the top of the

function:





  ;; Function foo (foo, funcdef_no=0, decl_uid=4988, cgraph_uid=0)



  Candidate (4987): input

  Rejected (4999): not aggregate: y.1

  Rejected (4993): not aggregate: x.0

  Created a replacement for input offset: 4, size: 4: input$val

  ...



  bb 2:

  input$val_14 = input.val;

  x.0_3 = x;

  _4 = __builtin_expect (x.0_3, 0);

  if (_4 != 0)

goto bb 3;

  else

goto bb 4;

  ...



Which means that the critical path for this function now executes an extra

instruction.



It would be nice if SRA would take into account the likelihood of statement

execution when deciding whether to apply the transformation.  We currently

verify that there are at least two reads -- perhaps we should check that there

are at least two reads that are likely to occur.



This can be seen in sub-optimal codegen for ARM, where a bitfield extract

(ubfx) is moved out of unlikely code into the critical path:



  foo:

movwr3, #:lower16:x

ubfxr2, r0, #4, #4

movtr3, #:upper16:x

ldr r3, [r3]

cmp r3, #0

bne .L6

...


[Bug tree-optimization/56352] New: Simplify testing of related conditions in for loop

2013-02-15 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56352



 Bug #: 56352

   Summary: Simplify testing of related conditions in for loop

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





If we have a loop like this:



for (i = 0; i  a  i  b; i++)

{

  /* Code which cannot affect i, a, or b */

}



gcc should be able to optimize this into:



tmp = MIN(a,b)

for (i = 0; i  tmp; i++)

{

  /* Body */

}



But it does not.  Similarly, code like:



for (i = 0; i  a; i++)

{

  if (i = b)

break;



  /* Code which cannot affect i, a, or b */

}



Should be similarly optimized.


[Bug target/56313] New: aarch64 backend not using fmls instruction

2013-02-13 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56313



 Bug #: 56313

   Summary: aarch64 backend not using fmls instruction

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





When this code is compiled with -O2 -ffast-math -S for an aarch64-linux-gnu

target:



float v1 __attribute__((vector_size(8)));

float v2 __attribute__((vector_size(8)));

float result __attribute__((vector_size(8)));



void foo (void)

{

  result = result + (-v1 * v2);

}



The following is generated:



ld1{v0.2s}, [x0]

fnegv2.2s, v2.2s

ld1{v1.2s}, [x1]

fmlav0.2s, v2.2s, v1.2s

st1{v0.2s}, [x0]



This code could be improved to:

ld1{v0.2s}, [x0]

ld1{v1.2s}, [x1]

fmlsv0.2s, v2.2s, v1.2s

st1{v0.2s}, [x0]


[Bug target/56313] aarch64 backend not using fmls instruction

2013-02-13 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56313



--- Comment #1 from Joshua Conner josh.m.conner at gmail dot com 2013-02-14 
01:39:55 UTC ---

In case it helps, the pattern for aarch64_vmlsmode is written as:



  (set (op0)

(minus (op1)

  (mult (op2)

(op3



Restructuring this to:



  (set (op0)

(fma (neg (op1))

  (op2)

  (op3)))



Allows the combiner to take advantage of the pattern.


[Bug target/56315] New: ARM: Improve use of 64-bit constants in logical operations

2013-02-13 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56315



 Bug #: 56315

   Summary: ARM: Improve use of 64-bit constants in logical

operations

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





In the ARM backend, support was added for recognizing addition with 64-bit

constants that could be split up into two 32-bit literals that could be handled

with immediates in the adds/adc operations.  However, this support is still not

present for the logical operations.  For example, compiling this code with -O2:



  unsigned long long or64 (unsigned long long input)

  {

return input | 0x20004ULL;

  }



Gives us:



movr2, #4

movr3, #2

orrr0, r0, r2

orrr1, r1, r3



When it could produce:



orrr0, r0, #4

orrr1, r1, #2



The same improvement could be applied to  and ^ operations as well.


[Bug tree-optimization/56094] New: Invalid line number info generated with tree-level ivopts

2013-01-23 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56094



 Bug #: 56094

   Summary: Invalid line number info generated with tree-level

ivopts

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: minor

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





The attached code has a number of instructions that are associated with the

head of the function.  This was showing up when setting a breakpoint on the

function itself and gdb was setting several - not a problem in general, except

that the statements were more appropriately correlated with intra-loop

calculations instead of anything to do with the prologue.



To reproduce, compile the attached file with -g -O2, and notice the statements

associated with line 83.



These lines are in fact statements that are generated during tree-level

induction variable optimization, but which aren't getting their location data

copied over from the gimple statement and so they default to the start of the

function (sorry for being a bit vague, but it's been a while since I looked

into the mechanics of this and I don't recall the details).



The fix I have implemented in our local tree is in rewrite_use_nonlinear_expr,

where after generating the computation (comp) I verify that it has a location

associated with it - if it doesn't but the use stmt (use-stmt) does have a

location, I copy the location from the use-stmt over to comp.


[Bug tree-optimization/56094] Invalid line number info generated with tree-level ivopts

2013-01-23 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56094



--- Comment #1 from Joshua Conner josh.m.conner at gmail dot com 2013-01-24 
04:03:44 UTC ---

Created attachment 29263

  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=29263

Reduced test case


[Bug tree-optimization/56094] Invalid line number info generated with tree-level ivopts

2013-01-23 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56094



--- Comment #2 from Joshua Conner josh.m.conner at gmail dot com 2013-01-24 
04:05:09 UTC ---

Sorry, I should have been more specific -- the function I'm describing in the

previous comments is test_main.


[Bug rtl-optimization/55747] New: Extra registers are saved in functions that only call noreturn functions

2012-12-19 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55747



 Bug #: 55747

   Summary: Extra registers are saved in functions that only call

noreturn functions

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: rtl-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





On architectures such as ARM, where a link register is used to save the return

address, this value does not need to be saved in a function that only calls

noreturn functions.



For example, if I build the following source:



  __attribute__((noreturn))

  extern void bar (void);



  int x;



  void foo (void)

  {

if (x)

  bar ();

  }



Using the options -O2, the link register is saved:



  stmfd   sp!, {r3, lr}

  ...

  ldmeqfd sp!, {r3, pc}



However, this is unnecessary since the only way the link register cannot be

corrupted since any calls to bar will not return.



Note that I am not filing this as an ARM target bug since the issue appears to

be a general problem related to dataflow analysis not tracking the difference

between calls to normal functions and calls to noreturn functions.  At any

rate, I see a similar problem in our custom target as well.


[Bug target/55701] New: Inline some instances of memset for ARM

2012-12-14 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55701



 Bug #: 55701

   Summary: Inline some instances of memset for ARM

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





memset() is almost never inlined on ARM, even at -O3.  If the target is known

to be 4-byte aligned or greater, it will be inlined for 1, 2, or 4 byte

lengths.  If the target alignment is unknown, it will be inlined only for a

single byte.



I don't see this problem with similar builtins (memcpy, memmove, and memclear

(memset with a target value of zero)) - they all inline small cases.



It probably makes sense for memset to be inlined up to at least 16 bytes or so

in all cases.



When aligned, memcpy and memmove use a ldmia/stmia (load multiple/store

multiple) sequence to create fairly compact inline code.  We could consider

doing the same sort of optimization with memset, using stmia only.


[Bug c/55681] New: Qualifiers on asm statements are order-dependent

2012-12-13 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55681



 Bug #: 55681

   Summary: Qualifiers on asm statements are order-dependent

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: c

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





The syntax that is accepted for asm statement qualifiers is:



  asm {volatile | const | restrict} {goto}



(this can easily be seen by looking at the code in c_parser_asm_statement).



This means, for example, that gcc isn't particularly orthogonal in what it

chooses to accept and reject:



  asm volatile (nop);  // accepted

  asm const (nop); // accepted with warning

  asm __restrict (nop);// accepted with warning

  asm const volatile (nop);// parse error

  asm const __restrict (nop);  // parse error

  asm volatile goto (nop : : : : label);   // accepted

  asm goto volatile (nop : : : : label);   // parse error



This is probably rarely a problem, since most of the statements that would

result in an error are not likely to be seen (I came across this when adding a

new qualifier for our local port, which exacerbated the problem), but I thought

I would mention it anyway -- the fix is relatively straightforward since the

qualifiers are independent.


[Bug middle-end/55653] New: Unnecessary initialization of vector register

2012-12-11 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55653



 Bug #: 55653

   Summary: Unnecessary initialization of vector register

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: middle-end

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





When initializing all lanes of a vector register, I notice that the register is

first initialized to zero and then all lanes of the vector are independently

initialized, resulting in extra code.



Specifically, I'm looking at the aarch64 target, with the following source:



void

fmla_loop (double * restrict result, double * restrict mul1,

   double mul2, int size)

{

  int i;



  for (i = 0; i  size; i++)

result[i] = result[i] + mul1[i] * mul2;

}



Compiled with:



aarch64-linux-gnu-gcc -std=c99 -O3 -ftree-vectorize -S -o test.s test.c



The resultant code to initialize a vector register with two instances of mul2

is:



  adr x3, .LC0

  ld1 {v3.2d}, [x3]

  ins v3.d[0], v0.d[0]

  ins v3.d[1], v0.d[0]

...

.LC0:

  .word   0

  .word   0

  .word   0

  .word   0



Where the first two instructions (that initialize the vector register) are

unnecessary, as is the space for .LC0.



Note that this initialization is being performed here in store_constructor:



/* Inform later passes that the old value is dead.  */

if (!cleared  !vector  REG_P (target))

  emit_move_insn (target, CONST0_RTX (GET_MODE (target)));



right after another check to see if the vector needs to be cleared out (and

determine that it doesn't).



Instead of the emit_move_insn, that code used to be:



   emit_insn (gen_rtx_CLOBBER (VOIDmode, target));



But was changed in r101169, with the comment:



  The expr.c change elides an extra move that's creeped in since we

changed clobbered values to get new registers in reload.



(see full checkin text here:

http://gcc.gnu.org/ml/gcc-patches/2005-06/msg01584.html)



It's not clear to me whether this can be changed back, or if later passes

should be recognizing this initialization as redundant, or whether we need a

new expand pattern to match vector fill (vector duplicate).  At any rate, the

code is certainly not ideal as it stands.



Thanks!


[Bug tree-optimization/55213] vectorizer ignores __restrict__

2012-11-29 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55213



--- Comment #4 from Joshua Conner josh.m.conner at gmail dot com 2012-11-29 
22:17:50 UTC ---

I'm also seeing this same issue in libgfortran's matmul_r8.c, where the inner

loop has an aliasing check even though all of the pointer dereferences are via

restricted pointers.  Again, the problem is worse because the aliasing

versioning prevents us from doing vector alignment peeling.


[Bug tree-optimization/55213] vectorizer ignores __restrict__

2012-11-20 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55213



Joshua Conner josh.m.conner at gmail dot com changed:



   What|Removed |Added



 CC||josh.m.conner at gmail dot

   ||com



--- Comment #3 from Joshua Conner josh.m.conner at gmail dot com 2012-11-20 
18:05:26 UTC ---

I'm running into a similar problem in code like this:



void

inner (float * restrict x, float * restrict y, int n)

{

  int i;



  for (i = 0; i  n; i++)

x[i] *= y[i];

}



void

outer (float *arr, int offset, int bytes)

{

  inner (arr[0], arr[offset], bytes);

}



In the out-of-line instance of inner(), no alias detection code is generated

(correctly, since the pointers are restricted).



When inner() is inlined into outer(), however, alias detection code is

unnecessarily generated.  This alone isn't a terrible penalty except that the

generation of a versioned loop to handle aliasing prevents us from performing

loop peeling for alignment, and so we end up with a vectorized unaligned loop

with poor performance.



Note that the place where I'm actually running into the problem is in fortran,

where pointer arguments are implicitly non-aliasing.


[Bug tree-optimization/55216] New: Infinite loop generated on non-infinite code

2012-11-05 Thread josh.m.conner at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55216



 Bug #: 55216

   Summary: Infinite loop generated on non-infinite code

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: josh.m.con...@gmail.com





Attempting to compile this code:





  int d[16];



  int

  SATD (void)

  {

int k, satd = 0, dd;



for (dd=d[k=0]; k16; dd=d[++k])

{

  satd += (dd  0 ? -dd : dd);

}



return satd;

  }



with -O2 generates an infinite loop:





  .L2:

b.L2



I am using trunk gcc (sync'd to r193173) configured with:



  --target=arm-linux-gnueabi --with-cpu=cortex-a15 --with-gnu-as --with-gnu-ld

--enable-__cxa_atexit --disable-libssp --disable-libmudflap

--enable-languages=c,c++,fortran --disable-nls



Although I am pretty sure this is a tree optimization issue and not a target

issue because I see the transformation from a valid loop into an invalid loop

during vrp1.



Specifically, when visiting this PHI node for the last time:



  Visiting PHI node: k_1 = PHI 0(2), k_8(4)



  Argument #0 (2 - 3 executable)

  0

  Value: [0, 0]



  Argument #1 (4 - 3 executable)

  k_8

  Value: [1, 15]



vrp_visit_phi_node determines that the range for k_1 is:



  k_1: [0,14]



If I'm understanding this correctly, the union of these ranges should give us

[0,15] instead (and would, except that adjust_range_with_scev() overrides it). 

This invalid range leads to the belief that the loop exit condition can never

be met.


[Bug lto/48508] ICE in output_die, at dwarf2out.c:11409

2011-11-06 Thread josh.m.conner at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48508

Joshua Conner josh.m.conner at gmail dot com changed:

   What|Removed |Added

 CC||josh.m.conner at gmail dot
   ||com

--- Comment #6 from Joshua Conner josh.m.conner at gmail dot com 2011-11-06 
19:01:26 UTC ---
I ran into this bug building SPEC2k for ARM (176.gcc) w/LTO, and have done some
investigation.

In the provided test case, during inlining we generate an abstract function die
for js_InternNonIntElementIdSlow (and the inlined instance with an
abstract_origin referring to the abstract function die).

Later, when we are generating the debug information for the non-slow version of
the function, js_InternNonIntElementId, we process the declaration that appears
inside that function:

  extern bool js_InternNonIntElementIdSlow (JSContext *, JSObject *,
const js::Value , long int *,
js::Value *);

We attempt to generate a die for this, and in doing so when looking up the decl
using lookup_decl_die, we are returned the abstract instance of the ...Slow
function.  We then attempt to re-define this die by clearing out the parameters
from old instance and re-using it (see the code that follows this comment in
gen_subprogram_die):

  /* If the definition comes from the same place as the declaration,
 maybe use the old DIE.  We always want the DIE for this function
 that has the *_pc attributes to be under comp_unit_die so the
 debugger can find it.  We also need to do this for abstract
 instances of inlines, since the spec requires the out-of-line copy
 to have the same parent.  For local class methods, this doesn't
 apply; we just use the old DIE.  */

Once we clear out the parameters, then the abstract_origin parameters in our
original inlined instance now point to unreachable/unallocated dies, triggering
the assertion failure.

It's not clear to me what the fix is, so I could use some insight into what
cases this code is supposed to handle.  From reading the comments and code, it
appears that we're trying to catch a case where we have a declaration followed
by a definition?  So, it's possible that we should recognize that we don't have
a definition here, just a declaration.  Alternatively (or in addition), should
we recognize that we are dealing with an abstract declaration and not try to
re-use it, since doing so will break any references that were almost certainly
generated?