[Bug target/62275] ARM should use vcvta instructions when possible for float - int rounding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62275 --- Comment #5 from Joshua Conner josh.m.conner at gmail dot com --- Thanks!
[Bug target/62275] New: ARM should use vcvta instructions when possible for float - int rounding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62275 Bug ID: 62275 Summary: ARM should use vcvta instructions when possible for float - int rounding Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: josh.m.conner at gmail dot com Instead of generating a library call for lround/lroundf, the ARM backend should use vcvta.s32.f64 and vcvta.s32.f32 instructions instead (as long as -fno-math-errno has been given, since this obviously won't set errno).
[Bug middle-end/56924] Folding of checks into a range check should check upper boundary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56924 --- Comment #3 from Joshua Conner josh.m.conner at gmail dot com --- It appears that gcc has a different approach now, which has its own advantages and disadvantages. Specifically, when I compile this same example I'm now seeing an initial tree of: if ((SAVE_EXPR BIT_FIELD_REF input, 8, 0 240) == 224 || (SAVE_EXPR BIT_FIELD_REF input, 8, 0 240) == 240) { bar (); } Which indeed generates much better assembly code (for ARM): and r0, r0, #224 cmp r0, #224 beq .L4 But with a slight modification of the original code to: if ((input.val == 0xd) || (input.val == 0xe) || (input.val == 0xf)) bar(); The tree looks like: if (((SAVE_EXPR BIT_FIELD_REF input, 8, 0 240) == 208 || (SAVE_EXPR BIT_FIELD_REF input, 8, 0 240) == 224) || (BIT_FIELD_REF input, 8, 0 240) == 240) And the generated assembly is: uxtbr0, r0 and r3, r0, #240 and r0, r0, #208 cmp r0, #208 cmpne r3, #224 beq .L4 Which could be much better as: ubfxr0, r0, #4, #4 cmp r0, #12 bhi .L4
[Bug target/56315] ARM: Improve use of 64-bit constants in logical operations
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56315 --- Comment #4 from Joshua Conner josh.m.conner at gmail dot com --- Excellent - thanks!
[Bug rtl-optimization/57462] ira-costs considers only a single register at a time
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57462 --- Comment #2 from Joshua Conner josh.m.conner at gmail dot com --- No problem - I appreciate you taking the time to respond. This has a noticeable impact on codegen for ARM because of the redundancy in the CPU/FPU functionality and cost of transferring data between integer/FP registers, so I thought it would be worth mentioning in case it wasn't recognized already. Thanks.
[Bug rtl-optimization/57462] New: ira-costs considers only a single register at a time
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57462 Bug ID: 57462 Summary: ira-costs considers only a single register at a time Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: josh.m.conner at gmail dot com In this code: int PopCnt(unsigned long long a, unsigned long long b) { register int c=0; while(a) { c++; a = a + b; } return(c); } Built for ARM with: gcc test.c -O2 -S -o test.s The code generated for the loop is: .L3: fmdrr d18, r0, r1 @ int vadd.i64d16, d18, d17 fmrrd r4, r5, d16 @ int and r0, r0, r4 and r1, r1, r5 orrsr5, r0, r1 add r3, r3, #1 bne .L3 There is quite a bit of gymnastics in order to use the FP registers for the add instruction. The code is simpler if all registers are allocated to integer registers: .L3: addsr2, r4, r6 adc r3, r5, r7 and r4, r4, r2 and r5, r5, r3 orrsr3, r4, r5 add r0, r0, #1 bne .L3 The code is shorter, and doesn't include the potentially-expensive FP=INT register move operations. *** The rest of this bug is my analysis, providing an explanation of why I have put this bug into the rtl-optimization category. The problem I see is that the register classifier (ira-costs.c) makes decisions on register classes for each register in relative isolation, without adequately considering the impact of that decision on other registers. In this example, we have 3 main registers we're concerned with: a, b, and a temporary register (ignoring c, which we don't need to consider). The code when costs are calculated is roughly: tmp = a + b a = a tmp CC = compare (a, 0) Both the adddi3 and anddi3 operations can be performed in either integer or FP regs, with a preference for the FP regs because the sequence is shorter (1 insn instead of 2). The compare operation can only be performed in an integer register. In the first pass of the cost analysis: a is assigned to the integer registers, since the cheaper adddi/anddi operations are outweighed by the cost of having to move the value from FP=INT for the compare. b and tmp are both assigned to FP registers, since they are only involved in operations that are cheaper with the FP hardware. In the second pass of the cost analysis, each register is again analyzed independently: a is left in the integer register because moving it to a FP register would add an additional FP=INT move for the compare. b and tmp are both left in FP registers because moving either one would still leave us with mixed FP/INT operations. The biggest problem I see is that the first pass should recognize that since a must be in an integer register, there is an unconsidered cost to putting b and tmp in FP registers since they are involved in instructions where the operands must be in the same register class. A lesser, and probably more difficult, problem is that the second pass could do better if it would consider changing register classes of more than one register at a time. This seems potentially complex, but perhaps we could just consider register pairs that are involved in instructions with mismatched operand classes, where the combination is invalid for the instruction.
[Bug rtl-optimization/57231] Hoist zero-extend operations when possible
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57231 --- Comment #3 from Joshua Conner josh.m.conner at gmail dot com --- Exactly - there's no need to truncate every iteration, we should be able to safely do it when the loop is complete.
[Bug rtl-optimization/57231] New: Hoist zero-extend operations when possible
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57231 Bug ID: 57231 Summary: Hoist zero-extend operations when possible Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: josh.m.conner at gmail dot com Compiling this code at -O2: unsigned char *value; unsigned short foobar (int iters) { unsigned short total; unsigned int i; for (i = 0; i iters; i++) total += value[i]; return total; } On ARM generates a zero-extend of total for every iteration of the loop: .L3: ldrbr1, [ip, r3]@ zero_extendqisi2 add r3, r3, #1 cmp r3, r0 add r2, r2, r1 uxthr2, r2 bne .L3 I believe we should be able to hoist the zero-extend (uxth) after the loop. Note that although I manifested this for ARM, I believe it's a general case that would have to be handled by the rtl optimizers. This shows up in a hot loop of bzip2: for (i = gs; i = ge; i++) { UInt16 icv = szptr[i]; cost0 += len[0][icv]; cost1 += len[1][icv]; cost2 += len[2][icv]; cost3 += len[3][icv]; cost4 += len[4][icv]; cost5 += len[5][icv]; }
[Bug c/56924] New: Folding of checks into a range check should check upper boundary
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56924 Bug #: 56924 Summary: Folding of checks into a range check should check upper boundary Classification: Unclassified Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com When we are performing folding of checks into a range check, if the values are at the top-end of the range we should just use a test instead of normalizing them into the bottom of the range and using a test. For example, consider: struct stype { unsigned int pad:4; unsigned int val:4; }; void bar (void); void foo (struct stype input) { if ((input.val == 0xe) || (input.val == 0xf)) bar(); } When compiled at -O2, the original tree generated is: ;; Function foo (null) ;; enabled by -tree-original { if (input.val + 2 = 1) { bar (); } } This is likely to be more efficient if we instead generate: if (input.val = 0xe) { bar (); } This can be seen in the inefficient codegen for an ARM cortex-a15: ubfxr0, r0, #4, #4 add r3, r0, #2 and r3, r3, #15 cmp r3, #1 (the add and the and are not necessary if we change the test condition). I was able to improve this by adding detection of this case into build_range_check.
[Bug tree-optimization/56925] New: SRA should take into account likelihood of statements being executed
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56925 Bug #: 56925 Summary: SRA should take into account likelihood of statements being executed Classification: Unclassified Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com In the following code: struct stype { unsigned int pad:4; unsigned int val:4; }; void bar (void); void baz (void); int x, y; unsigned int foo (struct stype input) { if (__builtin_expect (x, 0)) return input.val; if (__builtin_expect (y, 0)) return input.val + 1; return 0; } When compiled with -O2, SRA moves the read of input.val to the top of the function: ;; Function foo (foo, funcdef_no=0, decl_uid=4988, cgraph_uid=0) Candidate (4987): input Rejected (4999): not aggregate: y.1 Rejected (4993): not aggregate: x.0 Created a replacement for input offset: 4, size: 4: input$val ... bb 2: input$val_14 = input.val; x.0_3 = x; _4 = __builtin_expect (x.0_3, 0); if (_4 != 0) goto bb 3; else goto bb 4; ... Which means that the critical path for this function now executes an extra instruction. It would be nice if SRA would take into account the likelihood of statement execution when deciding whether to apply the transformation. We currently verify that there are at least two reads -- perhaps we should check that there are at least two reads that are likely to occur. This can be seen in sub-optimal codegen for ARM, where a bitfield extract (ubfx) is moved out of unlikely code into the critical path: foo: movwr3, #:lower16:x ubfxr2, r0, #4, #4 movtr3, #:upper16:x ldr r3, [r3] cmp r3, #0 bne .L6 ...
[Bug tree-optimization/56352] New: Simplify testing of related conditions in for loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56352 Bug #: 56352 Summary: Simplify testing of related conditions in for loop Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com If we have a loop like this: for (i = 0; i a i b; i++) { /* Code which cannot affect i, a, or b */ } gcc should be able to optimize this into: tmp = MIN(a,b) for (i = 0; i tmp; i++) { /* Body */ } But it does not. Similarly, code like: for (i = 0; i a; i++) { if (i = b) break; /* Code which cannot affect i, a, or b */ } Should be similarly optimized.
[Bug target/56313] New: aarch64 backend not using fmls instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56313 Bug #: 56313 Summary: aarch64 backend not using fmls instruction Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com When this code is compiled with -O2 -ffast-math -S for an aarch64-linux-gnu target: float v1 __attribute__((vector_size(8))); float v2 __attribute__((vector_size(8))); float result __attribute__((vector_size(8))); void foo (void) { result = result + (-v1 * v2); } The following is generated: ld1{v0.2s}, [x0] fnegv2.2s, v2.2s ld1{v1.2s}, [x1] fmlav0.2s, v2.2s, v1.2s st1{v0.2s}, [x0] This code could be improved to: ld1{v0.2s}, [x0] ld1{v1.2s}, [x1] fmlsv0.2s, v2.2s, v1.2s st1{v0.2s}, [x0]
[Bug target/56313] aarch64 backend not using fmls instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56313 --- Comment #1 from Joshua Conner josh.m.conner at gmail dot com 2013-02-14 01:39:55 UTC --- In case it helps, the pattern for aarch64_vmlsmode is written as: (set (op0) (minus (op1) (mult (op2) (op3 Restructuring this to: (set (op0) (fma (neg (op1)) (op2) (op3))) Allows the combiner to take advantage of the pattern.
[Bug target/56315] New: ARM: Improve use of 64-bit constants in logical operations
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56315 Bug #: 56315 Summary: ARM: Improve use of 64-bit constants in logical operations Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com In the ARM backend, support was added for recognizing addition with 64-bit constants that could be split up into two 32-bit literals that could be handled with immediates in the adds/adc operations. However, this support is still not present for the logical operations. For example, compiling this code with -O2: unsigned long long or64 (unsigned long long input) { return input | 0x20004ULL; } Gives us: movr2, #4 movr3, #2 orrr0, r0, r2 orrr1, r1, r3 When it could produce: orrr0, r0, #4 orrr1, r1, #2 The same improvement could be applied to and ^ operations as well.
[Bug tree-optimization/56094] New: Invalid line number info generated with tree-level ivopts
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56094 Bug #: 56094 Summary: Invalid line number info generated with tree-level ivopts Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com The attached code has a number of instructions that are associated with the head of the function. This was showing up when setting a breakpoint on the function itself and gdb was setting several - not a problem in general, except that the statements were more appropriately correlated with intra-loop calculations instead of anything to do with the prologue. To reproduce, compile the attached file with -g -O2, and notice the statements associated with line 83. These lines are in fact statements that are generated during tree-level induction variable optimization, but which aren't getting their location data copied over from the gimple statement and so they default to the start of the function (sorry for being a bit vague, but it's been a while since I looked into the mechanics of this and I don't recall the details). The fix I have implemented in our local tree is in rewrite_use_nonlinear_expr, where after generating the computation (comp) I verify that it has a location associated with it - if it doesn't but the use stmt (use-stmt) does have a location, I copy the location from the use-stmt over to comp.
[Bug tree-optimization/56094] Invalid line number info generated with tree-level ivopts
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56094 --- Comment #1 from Joshua Conner josh.m.conner at gmail dot com 2013-01-24 04:03:44 UTC --- Created attachment 29263 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=29263 Reduced test case
[Bug tree-optimization/56094] Invalid line number info generated with tree-level ivopts
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56094 --- Comment #2 from Joshua Conner josh.m.conner at gmail dot com 2013-01-24 04:05:09 UTC --- Sorry, I should have been more specific -- the function I'm describing in the previous comments is test_main.
[Bug rtl-optimization/55747] New: Extra registers are saved in functions that only call noreturn functions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55747 Bug #: 55747 Summary: Extra registers are saved in functions that only call noreturn functions Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com On architectures such as ARM, where a link register is used to save the return address, this value does not need to be saved in a function that only calls noreturn functions. For example, if I build the following source: __attribute__((noreturn)) extern void bar (void); int x; void foo (void) { if (x) bar (); } Using the options -O2, the link register is saved: stmfd sp!, {r3, lr} ... ldmeqfd sp!, {r3, pc} However, this is unnecessary since the only way the link register cannot be corrupted since any calls to bar will not return. Note that I am not filing this as an ARM target bug since the issue appears to be a general problem related to dataflow analysis not tracking the difference between calls to normal functions and calls to noreturn functions. At any rate, I see a similar problem in our custom target as well.
[Bug target/55701] New: Inline some instances of memset for ARM
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55701 Bug #: 55701 Summary: Inline some instances of memset for ARM Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com memset() is almost never inlined on ARM, even at -O3. If the target is known to be 4-byte aligned or greater, it will be inlined for 1, 2, or 4 byte lengths. If the target alignment is unknown, it will be inlined only for a single byte. I don't see this problem with similar builtins (memcpy, memmove, and memclear (memset with a target value of zero)) - they all inline small cases. It probably makes sense for memset to be inlined up to at least 16 bytes or so in all cases. When aligned, memcpy and memmove use a ldmia/stmia (load multiple/store multiple) sequence to create fairly compact inline code. We could consider doing the same sort of optimization with memset, using stmia only.
[Bug c/55681] New: Qualifiers on asm statements are order-dependent
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55681 Bug #: 55681 Summary: Qualifiers on asm statements are order-dependent Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com The syntax that is accepted for asm statement qualifiers is: asm {volatile | const | restrict} {goto} (this can easily be seen by looking at the code in c_parser_asm_statement). This means, for example, that gcc isn't particularly orthogonal in what it chooses to accept and reject: asm volatile (nop); // accepted asm const (nop); // accepted with warning asm __restrict (nop);// accepted with warning asm const volatile (nop);// parse error asm const __restrict (nop); // parse error asm volatile goto (nop : : : : label); // accepted asm goto volatile (nop : : : : label); // parse error This is probably rarely a problem, since most of the statements that would result in an error are not likely to be seen (I came across this when adding a new qualifier for our local port, which exacerbated the problem), but I thought I would mention it anyway -- the fix is relatively straightforward since the qualifiers are independent.
[Bug middle-end/55653] New: Unnecessary initialization of vector register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55653 Bug #: 55653 Summary: Unnecessary initialization of vector register Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com When initializing all lanes of a vector register, I notice that the register is first initialized to zero and then all lanes of the vector are independently initialized, resulting in extra code. Specifically, I'm looking at the aarch64 target, with the following source: void fmla_loop (double * restrict result, double * restrict mul1, double mul2, int size) { int i; for (i = 0; i size; i++) result[i] = result[i] + mul1[i] * mul2; } Compiled with: aarch64-linux-gnu-gcc -std=c99 -O3 -ftree-vectorize -S -o test.s test.c The resultant code to initialize a vector register with two instances of mul2 is: adr x3, .LC0 ld1 {v3.2d}, [x3] ins v3.d[0], v0.d[0] ins v3.d[1], v0.d[0] ... .LC0: .word 0 .word 0 .word 0 .word 0 Where the first two instructions (that initialize the vector register) are unnecessary, as is the space for .LC0. Note that this initialization is being performed here in store_constructor: /* Inform later passes that the old value is dead. */ if (!cleared !vector REG_P (target)) emit_move_insn (target, CONST0_RTX (GET_MODE (target))); right after another check to see if the vector needs to be cleared out (and determine that it doesn't). Instead of the emit_move_insn, that code used to be: emit_insn (gen_rtx_CLOBBER (VOIDmode, target)); But was changed in r101169, with the comment: The expr.c change elides an extra move that's creeped in since we changed clobbered values to get new registers in reload. (see full checkin text here: http://gcc.gnu.org/ml/gcc-patches/2005-06/msg01584.html) It's not clear to me whether this can be changed back, or if later passes should be recognizing this initialization as redundant, or whether we need a new expand pattern to match vector fill (vector duplicate). At any rate, the code is certainly not ideal as it stands. Thanks!
[Bug tree-optimization/55213] vectorizer ignores __restrict__
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55213 --- Comment #4 from Joshua Conner josh.m.conner at gmail dot com 2012-11-29 22:17:50 UTC --- I'm also seeing this same issue in libgfortran's matmul_r8.c, where the inner loop has an aliasing check even though all of the pointer dereferences are via restricted pointers. Again, the problem is worse because the aliasing versioning prevents us from doing vector alignment peeling.
[Bug tree-optimization/55213] vectorizer ignores __restrict__
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55213 Joshua Conner josh.m.conner at gmail dot com changed: What|Removed |Added CC||josh.m.conner at gmail dot ||com --- Comment #3 from Joshua Conner josh.m.conner at gmail dot com 2012-11-20 18:05:26 UTC --- I'm running into a similar problem in code like this: void inner (float * restrict x, float * restrict y, int n) { int i; for (i = 0; i n; i++) x[i] *= y[i]; } void outer (float *arr, int offset, int bytes) { inner (arr[0], arr[offset], bytes); } In the out-of-line instance of inner(), no alias detection code is generated (correctly, since the pointers are restricted). When inner() is inlined into outer(), however, alias detection code is unnecessarily generated. This alone isn't a terrible penalty except that the generation of a versioned loop to handle aliasing prevents us from performing loop peeling for alignment, and so we end up with a vectorized unaligned loop with poor performance. Note that the place where I'm actually running into the problem is in fortran, where pointer arguments are implicitly non-aliasing.
[Bug tree-optimization/55216] New: Infinite loop generated on non-infinite code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55216 Bug #: 55216 Summary: Infinite loop generated on non-infinite code Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: josh.m.con...@gmail.com Attempting to compile this code: int d[16]; int SATD (void) { int k, satd = 0, dd; for (dd=d[k=0]; k16; dd=d[++k]) { satd += (dd 0 ? -dd : dd); } return satd; } with -O2 generates an infinite loop: .L2: b.L2 I am using trunk gcc (sync'd to r193173) configured with: --target=arm-linux-gnueabi --with-cpu=cortex-a15 --with-gnu-as --with-gnu-ld --enable-__cxa_atexit --disable-libssp --disable-libmudflap --enable-languages=c,c++,fortran --disable-nls Although I am pretty sure this is a tree optimization issue and not a target issue because I see the transformation from a valid loop into an invalid loop during vrp1. Specifically, when visiting this PHI node for the last time: Visiting PHI node: k_1 = PHI 0(2), k_8(4) Argument #0 (2 - 3 executable) 0 Value: [0, 0] Argument #1 (4 - 3 executable) k_8 Value: [1, 15] vrp_visit_phi_node determines that the range for k_1 is: k_1: [0,14] If I'm understanding this correctly, the union of these ranges should give us [0,15] instead (and would, except that adjust_range_with_scev() overrides it). This invalid range leads to the belief that the loop exit condition can never be met.
[Bug lto/48508] ICE in output_die, at dwarf2out.c:11409
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48508 Joshua Conner josh.m.conner at gmail dot com changed: What|Removed |Added CC||josh.m.conner at gmail dot ||com --- Comment #6 from Joshua Conner josh.m.conner at gmail dot com 2011-11-06 19:01:26 UTC --- I ran into this bug building SPEC2k for ARM (176.gcc) w/LTO, and have done some investigation. In the provided test case, during inlining we generate an abstract function die for js_InternNonIntElementIdSlow (and the inlined instance with an abstract_origin referring to the abstract function die). Later, when we are generating the debug information for the non-slow version of the function, js_InternNonIntElementId, we process the declaration that appears inside that function: extern bool js_InternNonIntElementIdSlow (JSContext *, JSObject *, const js::Value , long int *, js::Value *); We attempt to generate a die for this, and in doing so when looking up the decl using lookup_decl_die, we are returned the abstract instance of the ...Slow function. We then attempt to re-define this die by clearing out the parameters from old instance and re-using it (see the code that follows this comment in gen_subprogram_die): /* If the definition comes from the same place as the declaration, maybe use the old DIE. We always want the DIE for this function that has the *_pc attributes to be under comp_unit_die so the debugger can find it. We also need to do this for abstract instances of inlines, since the spec requires the out-of-line copy to have the same parent. For local class methods, this doesn't apply; we just use the old DIE. */ Once we clear out the parameters, then the abstract_origin parameters in our original inlined instance now point to unreachable/unallocated dies, triggering the assertion failure. It's not clear to me what the fix is, so I could use some insight into what cases this code is supposed to handle. From reading the comments and code, it appears that we're trying to catch a case where we have a declaration followed by a definition? So, it's possible that we should recognize that we don't have a definition here, just a declaration. Alternatively (or in addition), should we recognize that we are dealing with an abstract declaration and not try to re-use it, since doing so will break any references that were almost certainly generated?