[Taking this back to the forum, if you don't mind, since it seems to be relevant.]
On Wed, Nov 27, 2013 at 03:36:17PM -0500, Jerry wrote: > "H. S. Teoh" <[email protected]> writes: > > > I'm interested in the disassembly of the faulty executable. Maybe > > you could run `objdump -D $program_name` and send me the output? I'm > > curious to see what got messed up. (With your reduced test case, > > that is. The disassembly of your actual program would be too > > unwieldy.) [...] > Here's the disassembly. I'm sending out of gnus in emacs. If it > doesn't come through, let me know and I'll resend from my regular mail > prog. It came through, thanks. I compared the disassembly with my working version of the same code, and found that the problem appears to be a wrong jump table in the switch statement. Here's the relevant snippet from near the end of CC.create(): [Jerry's version]: 417590: e8 1f 2d 00 00 callq 41a2b4 <_d_switch_string> 417595: 48 89 c6 mov %rax,%rsi 417598: 83 fe 03 cmp $0x3,%esi 41759b: 77 18 ja 4175b5 <_D9switchbug2CC6createFAyaZC9switchbug2CC+0x59> 41759d: ff 24 f5 20 01 43 00 jmpq *0x430120(,%rsi,8) 4175a4: 48 bf a0 72 43 00 00 movabs $0x4372a0,%rdi 4175ab: 00 00 00 4175ae: e8 55 1b 00 00 callq 419108 <_d_newclass> 4175b3: c9 leaveq 4175b4: c3 retq In simple terms, what this code does is: 417590: calls _d_switch_string, a function in druntime that does the string comparisons in a switch over string values, and returns a uint index of the matching switch case number (uint.max if not found). In this case, you have 4 switch cases, so they map to indices 0, 1, 2, 3. 41759b: checks the return value of _d_switch_string, and if it's > 3, then branch to <CC.create + 0x59>, which is where the default case is implemented (not quoted above, but if you look in your disassembly you'll see that it creates an Exception then calls the stack unwinding routine). Since we aren't hitting the default case in our test case, the control would pass to the next instruction. 41759d: here's the interesting part. This looks up a jump table at 0x430120 using the index returned by _d_switch_string, and branches to that address. Looking up this address in the disassembly dump, I find this: 0000000000430120 <_D9switchbug2BB6__initZ>: 430120: 40 01 43 00 rex add %eax,0x0(%rbx) ... 430130: 73 77 jae 4301a9 <_D9switchbug2DD6__vtblZ+0x19> Now this looks odd. _D9switchbug2BB6__initZ looks like the typeinfo for class BB; it is certainly NOT a switch statement jump table!! The first 4 bytes, in fact, corresponds to the address 430140, which contains this: 0000000000430140 <_D9switchbug2BB6__vtblZ>: 430140: 00 72 43 add %dh,0x43(%rdx) 430143: 00 00 add %al,(%rax) 430145: 00 00 add %al,(%rax) 430147: 00 90 76 41 00 00 add %dl,0x4176(%rax) 43014d: 00 00 add %al,(%rax) [... snipped ...] This is the virtual function table of class BB, so it's *definitely* not a valid jump destination of a switch statement! So this looks like where the problem came from. If the CPU ends up here, it would try to interpret function pointers as instructions, and would basically do random nonsensical things until it hits something that can't be interpreted as an instruction, or tries to access a random address that's outside the process address space, upon which the OS kicks in and sends a SIGSEGV to terminate the program. ... Now, for comparison, here is the corresponding disassembly from my working version of your code: [Teoh's version, in CC.create]: 417084: e8 a7 27 00 00 callq 419830 <_d_switch_string> 417089: 48 89 c6 mov %rax,%rsi 41708c: 83 fe 03 cmp $0x3,%esi 41708f: 77 18 ja 4170a9 <_D4test2CC6createFAyaZC4test2CC+0x59> 417091: ff 24 f5 60 e9 42 00 jmpq *0x42e960(,%rsi,8) 417098: 48 bf 40 45 63 00 00 movabs $0x634540,%rdi 41709f: 00 00 00 4170a2: e8 dd 15 00 00 callq 418684 <_d_newclass> 4170a7: c9 leaveq 4170a8: c3 retq Other than the different addresses, which are to be expected (different compile environments, etc.), this code is basically the same as in your version. The only difference lies in the contents of the jump table, which in my version is at 42e960 (as can be seen from the instruction at 417091 above), which contains: 42e95f: 00 98 70 41 00 00 add %bl,0x4170(%rax) 42e965: 00 00 add %al,(%rax) 42e967: 00 98 70 41 00 00 add %bl,0x4170(%rax) 42e96d: 00 00 add %al,(%rax) 42e96f: 00 98 70 41 00 00 add %bl,0x4170(%rax) 42e975: 00 00 add %al,(%rax) 42e977: 00 98 70 41 00 00 add %bl,0x4170(%rax) 42e97d: 00 00 add %al,(%rax) ... (Note that since this part of the code isn't instructions, the disassembler got a bit confused trying to interpret them as instructions, so the addresses are 1 byte off. So the jump table actually starts at the bytes 98 70 41 00 ... .) Now, *this* looks like a proper jump table. In fact, the aforementioned bytes represent the address 417098, which, if you look at the disassembly snippet above, is the very next instruction after the jump table lookup. This makes sense, since the next thing it does is to call _d_newclass to create an instance of DD, after which it simply returns (leaving the address of the new instance of DD in %rax, which is the register containing the return value as per x86 calling conventions). This corresponds with what the source code says. So here, everything is correct and the program works as expected. ... Now, all this begs the question of why dmd produced the wrong code in your environment, but produces the *right* code in mine. Since we're both using the same dmd source code (I believe!), it seems that the most likely culprit must be the linker -- especially since you mentioned something about gold vs. ld. My conjecture is that somehow the linker got mixed up, and wrote the wrong address for the switch's jump table into your executable. (These addresses are generally not fixed until link time, because the compiler doesn't know in advance exactly where each symbol will end up in the final executable.) Further confirmation for this can be found by searching for the byte sequence a4 75 41 in your disassembly dump (this is the address 4175a4, the next instruction after the jump table lookup, which is where the jump destination *should* have been in the first place). This sequence appears here in your version of the executable: 00000000004301f0 <_TMP3>: [... snipped ...] 4301ff: 00 a4 75 41 00 00 00 add %ah,0x41(%rbp,%rsi,2) 430206: 00 00 add %al,(%rax) 430208: a4 movsb %ds:(%rsi),%es:(%rdi) 430209: 75 41 jne 43024c <_D9switchbug2CC6__vtblZ+0xc> 43020b: 00 00 add %al,(%rax) 43020d: 00 00 add %al,(%rax) 43020f: 00 a4 75 41 00 00 00 add %ah,0x41(%rbp,%rsi,2) 430216: 00 00 add %al,(%rax) 430218: a4 movsb %ds:(%rsi),%es:(%rdi) 430219: 75 41 jne 43025c <_D9switchbug2CC6__vtblZ+0x1c> 43021b: 00 00 add %al,(%rax) 43021d: 00 00 add %al,(%rax) ... Ignoring the instructions on the right (they are basically objdump getting confused by data that aren't intended to be instructions), we see that the sequence a4 75 41 00 00 00 00 00 appears exactly 4 times, consecutively. This corresponds with the 4 cases of the switch statement. So, this must be where the real jump table is, at 430200, NOT at 430120 (which is 128 (0x80) bytes too early). It seems unlikely that the compiler would screw things up *this* bad (esp. since it didn't screw up in my environment!), so this seems to reinforce my conclusion that the fault lies with the linker. Well, I hope this helps. :) [...] > p.s. How do you prefer to be addressed? I go by my last name. T -- Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen
