We are getting a very bizarre error on a binary compiled with gcc-2.95.3.
This program has a main event loop which calls select(), and sometimes it
comes back from the select() and then the following subroutine call
branches into neverland.
Here is a piece of the code, disassembled from the on-disk binary image:
00450c74 <event_loop>:
450c74: 90 6f f0 18 stm %r6,%r15,24(%r15)
450c78: a7 d5 00 1c bras %r13,450cb0 <event_loop+0x3c>
450c7c: 00 44 d4 d8 .long 0x0044d4d8
450c80: 00 44 d9 98 .long 0x0044d998
450c84: 00 44 da a8 .long 0x0044daa8
450c88: 00 44 db 68 .long 0x0044db68
450c8c: 00 45 22 84 .long 0x00452284
450c90: 00 0f 42 40 .long 0x000f4240
450c94: 00 40 91 c0 .long 0x004091c0
450c98: 00 40 89 60 .long 0x00408960
450c9c: 00 44 dc c4 .long 0x0044dcc4
450ca0: 00 44 df dc .long 0x0044dfdc
450ca4: 00 44 e2 14 .long 0x0044e214
450ca8: 00 44 e1 54 .long 0x0044e154
450cac: 00 45 23 50 .long 0x00452350
450cb0: 18 1f lr %r1,%r15
Basically, we enter the function, save some registers, then jump around
this object which I'm calling the "constants table". This table consists
of:
entries 1-5: function entry points with static linkage within the module
entry 6: function with external linkage from another module
entry 7: a scalar constant, 1000000, used within the function
entry 8: a double-indirection to select() in glibc with external linkage
entry 9: as above, but for usleep()
entries 10-13: more function entry points with static linkage
entry 14: function with external linkage from another module
When the program loads, the dynamic linker rewrites these addresses for the
virtual address space under which the binary is executing. The table
becomes:
44 8e f3 68
44 8e f8 28
44 8e f9 38
44 8e f9 f8
44 8f 41 14
00 0f 42 40
40 ?? 7d 48
40 2c 78 f0
44 8e fb 54
44 8e fe 6c
44 8f 00 a4
44 8e ff e4
44 8f 41 e0
Note that one byte is unknown, gdb masked it away when it tried to
disassemble the constants as valid machine code. Anyway, this is the way
the code appears during execution. I can CTRL-C it several times in the
debugger, and it always looks like this.
As the program runs, and passes through this event loop hundreds or
thousands of times, there comes a time when something strange happens.
Suddenly, gdb reports a segmentation violation, as we tried to branch to
address 0x00021b54. This fails, our binary is mapped in at addresses
starting around 0x448ce000, and there is no valid mapping to that address.
Looking at the constants table, it has been changed:
00 02 13 68
00 02 18 28
00 02 19 38
00 02 19 f8
00 00 00 00
00 0f 42 40
00 00 00 00
00 00 00 00
00 02 1b 54
00 02 1e 6c
00 02 20 a4
00 02 1f e4
00 00 00 00
Looking at the backtrace, it fails the first time we try to branch from the
constants table following a select(). It appears that, very rarely, within
select(), something with a detailed knowledge of the binary format decides
to rewrite the jump addresses, but not the scalar constant 0x000f4240.
Pointers to functions with static linkage within the same module are set to
different values (but with the same spacings as in the original code).
Pointers to functions with external linkage, whether in the binary or in
external libraries, are zeroed out. The 0x00021b54 now occupies the
position in the constants table which used to be held by the address of the
next function to be called after select(). The memory on either side of
the constants table is not touched.
This code is single-threaded, and not particularly magical. It is running
as a fork()-ed child of another process, though.
Note that all of the kernels we have tested have a bug in mprotect(), it
returns success while failing to set any protection on the page(s). At
run-time, these pages are being mprotect()-ed:
mprotect(0x448ce000, 172032, PROT_READ|PROT_WRITE) = 0
mprotect(0x448ce000, 172032, PROT_READ|PROT_EXEC) = 0
The first mprotect() is to allow the jump tables to be rewritten by the
run-time linker. The second is to protect the text segment so that nobody
overwrites the pointer tables. Well, the pages are not protected, and
somebody is overwriting the page under some very rare set of circumstances
under select(), but I can't see who is doing it. Using hardware
watchpoints under gdb does not reveal the offending instruction, the
watchpoint was not caught. This is a bit of a mystery, as hardware
watchpoints do work under gdb on 390, but the attempt to watch a particular
address within the jump table did not result in any trap when the value
changed.
So, has anybody seen anything like this before? It looks like a possible
ld.so bug, though I suppose it is possible it is a kernel bug involving
bringing back pages which have been paged out (this might explain why it is
so dreadfully sporadic).
--
Jason McMullan, Senior Linux Consultant
Linuxcare, Inc. 412.432.6457 tel, 412.656.3519 cell
[EMAIL PROTECTED], http://www.linuxcare.com/
Linuxcare. Putting open source to work.