hi -

it turns out that parlib's ctors, uthread_lib_init() and
vcore_lib_init(), can be run multiple times, despite the
init_once_racy() thing.

any shared objects (.so but not .a, such as libelf.so) we make also
have their own copy of those ctors.  (surprise)!  when linked with a
binary that also has those ctors, both copies of the ctors run, and
both use different data structures (so the static bool controlling
whether it ran or not was a different location in memory).

it makes sense.  if they think they have a ctor, then they gotta run
it.  there's no dynamic lookup saying "run this other function".
the .so happens to have a function called uthread_lib_init().  it's
just undesirable and unnecessary to have that ctor in that library.

i found this because i had a bug in parlib initialization that i fixed,
but then saw it again in a different program.  that program linked in
elfutils (-lelf).  i saw the same problem with another -lelf binary.
that, plus the fact it was dying very early in the process's lifetime,
made me suspect constructors and the .so.  

overall, this is probably due to how we tell gcc to do a
"--whole-archive -lparlib --no-whole-archive" for anything we build.
we want all final programs to be linked with parlib, but not
necessarily all .sos.

i can get around the fact that init_once_racy() doesn't work (use the
'early SCP / VC-CTX-READY' flag in procdata).  but we'd still have to
be careful to always rebuild any shared libraries whenever we change
parts of parlib.

it'd be nice to change our gcc hack so that it only applies to final
objects.  we might just have to live with it, though.

for those curious, here's the BT of the bug:

#01 Addr 0x00004000002fcc3e is in libc-2.19.so at offset 0x00000000000d1c3e  
__ros_syscall_errno@@GLIBC_2.2.5+0x6e
#02 Addr 0x0000400000016ab2 is in libelf-0.164.so at offset 0x0000000000014ab2  
sys_yield+0x22
#03 Addr 0x00004000000186f7 is in libelf-0.164.so at offset 0x00000000000166f7  
thread0_sched_entry+0x37
#04 Addr 0x0000400000015a01 is in libelf-0.164.so at offset 0x0000000000013a01  
uthread_vcore_entry+0xf1
#05 Addr 0x0000400000005e34 is in libelf-0.164.so at offset 0x0000000000003e34  
__get_dtls.part.0+0x0
#06 Addr 0x0000400000015dbd is in libelf-0.164.so at offset 0x0000000000013dbd  
uthread_yield+0x21d
#07 Addr 0x00004000002fcc57 is in libc-2.19.so at offset 0x00000000000d1c57  
__ros_syscall_errno@@GLIBC_2.2.5+0x87
#08 Addr 0x00004000002fcf6c is in libc-2.19.so at offset 0x00000000000d1f6c  
mmap@@GLIBC_2.2.5+0x5c
#09 Addr 0x0000400000018deb is in libelf-0.164.so at offset 0x0000000000016deb  
ucq_init+0x2b
#10 Addr 0x000040000001a00d is in libelf-0.164.so at offset 0x000000000001800d  
get_eventq+0x1d
#11 Addr 0x000040000001b702 is in libelf-0.164.so at offset 0x0000000000019702  
init_posix_signals+0x32
#12 Addr 0x0000400000005f03 is in libelf-0.164.so at offset 0x0000000000003f03  
uthread_lib_init+0xa3
#13 Addr 0x000040000001df66 is in libelf-0.164.so at offset 0x000000000001bf66  
__do_global_ctors_aux+0x26
#14 Addr 0x0000400000005847 is in libelf-0.164.so at offset 0x0000000000003847  
_init+0x1f
#15 Addr 0x000000000010eb6c is in ld-2.19.so at offset 0x000000000000eb6c  
_dl_init_internal+0x8c
#16 Addr 0x0000000000100e9a is in ld-2.19.so at offset 0x0000000000000e9a  
oom+0x6a

a few things in parlib's initialization assumed some syscalls would
never block.  under the new strace, many syscalls can block, including
mmap.  (the kernel is aware of syscalls that can never block, and
strace will just drop those under load).

what happened here is that mmap (#08) blocked in the middle of
uthread_lib_init.  we just previous set the 2LS ops, set blockon, and
told the kernel we were ready for VC ctx.  but we hadn't installed any
syscall event handlers (or INDIR handlers, which also matters).  so the
kernel would try to send an event, but we didn't set anything up to
receive it.  we yielded an never woke up.  

strace saw it too:

E Syscall  18 (        mmap):(0x0, 0x2000, 0x3, 0x8020, 0xffffffffffffffff, 
0x0) ret: --- proc: 26 core: 0 vcore: 0 data: ''
X Syscall  18 (        mmap):(0x0, 0x2000, 0x3, 0x8020, 0xffffffffffffffff, 
0x0) ret: 0x40000057c000 proc: 26 core: 0 vcore: 0 data: ''
E Syscall  13 (  proc_yield):(0x0, 0x0, 0x0, 0x0, 0x0, 0x0) ret: --- proc: 26 
core: 0 vcore: 0 data: ''
X Syscall  13 (  proc_yield):(0x0, 0x0, 0x0, 0x0, 0x0, 0x0) ret: 0x0 proc: 26 
core: 0 vcore: 0 data: ''

There's actually a few bugs that have popped up due to having some
syscalls block that had never blocked before.  (They are blocking
because the strace queue is overflowing - i have it set to hold only 1
or 2 entries).  I'll probably make a test that makes every syscall
(that isn't blacklisted) block briefly.  I've fixed most of those bugs i
found already, but the .so ctor surprise added a couple hours to it. =)

barret

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to