hi - it turns out that parlib's ctors, uthread_lib_init() and vcore_lib_init(), can be run multiple times, despite the init_once_racy() thing.
any shared objects (.so but not .a, such as libelf.so) we make also have their own copy of those ctors. (surprise)! when linked with a binary that also has those ctors, both copies of the ctors run, and both use different data structures (so the static bool controlling whether it ran or not was a different location in memory). it makes sense. if they think they have a ctor, then they gotta run it. there's no dynamic lookup saying "run this other function". the .so happens to have a function called uthread_lib_init(). it's just undesirable and unnecessary to have that ctor in that library. i found this because i had a bug in parlib initialization that i fixed, but then saw it again in a different program. that program linked in elfutils (-lelf). i saw the same problem with another -lelf binary. that, plus the fact it was dying very early in the process's lifetime, made me suspect constructors and the .so. overall, this is probably due to how we tell gcc to do a "--whole-archive -lparlib --no-whole-archive" for anything we build. we want all final programs to be linked with parlib, but not necessarily all .sos. i can get around the fact that init_once_racy() doesn't work (use the 'early SCP / VC-CTX-READY' flag in procdata). but we'd still have to be careful to always rebuild any shared libraries whenever we change parts of parlib. it'd be nice to change our gcc hack so that it only applies to final objects. we might just have to live with it, though. for those curious, here's the BT of the bug: #01 Addr 0x00004000002fcc3e is in libc-2.19.so at offset 0x00000000000d1c3e __ros_syscall_errno@@GLIBC_2.2.5+0x6e #02 Addr 0x0000400000016ab2 is in libelf-0.164.so at offset 0x0000000000014ab2 sys_yield+0x22 #03 Addr 0x00004000000186f7 is in libelf-0.164.so at offset 0x00000000000166f7 thread0_sched_entry+0x37 #04 Addr 0x0000400000015a01 is in libelf-0.164.so at offset 0x0000000000013a01 uthread_vcore_entry+0xf1 #05 Addr 0x0000400000005e34 is in libelf-0.164.so at offset 0x0000000000003e34 __get_dtls.part.0+0x0 #06 Addr 0x0000400000015dbd is in libelf-0.164.so at offset 0x0000000000013dbd uthread_yield+0x21d #07 Addr 0x00004000002fcc57 is in libc-2.19.so at offset 0x00000000000d1c57 __ros_syscall_errno@@GLIBC_2.2.5+0x87 #08 Addr 0x00004000002fcf6c is in libc-2.19.so at offset 0x00000000000d1f6c mmap@@GLIBC_2.2.5+0x5c #09 Addr 0x0000400000018deb is in libelf-0.164.so at offset 0x0000000000016deb ucq_init+0x2b #10 Addr 0x000040000001a00d is in libelf-0.164.so at offset 0x000000000001800d get_eventq+0x1d #11 Addr 0x000040000001b702 is in libelf-0.164.so at offset 0x0000000000019702 init_posix_signals+0x32 #12 Addr 0x0000400000005f03 is in libelf-0.164.so at offset 0x0000000000003f03 uthread_lib_init+0xa3 #13 Addr 0x000040000001df66 is in libelf-0.164.so at offset 0x000000000001bf66 __do_global_ctors_aux+0x26 #14 Addr 0x0000400000005847 is in libelf-0.164.so at offset 0x0000000000003847 _init+0x1f #15 Addr 0x000000000010eb6c is in ld-2.19.so at offset 0x000000000000eb6c _dl_init_internal+0x8c #16 Addr 0x0000000000100e9a is in ld-2.19.so at offset 0x0000000000000e9a oom+0x6a a few things in parlib's initialization assumed some syscalls would never block. under the new strace, many syscalls can block, including mmap. (the kernel is aware of syscalls that can never block, and strace will just drop those under load). what happened here is that mmap (#08) blocked in the middle of uthread_lib_init. we just previous set the 2LS ops, set blockon, and told the kernel we were ready for VC ctx. but we hadn't installed any syscall event handlers (or INDIR handlers, which also matters). so the kernel would try to send an event, but we didn't set anything up to receive it. we yielded an never woke up. strace saw it too: E Syscall 18 ( mmap):(0x0, 0x2000, 0x3, 0x8020, 0xffffffffffffffff, 0x0) ret: --- proc: 26 core: 0 vcore: 0 data: '' X Syscall 18 ( mmap):(0x0, 0x2000, 0x3, 0x8020, 0xffffffffffffffff, 0x0) ret: 0x40000057c000 proc: 26 core: 0 vcore: 0 data: '' E Syscall 13 ( proc_yield):(0x0, 0x0, 0x0, 0x0, 0x0, 0x0) ret: --- proc: 26 core: 0 vcore: 0 data: '' X Syscall 13 ( proc_yield):(0x0, 0x0, 0x0, 0x0, 0x0, 0x0) ret: 0x0 proc: 26 core: 0 vcore: 0 data: '' There's actually a few bugs that have popped up due to having some syscalls block that had never blocked before. (They are blocking because the strace queue is overflowing - i have it set to hold only 1 or 2 entries). I'll probably make a test that makes every syscall (that isn't blacklisted) block briefly. I've fixed most of those bugs i found already, but the .so ctor surprise added a couple hours to it. =) barret -- You received this message because you are subscribed to the Google Groups "Akaros" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/d/optout.
