Le jeudi 02 novembre 2017 à 16:16 +0100, Nigel Babu a écrit : > Hello folks, > > Yesterday, we had an unplanned Gerrit outage. We have now determined > that > for some reason the machine rebooted for some reason. Michael is > continuing > to debug what lead to this issue. Gerrit does not start automatically > when > the VM restarted at this point.
So I did investigate, and ..... *roll drum* that's a kernel crash. I suspect that's some weird race condition somewhere, given the traceback I got with crash: [exception RIP: shmem_free_inode+19] RIP: ffffffff81198c23 RSP: ffff8816a232fd28 RFLAGS: 00010246 RAX: ffff8817092dd440 RBX: 0000000000000000 RCX: 0000000100400009 RDX: 000000010040000a RSI: ffffea005c24b740 RDI: ffff8812ad30a800 RBP: ffff8816a232fd38 R8: ffff8817092dd440 R9: 0000000100400009 R10: 00000000092dd201 R11: ffffea005c24b740 R12: ffff8817092dd440 R13: ffff880f6d8cc000 R14: ffff880f6d8cc018 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8816a232fd40] shmem_evict_inode at ffffffff8119d21f #11 [ffff8816a232fd70] evict at ffffffff8121a2e7 #12 [ffff8816a232fd98] iput at ffffffff8121ab85 #13 [ffff8816a232fdc8] devpts_del_ref at ffffffff81283868 #14 [ffff8816a232fde0] pty_unix98_shutdown at ffffffff813ec526 #15 [ffff8816a232fdf8] release_tty at ffffffff813e1477 #16 [ffff8816a232fe10] tty_release at ffffffff813e26dd #17 [ffff8816a232fea8] __fput at ffffffff81200109 #18 [ffff8816a232fef0] ____fput at ffffffff812003be #19 [ffff8816a232ff00] task_work_run at ffffffff810ace97 #20 [ffff8816a232ff30] do_notify_resume at ffffffff8102ab22 #21 [ffff8816a232ff50] int_signal at ffffffff81696dbd crash /usr/lib/debug/lib/modules/3.10.0-514.10.2.el7.x86_64/vmlinux ./vmcore The only useful entry in dmesg I found is: [20035599.848892] VFS: Busy inodes after unmount of tmpfs. Self- destruct in 5 seconds. Have a nice day... I didn't found open bug about it and I guess that unless I can reproduce it, I can't do much. So for me, that's case closed (minus the systemd which we are testing since 4 days: https://github.com/gluster/gluster.org_ansible_configurat ion/commit/9aa279acf9316eae6ff7afff36ad630fc42edeff ) > We are currently testing a systemd unit file for Gerrit in staging. > Once > that's in place, we can ensure that we start Gerrit automatically > when we > restart the server. > > Timeline of events (in CET): 16:24:24 kernel crash 16:26 kernel start > 16:29 - I receive an alert that Gerrit is down. This goes ignored > because > we're still working on Jenkins. > > 18:25 - I notice the alerts as we're packing up for the day and start > Gerrit. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-infra
