Hi Arnau, over the weekend we managed to provoke an identical behaviour. Jobs crash during the epilog phase when the job's CGroup gets removed.
Did you already open a bug at Univa or somewhere else? Cheers, Andreas Am Freitag, den 27.02.2015, 13:03 +0100 schrieb Arnau Bria: > Dear all, > > I'm running SL 6.5. The last update did install kernel > 2.6.32-504.1.3.el6.x86_64. > > Some of our nodes act as computing nodes from a Univa's GE computing > nodes, so they are used for running batch jobs. UGE supports cgroups > and each job that runs in the node creates a cpuset and sets some > memory limits thought the uge daemon (sge_execd) > > It has been working nicely with our previous kernel: > 2.6.32-431.29.2.el6.x86_64, but since we did upgrade most of the nodes > rebooted unexpectedly leaving a vmcore in the crash directory. I.e: > > # ls -lsa /var/crash/127.0.0.1-2015-02-23-08\:35\:23/vmcore > 1153928 -rw------- 1 root root 1181615424 feb 23 08:40 > /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore > 100 -rw-r--r-- 1 root root 99806 feb 23 08:35 > /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore-dmesg.txt > > > When I look at the txt file I see some strange message about cgroup > BUG, but as I'm not kernel expert I'd like to ask for some help in this > mailing list. the error shows: > > <3>INFO: task bedtools:32790 blocked for more than 120 seconds. > <3> Not tainted 2.6.32-504.1.3.el6.x86_64 #1 > <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > <6>bedtools D 0000000000000008 0 32790 32789 0x00000080 > <4> ffff881f1cf4d9a8 0000000000000082 0000000000000000 ffff881f1cf4d96c > <4> 0000000000000000 ffff88103fe71800 00001a68da2304b8 ffff880061b768c0 > <4> 0000000000000800 0000000101b6c99f ffff882026c8a638 ffff881f1cf4dfd8 > <4>Call Trace: > [...] > this is more or less common and we have some complaining about > scientific programs (samtools, etc..) but the important thing comes by > the end of the file: > > > > <4>------------[ cut here ]------------ > <4>WARNING: at kernel/cgroup.c:4428 __css_put+0x70/0x80() (Not tainted) > <4>Hardware name: ProLiant BL460c Gen8 > <4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc > ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat > nf_conntrack_ipv4 nf_conntrack nf > _defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data > dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table > pcc_cpufreq ipv6 ext3 jbd > microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt > iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi > scsi_transport_iscsi be2net serio_raw l > pc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa > video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: > scsi_wait_scan] > <4>Pid: 3512, comm: sge_execd Not tainted 2.6.32-504.1.3.el6.x86_64 #1 > <4>Call Trace: > <4> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0 > <4> [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20 > <4> [<ffffffff810cff80>] ? __css_put+0x70/0x80 > <4> [<ffffffff811813ce>] ? mem_cgroup_force_empty+0x3e/0x50 > <4> [<ffffffff811813f4>] ? mem_cgroup_pre_destroy+0x14/0x20 > <4> [<ffffffff810cfa90>] ? cgroup_rmdir+0xe0/0x560 > <4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 > <4> [<ffffffff8119ccf0>] ? vfs_rmdir+0xc0/0xf0 > <4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50 > <4> [<ffffffff8119ff64>] ? do_rmdir+0x184/0x1f0 > <4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200 > <4> [<ffffffff811a0026>] ? sys_rmdir+0x16/0x20 > <4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b > <4>---[ end trace 8eae4afa57f7484f ]--- > <4>------------[ cut here ]------------ > > <4>------------[ cut here ]------------ > <2>kernel BUG at kernel/cgroup.c:3725! > <4>invalid opcode: 0000 [#1] SMP > <4>last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor > <4>CPU 15 > <4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc > ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat > nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables bridge dm_thin_pool > dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc > autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd microcode > power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support > hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net > serio_raw lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod > crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last > unloaded: scsi_wait_scan] > <4> > <4>Pid: 3512, comm: sge_execd Tainted: G W --------------- > 2.6.32-504.1.3.el6.x86_64 #1 HP ProLiant BL460c Gen8 > <4>RIP: 0010:[<ffffffff810cfef6>] [<ffffffff810cfef6>] > cgroup_rmdir+0x546/0x560 > <4>RSP: 0018:ffff8820272e3db8 EFLAGS: 00010046 > <4>RAX: 0000000000000004 RBX: ffff882028150200 RCX: ffffffff81c0cb00 > <4>RDX: ffffc9001cf76000 RSI: ffff88102549a000 RDI: 0000000000000246 > <4>RBP: ffff8820272e3e48 R08: 0000000000000000 R09: 0000000000000000 > <4>R10: 000000000000000f R11: 0000000000000008 R12: 0000000000000000 > <4>R13: ffff882028150308 R14: ffff8820272e3de8 R15: ffff882026e4e040 > <4>FS: 00007fc508ca0740(0000) GS:ffff8810788e0000(0000) > knlGS:0000000000000000 > <4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > <4>CR2: 00007fc5085e1000 CR3: 00000020271f1000 CR4: 00000000000407e0 > <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > <4>Process sge_execd (pid: 3512, threadinfo ffff8820272e2000, task > ffff882026e4e040) > <4>Stack: > <4> ffff8820272e3e28 ffffffff81c0cb00 ffff882028150220 ffff882028150318 > <4><d> ffff882028150220 ffff88101ed25a00 0000000000000000 ffff882026e4e040 > <4><d> ffffffff8109eb00 ffffffff81aaa768 ffffffff81aaa768 00007fc50842f400 > <4>Call Trace: > <4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 > <4> [<ffffffff8119ccf0>] vfs_rmdir+0xc0/0xf0 > <4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50 > <4> [<ffffffff8119ff64>] do_rmdir+0x184/0x1f0 > <4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200 > <4> [<ffffffff811a0026>] sys_rmdir+0x16/0x20 > <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b > <4>Code: 45 00 31 c0 e9 40 fe ff ff 48 8d b8 90 00 00 00 e8 80 ae 45 00 e9 1e > ff ff ff 48 8d b8 90 00 00 00 e8 6f af 45 00 e9 c9 fe ff ff <0f> 0b eb fe 0f > 0b 0f 1f 40 00 eb fa 0f 0b eb fe 66 2e 0f 1f 84 > <1>RIP [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560 > <4> RSP <ffff8820272e3db8> > > > > then the node reboots. I've read about tainted kernels but I > can't figure out what's happening in my systems. Anyone could help me > to understand what going on? is this a real cgroups bug? maybe > sge_execd doing strange things when purging cgroups? do I have to > report this to kernel developers (sip) or to Univa? > > > Many thanks in advance, > Cheers, > Arnau -- | Andreas Haupt | E-Mail: [email protected] | DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt | Platanenallee 6 | Phone: +49/33762/7-7359 | D-15738 Zeuthen | Fax: +49/33762/7-7216
