Re: 2.6.32-504.1.3.el6.x86_64 and cgroup BUG ?

Andreas Haupt Sun, 01 Mar 2015 23:34:26 -0800

Hi Arnau,

over the weekend we managed to provoke an identical behaviour. Jobs
crash during the epilog phase when the job's CGroup gets removed.


Did you already open a bug at Univa or somewhere else?

Cheers,
Andreas

Am Freitag, den 27.02.2015, 13:03 +0100 schrieb Arnau Bria:
> Dear all,
> 
> I'm running SL 6.5. The last update did install kernel
> 2.6.32-504.1.3.el6.x86_64.
> 
> Some of our nodes act as computing nodes from a Univa's GE computing
> nodes, so they are used for running batch jobs. UGE supports cgroups
> and each job that runs in the node creates a cpuset and sets some
> memory limits thought the uge daemon (sge_execd)
> 
> It has been working nicely with our previous kernel:
> 2.6.32-431.29.2.el6.x86_64, but since we did upgrade most of the nodes
> rebooted unexpectedly leaving a vmcore in the crash directory. I.e:
> 
> # ls -lsa /var/crash/127.0.0.1-2015-02-23-08\:35\:23/vmcore
> 1153928 -rw------- 1 root root 1181615424 feb 23 08:40 
> /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore
>       100 -rw-r--r-- 1 root root      99806 feb 23 08:35 
> /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore-dmesg.txt
> 
> 
> When I look at the txt file I see some strange message about cgroup
> BUG, but as I'm not kernel expert I'd like to ask for some help in this
> mailing list. the error shows:
> 
> <3>INFO: task bedtools:32790 blocked for more than 120 seconds.
> <3>      Not tainted 2.6.32-504.1.3.el6.x86_64 #1
> <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> <6>bedtools      D 0000000000000008     0 32790  32789 0x00000080
> <4> ffff881f1cf4d9a8 0000000000000082 0000000000000000 ffff881f1cf4d96c
> <4> 0000000000000000 ffff88103fe71800 00001a68da2304b8 ffff880061b768c0
> <4> 0000000000000800 0000000101b6c99f ffff882026c8a638 ffff881f1cf4dfd8
> <4>Call Trace:
> [...]
> this is more or less common and we have some complaining about
> scientific programs (samtools, etc..) but the important thing comes by
> the end of the file:
> 
> 
> 
> <4>------------[ cut here ]------------
> <4>WARNING: at kernel/cgroup.c:4428 __css_put+0x70/0x80() (Not tainted)
> <4>Hardware name: ProLiant BL460c Gen8
> <4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc 
> ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat 
> nf_conntrack_ipv4 nf_conntrack nf
> _defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data 
> dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table 
> pcc_cpufreq ipv6 ext3 jbd
>  microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt 
> iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi 
> scsi_transport_iscsi be2net serio_raw l
> pc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa 
> video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
> scsi_wait_scan]
> <4>Pid: 3512, comm: sge_execd Not tainted 2.6.32-504.1.3.el6.x86_64 #1
> <4>Call Trace:
> <4> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
> <4> [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
> <4> [<ffffffff810cff80>] ? __css_put+0x70/0x80
> <4> [<ffffffff811813ce>] ? mem_cgroup_force_empty+0x3e/0x50
> <4> [<ffffffff811813f4>] ? mem_cgroup_pre_destroy+0x14/0x20
> <4> [<ffffffff810cfa90>] ? cgroup_rmdir+0xe0/0x560
> <4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
> <4> [<ffffffff8119ccf0>] ? vfs_rmdir+0xc0/0xf0
> <4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
> <4> [<ffffffff8119ff64>] ? do_rmdir+0x184/0x1f0
> <4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
> <4> [<ffffffff811a0026>] ? sys_rmdir+0x16/0x20
> <4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
> <4>---[ end trace 8eae4afa57f7484f ]---
> <4>------------[ cut here ]------------
> 
> <4>------------[ cut here ]------------
> <2>kernel BUG at kernel/cgroup.c:3725!
> <4>invalid opcode: 0000 [#1] SMP 
> <4>last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
> <4>CPU 15 
> <4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc 
> ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat 
> nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables bridge dm_thin_pool 
> dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc 
> autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd microcode 
> power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support 
> hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net 
> serio_raw lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod 
> crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last 
> unloaded: scsi_wait_scan]
> <4>
> <4>Pid: 3512, comm: sge_execd Tainted: G        W  ---------------    
> 2.6.32-504.1.3.el6.x86_64 #1 HP ProLiant BL460c Gen8
> <4>RIP: 0010:[<ffffffff810cfef6>]  [<ffffffff810cfef6>] 
> cgroup_rmdir+0x546/0x560
> <4>RSP: 0018:ffff8820272e3db8  EFLAGS: 00010046
> <4>RAX: 0000000000000004 RBX: ffff882028150200 RCX: ffffffff81c0cb00
> <4>RDX: ffffc9001cf76000 RSI: ffff88102549a000 RDI: 0000000000000246
> <4>RBP: ffff8820272e3e48 R08: 0000000000000000 R09: 0000000000000000
> <4>R10: 000000000000000f R11: 0000000000000008 R12: 0000000000000000
> <4>R13: ffff882028150308 R14: ffff8820272e3de8 R15: ffff882026e4e040
> <4>FS:  00007fc508ca0740(0000) GS:ffff8810788e0000(0000) 
> knlGS:0000000000000000
> <4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> <4>CR2: 00007fc5085e1000 CR3: 00000020271f1000 CR4: 00000000000407e0
> <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>Process sge_execd (pid: 3512, threadinfo ffff8820272e2000, task 
> ffff882026e4e040)
> <4>Stack:
> <4> ffff8820272e3e28 ffffffff81c0cb00 ffff882028150220 ffff882028150318
> <4><d> ffff882028150220 ffff88101ed25a00 0000000000000000 ffff882026e4e040
> <4><d> ffffffff8109eb00 ffffffff81aaa768 ffffffff81aaa768 00007fc50842f400
> <4>Call Trace:
> <4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
> <4> [<ffffffff8119ccf0>] vfs_rmdir+0xc0/0xf0
> <4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
> <4> [<ffffffff8119ff64>] do_rmdir+0x184/0x1f0
> <4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
> <4> [<ffffffff811a0026>] sys_rmdir+0x16/0x20
> <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
> <4>Code: 45 00 31 c0 e9 40 fe ff ff 48 8d b8 90 00 00 00 e8 80 ae 45 00 e9 1e 
> ff ff ff 48 8d b8 90 00 00 00 e8 6f af 45 00 e9 c9 fe ff ff <0f> 0b eb fe 0f 
> 0b 0f 1f 40 00 eb fa 0f 0b eb fe 66 2e 0f 1f 84 
> <1>RIP  [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
> <4> RSP <ffff8820272e3db8>
> 
> 
> 
> then the node reboots. I've read about tainted kernels  but I
> can't figure out what's happening in my systems. Anyone could help me
> to understand what going on? is this a real cgroups bug? maybe
> sge_execd doing strange things when purging cgroups? do I have to
> report this to kernel developers (sip) or to Univa?
> 
> 
> Many thanks in advance,
> Cheers,
> Arnau

-- 
| Andreas Haupt            | E-Mail: [email protected]
|  DESY Zeuthen            | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6         | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen         | Fax:    +49/33762/7-7216

Re: 2.6.32-504.1.3.el6.x86_64 and cgroup BUG ?

Reply via email to