Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc
tags 557520 + unreproducible thanks Hi, On Dienstag, 1. Dezember 2009, Tom Feiner wrote: I'm not sure how to continue with this bug without a way to reproduce it. Lets keep it open for now until some new info be available. Yup, thats the way to go. After some time it's also reasonable to close such bug reports. In this specific case it would help too if Rémy could confirm it's indeed a hardware issue. regards, Holger signature.asc Description: This is a digitally signed message part.
Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc
Hi Rémy, Rémy Sanchez wrote: Also, I have a pretty low confidence in the quality of my hard disk, since it's a cheap dedicated host, the hardware in the server is probably of very poor quality. It might be a hardware failure and there is just nothing to be done... Maybe smartctl will show you something regarding the disks. You can always also run a S.M.A.R.T short (or long) test manually, to check the disk health. (I know it's useless to have 4 distinct swap files, but I actual had a problem with svn that turned out to use an increasing amount of RAM in a very short time, so I added some swap on the fly to avoid an OOM, but I did not remove them afterward). Is this problem reproducible? And if so, can you attach an strace of the process? No I'm sorry I could not figure out how this happened, this had just appeared by itself in normal operation, I did not touch anything in any configuration of the server for days... I'm not sure how to continue with this bug without a way to reproduce it. Lets keep it open for now until some new info be available. Thanks, Tom signature.asc Description: OpenPGP digital signature
Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc
tags 557520 = moreinfo quit Hi Rémy, Thanks for the detailed bug report! And sorry for taking so long to answer the bug, I've been working hard lately getting munin 1.4 package ready :). Have you discovered anything else since you've reported the bug? Are other processes suffering from the same problem as munin-graph? The reason I'm asking is that AFAIK, processes that run into disk sleep state and stay there, are either: * Trying to communicate with a failed NFS/some other remote filesystem. * Trying to communicate with a failed/failing disk drive. * Trying to write to a filesystem mounted using a new/non-stable driver? Are any of the filesystems that munin-graph writes to NFS based? Or some other remote filesystems / external drives / drives mounted with a new driver which might have problems? Is this problem reproducible? And if so, can you attach an strace of the process? Regards, Tom Feiner signature.asc Description: OpenPGP digital signature
Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc
On Sunday 29 November 2009 21:33:35 Tom Feiner wrote: Thanks for the detailed bug report! And sorry for taking so long to answer the bug, I've been working hard lately getting munin 1.4 package ready :). Thanks answering :) Have you discovered anything else since you've reported the bug? Are other processes suffering from the same problem as munin-graph? The reason I'm asking is that AFAIK, processes that run into disk sleep state and stay there, are either: * Trying to communicate with a failed NFS/some other remote filesystem. * Trying to communicate with a failed/failing disk drive. * Trying to write to a filesystem mounted using a new/non-stable driver? Are any of the filesystems that munin-graph writes to NFS based? Or some other remote filesystems / external drives / drives mounted with a new driver which might have problems? Everything's ext3, internal hard disk. However I've got 4 swap files /home/swapfile1 swapswapdefaults0 0 /home/swapfile2 swapswapdefaults0 0 /home/swapfile3 swapswapdefaults0 0 /home/swapfile4 swapswapdefaults0 0 Maybe that the problem came from that, since in the call stack there is system_call_after_swapgs, but I'm just guessing. Also, I have a pretty low confidence in the quality of my hard disk, since it's a cheap dedicated host, the hardware in the server is probably of very poor quality. It might be a hardware failure and there is just nothing to be done... (I know it's useless to have 4 distinct swap files, but I actual had a problem with svn that turned out to use an increasing amount of RAM in a very short time, so I added some swap on the fly to avoid an OOM, but I did not remove them afterward). Is this problem reproducible? And if so, can you attach an strace of the process? No I'm sorry I could not figure out how this happened, this had just appeared by itself in normal operation, I did not touch anything in any configuration of the server for days... -- Rémy Sanchez http://hyperthese.net signature.asc Description: This is a digitally signed message part.
Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc
Package: munin Version: 1.2.6-10~lenny1 Severity: normal This morning, I woke up with tons of mails telling me that Lock already exists: /var/run/munin/munin-graph.lock. Dying. So I though that munin-graph was still running, and I launched a htop to see if it was not in an infinite loop or something like that. Htop launched, cleared the screen, and then drew nothing, with no way to kill it. After that I tried to see what hapened with ps, but ps never returned. By the way, at that time the load average was of 2.44, with no signifiant slow down. After some investigation, I could determine that munin-graph was put into disk sleep, and so were the other frozen processes (htop, ps, and various cat that I made on /proc/something). So far, here is what I could observe : * munin-graph (or htop/ps/etc) cannot be killed by TERM, KILL or SIGHUP * getting infos about disk sleeping processes other than munin-graph do not block * not all files in /proc/4304 (4304 is the PID of munin-graph) are blocking. According to my tests, I get locked when doing a cat on /proc/4304/{cmdline,environ,maps,numa_maps,smaps} but other files are behaving just fine. Here is what dmesg tells me [6428739.796022] Modules linked in: ppdev parport_pc lp parport xt_multiport iptable_filter ip_tables x_tables ipv6 dm_snapshot dm_mirror dm_log dm_mod loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc serio_raw psmouse pcspkr i2c_viapro i2c_core shpchp pci_hotplug button v ia_agp evdev ext3 jbd mbcache sd_mod ata_generic floppy ahci via_rhine mii via82cxxx libata scsi_mod dock ide_pci_generic ide_core thermal processor fa n thermal_sys [last unloaded: scsi_wait_scan] [6428739.796022] Pid: 4304, comm: munin-graph Tainted: G M 2.6.26-2-amd64 #1 [6428739.796022] RIP: 0010:[80283764] [80283764] find_vma+0x2b/0x57 [6428739.796022] RSP: 0018:81001f95bf50 EFLAGS: 00010206 [6428739.796022] RAX: 81003d42c3f8 RBX: 01ed2000 RCX: 810010887d98 [6428739.796022] RDX: 0080 RSI: 01ed2000 RDI: 81800440 [6428739.796022] RBP: 01f06000 R08: 0004 R09: 0003 [6428739.796022] R10: 7f0e59567a50 R11: 0206 R12: 81800440 [6428739.796022] R13: 01f06000 R14: 818004a0 R15: 00023ba0 [6428739.796022] FS: 7f0e5a1846e0() GS:8053c000() knlGS: [6428739.796022] CS: 0010 DS: ES: CR0: 80050033 [6428739.796022] CR2: 7f0e53417000 CR3: 376be000 CR4: 06e0 [6428739.796022] DR0: DR1: DR2: [6428739.796022] DR3: DR6: 0ff0 DR7: 0400 [6428739.796022] Process munin-graph (pid: 4304, threadinfo 81001f95a000, task 81002ceff120) [6428739.796022] Stack: 8028542c 01ed2000 00034000 00034000 [6428739.796022] 00034000 7f0e595679e0 8020beca 0206 [6428739.796022] 7f0e59567a50 0003 0004 000c [6428739.796022] Call Trace: [6428739.796022] [8028542c] ? sys_brk+0xc5/0x111 [6428739.796022] [8020beca] ? system_call_after_swapgs+0x8a/0x8f [6428739.796022] [6428739.796022] [6428739.796022] Code: 31 c0 48 85 ff 74 4f eb 05 48 89 c8 eb 3f 48 8b 47 10 48 85 c0 74 0c 48 39 70 10 76 06 48 39 70 08 76 33 48 8b 57 08 31 c0 eb 1d 48 39 72 e0 48 8d 4a d0 76 0f 48 39 72 d8 76 ce 48 8b 52 10 48 [6428739.796022] RIP [80283764] find_vma+0x2b/0x57 [6428739.796022] RSP 81001f95bf50 [6428739.797272] ---[ end trace 6e1e9d617365f1ff ]--- And also, here is the status file of the process 4304 Name: munin-graph State: D (disk sleep) Tgid: 4304 Pid:4304 PPid: 4106 TracerPid: 0 Uid:112 112 112 112 Gid:113 113 113 113 FDSize: 256 Groups: 113 VmPeak: 119792 kB VmSize: 119536 kB VmLck: 0 kB VmHWM: 15416 kB VmRSS: 15292 kB VmData: 9772 kB VmStk:84 kB VmExe: 4 kB VmLib: 10828 kB VmPTE: 252 kB Threads:1 SigQ: 4/8062 SigPnd: ShdPnd: 00044101 SigBlk: SigIgn: 0080 SigCgt: 00018000 CapInh: CapPrm: CapEff: CapBnd: