Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc

2009-12-02 Thread Holger Levsen
tags 557520 + unreproducible
thanks

Hi,

On Dienstag, 1. Dezember 2009, Tom Feiner wrote:
 I'm not sure how to continue with this bug without a way to reproduce it.
 Lets keep it open for now until some new info be available.

Yup, thats the way to go. After some time it's also reasonable to close such 
bug reports. In this specific case it would help too if Rémy could confirm 
it's indeed a hardware issue.


regards,
Holger


signature.asc
Description: This is a digitally signed message part.


Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc

2009-11-30 Thread Tom Feiner
Hi Rémy,

Rémy Sanchez wrote:
 
 Also, I have a pretty low confidence in the quality of my hard disk, since 
 it's 
 a cheap dedicated host, the hardware in the server is probably of very poor 
 quality. It might be a hardware failure and there is just nothing to be 
 done...

Maybe smartctl will show you something regarding the disks. You can always
also run a S.M.A.R.T short (or long) test manually, to check the disk health.

 
 (I know it's useless to have 4 distinct swap files, but I actual had a 
 problem 
 with svn that turned out to use an increasing amount of RAM in a very short 
 time, so I added some swap on the fly to avoid an OOM, but I did not remove 
 them afterward).
 
 Is this problem reproducible? And if so, can you attach an strace of the
  process?
 
 No I'm sorry I could not figure out how this happened, this had just appeared 
 by itself in normal operation, I did not touch anything in any configuration 
 of 
 the server for days...
 

I'm not sure how to continue with this bug without a way to reproduce it. Lets
keep it open for now until some new info be available.

Thanks,
Tom




signature.asc
Description: OpenPGP digital signature


Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc

2009-11-29 Thread Tom Feiner
tags 557520 = moreinfo
quit

Hi Rémy,

Thanks for the detailed bug report! And sorry for taking so long to answer the
bug, I've been working hard lately getting munin 1.4 package ready :).

Have you discovered anything else since you've reported the bug? Are other
processes suffering from the same problem as munin-graph? The reason I'm
asking is that AFAIK, processes that run into disk sleep state and stay there,
are either:

* Trying to communicate with a failed NFS/some other remote filesystem.
* Trying to communicate with a failed/failing disk drive.
* Trying to write to a filesystem mounted using a new/non-stable driver?

Are any of the filesystems that munin-graph writes to NFS based? Or some other
remote filesystems / external drives / drives mounted with a new driver which
might have problems?

Is this problem reproducible? And if so, can you attach an strace of the 
process?

Regards,
Tom Feiner









signature.asc
Description: OpenPGP digital signature


Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc

2009-11-29 Thread Rémy Sanchez
On Sunday 29 November 2009 21:33:35 Tom Feiner wrote:
 Thanks for the detailed bug report! And sorry for taking so long to answer
  the bug, I've been working hard lately getting munin 1.4 package ready :).

Thanks answering :)

 Have you discovered anything else since you've reported the bug? Are other
 processes suffering from the same problem as munin-graph? The reason I'm
 asking is that AFAIK, processes that run into disk sleep state and stay
  there, are either:
 
 * Trying to communicate with a failed NFS/some other remote filesystem.
 * Trying to communicate with a failed/failing disk drive.
 * Trying to write to a filesystem mounted using a new/non-stable driver?
 
 Are any of the filesystems that munin-graph writes to NFS based? Or some
  other remote filesystems / external drives / drives mounted with a new
  driver which might have problems?

Everything's ext3, internal hard disk. However I've got 4 swap files

/home/swapfile1 swapswapdefaults0   0
/home/swapfile2 swapswapdefaults0   0
/home/swapfile3 swapswapdefaults0   0
/home/swapfile4 swapswapdefaults0   0

Maybe that the problem came from that, since in the call stack there is 
system_call_after_swapgs, but I'm just guessing.

Also, I have a pretty low confidence in the quality of my hard disk, since it's 
a cheap dedicated host, the hardware in the server is probably of very poor 
quality. It might be a hardware failure and there is just nothing to be 
done...

(I know it's useless to have 4 distinct swap files, but I actual had a problem 
with svn that turned out to use an increasing amount of RAM in a very short 
time, so I added some swap on the fly to avoid an OOM, but I did not remove 
them afterward).

 Is this problem reproducible? And if so, can you attach an strace of the
  process?

No I'm sorry I could not figure out how this happened, this had just appeared 
by itself in normal operation, I did not touch anything in any configuration of 
the server for days...

-- 
Rémy Sanchez
http://hyperthese.net


signature.asc
Description: This is a digitally signed message part.


Bug#557520: munin-graph goes into disk sleep and freezes ps/top/htop/etc

2009-11-22 Thread Rémy Sanchez
Package: munin
Version: 1.2.6-10~lenny1
Severity: normal

This morning, I woke up with tons of mails telling me that

Lock already exists: /var/run/munin/munin-graph.lock. Dying.

So I though that munin-graph was still running, and I launched a htop to see
if it was not in an infinite loop or something like that. Htop launched,
cleared the screen, and then drew nothing, with no way to kill it. After that
I tried to see what hapened with ps, but ps never returned. By the way, at
that time the load average was of 2.44, with no signifiant slow down. After
some investigation, I could determine that munin-graph was put into disk sleep,
and so were the other frozen processes (htop, ps, and various cat that I made
on /proc/something). So far, here is what I could observe :

* munin-graph (or htop/ps/etc) cannot be killed by TERM, KILL or SIGHUP
* getting infos about disk sleeping processes other than munin-graph do
  not block
* not all files in /proc/4304 (4304 is the PID of munin-graph) are
  blocking. According to my tests, I get locked when doing a cat on
  /proc/4304/{cmdline,environ,maps,numa_maps,smaps} but other files are
  behaving just fine.

Here is what dmesg tells me

[6428739.796022] Modules linked in: ppdev parport_pc lp parport 
xt_multiport iptable_filter ip_tables x_tables ipv6 dm_snapshot dm_mirror 
dm_log dm_mod
 loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore 
snd_page_alloc serio_raw psmouse pcspkr i2c_viapro i2c_core shpchp pci_hotplug 
button v
 ia_agp evdev ext3 jbd mbcache sd_mod ata_generic floppy ahci via_rhine 
mii via82cxxx libata scsi_mod dock ide_pci_generic ide_core thermal processor fa
 n thermal_sys [last unloaded: scsi_wait_scan]
 [6428739.796022] Pid: 4304, comm: munin-graph Tainted: G   M  
2.6.26-2-amd64 #1
 [6428739.796022] RIP: 0010:[80283764]  [80283764] 
find_vma+0x2b/0x57
 [6428739.796022] RSP: 0018:81001f95bf50  EFLAGS: 00010206
 [6428739.796022] RAX: 81003d42c3f8 RBX: 01ed2000 RCX: 
810010887d98
 [6428739.796022] RDX: 0080 RSI: 01ed2000 RDI: 
81800440
 [6428739.796022] RBP: 01f06000 R08: 0004 R09: 
0003
 [6428739.796022] R10: 7f0e59567a50 R11: 0206 R12: 
81800440
 [6428739.796022] R13: 01f06000 R14: 818004a0 R15: 
00023ba0
 [6428739.796022] FS:  7f0e5a1846e0() GS:8053c000() 
knlGS:
 [6428739.796022] CS:  0010 DS:  ES:  CR0: 80050033
 [6428739.796022] CR2: 7f0e53417000 CR3: 376be000 CR4: 
06e0
 [6428739.796022] DR0:  DR1:  DR2: 

 [6428739.796022] DR3:  DR6: 0ff0 DR7: 
0400
 [6428739.796022] Process munin-graph (pid: 4304, threadinfo 
81001f95a000, task 81002ceff120)
 [6428739.796022] Stack:  8028542c 01ed2000 
00034000 00034000
 [6428739.796022]  00034000 7f0e595679e0 8020beca 
0206
 [6428739.796022]  7f0e59567a50 0003 0004 
000c
 [6428739.796022] Call Trace:
 [6428739.796022]  [8028542c] ? sys_brk+0xc5/0x111
 [6428739.796022]  [8020beca] ? 
system_call_after_swapgs+0x8a/0x8f
 [6428739.796022]
 [6428739.796022]
 [6428739.796022] Code: 31 c0 48 85 ff 74 4f eb 05 48 89 c8 eb 3f 48 8b 
47 10 48 85 c0 74 0c 48 39 70 10 76 06 48 39 70 08 76 33 48 8b 57 08 31 c0 eb 
1d 48 39 72 e0 48 8d 4a d0 76 0f 48 39 72 d8 76 ce 48 8b 52 10 48
 [6428739.796022] RIP  [80283764] find_vma+0x2b/0x57
 [6428739.796022]  RSP 81001f95bf50
 [6428739.797272] ---[ end trace 6e1e9d617365f1ff ]---

And also, here is the status file of the process 4304

Name:   munin-graph
State:  D (disk sleep)
Tgid:   4304
Pid:4304
PPid:   4106
TracerPid:  0
Uid:112 112 112 112
Gid:113 113 113 113
FDSize: 256
Groups: 113 
VmPeak:   119792 kB
VmSize:   119536 kB
VmLck: 0 kB
VmHWM: 15416 kB
VmRSS: 15292 kB
VmData: 9772 kB
VmStk:84 kB
VmExe: 4 kB
VmLib: 10828 kB
VmPTE:   252 kB
Threads:1
SigQ:   4/8062
SigPnd: 
ShdPnd: 00044101
SigBlk: 
SigIgn: 0080
SigCgt: 00018000
CapInh: 
CapPrm: 
CapEff: 
CapBnd: