Hi, Thanks for dealing with this.
I'm glad we made Savannah more alike to other FSF systems, so you could more easily fix low-level issues :) I disabled the monthly array checks as you suggested - previously we got "array is busy" spurious messages each month (which Danny had a look at), something's definitely not clean yet there. -- Sylvain On Sun, Feb 07, 2010 at 09:50:18AM -0500, Ward Vandewege wrote: > OK - in the end I > > 1. killed the rsync backups on savannah-backup > 2. xm destroy'ed all the domUs on savannah > 3. rebooted savannah > > I have left the vcs-noshell-snapshot there, you can delete it when you've > decided it's no longer needed? > > Colonialone is now on the -21 version of the kernel package, which is nice. > > I'd recommend switching off the monthly array checks in /etc/default/mdadm. > > Thanks, > Ward. > > > On Sun, Feb 07, 2010 at 09:24:53AM -0500, Ward Vandewege wrote: > > Hi Sylvain, > > > > Savannah's having problems. > > > > It seems to have been triggered by a combination of having the monthly mdadm > > array check switched on in /etc/default/mdadm and the rsycn backup script in > > /root/remote_backup.sh that kicked off around 7am this morning. > > > > All the domUs are so starved for CPU and/or IO that they have lots of these > > on console: > > > > [2423148.713493] ======================= > > [2423152.912986] INFO: task kjournald:506 blocked for more than 120 seconds. > > [2423152.913002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > > this message. > > [2423152.913009] kjournald D ed753dec 0 506 2 > > [2423152.913018] ecc13280 00000246 ed6a1f0c ed753dec c010621f > > ecc13408 > > c115ab40 00000000 > > [2423152.913033] ec1dce40 8abe4250 240b36fe 8abe34a7 0013398d > > ed753dec > > 0ed26a5a ed6a1f0c > > [2423152.913047] ed753dec c0133204 0ed26a5a ed753dec 0ed26a5a > > ed6a1f0c > > c115ab40 00d9b000 > > [2423152.913065] Call Trace: > > [2423152.913073] [<c010621f>] xen_clocksource_read+0xc/0x164 > > [2423152.913085] [<c0133204>] getnstimeofday+0x37/0xbc > > [2423152.913096] [<c02caa7c>] io_schedule+0x49/0x80 > > [2423152.913104] [<c018e286>] sync_buffer+0x30/0x33 > > [2423152.913114] [<c02cac6a>] __wait_on_bit+0x33/0x58 > > [2423152.913121] [<c018e256>] sync_buffer+0x0/0x33 > > [2423152.913130] [<c018e256>] sync_buffer+0x0/0x33 > > [2423152.913136] [<c02cacee>] out_of_line_wait_on_bit+0x5f/0x67 > > [2423152.913146] [<c012ecc5>] wake_bit_function+0x0/0x3c > > [2423152.913155] [<c018e222>] __wait_on_buffer+0x16/0x18 > > [2423152.913162] [<ee04134a>] journal_commit_transaction+0x7dc/0xcae [jbd] > > [2423152.913180] [<c0126799>] lock_timer_base+0x19/0x35 > > [2423152.913191] [<ee044054>] kjournald+0xbc/0x225 [jbd] > > [2423152.913204] [<c012ec98>] autoremove_wake_function+0x0/0x2d > > [2423152.913211] [<ee043f98>] kjournald+0x0/0x225 [jbd] > > [2423152.913224] [<c012ebd5>] kthread+0x38/0x5f > > [2423152.913231] [<c012eb9d>] kthread+0x0/0x5f > > [2423152.913238] [<c010425f>] kernel_thread_helper+0x7/0x10 > > > > Meanwhile the resync on md3 is pretty much stuck: > > > > md3 : active raid1 sda6[0] sdb6[3] sdc6[2] sdd6[1] > > 955128384 blocks [4/4] [UUUU] > > [============>........] check = 60.0% (573359936/955128384) > > finish=1258443.5min speed=5K/sec > > > > I tried killing the backup rsync, but without success so far (it ignores > > even > > kill -9). > > > > It feels like something is deadlocked; the lvs command simply does not > > return > > either (or, perhaps it takes more than 10 minutes...). > > > > I tried to bring down the vcs-noshell and builder domUs gracefully with xm > > shutdown, but they are too locked up to respond to that. > > > > Unless you have a better idea, I think the best course of action would be to > > reboot colonialone - we may have to xm destroy the running domUs first. I'm > > a little worried about the lvm snapshot and potential filesystem corruption > > from shutting down the domUs uncleanly. > > > > Restarting would also bring colonialone and its domUs up to the -21 kernel > > packages which fixed a nasty dom0 kernel panic on heavy IO (we suffered from > > that on another server). > > > > What do you think? > > > > Thanks, > > Ward. > > > > -- > > Ward Vandewege <[email protected]> > > Free Software Foundation - Senior Systems Administrator > -- > Ward Vandewege <[email protected]> > Free Software Foundation - Senior Systems Administrator
