Re: [PVE-User] Crash after Upgrade PVE2.3 / Cron backup crashes with 2.6.32-19 but not with 2.6.32-17

Martin Schuchmann Thu, 21 Mar 2013 18:30:55 -0700

Hi,

Today I had the same behaviour as yesterday - at 23:00h cron started thebackup job and immediately the whole node was out of order: Viawebinterface I could see the running VMs but they had not been reachablevia RDP/SSH anymore. Also VZCTL ENTER did not work.


Again there had been kernel errors in syslog.

After a hart-reset via we recognized a logical volume created at thesame time as the crashed:


lvdisplay

--- Logical volume ---
  LV Path                /dev/promo3/vzsnap-promo3-0
  LV Name                vzsnap-promo3-0
  VG Name                promo3
  LV UUID                DswPod-t1lR-yKen-vwDH-sG5D-Djpl-wo9iSX
  LV Write Access        read/write
  LV Creation host, time promo3, 2013-03-21 23:00:02 +0100
  LV Status              available
  # open                 0
  LV Size                4,00 GiB
  Current LE             1024
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

After deleting via lvremove I started a manual backup (snapshot) for CTsand VMs - no problem occured.

Now I created again a new cron backup via webinterface of the samemachine which had been successfully updated.


A few seconds after start the following errors occured:

Mar 22 01:00:01 promo3 vzdump[12622]: <root@pam> starting taskUPID:promo3:00003150:0004D259:514B9F01:vzdump::root@pam:Mar 22 01:00:01 promo3 vzdump[12624]: INFO: starting new backup job:vzdump 306 --quiet 1 --mailto [email protected] --mode snapshot --compresslzo --storage p2_storageMar 22 01:00:01 promo3 vzdump[12624]: INFO: Starting Backup of VM 306(openvz)

Mar 22 01:00:02 promo3 pmxcfs[4048]: [status] notice: received log
Mar 22 01:00:02 promo3 kernel: EXT3-fs: barriers disabled

Mar 22 01:00:02 promo3 kernel: kjournald starting. Commit interval 5seconds

Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): using internal journal

Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148485Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148484Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148481Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148429Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148427Mar 22 01:00:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70156303

Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): 6 orphan inodes deleted
Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): recovery complete

Mar 22 01:00:02 promo3 kernel: EXT3-fs (dm-3): mounted filesystem withordered data mode

Mar 22 01:00:02 promo3 pmxcfs[4048]: [status] notice: received log

Mar 22 01:01:06 promo3 pvestatd[4511]: WARNING: command 'df -P -B 1/mnt/pve/p3_storage' failed: got timeoutMar 22 01:01:36 promo3 pvestatd[4511]: WARNING: command 'df -P -B 1/mnt/pve/p3_storage' failed: got timeoutMar 22 01:02:58 promo3 kernel: device-mapper: snapshots: Invalidatingsnapshot: Unable to allocate exception.

Mar 22 01:03:05 promo3 kernel: Aborting journal on device dm-3.

Mar 22 01:03:05 promo3 kernel: Buffer I/O error on device dm-3, logicalblock 342819330

Mar 22 01:03:05 promo3 kernel: lost page write due to I/O error on dm-3

Mar 22 01:03:05 promo3 kernel: JBD: I/O error detected when updatingjournal superblock for dm-3.Mar 22 01:03:05 promo3 kernel: EXT3-fs (dm-3): error:ext3_journal_start_sb: Detected aborted journalMar 22 01:03:05 promo3 kernel: EXT3-fs (dm-3): error: remountingfilesystem read-onlyMar 22 01:03:09 promo3 kernel: EXT3-fs (dm-3): error: ext3_put_super:Couldn't clean up the journalMar 22 01:03:10 promo3 vzdump[12624]: ERROR: Backup of VM 306 failed -command '(cd /mnt/vzsnap0/private/306;find . '(' -regex '^\.$' ')' -o'(' -type 's' -prune ')' -o -print0|sed 's/\\/\\\\/g'|tar cpf - --totals--sparse --numeric-owner --no-recursion --one-file-system --null -T-|lzop)>/mnt/pve/p2_storage/dump/vzdump-openvz-306-2013_03_22-01_00_01.tar.dat' failed: exit code 2

Mar 22 01:03:10 promo3 vzdump[12624]: INFO: Backup job finished with errors
Mar 22 01:03:12 promo3 citadel: 1 unique messages to be merged
Mar 22 01:03:12 promo3 citadel: 1 unique messages to be merged
Mar 22 01:03:12 promo3 vzdump[12624]: job errors

Mar 22 01:03:12 promo3 vzdump[12622]: <root@pam> end taskUPID:promo3:00003150:0004D259:514B9F01:vzdump::root@pam: job errorsMar 22 01:03:12 promo3 /USR/SBIN/CRON[12617]: (CRON) error (grandchild#12620 failed with exit status 255)


This time the backup did not crash the whole node, but it failed.
Also the lvdisplay did show the lv again during the failed backup:

--- Logical volume ---
  LV Path                /dev/promo3/vzsnap-promo3-0
  LV Name                vzsnap-promo3-0
  VG Name                promo3
  LV UUID                UkxdQW-GGM7-raEO-MSxS-k9jZ-s0D1-g2ZO9M
  LV Write Access        read/write
  LV Creation host, time promo3, 2013-03-22 01:00:01 +0100
  LV snapshot status     active destination for data
  LV Status              available
  # open                 1
  LV Size                2,55 TiB
  Current LE             669651
  COW-table size         4,00 GiB
  COW-table LE           1024
  Allocated to snapshot  60,92%
  Snapshot chunk size    4,00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

I added a new backup cron a few minutes later. It started, and I triedto have a look on the lv's - but lvdisplay did not answer.I started a new SSH console and tried lvscan - and it also did notanswer during the backup, CTRL-C endet with:


promo3:~# lvscan
^C  CTRL-c detected: giving up waiting for lock

/run/lock/lvm/V_promo3: flock failed: Unterbrechung während desBetriebssystemaufrufs

  Can't get lock for promo3
  Skipping volume group promo3

Syslog this time.

Mar 22 01:08:01 promo3 /USR/SBIN/CRON[13592]: (root) CMD (vzdump 306--quiet 1 --mode snapshot --mailto [email protected] --compress lzo--storage p2_storage)Mar 22 01:08:02 promo3 vzdump[13593]: <root@pam> starting taskUPID:promo3:0000351B:00058E23:514BA0E2:vzdump::root@pam:Mar 22 01:08:02 promo3 vzdump[13595]: INFO: starting new backup job:vzdump 306 --quiet 1 --mailto [email protected] --mode snapshot --compresslzo --storage p2_storageMar 22 01:08:02 promo3 vzdump[13595]: INFO: Starting Backup of VM 306(openvz)

Mar 22 01:09:54 promo3 rrdcached[4027]: flushing old values
Mar 22 01:09:54 promo3 rrdcached[4027]: rotating journals

Mar 22 01:09:54 promo3 rrdcached[4027]: started new journal/var/lib/rrdcached/journal//rrd.journal.1363910994.615643Mar 22 01:11:07 promo3 kernel: ct0 nfs: server 10.1.0.3 not responding,still tryingMar 22 01:11:22 promo3 kernel: INFO: task nfsd:3957 blocked for morethan 120 seconds.Mar 22 01:11:22 promo3 kernel: "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.Mar 22 01:11:22 promo3 kernel: nfsd D ffff880879f2d1e0 03957 2 0 0x00000000Mar 22 01:11:22 promo3 kernel: ffff880879f2f900 0000000000000046ffff8808619a4fc0 0000000000000001Mar 22 01:11:22 promo3 kernel: 00000000000005a8 ffff88087bb77e000000000000000080 0000000000000004Mar 22 01:11:22 promo3 kernel: ffff880879f2f8d0 ffffffff81182e1bffff880879f2d7a8 000000000001e9c0

Mar 22 01:11:22 promo3 kernel: Call Trace:

Mar 22 01:11:22 promo3 kernel: [<ffffffff81182e1b>] ?cache_flusharray+0xab/0x100Mar 22 01:11:22 promo3 kernel: [<ffffffff810974ee>] ?prepare_to_wait+0x4e/0x80Mar 22 01:11:22 promo3 kernel: [<ffffffff8119ad69>]__sb_start_write+0x169/0x1a0Mar 22 01:11:22 promo3 kernel: [<ffffffff81097200>] ?autoremove_wake_function+0x0/0x40Mar 22 01:11:22 promo3 kernel: [<ffffffff81127489>]generic_file_aio_write+0x69/0x100Mar 22 01:11:22 promo3 kernel: [<ffffffff81127420>] ?generic_file_aio_write+0x0/0x100Mar 22 01:11:22 promo3 kernel: [<ffffffff8119872b>]do_sync_readv_writev+0xfb/0x140

Mar 22 01:11:22 promo3 kernel: [<ffffffff811b3e40>] ? iput+0x30/0x70

Mar 22 01:11:22 promo3 kernel: [<ffffffff81097200>] ?autoremove_wake_function+0x0/0x40Mar 22 01:11:22 promo3 kernel: [<ffffffffa0361e70>] ?nfsd_acceptable+0x0/0x120 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffff81198557>] ?rw_copy_check_uvector+0x97/0x120Mar 22 01:11:22 promo3 kernel: [<ffffffff81199696>]do_readv_writev+0xd6/0x1f0Mar 22 01:11:22 promo3 kernel: [<ffffffffa0361ff2>] ?nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffffa0571e99>] ?vzquota_qlnk_destroy+0x29/0x110 [vzdquota]

Mar 22 01:11:22 promo3 kernel: [<ffffffff811997f8>] vfs_writev+0x48/0x60

Mar 22 01:11:22 promo3 kernel: [<ffffffffa0363a25>]nfsd_vfs_write+0x115/0x480 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffffa0365cbb>] ?nfsd_open+0x23b/0x2c0 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffffa0366107>]nfsd_write+0xe7/0x100 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffffa036e1df>]nfsd3_proc_write+0xaf/0x140 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffffa035e52e>]nfsd_dispatch+0xfe/0x240 [nfsd]Mar 22 01:11:22 promo3 kernel: [<ffffffffa027e174>]svc_process_common+0x344/0x650 [sunrpc]Mar 22 01:11:22 promo3 kernel: [<ffffffff8105a620>] ?default_wake_function+0x0/0x20Mar 22 01:11:22 promo3 kernel: [<ffffffffa027e7b2>]svc_process+0x102/0x150 [sunrpc]

Mar 22 01:11:22 promo3 kernel: [<ffffffffa035ee5d>] nfsd+0xcd/0x180 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffffa035ed90>] ? nfsd+0x0/0x180 [nfsd]
Mar 22 01:11:22 promo3 kernel: [<ffffffff81096c26>] kthread+0x96/0xa0
Mar 22 01:11:22 promo3 kernel: [<ffffffff8100c1aa>] child_rip+0xa/0x20
Mar 22 01:11:22 promo3 kernel: [<ffffffff81096b90>] ? kthread+0x0/0xa0
Mar 22 01:11:22 promo3 kernel: [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

Mar 22 01:11:22 promo3 kernel: INFO: task nfsd:3958 blocked for morethan 120 seconds.Mar 22 01:11:22 promo3 kernel: "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.Mar 22 01:11:22 promo3 kernel: nfsd D ffff880879f2c700 03958 2 0 0x00000000Mar 22 01:11:22 promo3 kernel: ffff88087aa97900 0000000000000046ffff88087aa978a0 ffff881064c4b000Mar 22 01:11:22 promo3 kernel: ffff8808619a4fc0 ffff88087bb77e00ffff88086de62bf8 0000000000000020Mar 22 01:11:22 promo3 kernel: ffff88087aa978d0 ffffffff81182e1bffff880879f2ccc8 000000000001e9c0




During the job also the listing of the mounted pve storage did not work:

ls /mnt/pve/

ended with an hung up.

All machines on the node had been inaccassibe again.

I did a reboot with kernel  2.6.32-17.

Entered a new cronjob and the backup worked as it did for 12 monthbefore, here the syslog:

Mar 22 02:22:01 promo3 /USR/SBIN/CRON[3738]: (root) CMD (vzdump 306--quiet 1 --mode snapshot --compress lzo --storage p2_storage)Mar 22 02:22:01 promo3 vzdump[3739]: <root@pam> starting taskUPID:promo3:00000E9D:00007A5A:514BB239:vzdump::root@pam:Mar 22 02:22:01 promo3 vzdump[3741]: INFO: starting new backup job:vzdump 306 --quiet 1 --mailto [email protected] --mode snapshot --compresslzo --storage p2_storageMar 22 02:22:01 promo3 vzdump[3741]: INFO: Starting Backup of VM 306(openvz)

Mar 22 02:22:02 promo3 kernel: EXT3-fs: barriers disabled

Mar 22 02:22:02 promo3 kernel: kjournald starting. Commit interval 5seconds

Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): using internal journal

Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148484Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148481Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148465Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148429Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70148427Mar 22 02:22:02 promo3 kernel: ext3_orphan_cleanup: deletingunreferenced inode 70156303

Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): 6 orphan inodes deleted
Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): recovery complete

Mar 22 02:22:02 promo3 kernel: EXT3-fs (dm-3): mounted filesystem withordered data modeMar 22 02:22:30 promo3 ntpd[1884]: Listen normally on 38 veth306.0fe80::c04e:44ff:fe61:ecfe UDP 123Mar 22 02:25:49 promo3 vzdump[3741]: INFO: Finished Backup of VM 306(00:03:48)

Mar 22 02:25:49 promo3 vzdump[3741]: INFO: Backup job finished successfully
Mar 22 02:25:49 promo3 citadel: 1 unique messages to be merged
Mar 22 02:25:49 promo3 citadel: 1 unique messages to be merged

Mar 22 02:25:49 promo3 vzdump[3739]: <root@pam> end taskUPID:promo3:00000E9D:00007A5A:514BB239:vzdump::root@pam: OK



Any hints to pevent kernel  2.6.32-19 from that issue?

Regards, Martin


_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Crash after Upgrade PVE2.3 / Cron backup crashes with 2.6.32-19 but not with 2.6.32-17

Reply via email to