Hi Gents, For now Daan and I put in a temporary fix for 4.1.1 (same for 4.0) which delays the deletion of the files for a day based on their create time. This means that at least snapshots that are in progress are not deleted unless it takes over a day to make them. The reason for putting the fix in place is that we've seen two production hypervisors collapse in the past two days because of the effects of the scavenger removing the files and tapdisk/sparse_dd going bananas and hitting something with NFS in the kernel that the kernel didn't like over a prolonged period of time. Besides this we're going to figure out what we hit in the kernel and hope that the next update cycle contains a patch for it, if not we'll have to conjure one up.
Cheers, Funs -----Original Message----- From: Joris van Lieshout Sent: vrijdag 20 september 2013 14:51 To: 'min.c...@citrix.com'; 'sudi...@gmail.com' Cc: 'Daan Hoogland (daan.hoogl...@gmail.com)'; Funs Kessen; 'dev@cloudstack.apache.org'; 'Hugo Trippaers' Subject: FW: (CLOUDSTACK-692) The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being copied to secondary storage Hi Min and Edison, I hope you don't mind me addressing you directly. I see that you two have done most of the work on the Snapshot parts of CS. We've been having production impacting issues due to the bug I tried to describe below (and in ticket 692). Yes, it's my first time engaging in the community so I hope I've took the right approach. :) Also I've did some digging around in the CS 4.0, 4.1 and 4.2 code bases and see that large parts of the Snapshot process have been revised in 4.2. The issue that we have been having where using the 4.0 and 4.1 code bases and, more particularly, due to "snapshot ... is not recorded in DB, remove it" in CleanupSnapshotBackupCommand of NfsSecondaryStorageResource.java. Because CleanupSnapshotBackupCommand has been removed in commit https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=commit;h=27133fba7daefcea6ddba943efb9c96f23dacef2 I wonder if therefore this bug has also been solved? Thanks in advanced. Kind regards, Joris van Lieshout -----Original Message----- From: Joris van Lieshout [mailto:jvanliesh...@schubergphilis.com] Sent: dinsdag 17 september 2013 15:56 To: 'dev@cloudstack.apache.org' Subject: (CLOUDSTACK-692) The CleanupSnapshotBackup process on SSVM deletes snapshots that are still in the process of being copied to secondary storage Hi there, I was wondering if anyone can help us with this issue? There seems to be a situation where the CleanupSnapshotBackup process deletes vhd files belonging to an active BackupSnapshot process. I've created CLOUDSTACK-692 for it and logged as much info as possible, including the steps I use to clean the mess up after we have hit this. We have seen it happen in CS 4.0 and 4.1.1, and from what I have seen in the code it probably also exists in 4.2. I haven't reproduced the issue in a lab because we are hitting it quite often in production and uat so I have all the examples I need. :) But I guess the best way to reproduce it is to create a vm with quite some io activity (so snapshots will be big), enable hourly snapshot and shorten the storage.cleanup.interval global setting so the cleanup process gets trigger more frequently. We are hitting this on XenServer 6.0.2 but if this snapshot cleanup and backup process is generally the same across other HVs type I would image this is relevant for that as well... Kind regards, Joris van Lieshout