Hi All, I've seen some stuff I don't understand with a reclaim stg command and I'd like anyone who does understand to enlighten me.
TSM server is 4.2.3.3 on AIX 5.1 RML03 running in a p690 LPAR. All disk is FC attached to IBM ESS arrays, disk stgpools are not mirrored but DB and log are using AIX mirroring. All TSM files are on filesystems most are JFS, but newer stgpool data is on JFS2. We are very cautious here. The TSM LPAR has two HBAs connected to separate san fabrics connected via multiple paths to two ESSes. Despite that, yesterday all four paths to one ESS dropped out together. Nothing much was happening in TSM, at the time so we resynched the disks and continued. However, because of unrelated issues hung over from the weekend, one of our disk pools was 99% full, so we decided to migrate all our data early. During the migration the same disk dropped out again. Some of the stgpool being migrated was on this disk and not mirrored, and gave repeated error messages about errors reading the disk until it was brought back online to AIX, at which point the errors stopped and the migration continued, finishing with a FAILURE notification. Afterward there was no data to be migrated, but the diskpool and some of its volumes were'nt empty. Accordingly I ran an AUDIT VOL FIX=yes against one of the affected volumes. This went OK, but on the second volume the TSM server died with an error attempting rollback and would not restart. Since TSM is down we run fsck on all the affected filesystems and they are all clean. So, now we're in trouble and need to restore the DB. We do that and the redo of the log fails with the same error attempting the rollback. So, now we're in DEEP trouble and just do the restore to point in time of the last backup. <aside> We run a normal sort of backup pattern. Most data goes to diskpools overnight. In the morning diskpools are backed up, then tapepools are backed up, a DB backup is taken, expiry is run and tapes are ejected. In the evening the same diskpool/tapepool/db backup sequence is run, although it normally doesn't do much, and then diskpools are force migrated to tape. Our problems happened after the morning backup cycle. </aside> According to the admin guide we must now run AUDIT VOL FIX=yes on all our diskpool volumes. This takes 4 hours and reports huge numbers of missing files. Next we run a reclaim stg on one diskpool and it finishes neatly, without mounting a tape. The second pool calls for a tape that was not created in the last 24 hours. Hmm defer that - this pool isn't important The third and final pool is the one that was in the middle of its migrate at the crash. This calls for some really strange tapes. Eventually we produce a list that is 1/3 of the offsite tape pool, including some tapes that were last written at the end of June. Eventually, we mark all of the volumes in the diskpools as destroyed, then rename the underlying files. We add these renamed files back in to the diskpools as new volumes, enable sessions and we are in business again, able to run restore stg at our leisure the following day. OK, so the first question is :- When TSM couldn't read the diskpool volumes in the first place I would have expected it to immediately mark them as off-line and stop using them, but it didn't. Why? Under what circumstances do volumes go off-line? Second question. Restoring of diskpools seems strange. There are two possibilities. If a diskpool is "cleared" by a migration, then the data is unavailable after the DB restore to previous point in time, but the restore stg should only refer to tapes created in the most recent backup stg operation. On the other hand if the diskpool is not "cleared" by migration, but rather the data is left in place and "forgotten", then only files that are overwritten by new data after the DB restore point should be damaged and need restoration. The rest should just magically reappear when their DB references are restored. I can't think of any other possibilities Sorry to have been so detailed, but I wanted you all to have the full story. The whole concept of having to restore data from 85 tapes after a two second outage is extremely worrying. Having to get significant numbers of tapes back from off-site storage to do this in a hurry is even more so. Thanks Steve Harris AIX and TSM Admin Queensland Health, Brisbane Australia. *********************************************************************************** This email, including any attachments sent with it, is confidential and for the sole use of the intended recipients(s). This confidentiality is not waived or lost, if you receive it and you are not the intended recipient(s), or if it is transmitted/received in error. Any unauthorised use, alteration, disclosure, distribution or review of this email is prohibited. It may be subject to a statutory duty of confidentiality if it relates to health service matters. If you are not the intended recipients(s), or if you have received this e-mail in error, you are asked to immediately notify the sender by telephone or by return e-mail. You should also delete this e-mail message and destroy any hard copies produced. ***********************************************************************************
