Hi All, 

I've seen some stuff I don't understand with a reclaim stg command and I'd like anyone 
who does understand to enlighten me.

TSM server is 4.2.3.3 on AIX 5.1 RML03 running in a p690 LPAR.  All disk is FC 
attached to IBM ESS arrays, disk stgpools are not mirrored but DB and log are using 
AIX mirroring. All TSM files are on filesystems most are JFS, but newer stgpool data 
is on JFS2.

We are very cautious here.  The TSM LPAR has two HBAs connected to separate san 
fabrics connected via multiple paths to two ESSes.
Despite that, yesterday all four paths to one ESS dropped out together.  Nothing much 
was happening in TSM, at the time so we resynched the disks and continued.  However, 
because of unrelated issues hung over from the weekend, one of our disk pools was 99% 
full, so we decided to migrate all our data early.  During the migration the same disk 
dropped out again.  Some of the stgpool being migrated was on this disk and not 
mirrored, and  gave repeated error messages about errors reading the disk until it was 
brought back online to AIX, at which point the errors stopped and the migration 
continued, finishing with a FAILURE notification.

Afterward there was no data to be migrated, but the diskpool and some of its volumes 
were'nt empty.  Accordingly I ran an AUDIT VOL FIX=yes against one of the affected 
volumes.  This went OK, but on the second volume the TSM server died with an error 
attempting rollback and would not restart.

Since TSM is down we run fsck on all the affected filesystems and they are all clean.

So, now we're in trouble and need to restore the DB.  We do that and the redo of the 
log fails with the same error attempting the rollback.
So, now we're in DEEP trouble and just do the restore to point in time of the last 
backup.

<aside>
We run a normal sort of backup pattern.  Most data goes to diskpools overnight.  In 
the morning diskpools are backed up, then tapepools are backed up, a DB backup is 
taken, expiry is run and tapes are ejected.  In the evening the same 
diskpool/tapepool/db backup sequence is run, although it normally doesn't do much, and 
then diskpools are force migrated to tape.

Our problems happened after the morning backup cycle.
</aside>

According to the admin guide we must now run AUDIT VOL FIX=yes on all our diskpool 
volumes.  This takes 4 hours and reports huge numbers of missing files.

Next we run a reclaim stg on one diskpool and it finishes neatly, without mounting a 
tape. 
The second pool calls for a tape that was not created in the last 24 hours.  Hmm defer 
that - this pool isn't important
The third and final pool is the one that was in the middle of its migrate at the 
crash.  This calls for some really strange tapes.  Eventually we produce a list that 
is 1/3 of the offsite tape pool, including some tapes that were last written at the 
end of June.

Eventually, we mark all of the volumes in the diskpools as destroyed, then rename the 
underlying files.  We add these renamed files back in to the diskpools as new volumes, 
enable sessions and we are in business again, able to run restore stg at our leisure 
the following day.

OK, so the first question is :-

When TSM couldn't read the diskpool volumes in the first place I would have expected 
it to immediately mark them as off-line and stop using them, but it didn't. Why? Under 
what circumstances do volumes go off-line?


Second question.

Restoring of diskpools seems strange.  There are two possibilities.  If a diskpool is 
"cleared" by a migration, then the data is unavailable after the DB restore to 
previous point in time, but the restore stg should only refer to tapes created in the 
most recent backup stg operation.  
On the other hand if the diskpool is not "cleared" by migration, but rather the data 
is left in place and "forgotten", then only files that are overwritten by new data 
after the DB restore point should be damaged and need restoration.  The rest should 
just magically reappear when their DB references are restored. 

I can't think of any other possibilities

Sorry to have been so detailed, but I wanted you all to have the full story.  The 
whole concept of having to restore data from 85 tapes after a two second outage is 
extremely worrying.  Having to get significant numbers of tapes back from off-site 
storage to do this in a hurry is even more so.

Thanks

Steve Harris
AIX and TSM Admin
Queensland Health, Brisbane Australia.



***********************************************************************************
This email, including any attachments sent with it, is confidential and for the sole 
use of the intended recipients(s).  This confidentiality is not waived or lost, if you 
receive it and you are not the intended recipient(s), or if it is transmitted/received 
in error.

Any unauthorised use, alteration, disclosure, distribution or review of this email is 
prohibited.  It may be subject to a statutory duty of confidentiality if it relates to 
health service matters.

If you are not the intended recipients(s), or if you have received this e-mail in 
error, you are asked to immediately notify the sender by telephone or by return 
e-mail.  You should also delete this e-mail message and destroy any hard copies 
produced.
***********************************************************************************

Reply via email to