Hi All, I've been working with AIX for over 10 years now, starting with AIX 4.1.2 and have always been impressed by the robustness of its mksysb process. However recent versions have become less robust, and as you will see sometimes the various changes can paint you into a corner.
This is not strictly a TSM related post, but I know that there are a lot of small AIX/TSM installations out there that might be relying on a mksysb restore as the cornerstone of a DR procedure. The scenario is that this client has two P620-6M2 servers. One is the production SAP machine, the second is the Quality assurance machine. QA has 2 cpus instead of 4 on prod, 1 fibre card instead of two and 4GB of memory rather than 8GB. Both machines back up using aix mksysb and savevg to one half of a 3582 autoloader using LTO2 fibre connection. Both had rootvg mirrored on on dual internal 146GB disks The OS was AIX 5.2 at ML007, originally installed from ML002 cd media. Microcode was way out of date. The test was to restore a mksysb of the prod system on the QA system then restore the SAP data and bring up SAP. I developed a process. Because the tape drive was fibre attached it was not bootable - tape was also not bootable because the boot image was too large - a problem with AIX 5.2. Thus I was to boot from the install media then use that to install from the mksysb image on the Fibre attached tape drive. This worked well, the cd booted ok, we could see the fibre drive, the data restored .... But when we got to the end of the process the installation looped with "process killed" messages. A call was placed with IBM AIX support. It turns out that as of AIX 5.2, when you restore using a bootable cd, the CD has to be at the same or later level than the system you are restoring otherwise, results are unpredictable. At this point we had blown away the QA system but were unable to restore prod. We were also unable to restore QA for the same reason (yes I tried). A bit more investigation turned up a procdure for creating a cd image on the current prod system that should be able to boot and allow the restore to proceed. As there was no burner on these machines I created the image on prod, copied it to a windows box and burnt it there. The QA machine booted from the resultant CD, but after boot, the only device that it could see was the CD drive. The tape drive was simply not visible. Another call to AIX Support. It seems that the Atape driver that is required to use the LTO2 tape drive is developed by the storage people and not by the AIX people. It is non-standard and when installed changes the AIX ODM in a way that corrupts the recovery CD that is generated so that the tape drives are not visible, giving the symptoms that I saw. At this point the client was getting anxious to have their machine back so, using the original 52-002 install cds I did a new OS install to one of the internal disk drives, including the alt_disk_install package, then used alt_disk_install to restore the old os to the other internal disk. It would not boot from the new image. This was now late Friday afternoon, and two full days had been expended in the process. Over the weekend the client had an attempt to restore using my original plan but got the same result as I did. Monday morning I was back on site with a colleague who is also well versed in AIX. We decided to apply a microcode update to the machine that would allow it to boot from the larger boot image direct from tape. We brought with us another, scsi attached LTO2 drive that we connected to the built in SCSI bus on the machine. The microcode update was very slow, and not helped by the fact that we needed to use floppies, and no-one uses these any more so it took some time to find them. Eventually the microcode was updated, but the machine would not boot from the tape, giving an error code that indicated IO errors on the tape drive. After a couple of attempts we reverted to re-installing the operating system and performing an alt_disk_install. We had upgrade cds to 52-008 on hand so we upgraded to that level before doing the alt_disk_install process. We could not read the tape on the SCSI attached drive so eventually reverted to the Fibre attached drive and the data restored. This time the OS booted, but, we could not connect to it. The machine has a video card and uses a standard LCD screen and keyboard to provide a console. nothing was appearing on it so we used a serial cable to connect to the serial port using hyperterm from the laptop. We could get a login prompt, but could not login. This was how we ended day three. Day four, we attended with a second SCSI attached tape drive, and a 3151 terminal. Once the terminal was plugged in we could log on, then activate the lft console and before long everything was back up and running. Lessons: 1. Always keep a green screen terminal on hand - actually a null modem cable into hyperterm would have done the trick. 2. AIX 5.2 and above require you to generate a new boot CD after every maintenance update. 3. If you use the Atape driver, the generated boot cd is not usable. I would suggest that Atape users should order a refresh of their install CDs with every maintenance update they do. 4. test your recovery procedures before you need to. At least in this case production was not affected. I hope all this helps someone. Regards Steve Steven Harris AIX and TSM Admin Brisbane Australia
