Until fairly recently my employer has had one production TSM server with a DR plan involving a commercial hot site provider. We now have multiple production servers at multiple locations and a DR plan using our own facilities. We think we understand how recovery from a real disaster would work, but we am having trouble figuring out how to run DR tests in the new environment.
We have two IBM z10 systems in different locations. Each has a Linux image for hosting the production TSM instance or instances for its location. One site has a single TSM instance for storing client data. The other site has a client data storage instance, which also serves as a configuration manager for both sites, and a small instance configured as a library manager. Each site has a primary pool on sequential disk files and a copy pool on tape volumes in a shared tape library at a third location. We make extensive use of server to server communications. I have already mentioned configuration management and library sharing. We use virtual volumes to send recovery plans between locations and to send library manager database backups to the other location. We have command routing between any pair of servers available for the convenience of the TSM administrators. Our DR plan involves a standby Linux image at each location. Each standby image will have empty versions of the instance or instances from the other location installed and ready for database restores. We would like to be able to test the database restore process while all the production instances are active. We are prepared to suspend normal tape activity during DR tests. We would like to be able to run test client restores from the recreated instances. We are a bit nervous about the idea of two TSM instances on different Linux images with the same server name and with both configured for communications with other servers. One of the options we are considering is to execute a 'set servername' command as soon as possible after a TSM database restore to eliminate the server name collisions as quickly as possible. We have already thought of several complications that would result from this approach. We would need to execute some 'define server' commands. We would need to change the ownership of tape volumes used during tests. In some cases we would need to update a device configuration file to support a TSM database restore using a renamed library manager and then update the library definition in the restored database. We would appreciate any advice or warnings from TSM administrators who have run DR tests in environments similar to ours. Thomas Denier Thomas Jefferson University Hospital
