"ADSM: Dist Stor Manager" <[email protected]> wrote on 12/22/2005 11:53:11 AM:
> In a MSCS cluster, an admin of one of our higher profile client machines > failed over from one machine (OLALPHA) back to other (OLBRAVO) after > BRAVO crashed this morning. > > > > Since I've been having a devil of a time with a MSCS cluster resource > that serves as the scheduler for the cluster drive on BRAVO not coming > up. To begin with, it posted ANS1835E, ANS1025E, ANS1570E, all of which > point to authentication problems. I updated the node password, issued a > 'q ses -optfile...', and it would authenticate fine. When I try to bring > the cluster resource back online, it stays up from a few seconds, fails, > and when I check the registry, the passwords has disappeared! What in > the world? It has also posted ANS1029E and ANS2050E since I've been > playing around trying to get the cluster resource to work, and also the > base client (to back up C/D/system state) has been issuing ANS1977E with > the "ccCreateTimerFile: Unable to create timer file" and "errno=13 > error: Permission denied". > It sounds like the services weren't setup properly from the start or the service password somehow got out of sync. When setting up the services in the cluster, it is very important to fully set them up on each node of the cluster and be sure they are working BEFORE setting up the service in the cluster manager. I think your only solution is to remove the service from the cluster configuration, then remove/resetup the services on one node, restart the service several times and make sure it works OK. Then failover to the other node and repeat. Once you are sure both work, add the service back in to the cluster, make sure you get the right registry key setup to replicate during failover. Fail back and forth a couple times to make sure all is working properly. The big drawback here is that you will need to do this during downtime when you can failover nodes quite a few times. That is why it is so important to ensure it is done right from the start. Every time I have seen the disappearing password in a cluster it was because the services weren't setup right initially or fully before configuring them in the cluster. In one rare case, special characters in the node password also caused a problem and the password wouldn't replicate properly. For this reason I always use only letters or numbers in cluster node passwords (no underscores, dashes, etc.).
