Your failback concern is definitely valid. There is a chance we may be able to add an automatic pairresync to our failover script. Either doing a swaps during the failover from the failed primary node (if this command can be successful without access to the remote horcm) or by doing a swapp when the device group is moved back to the original, recovered primary (I've never played with this version of swap). Give me a little time to play with this and see if I can put together some code that is reliable enough (if you have any experience or input on the options above, it would certainly help).
Maria, Can I snag a truecopy cluster to play with this? This is the issue that you and I discussed at length, and pretty much decided truecopy was working as intended, but we are seeing this time and time again so I may want to put in some auto-recovery if possible. stephen On 07/08/09 02:55, Sergei Kolodka wrote: > Stephen, thanks for your definitive answer, I'm hoping I can get same > definitive answer from Sun support for logged call, unless it'll be from you > of course ;-) > > To be honest I'd like it to be in Sun Cluster documentation somewhere in bold > and large font because of two reasons. > > First reason is, for example, in company I'm working for we have storage team > and I'm not really allowed to touch SAN and know not much about it and I know > quite a few large companies and Govt depts which work same way. Anyway, > couple of certified Sun Cluster admins we asked about this problem knew not > much about it and never seen that behaviour before, person who designed and > built cluster had no idea pairresync must be done after each failover and if > it was in SC manual or SC manual had at least references to TC manual for > this particular case that would greatly help to troubleshoot this > issue/feature. > Actually I just did +pairresyns +swaps search on SC3.2 Sun's documentation > web site and surprisingly all 21 returned results are related to Geographic > edition, which is not quite as same as usual SunCluster and basically there's > not much information related to troubleshooting and resolution of problems > with SC non-geo + TC in SC manual. > > Second reason is much more important as for me and that's possible > consequences of not doing pairresync after failover. The first time we > encountered this problem we actually did not check pair status, booted > original node and flipped resource groups back without doing pairresync and > as result completely locked TrueCopy without any hope to do pairresync, > either -swaps or -swapp. Storage admins had to split pair in their own SAN > interface and only then we were able to start our cluster. This accidentally > happened four weeks before going production, took whole day to resolve > because noone knew what happened and if it was production it would cost us > few millions of dollars for every hour of downtime. One page in SC manual > stating that admins must always do pairresync after failover with at least > reference to TC manual could easily save us that money and at least five > years of miserable admin's life in situation of real disaster. > > There's other very much related question I would like to find answer for. > Right now we have Failback set as False for all our RGs. If you could shed > some light what's going to happen if Failback is set to True, i.e. first > node rebooted and RGs started going back before admins had chance to do > pairresync I'd greatly appreciate that. > > Once more thank you for your help, > Sergei >