[ha-clusters-discuss] HORCM question

stephen.d...@sun.com Thu, 09 Jul 2009 00:31:00 -0700

Your failback concern is definitely valid.  There is a chance we may be 
able to add an automatic pairresync to our failover script.  Either 
doing a swaps during the failover from the failed primary node (if this 
command can be successful without access to the remote horcm) or by 
doing a swapp when the device group is moved back to the original, 
recovered primary (I've never played with this version of swap).  Give 
me a little time to play with this and see if I can put together some 
code that is reliable enough (if you have any experience or input on the 
options above, it would certainly help).



Maria,

Can I snag a truecopy cluster to play with this?  This is the issue that 
you and I discussed at length, and pretty much decided truecopy was 
working as intended, but we are seeing this time and time again so I may 
want to put in some auto-recovery if possible.

stephen


On 07/08/09 02:55, Sergei Kolodka wrote:
> Stephen, thanks for your definitive answer, I'm hoping I can get same 
> definitive answer from Sun support for logged call, unless it'll be from you 
> of course ;-)
>
> To be honest I'd like it to be in Sun Cluster documentation somewhere in bold 
> and large font  because of two reasons. 
>
> First reason is, for example, in company I'm working for we have storage team 
> and I'm not really allowed to touch SAN and know not much about it and I know 
> quite a few large companies and Govt depts which work same way. Anyway, 
> couple of certified Sun Cluster admins we asked about this problem knew not 
> much about it and never seen that behaviour before, person who designed and 
> built cluster had no idea pairresync must be done after each failover and if 
> it was in SC manual or SC manual had at least references to TC manual for 
> this particular case that would greatly help to troubleshoot this 
> issue/feature. 
> Actually I just did +pairresyns +swaps search on SC3.2 Sun's documentation 
> web site and surprisingly all 21 returned results are related to Geographic 
> edition, which is not quite as same as usual SunCluster and basically there's 
> not much information related to troubleshooting and resolution of problems 
> with SC non-geo + TC in SC manual.
>
> Second reason is much more important as for me and that's possible 
> consequences of not doing pairresync after failover. The first time we 
> encountered this problem we actually did not check pair status, booted 
> original node and flipped resource groups back without doing pairresync and 
> as result completely locked TrueCopy without any hope to do pairresync, 
> either -swaps or -swapp. Storage admins had to split pair in their own SAN 
> interface and only then we were able to start our cluster. This accidentally 
> happened four weeks before going production, took whole day to resolve 
> because noone knew what happened and if it was production it would cost us 
> few millions of dollars for every hour of downtime. One page in SC manual 
> stating that admins must always do pairresync after failover with at least 
> reference to TC manual could easily save us that money and at least five 
> years of miserable admin's life in situation of real disaster.
>
> There's other very much related question I would like to find answer for. 
> Right now we have Failback set as False for all our RGs. If you could shed 
> some light  what's going to happen if Failback is set to True, i.e. first 
> node rebooted and RGs started going back before admins had chance to do 
> pairresync I'd greatly appreciate that.
>
> Once more thank you for your help, 
> Sergei
>

[ha-clusters-discuss] HORCM question

Reply via email to