Solved it. Had to do 2 things - 1, make the script a dependency of the network name and IP (meaning that the network was available on the 2nd node, before the script came up), and 2, the script apparently hadn't cached the SSH key of the failover SAN, and so it was prompting us to accept it. Once we fed a "y" into the pipleine (by doing a "echo y | plink" to accept it, we could then do non-interactive plink commands (using the -batch parameter), and all was well.
I thought for sure we had cached that key, but apparently not. Once we figured out how to access the cluster log (the Powershell "Get-ClusterLog" command), we could see what was happening. On Tue, Jun 23, 2015 at 10:32 AM, Michael Leone <[email protected]> wrote: > Some progress! I discovered the Powershell "Get-ClusterLog -Timespan > 5" command, to show the detailed cluster log for the last 5 minutes. > So I do see my script failing - the return code is being set to 1. I > don't know why, the return code should be zero, since the command to > verify the location of the disks should just return properly. But it's > not ... I am wondering if it's the network address at the 2nd node > that is the problem. If the 2nd node doesn't have networking access on > it's own NIC, then the plink command to the RPA would never get to be > executed. > > But then I see this: > > Network Name <SQL Network Name (MS1)>: Obtaining IP info for resource > SQL IP Address 2 (MS1) > Network Name <SQL Network Name (MS1)>: Using provider SQL IP Address 2 > (MS1), ip address <Node-2-IP>, mask 255.255.255.0, prefix length 24 > Network Name <SQL Network Name (MS1)>: Obtaining IP info for resource > SQL IP Address 1 (MS1) > Network Name <SQL Network Name (MS1)>: Using provider SQL IP Address 1 > (MS1), ip address <Node-1-IP>, mask 255.255.255.0, prefix length 24 > Network Name <SQL Network Name (MS1)>: Unable to get provider's > transport name, status 2. Checking if the provider is online.. > Network Name <SQL Network Name (MS1)>: Ignoring transport name since > provider is not online > Network Name <SQL Network Name (MS1)>: Resource has 2 IPs > Network Name <SQL Network Name (MS1)>: IP: Type Ipv4, Address > <Node-2-IP>:~0~, Prefix <Node-2-IP>/24, Online true, Transport > \Device\NetBt_If14 > Network Name <SQL Network Name (MS1)>: IP: Type Ipv4, Address > <Node-1-IP>:~0~, Prefix <Node-1-IP>/24, Online false, Transport > Network Name <SQL Network Name (MS1)>: Handling provider state change > finished with result: 0 > Network Name <SQL Network Name (MS1)>: Configuration: In > ProviderStateChangeImp, bIsOnline is true, > bProviderStateChangeInProgress_ is false > Network Name <SQL Network Name (MS1)>: Configuration: Netname is > online, scheduling the timer to run after 30 seconds > Network Name <SQL Network Name (MS1)>: Dns: Number of IPs that match > current node id is 1. Registering MS1 for multichannel support > Network Name <SQL Network Name (MS1)>: Dns: Registering netname for > multichannel support returned 0 > > Which I think means that the network for node 2 came up, as it should. > So maybe that's not it ... > > > On Tue, Jun 23, 2015 at 9:43 AM, Michael Leone <[email protected]> wrote: >> We are working on a multisite cluster (in this test case, it's a SQL >> 2012 cluster). We have installed clustering and SQL, told it it will >> be a multisite. >> >> We use EMC RecoverPoint to keep the SAN here and the one at the other >> site in sync. EMC tells us we need to use a Generic Script Resource >> that will tell the ReoverPoint when to transfer the active consistency >> group (basically, the SAN LUN) to the other site, when we fail from >> one node to the other. >> >> So they gave us a sample VB script that uses plink (from the putty >> folks) to issue failover commands to the RecoverPoint; the RPA >> (RecoverPoint Appliance) has a CLI mode that you can access via SSH. >> Hope all that's clear - when you fail over to the other node, the VB >> script tells the RPA to move the disks there (if they are not already >> there), so the cluster resource can come online at that site. >> >> So here's the problem: the script works in one direction and not the >> other. :-) Meaning: if the role is at HQ, but the RPA has the disks >> active at the remote site, the script properly tells the RPA to move >> the disks back to HQ, and the cluster comes online at HQ. >> >> (we copied the same script to a folder on the C: drive of each node; >> the one at HQ is customized to send to the RPA at HQ; the script at DR >> is the same, except that it is customized to send to the RPA at DR) >> >> However, at the remote site, the cluster won't even come up, even if >> the script has nothing to do (if the role is at the remote site, and >> the disks are at the remote site, there is nothing for the script to >> do, and it just exits, after checking where the disks are at). The >> problem is that we get an error "Incorrect function" on the script >> resource at the remote site. (the "information details" link just says >> "0x80070001 Incorrect function"). It never even executes, just errors >> out, and so the role never comes online there. And there's nothing in >> the event logs that is telling us why, or what function is incorrect, >> or why ... >> >> I'd like the script to actually write out some debugging info into the >> log, so EMC added some "Resource.LogInformation" lines to the script. >> But where do these lines write to? Is there supposed to be some >> cluster.log written? I didn't see anything in "%windir%\cluster" >> anywhere. What should be happening (in this test case) is that the >> script queries where the disk are (are they at DR?), and then exits >> with "success" status code (=0) because the answer is yes, the disks >> are at DR. But obviously it's not doing that correctly ... >> >> Anyone using a generic script resource? If so, are you using it like >> we want to? :-) >> >> I am missing something basic here, since it works in one direction, >> but I don't know what ...
