[ha-clusters-discuss] Why the qfs cannot switch to another node

Yang Lifeng Wed, 21 Jan 2009 10:07:38 +0800

Hi Venkateswarlu,

Thanks your response! I use the shared QFS and SUNW.qfs agent ,and 
running the "clrg online" command
has no effect, it is still offline.


Could you please cc me the bug id or add me in the interest list when 
you file the bug,thanks!

Regards,
Lifeng

Venkateswarlu Tella ??:
> Hi Yang,
>
> First a bit details on QFS...
>
> The QFS can configured as
>
> 1. Standalone QFS
> In case of standalone QFS you can configure as HA file system
> by using HAStoragePlus resource type.
> 2. Shared QFS
> In case of shared QFS file system, there will be an MDS server for
> each file system which will used to serve the file system requests.
> When this is configured on multiple nodes, one has ensure that MDS
> server is up and running on one of the nodes.
> So when shared QFS is configured in cluster environment you can use
> SUNW.qfs agent which ensure that MDS server will be running on
> cluster node which is UP.
>
> In your case it looks like you are using shared QFS file system
> configuration not a standalone QFS file system.
>
> The "QFS filesystem group" is just the MDS server where it
> is moving from one node to the other not the actual file system. Note 
> that wherever the QFS RG is up, you can access the file system from 
> all nodes.
>
> And coming to your problem, it looks the bug in SUNW.qfs QFS agent. 
> For some reason the node which has connectivity to the storage 
> complains that there is some failover action in progress and hence no 
> continuing to host the QFS RG.
>
> Have you tried again to bring online after the RG went offline? If the 
> still occurs i can file a bug on QFS on behalf of you.
>
> Thanks
> -Venku
>
>
> On 01/16/09 16:46, lf yang wrote:
>> Hi Guys
>> I setup two nodes cluster (version 16/09/20080) and connect two nodes 
>> to a fiber disk array for global device .
>> I install the last version of SAM/QFS in it,also config the SAM/QFS 
>> in the sun cluster environment to make it as the HA file system.
>>
>> Then I use the "clrg switch" to switch the QFS filesystem cluster
>> group ,it works fine,the file system can switch succussfully.And when 
>> I poweroff one of the node ,the file system also can switch
>> to the alive node succussfully.But When I pull out the fiber cable
>> of one node ,the file system cannot be swithed, the resource group
>> is offline .
>>
>> Before pull out the cable the,the file system master is node1 (c1),after
>> I pull out the cable of c1, it should be switch to c2,but looks it's 
>> failed.
>>
>> here is the messages:
>>
>> cluster node1 messages: ====================
>> Jan 17 17:59:43 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_monitor_start>:tag=<global.fsg.fs1.7>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 17 17:59:43 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_monitor_start> completed successfully for resour
>> ce <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout 
>> <120 seconds>
>> Jan 17 18:00:45 c1 qlc: [ID 439991 kern.info] NOTICE: Qlogic 
>> qlc(2,0): Loop OFFLINE
>> Jan 17 18:00:46 c1 qlc: [ID 439991 kern.info] NOTICE: Qlogic 
>> qlc(3,0): Loop OFFLINE
>> Jan 17 18:01:45 c1 smbd[907]: [ID 766186 daemon.error] 
>> NbtDatagramDecode[11]: too small packet
>> Jan 17 18:02:15 c1 fctl: [ID 517869 kern.warning] WARNING: 
>> fp(2)::OFFLINE timeout
>> Jan 17 18:02:16 c1 fctl: [ID 517869 kern.warning] WARNING: 
>> fp(5)::OFFLINE timeout
>> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] 
>> /pci at 0,0/pci10de,5d at e/pci1077,138 at 0/fp at 0,0 (fcp2):
>> Jan 17 18:02:35 c1 offlining lun=4 (trace=0), target=1 (trace=2800004)
>> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] 
>> /pci at 0,0/pci10de,5d at e/pci1077,138 at 0/fp at 0,0 (fcp2):
>> Jan 17 18:02:35 c1 offlining lun=3 (trace=0), target=1 (trace=2800004)
>> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] 
>> /pci at 0,0/pci10de,5d at e/pci1077,138 at 0/fp at 0,0 (fcp2):
>> Jan 17 18:02:35 c1 offlining lun=2 (trace=0), target=1 (trace=2800004)
>> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] 
>> /pci at 0,0/pci10de,5d at e/pci1077,138 at 0/fp at 0,0 (fcp2):
>> Jan 17 18:02:35 c1 offlining lun=1 (trace=0), target=1 (trace=2800004)
>> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] 
>> /pci at 0,0/pci10de,5d at e/pci1077,138 at 0/fp at 0,0 (fcp2):
>> Jan 17 18:02:35 c1 offlining lun=0 (trace=0), target=1 (trace=2800004)
>> Jan 17 18:02:35 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_probe]: [ID 139143 
>> daemon.error] test: Unable to get QFS hosts table. Error: ^Xx^
>> D^H\377\377\377\377\270\257\347\376?\242\346\376\240w^D^HT\242\346\376\270w^D^H^Xx^D^H\324x^D^H\320x^D^H\270\257\347\376\324w^D^H\27
>>  
>>
>> 6\237\361\376\270w^D^H^Px^D^H^X6^G^H(6^G^H^B
>> Jan 17 18:02:35 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_probe]: [ID 831072 
>> daemon.notice] Issuing a resource restart request because of pr
>> obe failures.
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 494478 daemon.notice] 
>> resource fs1 in resource group fsg has requested restart of th
>> e resource on c1.
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_monitor_stop> for resource <fs1>, reso
>> urce group <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_monitor_stop>:tag=<global.fsg.fs1.8>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>
>> , ...)
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_monitor_stop> completed successfully for resourc
>> e <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout 
>> <120 seconds>
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_stop> for resource <fs1>, resource gro
>> up <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>, ...)
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_stop> completed successfully for resource <fs1>,
>> resource group <fsg>, node <c1>, time used: 0% of timeout <120 seconds>
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_start> for resource <fs1>, resource gr
>> oup <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_start>:tag=<global.fsg.fs1.0>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>, ...)
>> Jan 17 18:02:35 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_start]: [ID 139143 
>> daemon.error] test: Unable to get QFS hosts table. Error:
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] 
>> Method <scqfs_start> failed on resource <fs1> in resource group
>> <fsg> [exit code <1>, time used: 0% of timeout <120 seconds>]
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_stop> for resource <fs1>, resource gro
>> up <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>, ...)
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_stop> completed successfully for resource <fs1>,
>> resource group <fsg>, node <c1>, time used: 0% of timeout <120 seconds>
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_postnet_stop> for resource <fs1>, reso
>> urce group <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_postnet_stop> completed successfully for resourc
>> e <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout 
>> <120 seconds>
>> Jan 17 18:02:41 c1 Cluster.qdmd: [ID 564960 daemon.notice] qdmd: An 
>> error occurred while opening quorum device /dev/did/rdsk/d11s2
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_prenet_start> for resource <fs1>, reso
>> urce group <fsg>, node <c1>, timeout <300> seconds
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_prenet_start>:tag=<global.fsg.fs1.10>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 17 18:07:39 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 139143 daemon.error] test: Unable to get QFS hosts table. Erro
>> r:
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] 
>> Method <scqfs_prenet_start> failed on resource <fs1> in resourc
>> e group <fsg> [exit code <255>, time used: 0% of timeout <300 seconds>]
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_stop> for resource <fs1>, resource gro
>> up <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>, ...)
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_stop> completed successfully for resource <fs1>,
>> resource group <fsg>, node <c1>, time used: 0% of timeout <120 seconds>
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_postnet_stop> for resource <fs1>, reso
>> urce group <fsg>, node <c1>, timeout <120> seconds
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
>> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_postnet_stop> completed successfully for resourc
>> e <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout 
>> <120 seconds>
>> Jan 17 18:07:47 c1 Cluster.qdmd: [ID 564960 daemon.notice] qdmd: An 
>> error occurred while opening quorum device /dev/did/rdsk/d11s2
>> Jan 17 18:08:46 c1 smbd[907]: [ID 766186 daemon.error] 
>> NbtDatagramDecode[11]: too small packet
>> Jan 17 18:09:48 c1 scsi: [ID 107833 kern.warning] WARNING: 
>> /scsi_vhci/disk at g001b4d280100f619 (sd3):
>> Jan 17 18:09:48 c1 transport rejected fatal error
>> Jan 17 18:09:48 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The 
>> state of the path to device: /dev/did/rdsk/d5s0 has changed to FAIL
>> ED
>> Jan 17 18:09:48 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The 
>> state of the path to device: /dev/did/rdsk/d3s0 has changed to FAIL
>> ED
>> Jan 17 18:09:48 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The 
>> state of the path to device: /dev/did/rdsk/d11s0 has changed to FAI
>> LED
>> Jan 17 18:10:18 c1 scsi: [ID 107833 kern.warning] WARNING: 
>> /scsi_vhci/disk at g001b4d280200f619 (sd2):
>> Jan 17 18:10:18 c1 transport rejected fatal error
>> Jan 17 18:10:18 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The 
>> state of the path to device: /dev/did/rdsk/d7s0 has changed to FAIL
>> ED
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> cluster node2 messages:
>> ====================
>> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <Starting>
>> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_prenet_start>:tag=<global.fsg.fs1.10>: Calling 
>> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 374142 daemon.notice] test: Metadata server c1 found.
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 763423 daemon.notice] test: Attempting voluntary failover.
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 634975 daemon.notice] test: Initiating voluntary MDS change to
>> node c2
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 734485 daemon.error] test: Host failover already pending.
>> Jan 16 20:05:48 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 298843 daemon.notice] test: Waiting for switchover to complete.
>> Jan 16 20:06:48 c2 last message repeated 1 time
>> Jan 16 20:07:49 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 298843 daemon.notice] test: Waiting for switchover to complete.
>> Jan 16 20:09:49 c2 last message repeated 2 times
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 764140 daemon.error] 
>> Method <scqfs_prenet_start> on resource <fs1>, resource group <
>> fsg>, node <c2>: Timeout.
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_START_FAILED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
>> resource group fsg state on node c2 change to RG_PENDING_OFF_S
>> TART_FAILED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
>> resource fs1 status on node c2 change to R_FM_FAULTED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_STOPPING
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_stop> for resource <fs1>, resource gro
>> up <fsg>, node <c2>, timeout <120> seconds
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
>> resource fs1 status on node c2 change to R_FM_UNKNOWN
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <Stopping>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling 
>> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>, ...)
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_stop> completed successfully for resource <fs1>,
>> resource group <fsg>, node <c2>, time used: 0% of timeout <120 seconds>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_STOPPED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_POSTNET_STOPPING
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_postnet_stop> for resource <fs1>, reso
>> urce group <fsg>, node <c2>, timeout <120> seconds
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
>> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_postnet_stop> completed successfully for resourc
>> e <fs1>, resource group <fsg>, node <c2>, time used: 0% of timeout 
>> <120 seconds>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_OFFLINE
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
>> resource fs1 status on node c2 change to R_FM_OFFLINE
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
>> resource group fsg state on node c2 change to RG_OFFLINE_START
>> _FAILED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
>> resource group fsg state on node c2 change to RG_OFFLINE
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] 
>> Not attempting to start resource group <fsg> on node <c1> beca
>> use this resource group has already failed to start on this node 2 or 
>> more times in the past 3600 seconds
>>
>> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <Starting>
>> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_prenet_start>:tag=<global.fsg.fs1.10>: Calling 
>> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 374142 daemon.notice] test: Metadata server c1 found.
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 763423 daemon.notice] test: Attempting voluntary failover.
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 634975 daemon.notice] test: Initiating voluntary MDS change to
>> node c2
>> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 734485 daemon.error] test: Host failover already pending.
>> Jan 16 20:05:48 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 298843 daemon.notice] test: Waiting for switchover to complete.
>> Jan 16 20:06:48 c2 last message repeated 1 time
>> Jan 16 20:07:49 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 
>> 298843 daemon.notice] test: Waiting for switchover to complete.
>> Jan 16 20:09:49 c2 last message repeated 2 times
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 764140 daemon.error] 
>> Method <scqfs_prenet_start> on resource <fs1>, resource group <
>> fsg>, node <c2>: Timeout.
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_START_FAILED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
>> resource group fsg state on node c2 change to RG_PENDING_OFF_S
>> TART_FAILED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
>> resource fs1 status on node c2 change to R_FM_FAULTED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_STOPPING
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_stop> for resource <fs1>, resource gro
>> up <fsg>, node <c2>, timeout <120> seconds
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
>> resource fs1 status on node c2 change to R_FM_UNKNOWN
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <Stopping>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling 
>> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1>, ...)
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_stop> completed successfully for resource <fs1>,
>> resource group <fsg>, node <c2>, time used: 0% of timeout <120 seconds>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_STOPPED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_POSTNET_STOPPING
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
>> launching method <scqfs_postnet_stop> for resource <fs1>, reso
>> urce group <fsg>, node <c2>, timeout <120> seconds
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 
>> 41 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
>> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
>> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 
>> 2:DES} =<1
>>> , ...)
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] 
>> method <scqfs_postnet_stop> completed successfully for resourc
>> e <fs1>, resource group <fsg>, node <c2>, time used: 0% of timeout 
>> <120 seconds>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
>> resource fs1 state on node c2 change to R_OFFLINE
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
>> resource fs1 status on node c2 change to R_FM_OFFLINE
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
>> resource fs1 status msg on node c2 change to <>
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
>> resource group fsg state on node c2 change to RG_OFFLINE_START
>> _FAILED
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
>> resource group fsg state on node c2 change to RG_OFFLINE
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] 
>> Not attempting to start resource group <fsg> on node <c1> beca
>> use this resource group has already failed to start on this node 2 or 
>> more times in the past 3600 seconds
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] 
>> Not attempting to start resource group <fsg> on node <c2> beca
>> use this resource group has already failed to start on this node 2 or 
>> more times in the past 3600 seconds
>> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 674214 daemon.notice] 
>> rebalance: no primary node is currently found for resource gro
>> up <fsg>.
>>
>>
>> Is ther any clue for this?
>> thanks very much!
>>
>> Regards,
>> Lifeng
>
>
>

[ha-clusters-discuss] Why the qfs cannot switch to another node

Reply via email to