[ha-clusters-discuss] Why the qfs cannot switch to another node

Venkateswarlu Tella Mon, 19 Jan 2009 16:13:36 +0530

Hi Yang,

First a bit details on QFS...


The QFS can configured as

1. Standalone QFS
     In case of standalone QFS you can configure as HA file system
     by using HAStoragePlus resource type.
2. Shared QFS
     In case of shared QFS file system, there will be an MDS server for
     each file system which will used to serve the file system requests.
     When this is configured on multiple nodes, one has ensure that MDS
     server is up and running on one of the nodes.
     So when shared QFS is configured in cluster environment you can use
     SUNW.qfs agent which ensure that MDS server will be running on
     cluster node which is UP.

In your case it looks like you are using shared QFS file system
configuration not a standalone QFS file system.

The "QFS filesystem group" is just the MDS server where it
is moving from one node to the other not the actual file system. Note 
that wherever the  QFS RG is up, you can access the file system from all 
nodes.

And coming to your problem, it looks the bug in SUNW.qfs QFS agent. For 
some reason the node which has connectivity to the storage complains 
that there is some failover action in progress and hence no continuing 
to host the QFS RG.

Have you tried again to bring online after the RG went offline? If the 
still occurs i can file a bug on QFS on behalf of you.

Thanks
-Venku


On 01/16/09 16:46, lf yang wrote:
> Hi Guys
> I setup two nodes cluster (version 16/09/20080) and connect 
> two nodes to a fiber disk array for global device .
> I install  the last version of SAM/QFS in it,also config 
> the SAM/QFS in the sun cluster environment to make 
> it as the HA file system.
> 
> Then I use the "clrg  switch" to switch the QFS filesystem cluster
> group ,it works fine,the file system can switch succussfully.And 
> when I poweroff one of the node ,the file system also can switch
> to the alive node succussfully.But When  I pull out the fiber cable
> of one node ,the file system cannot be swithed, the resource group
> is offline .
> 
> Before pull out the cable the,the file system master is node1 (c1),after
> I pull out the cable of c1, it should be switch to c2,but looks it's failed.
> 
> here is the messages:
> 
> cluster node1 messages: 
> ====================
> Jan 17 17:59:43 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_monitor_start>:tag=<global.fsg.fs1.7>: Calling 
> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 17 17:59:43 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_monitor_start> completed successfully for resour
> ce <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout <120 
> seconds>
> Jan 17 18:00:45 c1 qlc: [ID 439991 kern.info] NOTICE: Qlogic qlc(2,0): Loop 
> OFFLINE
> Jan 17 18:00:46 c1 qlc: [ID 439991 kern.info] NOTICE: Qlogic qlc(3,0): Loop 
> OFFLINE
> Jan 17 18:01:45 c1 smbd[907]: [ID 766186 daemon.error] NbtDatagramDecode[11]: 
> too small packet
> Jan 17 18:02:15 c1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::OFFLINE 
> timeout
> Jan 17 18:02:16 c1 fctl: [ID 517869 kern.warning] WARNING: fp(5)::OFFLINE 
> timeout
> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] /pci at 0,0/pci10de,5d at 
> e/pci1077,138 at 0/fp at 0,0 (fcp2):
> Jan 17 18:02:35 c1      offlining lun=4 (trace=0), target=1 (trace=2800004)
> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] /pci at 0,0/pci10de,5d at 
> e/pci1077,138 at 0/fp at 0,0 (fcp2):
> Jan 17 18:02:35 c1      offlining lun=3 (trace=0), target=1 (trace=2800004)
> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] /pci at 0,0/pci10de,5d at 
> e/pci1077,138 at 0/fp at 0,0 (fcp2):
> Jan 17 18:02:35 c1      offlining lun=2 (trace=0), target=1 (trace=2800004)
> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] /pci at 0,0/pci10de,5d at 
> e/pci1077,138 at 0/fp at 0,0 (fcp2):
> Jan 17 18:02:35 c1      offlining lun=1 (trace=0), target=1 (trace=2800004)
> Jan 17 18:02:35 c1 scsi: [ID 243001 kern.info] /pci at 0,0/pci10de,5d at 
> e/pci1077,138 at 0/fp at 0,0 (fcp2):
> Jan 17 18:02:35 c1      offlining lun=0 (trace=0), target=1 (trace=2800004)
> Jan 17 18:02:35 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_probe]: [ID 139143 
> daemon.error] test: Unable to get QFS hosts table.  Error: ^Xx^
> D^H\377\377\377\377\270\257\347\376?\242\346\376\240w^D^HT\242\346\376\270w^D^H^Xx^D^H\324x^D^H\320x^D^H\270\257\347\376\324w^D^H\27
> 6\237\361\376\270w^D^H^Px^D^H^X6^G^H(6^G^H^B
> Jan 17 18:02:35 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_probe]: [ID 831072 
> daemon.notice] Issuing a resource restart request because of pr
> obe failures.
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 494478 daemon.notice] 
> resource fs1 in resource group fsg has requested restart of th
> e resource on c1.
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_monitor_stop> for resource <fs1>, reso
> urce group <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_monitor_stop>:tag=<global.fsg.fs1.8>: Calling 
> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>
> , ...)
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_monitor_stop> completed successfully for resourc
> e <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout <120 
> seconds>
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_stop> for resource <fs1>, resource gro
> up <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling security_clnt_connect(..., 
> host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_stop> completed successfully for resource <fs1>,
>  resource group <fsg>, node <c1>, time used: 0% of timeout <120 seconds>
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_start> for resource <fs1>, resource gr
> oup <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_start>:tag=<global.fsg.fs1.0>: Calling security_clnt_connect(..., 
> host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
> Jan 17 18:02:35 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_start]: [ID 139143 
> daemon.error] test: Unable to get QFS hosts table.  Error:
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] Method 
> <scqfs_start> failed on resource <fs1> in resource group
>  <fsg> [exit code <1>, time used: 0% of timeout <120 seconds>]
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_stop> for resource <fs1>, resource gro
> up <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling security_clnt_connect(..., 
> host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_stop> completed successfully for resource <fs1>,
>  resource group <fsg>, node <c1>, time used: 0% of timeout <120 seconds>
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_postnet_stop> for resource <fs1>, reso
> urce group <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 17 18:02:35 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_postnet_stop> completed successfully for resourc
> e <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout <120 
> seconds>
> Jan 17 18:02:41 c1 Cluster.qdmd: [ID 564960 daemon.notice] qdmd: An error 
> occurred while opening quorum device /dev/did/rdsk/d11s2
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_prenet_start> for resource <fs1>, reso
> urce group <fsg>, node <c1>, timeout <300> seconds
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_prenet_start>:tag=<global.fsg.fs1.10>: Calling 
> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 17 18:07:39 c1 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 139143 
> daemon.error] test: Unable to get QFS hosts table.  Erro
> r:
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] Method 
> <scqfs_prenet_start> failed on resource <fs1> in resourc
> e group <fsg> [exit code <255>, time used: 0% of timeout <300 seconds>]
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_stop> for resource <fs1>, resource gro
> up <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling security_clnt_connect(..., 
> host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_stop> completed successfully for resource <fs1>,
>  resource group <fsg>, node <c1>, time used: 0% of timeout <120 seconds>
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_postnet_stop> for resource <fs1>, reso
> urce group <fsg>, node <c1>, timeout <120> seconds
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
> security_clnt_connect(..., host=<c1>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 17 18:07:39 c1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_postnet_stop> completed successfully for resourc
> e <fs1>, resource group <fsg>, node <c1>, time used: 0% of timeout <120 
> seconds>
> Jan 17 18:07:47 c1 Cluster.qdmd: [ID 564960 daemon.notice] qdmd: An error 
> occurred while opening quorum device /dev/did/rdsk/d11s2
> Jan 17 18:08:46 c1 smbd[907]: [ID 766186 daemon.error] NbtDatagramDecode[11]: 
> too small packet
> Jan 17 18:09:48 c1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at 
> g001b4d280100f619 (sd3):
> Jan 17 18:09:48 c1      transport rejected fatal error
> Jan 17 18:09:48 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The state of the 
> path to device: /dev/did/rdsk/d5s0 has changed to FAIL
> ED
> Jan 17 18:09:48 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The state of the 
> path to device: /dev/did/rdsk/d3s0 has changed to FAIL
> ED
> Jan 17 18:09:48 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The state of the 
> path to device: /dev/did/rdsk/d11s0 has changed to FAI
> LED
> Jan 17 18:10:18 c1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at 
> g001b4d280200f619 (sd2):
> Jan 17 18:10:18 c1      transport rejected fatal error
> Jan 17 18:10:18 c1 Cluster.scdpmd: [ID 977412 daemon.notice] The state of the 
> path to device: /dev/did/rdsk/d7s0 has changed to FAIL
> ED
> 
> 
> 
> 
> 
> 
> 
> 
> 
> cluster node2 messages:
> ====================
> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <Starting>
> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_prenet_start>:tag=<global.fsg.fs1.10>: Calling 
> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 374142 
> daemon.notice] test: Metadata server c1 found.
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 763423 
> daemon.notice] test: Attempting voluntary failover.
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 634975 
> daemon.notice] test: Initiating voluntary MDS change to
> node c2
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 734485 
> daemon.error] test:  Host failover already pending.
> Jan 16 20:05:48 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 298843 
> daemon.notice] test: Waiting for switchover to complete.
> Jan 16 20:06:48 c2 last message repeated 1 time
> Jan 16 20:07:49 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 298843 
> daemon.notice] test: Waiting for switchover to complete.
> Jan 16 20:09:49 c2 last message repeated 2 times
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 764140 daemon.error] Method 
> <scqfs_prenet_start> on resource <fs1>, resource group <
> fsg>, node <c2>: Timeout.
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_START_FAILED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
> resource group fsg state on node c2 change to RG_PENDING_OFF_S
> TART_FAILED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
> resource fs1 status on node c2 change to R_FM_FAULTED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_STOPPING
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_stop> for resource <fs1>, resource gro
> up <fsg>, node <c2>, timeout <120> seconds
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
> resource fs1 status on node c2 change to R_FM_UNKNOWN
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <Stopping>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling security_clnt_connect(..., 
> host=<c2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_stop> completed successfully for resource <fs1>,
>  resource group <fsg>, node <c2>, time used: 0% of timeout <120 seconds>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_STOPPED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_POSTNET_STOPPING
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_postnet_stop> for resource <fs1>, reso
> urce group <fsg>, node <c2>, timeout <120> seconds
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_postnet_stop> completed successfully for resourc
> e <fs1>, resource group <fsg>, node <c2>, time used: 0% of timeout <120 
> seconds>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_OFFLINE
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
> resource fs1 status on node c2 change to R_FM_OFFLINE
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
> resource group fsg state on node c2 change to RG_OFFLINE_START
> _FAILED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
> resource group fsg state on node c2 change to RG_OFFLINE
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not 
> attempting to start resource group <fsg> on node <c1> beca
> use this resource group has already failed to start on this node 2 or more 
> times in the past 3600 seconds
> 
> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <Starting>
> Jan 16 20:05:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_prenet_start>:tag=<global.fsg.fs1.10>: Calling 
> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 374142 
> daemon.notice] test: Metadata server c1 found.
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 763423 
> daemon.notice] test: Attempting voluntary failover.
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 634975 
> daemon.notice] test: Initiating voluntary MDS change to
> node c2
> Jan 16 20:05:46 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 734485 
> daemon.error] test:  Host failover already pending.
> Jan 16 20:05:48 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 298843 
> daemon.notice] test: Waiting for switchover to complete.
> Jan 16 20:06:48 c2 last message repeated 1 time
> Jan 16 20:07:49 c2 SC[,SUNW.qfs:4.6,fsg,fs1,scqfs_prenet_start]: [ID 298843 
> daemon.notice] test: Waiting for switchover to complete.
> Jan 16 20:09:49 c2 last message repeated 2 times
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 764140 daemon.error] Method 
> <scqfs_prenet_start> on resource <fs1>, resource group <
> fsg>, node <c2>: Timeout.
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_START_FAILED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
> resource group fsg state on node c2 change to RG_PENDING_OFF_S
> TART_FAILED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
> resource fs1 status on node c2 change to R_FM_FAULTED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_STOPPING
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_stop> for resource <fs1>, resource gro
> up <fsg>, node <c2>, timeout <120> seconds
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
> resource fs1 status on node c2 change to R_FM_UNKNOWN
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <Stopping>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_stop>:tag=<global.fsg.fs1.1>: Calling security_clnt_connect(..., 
> host=<c2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_stop> completed successfully for resource <fs1>,
>  resource group <fsg>, node <c2>, time used: 0% of timeout <120 seconds>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_STOPPED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_POSTNET_STOPPING
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] 
> launching method <scqfs_postnet_stop> for resource <fs1>, reso
> urce group <fsg>, node <c2>, timeout <120> seconds
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 204411 daemon.notice] 41 
> fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWsamfs/sc/b
> in/scqfs_postnet_stop>:tag=<global.fsg.fs1.11>: Calling 
> security_clnt_connect(..., host=<c2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1
>> , ...)
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method 
> <scqfs_postnet_stop> completed successfully for resourc
> e <fs1>, resource group <fsg>, node <c2>, time used: 0% of timeout <120 
> seconds>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] 
> resource fs1 state on node c2 change to R_OFFLINE
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] 
> resource fs1 status on node c2 change to R_FM_OFFLINE
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] 
> resource fs1 status msg on node c2 change to <>
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
> resource group fsg state on node c2 change to RG_OFFLINE_START
> _FAILED
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] 
> resource group fsg state on node c2 change to RG_OFFLINE
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not 
> attempting to start resource group <fsg> on node <c1> beca
> use this resource group has already failed to start on this node 2 or more 
> times in the past 3600 seconds
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not 
> attempting to start resource group <fsg> on node <c2> beca
> use this resource group has already failed to start on this node 2 or more 
> times in the past 3600 seconds
> Jan 16 20:10:46 c2 Cluster.RGM.global.rgmd: [ID 674214 daemon.notice] 
> rebalance: no primary node is currently found for resource gro
> up <fsg>.
> 
> 
> Is ther any clue for this?
> thanks very much!
> 
> Regards,
> Lifeng

[ha-clusters-discuss] Why the qfs cannot switch to another node

Reply via email to