[ha-clusters-discuss] HAStoragePlus resource with a zone on top, unable to migrate

Venkateswarlu Tella Tue, 08 Dec 2009 11:42:54 +0530

Hi Tundra,

I didn't see much explanatory message about the reason of gds_svc_start 
Timeout, and what it is expecting in the timeout duration.


> root at mltproc1:~# zfs set 
> mountpoint=/common_pool0/common_zone/root/personal_pool0/personal 
> personal_pool0/personal

Don't set the mountpoint to "/common_pool0/ ..." which becomes hardcoded. It 
has to default value. The reason is when HASP imports the pool for a zone, all
the file systems will be relative to zone root, and setting mountpoint to 
hardcoded will have issues.

I suggest you to to keep the dependencies as we discussed earlier and remove 
the lofs part as well.

Now bring the all the resource groups online, and check all the file systems 
hierarchy to ensure that all are mounted in expected order at expected 
mountpoints.
If the above is fine try offlin'ing the RG's .

If you have issues, you can send the information of dependencies, the pools 
involved, and file system hierarchy (i.e zpool list, zfs list, df -F zfs) 
during RG online.

Thanks
-Venku

> root at mltproc1:~# zfs get zoned,mountpoint personal_pool0/personal
> NAME                     PROPERTY    VALUE                                    
>                SOURCE
> personal_pool0/personal  zoned       off                                      
>                local
> personal_pool0/personal  mountpoint  
> /common_pool0/common_zone/root/personal_pool0/personal  local
> root at mltstore0:~# clrs set -p Resource_dependencies+=common_zpool 
> personal_pool
> root at mltstore0:~# clrs set -p Resource_dependencies+=personal_pool 
> common_zone
> 
> I didn't get any errors in this process.
> 
> Now when I attempt to switch or start the 'common_shares' resource group, it 
> just keeps migrating from node to node, never getting out of 'Pending 
> online', and in the logging host for the cluster I see the following, which 
> looks to me like the zone just doesn't start (timeout at 11:08:40 - looks 
> like 192.168.11.21 and 192.168.11.22 are 2 sec off in timesync), but I don't 
> see any details of why:
> 
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.gds:6,common_shares,common_zone,gds_svc_start]: [ID 661560 
> daemon.info] All the SUNW.HAStoragePlus resources that this resource depends 
> on are online on the local node. Proceeding with the checks for the existence 
> and permissions of the start/stop/probe commands.
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.gds:6,common_shares,common_zone,gds_svc_start]: [ID 268646 
> daemon.info] Extension property <network_aware> has a value of <1>
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.LogicalHostname:3,common_shares,common_lhname,hafoip_monitor_start]: 
> [ID 211198 daemon.info] Completed successfully.
> Dec  7 11:03:37 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 515159 
> daemon.notice] method <hafoip_monitor_start> completed successfully for 
> resource <common_lhname>, resource group <common_shares>, node <mltstore1>, 
> time used: 0% of timeout <300 seconds>
> Dec  7 11:03:35 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 443746 
> daemon.notice] resource common_lhname state on node mltstore1 change to 
> R_ONLINE
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.gds:6,common_shares,common_zone,gds_svc_start]: [ID 887138 
> daemon.info] Extension property <Child_mon_level> has a value of <-1>
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.gds:6,common_shares,common_zone,gds_svc_start]: [ID 833212 
> daemon.info] Attempting to start the data service under process monitor 
> facility.
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.gds:6,common_shares,common_zone,gds_svc_start]: [ID 569559 
> daemon.info] Start of /opt/SUNWsczone/sczbt/bin/start_sczbt -R common_zone -G 
> common_shares -P /common_pool0/common_zone/parameters  completed successfully.
> Dec  7 11:03:37 [192.168.11.21.214.62] 
> SC[,SUNW.gds:6,common_shares,common_zone,gds_svc_start]: [ID 268646 
> daemon.info] Extension property <network_aware> has a value of <1>
> Dec  7 11:03:38 [192.168.11.21.214.62] genunix: [ID 408114 kern.info] 
> /pseudo/zconsnex at 1/zcons at 1 (zcons1) online
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 764140 
> daemon.error] Method <gds_svc_start> on resource <common_zone>, resource 
> group <common_shares>, node <mltstore1>: Timeout.
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 443746 
> daemon.error] resource common_zone state on node mltstore1 change to 
> R_START_FAILED
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 784560 
> daemon.notice] resource common_zone status on node mltstore1 change to 
> R_FM_FAULTED
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 922363 
> daemon.notice] resource common_zone status msg on node mltstore1 change to <>
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 529407 
> daemon.error] resource group common_shares state on node mltstore1 change to 
> RG_PENDING_OFF_START_FAILED
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 784560 
> daemon.notice] resource common_zone status on node mltstore1 change to 
> R_FM_UNKNOWN
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 922363 
> daemon.notice] resource common_zone status msg on node mltstore1 change to 
> <Stopping>
> Dec  7 11:08:38 [192.168.11.22.236.224] Cluster.RGM.global.rgmd: [ID 443746 
> daemon.notice] resource common_zone state on node mltstore1 change to 
> R_STOPPING
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 224900 
> daemon.notice] launching method <hastorageplus_monitor_stop> for resource 
> <personal_pool>, resource group <common_shares>, node <mltstore1>, timeout 
> <90> seconds
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 224900 
> daemon.notice] launching method <hafoip_monitor_stop> for resource 
> <common_lhname>, resource group <common_shares>, node <mltstore1>, timeout 
> <300> seconds
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 224900 
> daemon.notice] launching method <hastorageplus_monitor_stop> for resource 
> <common_zpool>, resource group <common_shares>, node <mltstore1>, timeout 
> <90> seconds
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 224900 
> daemon.notice] launching method <gds_svc_stop> for resource <common_zone>, 
> resource group <common_shares>, node <mltstore1>, timeout <300> seconds
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 669833 
> daemon.debug] 68 fe_rpc_command: 
> cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_monitor_stop>:tag=<common_shares.personal_pool.8>:
>  Calling security_clnt_connect(..., host=<mltstore1>, sec_type {0:WEAK, 
> 1:STRONG, 2:DES} =<1>, ...)
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 653003 
> daemon.debug] 73 fe_rpc_command: 
> cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_monitor_stop>:tag=<common_shares.common_zpool.8>:
>  Calling security_clnt_connect(..., host=<mltstore1>, sec_type {0:WEAK, 
> 1:STRONG, 2:DES} =<1>, ...)
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 846460 
> daemon.debug] 65 fe_rpc_command: 
> cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hafoip/hafoip_monitor_stop>:tag=<common_shares.common_lhname.8>:
>  Calling security_clnt_connect(..., host=<mltstore1>, sec_type {0:WEAK, 
> 1:STRONG, 2:DES} =<1>, ...)
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 170767 
> daemon.debug] 71 fe_rpc_command: 
> cmd_type(enum):<1>:cmd=</opt/SUNWscgds/bin/gds_svc_stop>:tag=<common_shares.common_zone.1>:
>  Calling security_clnt_connect(..., host=<mltstore1>, sec_type {0:WEAK, 
> 1:STRONG, 2:DES} =<1>, ...)
> Dec  7 11:08:40 [192.168.11.21.214.62] Cluster.RGM.global.rgmd: [ID 515159 
> daemon.notice] method <hastorageplus_monitor_stop> completed successfully for 
> resource <personal_pool>, resource group <common_shares>, node <mltstore1>, 
> time used: 0% of timeout <90 seconds>

[ha-clusters-discuss] HAStoragePlus resource with a zone on top, unable to migrate

Reply via email to