[ha-clusters-discuss] HAStoragePlus resource with a zone on top, unable to migrate

Tundra Slosek Thu, 3 Dec 2009 07:23:29 -0500

To manually force the export - 'zpool export common_pool0' or something
else?


One item that I'm curious about - is it possible that zfs-auto-snapshot is
conflicting with this? I presumed that it's smart enough to be able to
handle that (export being atomic as well as snapshot being atomic, they
shouldn't step on each other). And should zfs-auto-snapshot be running in
the specific zone or in global zone?

The but that Amit mentioned - would that trigger with TWO zpools in the
resource?

Thanks to all for their ideas, suggestions and knowledge.

On Thu, Dec 3, 2009 at 6:42 AM, Venkateswarlu Tella <
Venkateswarlu.Tella at sun.com> wrote:

> Hi,
> HAStoragePlus uses force export and force unmount on the pool and its file
> systems respectively. So the umount has to succeed even if someone is using
> file system. Not sure if something broken in ZFS umount semantics.
>
> I would be good thing to test manually the force export as Hartmut
> suggested.
>
> Thanks
> -Venku
>
>
> On 12/03/09 14:11, Hartmut Streppel wrote:
>
>> Hi,
>> hard to diagnose. Your dependencies are correct. The messages indicate
>> that the zone is down before the HAStoragePlus resource tries to export the
>> zpool.
>> The only possibility left in my mind is that there is some other process,
>> running in the global zone, using the zpool to be exported.
>>
>> In this situation, are you able to export the zpool manually? If not, it
>> is not a cluster problem. You could try to find out which processes are
>> using a file or directory on that zpool.
>>
>> Regards
>> Hartmut
>>
>>
>> On 12/02/09 23:24, Tundra Slosek wrote:
>>
>>> Is there a concise way to dump the pertinent details of a group?
>>>
>>> If I understand correctly, this shows that the resource 'common_zone'
>>> (the gds resource created by sczbt_register) depends on 'common_lhname' (the
>>> logicalhostname resource) and 'common_zpool' (the HAStoragePlus resource). I
>>> am certainly open to being enlightened...
>>> root at mltproc1:~# /usr/cluster/bin/clresource show -y
>>> Resource_dependencies common_zone
>>>
>>> === Resources ===                             Resource:
>>>                     common_zone
>>>  Resource_dependencies:                           common_lhname
>>> common_zpool
>>>
>>>
>>> Also, a more complete snippet of the log file going back further in time
>>> - does the first log entry at 11:43:18 show that the zone actually stopped,
>>> or that the cluster incorrectly thinks the zone is stopped when it isn't?:
>>>
>>> Dec  2 11:43:17 mltproc1
>>> SC[SUNWsczone.stop_sczbt]:common_shares:common_zone: [ID 567783
>>> daemon.notice] stop_command rc<0>
>>>  - Changing to init state 0 - please wait
>>> Dec  2 11:43:17 mltproc1
>>> SC[SUNWsczone.stop_sczbt]:common_shares:common_zone: [ID 567783
>>> daemon.notice] stop_command rc<0>
>>>  - Shutdown started.    Wed Dec  2 11:40:30 EST 2009
>>> Dec  2 11:43:18 mltproc1 Cluster.RGM.global.rgmd: [ID 515159
>>> daemon.notice] method <gds_svc_stop> completed successfully f
>>> or resource <common_zone>, resource group <common_shares>, node
>>> <mltproc1>, time used: 56% of timeout <300 seconds>
>>> Dec  2 11:43:18 mltproc1 Cluster.RGM.global.rgmd: [ID 224900
>>> daemon.notice] launching method <hafoip_stop> for resource <c
>>> ommon_lhname>, resource group <common_shares>, node <mltproc1>, timeout
>>> <300> seconds
>>> Dec  2 11:43:18 mltproc1 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN:
>>> local = 192.168.011.005:0, remote = 000.000.000.0
>>> 00:0, start = -2, end = 6
>>> Dec  2 11:43:18 mltproc1 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN:
>>> aborted 0 connection Dec  2 11:43:18 mltproc1 Cluster.RGM.global.rgmd: [ID
>>> 515159 daemon.notice] method <hafoip_stop> completed successfully fo
>>> r resource <common_lhname>, resource group <common_shares>, node
>>> <mltproc1>, time used: 0% of timeout <300 seconds>
>>> Dec  2 11:43:19 mltproc1 Cluster.RGM.global.rgmd: [ID 224900
>>> daemon.notice] launching method <hastorageplus_postnet_stop> for resource
>>> <personal_pool>, resource group <common_shares>, node <mltproc1>, timeout
>>> <1800> seconds
>>> Dec  2 11:43:19 mltproc1 Cluster.RGM.global.rgmd: [ID 224900
>>> daemon.notice] launching method <hastorageplus_postnet_stop> for resource
>>> <common_zpool>, resource group <common_shares>, node <mltproc1>, timeout
>>> <1800> seconds
>>> Dec  2 11:43:22 mltproc1 Cluster.RGM.global.rgmd: [ID 515159
>>> daemon.notice] method <hastorageplus_postnet_stop> completed successfully
>>> for resource <personal_pool>, resource group <common_shares>, node
>>> <mltproc1>, time used: 0% of timeout <1800
>>>  seconds>
>>> Dec  2 11:43:22 mltproc1
>>> SC[,SUNW.HAStoragePlus:8,common_shares,common_zpool,hastorageplus_postnet_stop]:
>>> [ID 471757 daemon.error] cannot unmount '/common_pool0/common_zone/root' :
>>> Device busy
>>> Dec  2 11:43:22 mltproc1
>>> SC[,SUNW.HAStoragePlus:8,common_shares,common_zpool,hastorageplus_postnet_stop]:
>>> [ID 952001 daemon.error] Failed to export :common_pool0 Dec  2 11:43:22
>>> mltproc1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] Method
>>> <hastorageplus_postnet_stop> failed on resource <common_zpool> in resource
>>> group <common_shares> [exit code <1>, time used: 0% of timeout <1800
>>> seconds>]
>>>
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20091203/5a008b51/attachment-0001.html>

[ha-clusters-discuss] HAStoragePlus resource with a zone on top, unable to migrate

Reply via email to