> One causes the other, I assume the processes of the > sczbt get killed > because they are getting too close to theit > stop_timeout. So question > number one is the real root cause. I assume in your > version of the sccbt > is a pmfadm -s <name> KILL, and the clearzone script > is started with > hatimerun and not with pmfadm.
Do these two grep searches confirm your question? If not, let me know where I can look. root at mltproc1:~# grep -i pmfadm /opt/SUNWsczone/sczbt/bin/* /opt/SUNWsczone/sczbt/bin/functions: ${PMFADM} -s ${RESOURCEGROUP},${RESOURCE},0.svc /opt/SUNWsczone/sczbt/bin/functions: ${PMFADM} -s ${RESOURCEGROUP},${RESOURCE},0.svc KILL 2> /dev/null root at mltproc1:~# grep -i hatimerun /opt/SUNWsczone/sczbt/bin/functions /usr/cluster/bin/hatimerun -t ${CLEAR_STOP_TIMEOUT} /opt/SUNWsczone/sczbt/bin/clear_zone ${Zonepath} ${RESOURCEGROUP} ${RESOURCE} >>${LOGFILE} > > So my suggestion would be to increase the stop_timout > of the zones boot > resource and see if it was just too small. > > The root cause is that the stop of the zone takes > longer than expected. The following entry is recorded in syslog at the same time as the sample I detailed: Dec 14 09:55:57 mltproc1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <gds_svc_stop> completed successfully for resource <smb1_zone>, resource group <smb1_rg>, node <mltproc1>, time used: 4% of timeout <300 seconds> I read this to mean that the resource smb1_zone reached stop successfully within 4% of 300 seconds. Given this, what STOP_TIMEOUT value do you think would make a difference? I have some entries in the log of stop successful with time used of 63% and 80% - is there some other variable of 'how much of STOP_TIMEOUT to use before failing' that I'm not seeing? > > Detlef > > Tundra Slosek wrote: > >> I do not understand this as well the only > possibilty > >> is that the > >> stop_timeout is excceded, but then the status of > the > >> sczbt resource must > >> be stop_failed. > >> > > > > It is not the sczbt resource (in this case, > smb1_zone) which has is in stop_failed. It is the > underlying HAStoragePlus resources (in this case, > smb1_zpool) which is failed. > > > > > >> So we have to solve two questions: > >> what is blocking the zfs umount? > >> is the stop_timoutn exceeded? This must be > reflected > >> in > >> /var/adm/messages of the node something like: > >> "Function: stop_sczbt - > >> Manual intervention needed for non-global zone". > If > >> the stop_timout is > >> exceeded, I would try to raise it, just a try, but > >> question one needs to > >> be resolved first. > >> > > > > I have instrumented this in DTrace (perhaps > incorrectly or incompletely, so I am open to > suggestions on changes). My DTrace script and > complete dumps are available earlier in this thread, > however the relevant (as far as I can see) portions > are as follows: > > > > time:225428757327280 umount2-execname:zoneadmd > mountpoint:/smb1_pool0/smb1_zone/root/dev > flag:0 PID:4873 ParentPID:1 > 225428762957138 umount2-execname:zoneadmd > return arg0:0 PID:4873 ParentPID:1 > > e:225428766242637 exec-execname:zoneadmd > target:/bin/sh PID:7316 ParentPID:4873 > time:225428833735669 exec-execname:ksh93 > target:/usr/sbin/umount PID:7329 > ParentPID:7316 > 25428837002545 exec-execname:umount > target:/usr/lib/fs/zfs/umount PID:7329 > ParentPID:7316 > 25428847206624 umount2-execname:zfs > mountpoint:/smb1_pool0/smb1_zone/root flag:0 > PID:7329 ParentPID:7316 > time:225432170675815 > umount2-execname:hastorageplus_po > mountpoint:/smb1_pool0/smb1_zone/root > flag:1024 PID:7450 ParentPID:1179 > ime:225435468361047 umount2-execname:zfs return > arg0:0 PID:7329 ParentPID:7316 > > time:225435468446546 > umount2-execname:hastorageplus_po return > arg0:-1 PID:7450 ParentPID:1179 > time:225435475257693 umount2-execname:zoneadmd > mountpoint:/var/run/zones/smb1.zoneadmd_door > flag:0 PID:4873 ParentPID:1 > me:225435483475900 umount2-execname:zoneadmd > return arg0:0 PID:4873 ParentPID:1 > > f I read this correctly, zoneadmd has finished > stopping the named zone, and is unmounting various > mountpoints within the zone's tree > (/smb1_pool0/smb1_zone/root/dev at the beginning), > > > > Then calling (indirectly) /usr/lib/fs/zfs/umount > which starts to umount2 /smb1_pool0/smb1_zone/root > > > > Before that call to umount2 that > /usr/lib/fs/zfs/umount made returns, however, > hastorageplus_po tries to umount2 the same mountpoint > (well, hastorageplus_po is trying to export the pool, > but part of that is to umount2 all mounted zfs > mountpoints recursively first). > > > > Then the zfs umount2 completes with success, > > > > Then the hastorageplus_po umount2 fails (this makes > sense, in a very limited scope, as the mountpoint is > gone after the call is made and before it > completes)... which puts the resource named > smb1_zpool into failed state. > > > > What I don't understand is why smb1_zpool (the > resource that should have called hastorageplus_po) is > beginning the 'stop' sequence when the zfs umount2 > hasn't completed yet. > > > > > >> Detlef > >> > >> Tundra Slosek wrote: > >> > >>>> Hi Tundra, > >>>> > >>>> The reasoning behind is that the root directory > is > >>>> > >> a > >> > >>>> property of > >>>> Solaris, and placing something in her might have > >>>> > >> some > >> > >>>> impact. It could > >>>> have been, that the zoneadm halt tried to > unmount > >>>> > >> the > >> > >>>> root fs without > >>>> success, because the gds is sitting on it. > >>>> > >>>> > >>> As a recap - sometimes stop (no matter the > source) > >>> > >> works correctly, sometimes it doesn't. When it > >> doesn't, it is because zoneadm issues a zfs umount > >> against the root directory and that is still > >> lingering when the underlying zpool's > hastorageplus > >> tries to export the zpool. What I have noticed is > >> that when the timing is right (i.e. zfs umount > >> completes first), then the zpool export happens > >> without the 'FORCE' flag set, but when the timing > is > >> wrong (and zfs umount has not yet completed), then > >> the 'FORCE' flag is set on the zpool export (and > it > >> fails because the device is in use, and then > >> immediately after, the zfs umount completes). > >> > >>> I do not understand why the hastorageplus begins > >>> > >> it's 'stop' before the zone is completely stopped > - > >> what seems to happen is that the zone stops, and > then > >> issues zoneadm request to unmount the zonepath; > >> however the gds returns to the rgm with success > >> before zoneadm is actually finished. > >> > >>> > >>> > >>>> Anyway silly question. you do have adependency > >>>> between the sczbt > >>>> resource and the HAStoragePlus resource? > >>>> > >>>> > >>> No question is silly. If I undestand the output > of > >>> > >> clrs here, then the dependency is set. > >> > >>> root at mltstore1:~# clrs show -v smb1_zone | grep > >>> > >> smb1 > >> > >>> Resource: > >>> > >> smb1_zone > >> smb1_rg > >> pendencies: smb1_lhname > >> smb1_zpool > >> > >>> Start_command: > >>> > >> opt/SUNWsczone/sczbt/bin/start_sczbt -R smb1_zone > -G > >> smb1_rg -P /smb1_pool0/parameters > >> > >>> Stop_command: > >>> > >> opt/SUNWsczone/sczbt/bin/stop_sczbt -R smb1_zone > -G > >> smb1_rg -P /smb1_pool0/parameters > >> > >>> Probe_command: > >>> > >> opt/SUNWsczone/sczbt/bin/probe_sczbt -R smb1_zone > -G > >> smb1_rg -P /smb1_pool0/parameters > >> > >>> Network_resources_used: > >>> > >> smb1_lhname > >> > >>>> Tundra Slosek wrote: > >>>> > >>>> > >>>>>> Hi Tundra, > >>>>>> > >>>>>> One thing which you should never do is move > the > >>>>>> parameter directory into > >>>>>> the root file system for the zone. this is > what > >>>>>> > >>>>>> > >>>> might > >>>> > >>>> > >>>>>> cause the > >>>>>> headache, because the sczbt resource accesses > >>>>>> > >> the > >> > >>>>>> parameter directory > >>>>>> and calling zoneadm halt which tries to remove > >>>>>> > >>>>>> > >>>> the > >>>> > >>>> > >>>>> mount and this might > >>>>> > >>>>> > >>>>>> not work. > >>>>>> > >>>>>> I would suggest to move the parameters > directory > >>>>>> > >>>>>> > >>>> to: > >>>> > >>>> > >>>>>> /smb1_pool0/parameters > >>>>>> > >>>>>> > >>>>>> > >>>>> I'm not sure I understand why a file open in > >>>>> > >>>>> > >>>> /smb1_pool0/smb1_zone/parameters/ would prevent > >>>> > >> zfs > >> > >>>> unmounting of /smb1_pool0/smb1_zone/root, > however > >>>> it's easy enough to test, I don't see any harm > in > >>>> > >> the > >> > >>>> suggested change and I remain open to the > >>>> > >> possibility > >> > >>>> that there is something fundamental I'm > >>>> misunderstanding. > >>>> > >>>> > >>>>> Done, (created the directory above, copied the > >>>>> > >>>>> > >>>> existing contents of parameters and changed the > >>>> > >> clrs > >> > >>>> Start_command, Stop_command and Probe_command to > >>>> point at /smb1_pool0/parameters instead of > >>>> /smb1_pool0/smb1_zone/parameters) however the > >>>> > >> exact > >> > >>>> same behavior exists - i.e. overlap between zfs > >>>> unmount of /smb1_pool0/smb1_zone/root and > >>>> hastorageplus attempting to export the > smb1_pool0 > >>>> zpool. DTrace log available as per prior > efforts, > >>>> > >> if > >> > >>>> anyone thinks it will be helpful, however it > >>>> > >> doesn't > >> > >>>> seem different to me. > >>>> > >>>> > >>>>> > >>>>> > >>>>> > >>>> -- > >>>> > >>>> > >>>> > >> > ****************************************************** > >> > >>>> *********************** > >>>> Detlef Ulherr > >>>> Staff Engineer Tel: (++49 > >>>> > >> 6103) > >> > >>>> 752-248 > >>>> Availability Engineering Fax: (++49 6103) > >>>> > >> 752-167 > >> > >>>> Sun Microsystems GmbH > >>>> Amperestr. 6 > >>>> > mailto:detlef.ulherr at sun.com > http://www.sun.de/ > >> > >>>> > >> > ****************************************************** > >> > >>>> ****** > >>>> > >>>> Sitz der Gesellschaft: > >>>> Sun Microsystems GmbH, Sonnenallee 1, D-85551 > >>>> Kirchheim-Heimstetten > >>>> Amtsgericht M?nchen: HRB 161028 > >>>> Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang > Engels, > >>>> Wolf Frenkel > >>>> Vorsitzender des Aufsichtsrates: Martin H?ring > >>>> > >>>> > >>>> > >> > ****************************************************** > >> > >>>> *********************** > >>>> > >>>> > >>>> _______________________________________________ > >>>> ha-clusters-discuss mailing list > >>>> ha-clusters-discuss at opensolaris.org > >>>> > >>>> > >> > http://mail.opensolaris.org/mailman/listinfo/ha-cluste > >> > >>>> rs-discuss > >>>> > >>>> > >> -- > >> > >> > ****************************************************** > >> *********************** > >> Detlef Ulherr > >> Staff Engineer Tel: (++49 > 6103) > >> 752-248 > >> Availability Engineering Fax: (++49 6103) > 752-167 > >> Sun Microsystems GmbH > >> Amperestr. 6 > >> mailto:detlef.ulherr at sun.com > >> http://www.sun.de/ > >> > ****************************************************** > >> ****** > >> > >> Sitz der Gesellschaft: > >> Sun Microsystems GmbH, Sonnenallee 1, D-85551 > >> Kirchheim-Heimstetten > >> Amtsgericht M?nchen: HRB 161028 > >> Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, > >> Wolf Frenkel > >> Vorsitzender des Aufsichtsrates: Martin H?ring > >> > >> > ****************************************************** > >> *********************** > >> > >> > >> _______________________________________________ > >> ha-clusters-discuss mailing list > >> ha-clusters-discuss at opensolaris.org > >> > http://mail.opensolaris.org/mailman/listinfo/ha-cluste > >> rs-discuss > >> > > -- > > ****************************************************** > *********************** > Detlef Ulherr > Staff Engineer Tel: (++49 6103) > 752-248 > Availability Engineering Fax: (++49 6103) 752-167 > Sun Microsystems GmbH > Amperestr. 6 > mailto:detlef.ulherr at sun.com > http://www.sun.de/ > ****************************************************** > ****** > > Sitz der Gesellschaft: > Sun Microsystems GmbH, Sonnenallee 1, D-85551 > Kirchheim-Heimstetten > Amtsgericht M?nchen: HRB 161028 > Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, > Wolf Frenkel > Vorsitzender des Aufsichtsrates: Martin H?ring > > ****************************************************** > *********************** > > > _______________________________________________ > ha-clusters-discuss mailing list > ha-clusters-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/ha-cluste > rs-discuss -- This message posted from opensolaris.org