One causes the other, I assume the processes of the sczbt get killed because they are getting too close to theit stop_timeout. So question number one is the real root cause. I assume in your version of the sccbt is a pmfadm -s <name> KILL, and the clearzone script is started with hatimerun and not with pmfadm.
So my suggestion would be to increase the stop_timout of the zones boot resource and see if it was just too small. The root cause is that the stop of the zone takes longer than expected. Detlef Tundra Slosek wrote: >> I do not understand this as well the only possibilty >> is that the >> stop_timeout is excceded, but then the status of the >> sczbt resource must >> be stop_failed. >> > > It is not the sczbt resource (in this case, smb1_zone) which has is in > stop_failed. It is the underlying HAStoragePlus resources (in this case, > smb1_zpool) which is failed. > > >> So we have to solve two questions: >> what is blocking the zfs umount? >> is the stop_timoutn exceeded? This must be reflected >> in >> /var/adm/messages of the node something like: >> "Function: stop_sczbt - >> Manual intervention needed for non-global zone". If >> the stop_timout is >> exceeded, I would try to raise it, just a try, but >> question one needs to >> be resolved first. >> > > I have instrumented this in DTrace (perhaps incorrectly or incompletely, so I > am open to suggestions on changes). My DTrace script and complete dumps are > available earlier in this thread, however the relevant (as far as I can see) > portions are as follows: > > time:225428757327280 umount2-execname:zoneadmd > mountpoint:/smb1_pool0/smb1_zone/root/dev flag:0 PID:4873 > ParentPID:1 > time:225428762957138 umount2-execname:zoneadmd return arg0:0 > PID:4873 ParentPID:1 > time:225428766242637 exec-execname:zoneadmd target:/bin/sh PID:7316 > ParentPID:4873 > time:225428833735669 exec-execname:ksh93 target:/usr/sbin/umount > PID:7329 ParentPID:7316 > time:225428837002545 exec-execname:umount target:/usr/lib/fs/zfs/umount > PID:7329 ParentPID:7316 > time:225428847206624 umount2-execname:zfs > mountpoint:/smb1_pool0/smb1_zone/root flag:0 PID:7329 > ParentPID:7316 > time:225432170675815 umount2-execname:hastorageplus_po > mountpoint:/smb1_pool0/smb1_zone/root flag:1024 PID:7450 > ParentPID:1179 > time:225435468361047 umount2-execname:zfs return arg0:0 PID:7329 > ParentPID:7316 > time:225435468446546 umount2-execname:hastorageplus_po return > arg0:-1 PID:7450 ParentPID:1179 > time:225435475257693 umount2-execname:zoneadmd > mountpoint:/var/run/zones/smb1.zoneadmd_door flag:0 PID:4873 > ParentPID:1 > time:225435483475900 umount2-execname:zoneadmd return arg0:0 > PID:4873 ParentPID:1 > > If I read this correctly, zoneadmd has finished stopping the named zone, and > is unmounting various mountpoints within the zone's tree > (/smb1_pool0/smb1_zone/root/dev at the beginning), > > Then calling (indirectly) /usr/lib/fs/zfs/umount which starts to umount2 > /smb1_pool0/smb1_zone/root > > Before that call to umount2 that /usr/lib/fs/zfs/umount made returns, > however, hastorageplus_po tries to umount2 the same mountpoint (well, > hastorageplus_po is trying to export the pool, but part of that is to umount2 > all mounted zfs mountpoints recursively first). > > Then the zfs umount2 completes with success, > > Then the hastorageplus_po umount2 fails (this makes sense, in a very limited > scope, as the mountpoint is gone after the call is made and before it > completes)... which puts the resource named smb1_zpool into failed state. > > What I don't understand is why smb1_zpool (the resource that should have > called hastorageplus_po) is beginning the 'stop' sequence when the zfs > umount2 hasn't completed yet. > > >> Detlef >> >> Tundra Slosek wrote: >> >>>> Hi Tundra, >>>> >>>> The reasoning behind is that the root directory is >>>> >> a >> >>>> property of >>>> Solaris, and placing something in her might have >>>> >> some >> >>>> impact. It could >>>> have been, that the zoneadm halt tried to unmount >>>> >> the >> >>>> root fs without >>>> success, because the gds is sitting on it. >>>> >>>> >>> As a recap - sometimes stop (no matter the source) >>> >> works correctly, sometimes it doesn't. When it >> doesn't, it is because zoneadm issues a zfs umount >> against the root directory and that is still >> lingering when the underlying zpool's hastorageplus >> tries to export the zpool. What I have noticed is >> that when the timing is right (i.e. zfs umount >> completes first), then the zpool export happens >> without the 'FORCE' flag set, but when the timing is >> wrong (and zfs umount has not yet completed), then >> the 'FORCE' flag is set on the zpool export (and it >> fails because the device is in use, and then >> immediately after, the zfs umount completes). >> >>> I do not understand why the hastorageplus begins >>> >> it's 'stop' before the zone is completely stopped - >> what seems to happen is that the zone stops, and then >> issues zoneadm request to unmount the zonepath; >> however the gds returns to the rgm with success >> before zoneadm is actually finished. >> >>> >>> >>>> Anyway silly question. you do have adependency >>>> between the sczbt >>>> resource and the HAStoragePlus resource? >>>> >>>> >>> No question is silly. If I undestand the output of >>> >> clrs here, then the dependency is set. >> >>> root at mltstore1:~# clrs show -v smb1_zone | grep >>> >> smb1 >> >>> Resource: >>> >> smb1_zone >> smb1_rg >> pendencies: smb1_lhname >> smb1_zpool >> >>> Start_command: >>> >> opt/SUNWsczone/sczbt/bin/start_sczbt -R smb1_zone -G >> smb1_rg -P /smb1_pool0/parameters >> >>> Stop_command: >>> >> opt/SUNWsczone/sczbt/bin/stop_sczbt -R smb1_zone -G >> smb1_rg -P /smb1_pool0/parameters >> >>> Probe_command: >>> >> opt/SUNWsczone/sczbt/bin/probe_sczbt -R smb1_zone -G >> smb1_rg -P /smb1_pool0/parameters >> >>> Network_resources_used: >>> >> smb1_lhname >> >>>> Tundra Slosek wrote: >>>> >>>> >>>>>> Hi Tundra, >>>>>> >>>>>> One thing which you should never do is move the >>>>>> parameter directory into >>>>>> the root file system for the zone. this is what >>>>>> >>>>>> >>>> might >>>> >>>> >>>>>> cause the >>>>>> headache, because the sczbt resource accesses >>>>>> >> the >> >>>>>> parameter directory >>>>>> and calling zoneadm halt which tries to remove >>>>>> >>>>>> >>>> the >>>> >>>> >>>>> mount and this might >>>>> >>>>> >>>>>> not work. >>>>>> >>>>>> I would suggest to move the parameters directory >>>>>> >>>>>> >>>> to: >>>> >>>> >>>>>> /smb1_pool0/parameters >>>>>> >>>>>> >>>>>> >>>>> I'm not sure I understand why a file open in >>>>> >>>>> >>>> /smb1_pool0/smb1_zone/parameters/ would prevent >>>> >> zfs >> >>>> unmounting of /smb1_pool0/smb1_zone/root, however >>>> it's easy enough to test, I don't see any harm in >>>> >> the >> >>>> suggested change and I remain open to the >>>> >> possibility >> >>>> that there is something fundamental I'm >>>> misunderstanding. >>>> >>>> >>>>> Done, (created the directory above, copied the >>>>> >>>>> >>>> existing contents of parameters and changed the >>>> >> clrs >> >>>> Start_command, Stop_command and Probe_command to >>>> point at /smb1_pool0/parameters instead of >>>> /smb1_pool0/smb1_zone/parameters) however the >>>> >> exact >> >>>> same behavior exists - i.e. overlap between zfs >>>> unmount of /smb1_pool0/smb1_zone/root and >>>> hastorageplus attempting to export the smb1_pool0 >>>> zpool. DTrace log available as per prior efforts, >>>> >> if >> >>>> anyone thinks it will be helpful, however it >>>> >> doesn't >> >>>> seem different to me. >>>> >>>> >>>>> >>>>> >>>>> >>>> -- >>>> >>>> >>>> >> ****************************************************** >> >>>> *********************** >>>> Detlef Ulherr >>>> Staff Engineer Tel: (++49 >>>> >> 6103) >> >>>> 752-248 >>>> Availability Engineering Fax: (++49 6103) >>>> >> 752-167 >> >>>> Sun Microsystems GmbH >>>> Amperestr. 6 >>>> mailto:detlef.ulherr at sun.com >>>> http://www.sun.de/ >>>> >>>> >> ****************************************************** >> >>>> ****** >>>> >>>> Sitz der Gesellschaft: >>>> Sun Microsystems GmbH, Sonnenallee 1, D-85551 >>>> Kirchheim-Heimstetten >>>> Amtsgericht M?nchen: HRB 161028 >>>> Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, >>>> Wolf Frenkel >>>> Vorsitzender des Aufsichtsrates: Martin H?ring >>>> >>>> >>>> >> ****************************************************** >> >>>> *********************** >>>> >>>> >>>> _______________________________________________ >>>> ha-clusters-discuss mailing list >>>> ha-clusters-discuss at opensolaris.org >>>> >>>> >> http://mail.opensolaris.org/mailman/listinfo/ha-cluste >> >>>> rs-discuss >>>> >>>> >> -- >> >> ****************************************************** >> *********************** >> Detlef Ulherr >> Staff Engineer Tel: (++49 6103) >> 752-248 >> Availability Engineering Fax: (++49 6103) 752-167 >> Sun Microsystems GmbH >> Amperestr. 6 >> mailto:detlef.ulherr at sun.com >> http://www.sun.de/ >> ****************************************************** >> ****** >> >> Sitz der Gesellschaft: >> Sun Microsystems GmbH, Sonnenallee 1, D-85551 >> Kirchheim-Heimstetten >> Amtsgericht M?nchen: HRB 161028 >> Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, >> Wolf Frenkel >> Vorsitzender des Aufsichtsrates: Martin H?ring >> >> ****************************************************** >> *********************** >> >> >> _______________________________________________ >> ha-clusters-discuss mailing list >> ha-clusters-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/ha-cluste >> rs-discuss >> -- ***************************************************************************** Detlef Ulherr Staff Engineer Tel: (++49 6103) 752-248 Availability Engineering Fax: (++49 6103) 752-167 Sun Microsystems GmbH Amperestr. 6 mailto:detlef.ulherr at sun.com 63225 Langen http://www.sun.de/ ***************************************************************************** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht M?nchen: HRB 161028 Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin H?ring *****************************************************************************