Hi Armin, Thanks for reproducing the problem. Can you check for all (if any) error messages from hastorageplus resource?
If the problem can be reproduced with few failovers, can you just track each failover and provide more information at failover time when the zpool status is in imported state from both the nodes. The information you can check is for each failover. -What is status of zpool status from node where the failover happened from? -Is there there fail to failover? Sorry if bothering by requesting too much information. We are worried as there is corruption happening and data is being lost. Thanks -Venku On 10/27/08 20:53, Armin Ollig wrote: > Hi Venku and all others, > > thanks for your suggestions. > I wrote a script to do some IO from both hosts (in non-cluster-mode) to the > FC-LUNs in questions and check the md5sums of all files afterwards. As > expected there was no corruption. > > After recreating the cluster-resource and a few failovers i found the HASP > resource in this state, with the vb1 zfs concurrently mounted on *both* nodes: > > # clresource status vb1-storage > === Cluster Resources === > Resource Name Node Name State Status Message > ------------- --------- ----- -------------- > vb1-storage siegfried Offline Offline > voelsung Starting Unknown - Starting > > > siegfried# zpool status > pool: vb1 > state: ONLINE > scrub: none requested > config: > NAME STATE READ WRITE CKSUM > vb1 ONLINE 0 0 0 > c4t600D0230000000000088824BC4228807d0s0 ONLINE 0 0 0 > > errors: No known data errors > voelsung# zpool status > > pool: vb1 > state: ONLINE > scrub: none requested > config: > NAME STATE READ WRITE CKSUM > vb1 ONLINE 0 0 0 > c4t600D0230000000000088824BC4228807d0s0 ONLINE 0 0 > 0[/i] > > > In this state filesystem-corruption can occur easily. > The zpool was created using the cluster-wide did device: > zpool create vb1 /dev/did/dsk/d12s0 > > There was no fc path failure to the LUNs, both interconnects are normal. > After some some minutes in this state a kernel panic is triggered and both > nodes reboot. > > Oct 27 16:09:10 voelsung Cluster.RGM.fed: [ID 922870 daemon.error] tag > vb1.vb1-storage.10: unable to kill process with SIGKILL > Oct 27 16:09:10 voelsung Cluster.RGM.rgmd: [ID 904914 daemon.error] fatal: > Aborting this node because method <hastorageplus_prenet_start> on resource > <vb1-storage> for node <voelsung> is unkillable > > > Best wishes, > Armin > -- > This message posted from opensolaris.org > _______________________________________________ > ha-clusters-discuss mailing list > ha-clusters-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss