Hi Armin,

Thanks for reproducing the problem.
Can you check for all (if any) error messages from hastorageplus resource?

If the problem can be reproduced with few failovers, can you just track 
each failover and provide more information at failover time when the 
zpool status is in imported state from both the nodes.
The information you can check is for each failover.
-What is status of zpool status from node where the failover happened from?
-Is there there fail to failover?

Sorry if bothering by requesting too much information. We are worried as 
there is corruption happening and data is being lost.

Thanks
-Venku

On 10/27/08 20:53, Armin Ollig wrote:
> Hi Venku and all others,
> 
>  thanks for your suggestions. 
> I wrote a script to do some IO from both hosts (in non-cluster-mode) to the 
> FC-LUNs in questions and check the md5sums of all files afterwards. As 
> expected there was no corruption.
> 
> After recreating the cluster-resource and a few failovers i found the HASP 
> resource in this state, with the vb1 zfs concurrently mounted on *both* nodes:
> 
> # clresource status vb1-storage
> === Cluster Resources ===
> Resource Name      Node Name      State         Status Message
> -------------      ---------      -----         --------------
> vb1-storage        siegfried      Offline       Offline
>                         voelsung       Starting      Unknown - Starting
> 
> 
>  siegfried# zpool status
>   pool: vb1
>  state: ONLINE
>  scrub: none requested
> config:
>         NAME                                       STATE     READ WRITE CKSUM
>         vb1                                        ONLINE       0     0     0
>           c4t600D0230000000000088824BC4228807d0s0  ONLINE       0     0     0
> 
> errors: No known data errors
> voelsung# zpool status                                                        
>                                                   
>   pool: vb1
>  state: ONLINE
>  scrub: none requested
> config:
>         NAME                                       STATE     READ WRITE CKSUM
>         vb1                                        ONLINE       0     0     0
>           c4t600D0230000000000088824BC4228807d0s0  ONLINE       0     0     
> 0[/i]
> 
> 
> In this state filesystem-corruption can occur easily. 
> The zpool was created using the cluster-wide did device:
> zpool create vb1 /dev/did/dsk/d12s0         
> 
> There was no fc path failure to the LUNs, both interconnects are normal.
> After some some minutes in this state a kernel panic is triggered and both 
> nodes reboot.
> 
> Oct 27 16:09:10 voelsung Cluster.RGM.fed: [ID 922870 daemon.error] tag 
> vb1.vb1-storage.10: unable to kill process with SIGKILL
> Oct 27 16:09:10 voelsung Cluster.RGM.rgmd: [ID 904914 daemon.error] fatal: 
> Aborting this node because method <hastorageplus_prenet_start> on resource 
> <vb1-storage> for node <voelsung> is unkillable
>  
> 
> Best wishes,
>  Armin
> --
> This message posted from opensolaris.org
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss

Reply via email to