[ha-clusters-discuss] (ZFS) file corruption with HAStoragePlus

Venkateswarlu Tella Mon, 03 Nov 2008 13:20:46 +0530

Hi Armin,

Thanks for reproducing the problem.


I am not able to figure out any obvious reason about the multiple node 
importing the same ZFS pool.

Here is one thing i want to mention that HASP will intentionally import 
the pool using alternate path (zpool import -R / <poolname>) so that the 
pool configuration will not be cached by solaris and thus prevents the 
node (which imported the pools) to do auto import while it gets booted.

This is especially required to avoid multiple node import problem during 
failovers.

Thanks & Regards
-Venku

On 10/29/08 19:04, Armin Ollig wrote:
> Hi Venku, 
> 
> thanks for your input. Meanwhile i cloned the systems with lucreate and 
> updated it to the ce 09/08 release. I removed and recreated the cluster. The 
> failure that both nodes mount the zfs concurrently happens after i trigger a 
> fail over of the HASP resource (by rebooting the active-node). This leads to 
> a kernel panic, thereafter i import manually another pool and reboot. Finally 
> both nodes mount the HASP-zfs. Here are the commands and logs involved:   
> 
> Remove old HASP resource:
> 
> siegfried# clresource disable vb1-storage
> siegfried# clresource delete vb1-storage
> siegfried# clresourcegroup  delete vb1
> siegfried# zpool destroy vb1
> siegfried# zpool create UNUSED___ /dev/did/dsk/d5s0 # overwrite the zfs labels
> siegfried# rm /etc/zfs/zpool.cache 
> voelsung# rm /etc/zfs/zpool.cache 
> voelsung# scshutdown -y -g0 # end episode 1
> 
> 
> Re-create the resource:
> 
> siegfried# zpool create -f vb1 /dev/did/dsk/d5s0
> siegfried# zfs create vb1/vb1
> siegfried# zpool  status vb1
>   pool: vb1
>  state: ONLINE
>  scrub: none requested
> config:
>         NAME                 STATE     READ WRITE CKSUM
>         vb1                  ONLINE       0     0     0
>           /dev/did/dsk/d5s0  ONLINE       0     0     0
> errors: No known data errors
> siegfried# zfs create vb1/vb1
> siegfried# clresourcegroup create vb1
> siegfried# clresourcegroup manage vb1
> siegfried# clrg online vb1
> siegfried# clresource create -t SUNW.HAStoragePlus \
>> -g vb1 \
>> -p Zpools=vb1 \
>> -p AffinityOn=True vb1-storage 
> siegfried# clresource status vb1-storage
> === Cluster Resources ===
> Resource Name       Node Name      State        Status Message
> -------------       ---------      -----        --------------
> vb1-storage         voelsung       Online       Online
>                     siegfried      Offline      Offline
> 
> voelsung# mount| grep vb1
> /vb1 on vb1 read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90002 
> on Wed Oct 29 13:30:57 2008
> /vb1/vb1 on vb1/vb1 
> read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90003 on Wed Oct 29 
> 13:30:57 2008
> siegfried# mount| grep vb1
> siegfried#                              
> #....no problem until here...
> 
> 
> Things go wrong (voelsung is rebooted, siegfried panics):
> voelsung# reboot 
> Oct 29 13:38:47 siegfried genunix: NOTICE: clcomm: Path siegfried:e1000g2 - 
> voelsung:e1000g2 being drained
> Oct 29 13:38:47 siegfried genunix: NOTICE: clcomm: Path siegfried:e1000g1 - 
> voelsung:e1000g1 being drained
> Oct 29 13:38:47 siegfried ip: TCP_IOC_ABORT_CONN: aborted 0 connection
> Oct 29 13:38:49 siegfried genunix: NOTICE: CMM: Node voelsung (nodeid = 1) is 
> down.
> Oct 29 13:38:49 siegfried genunix: NOTICE: CMM: Cluster members: siegfried.
> Oct 29 13:38:49 siegfried genuniNx: NOTICE: CMM:onode reconfigurttion #3 
> completifying cluster that this node is panicking
> 
> panic[cpu3]/thread=ffffff000f631c80: Reservation Conflict
> Disk: /scsi_vhci/disk at g600d02300000000000888275cd1cc500
> 
> ffffff000f631a00 sd:sd_panic_for_res_conflict+4f ()
> ffffff000f631a40 sd:sd_pkt_status_reservation_conflict+a8 ()
> ffffff000f631a90 sd:sdintr+44e ()
> ffffff000f631b30 scsi_vhci:vhci_intr+6ac ()
> ffffff000f631b50 fcp:fcp_post_callback+1e ()
> ffffff000f631b90 fcp:fcp_cmd_callback+4b ()
> ffffff000f631bd0 emlxs:emlxs_iodone+b1 ()
> ffffff000f631c20 emlxs:emlxs_iodone_server+15d ()
> ffffff000f631c60 emlxs:emlxs_thread+15e ()
> ffffff000f631c70 unix:thread_start+8 ()
> 
> syncing file systems... 6 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 done 
> (not all i/o completed)
> dumping to /dev/dsk/c4t600D0230000000000088824BC4228803d0s1, offset 
> 1719074816, content: kernel
> 100% done: 178044 pages dumped, compression ratio 5.30, dump succeeded
> rebooting...
> 
> I import a pool "dummypool" manually on siegfried (this zpool is not under 
> HASP control and is manually import only on node siegfried)
> siegfried# zpool  import  dummypool   
> siegfried# zpool  status  dummypool                                           
>                                                                 
>   pool: dummypool
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                       STATE     READ WRITE CKSUM
>         dummypool                                  ONLINE       0     0     0
>           c4t600D02300000000000888275CD1CC500d0s0  ONLINE       0     0     0
> 
> errors: No known data errors
> siegfried# reboot  
> Kernel errors while booting siegfried:
> Hostname: siegfried
> WARNING: /scsi_vhci/disk at g600d02300000000000888275cd1cc500 (sd15):
>         reservation conflict
> WARNING: /scsi_vhci/disk at g600d0230000000000088824bc4228807 (sd13):
>         reservation conflict
> 
> Here (finally) the error: Both nodes mount the zfs concurrently:
> siegfried# zpool  status     
>   pool: dummypool
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                       STATE     READ WRITE CKSUM
>         dummypool                                  ONLINE       0     0     0
>           c4t600D02300000000000888275CD1CC500d0s0  ONLINE       0     0     0
> 
> errors: No known data errors
> 
>   pool: vb1
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                       STATE     READ WRITE CKSUM
>         vb1                                        ONLINE       0     0     0
>           c4t600D0230000000000088824BC4228807d0s0  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> voelsung# zpool  status 
>   pool: vb1
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                       STATE     READ WRITE CKSUM
>         vb1                                        ONLINE       0     0     0
>           c4t600D0230000000000088824BC4228807d0s0  ONLINE       0     0     0
> 
> errors: No known data errors
> voelsung# clresource status vb1-storage
> 
> === Cluster Resources ===
> 
> Resource Name       Node Name      State        Status Message
> -------------       ---------      -----        --------------
> vb1-storage         voelsung       Online       Online
>                     siegfried      Offline      Offline
> 
> 
> Best wishes,
>  Armin

[ha-clusters-discuss] (ZFS) file corruption with HAStoragePlus

Reply via email to