[ha-clusters-discuss] (ZFS) file corruption with HAStoragePlus

Armin Ollig Wed, 29 Oct 2008 06:34:33 PDT

Hi Venku, 

thanks for your input. Meanwhile i cloned the systems with lucreate and updated 
it to the ce 09/08 release. I removed and recreated the cluster. The failure 
that both nodes mount the zfs concurrently happens after i trigger a fail over 
of the HASP resource (by rebooting the active-node). This leads to a kernel 
panic, thereafter i import manually another pool and reboot. Finally both nodes 
mount the HASP-zfs. Here are the commands and logs involved:


Remove old HASP resource:

siegfried# clresource disable vb1-storage
siegfried# clresource delete vb1-storage
siegfried# clresourcegroup  delete vb1
siegfried# zpool destroy vb1
siegfried# zpool create UNUSED___ /dev/did/dsk/d5s0 # overwrite the zfs labels
siegfried# rm /etc/zfs/zpool.cache 
voelsung# rm /etc/zfs/zpool.cache 
voelsung# scshutdown -y -g0 # end episode 1


Re-create the resource:

siegfried# zpool create -f vb1 /dev/did/dsk/d5s0
siegfried# zfs create vb1/vb1
siegfried# zpool  status vb1
  pool: vb1
 state: ONLINE
 scrub: none requested
config:
        NAME                 STATE     READ WRITE CKSUM
        vb1                  ONLINE       0     0     0
          /dev/did/dsk/d5s0  ONLINE       0     0     0
errors: No known data errors
siegfried# zfs create vb1/vb1
siegfried# clresourcegroup create vb1
siegfried# clresourcegroup manage vb1
siegfried# clrg online vb1
siegfried# clresource create -t SUNW.HAStoragePlus \
> -g vb1 \
> -p Zpools=vb1 \
> -p AffinityOn=True vb1-storage 
siegfried# clresource status vb1-storage
=== Cluster Resources ===
Resource Name       Node Name      State        Status Message
-------------       ---------      -----        --------------
vb1-storage         voelsung       Online       Online
                    siegfried      Offline      Offline

voelsung# mount| grep vb1
/vb1 on vb1 read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90002 on 
Wed Oct 29 13:30:57 2008
/vb1/vb1 on vb1/vb1 
read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=2d90003 on Wed Oct 29 
13:30:57 2008
siegfried# mount| grep vb1
siegfried#                              
#....no problem until here...


Things go wrong (voelsung is rebooted, siegfried panics):
voelsung# reboot 
Oct 29 13:38:47 siegfried genunix: NOTICE: clcomm: Path siegfried:e1000g2 - 
voelsung:e1000g2 being drained
Oct 29 13:38:47 siegfried genunix: NOTICE: clcomm: Path siegfried:e1000g1 - 
voelsung:e1000g1 being drained
Oct 29 13:38:47 siegfried ip: TCP_IOC_ABORT_CONN: aborted 0 connection
Oct 29 13:38:49 siegfried genunix: NOTICE: CMM: Node voelsung (nodeid = 1) is 
down.
Oct 29 13:38:49 siegfried genunix: NOTICE: CMM: Cluster members: siegfried.
Oct 29 13:38:49 siegfried genuniNx: NOTICE: CMM:onode reconfigurttion #3 
completifying cluster that this node is panicking

panic[cpu3]/thread=ffffff000f631c80: Reservation Conflict
Disk: /scsi_vhci/disk at g600d02300000000000888275cd1cc500

ffffff000f631a00 sd:sd_panic_for_res_conflict+4f ()
ffffff000f631a40 sd:sd_pkt_status_reservation_conflict+a8 ()
ffffff000f631a90 sd:sdintr+44e ()
ffffff000f631b30 scsi_vhci:vhci_intr+6ac ()
ffffff000f631b50 fcp:fcp_post_callback+1e ()
ffffff000f631b90 fcp:fcp_cmd_callback+4b ()
ffffff000f631bd0 emlxs:emlxs_iodone+b1 ()
ffffff000f631c20 emlxs:emlxs_iodone_server+15d ()
ffffff000f631c60 emlxs:emlxs_thread+15e ()
ffffff000f631c70 unix:thread_start+8 ()

syncing file systems... 6 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 done (not 
all i/o completed)
dumping to /dev/dsk/c4t600D0230000000000088824BC4228803d0s1, offset 1719074816, 
content: kernel
100% done: 178044 pages dumped, compression ratio 5.30, dump succeeded
rebooting...

I import a pool "dummypool" manually on siegfried (this zpool is not under HASP 
control and is manually import only on node siegfried)
siegfried# zpool  import  dummypool   
siegfried# zpool  status  dummypool                                             
                                                              
  pool: dummypool
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        dummypool                                  ONLINE       0     0     0
          c4t600D02300000000000888275CD1CC500d0s0  ONLINE       0     0     0

errors: No known data errors
siegfried# reboot  
Kernel errors while booting siegfried:
Hostname: siegfried
WARNING: /scsi_vhci/disk at g600d02300000000000888275cd1cc500 (sd15):
        reservation conflict
WARNING: /scsi_vhci/disk at g600d0230000000000088824bc4228807 (sd13):
        reservation conflict

Here (finally) the error: Both nodes mount the zfs concurrently:
siegfried# zpool  status     
  pool: dummypool
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        dummypool                                  ONLINE       0     0     0
          c4t600D02300000000000888275CD1CC500d0s0  ONLINE       0     0     0

errors: No known data errors

  pool: vb1
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        vb1                                        ONLINE       0     0     0
          c4t600D0230000000000088824BC4228807d0s0  ONLINE       0     0     0

errors: No known data errors

voelsung# zpool  status 
  pool: vb1
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        vb1                                        ONLINE       0     0     0
          c4t600D0230000000000088824BC4228807d0s0  ONLINE       0     0     0

errors: No known data errors
voelsung# clresource status vb1-storage

=== Cluster Resources ===

Resource Name       Node Name      State        Status Message
-------------       ---------      -----        --------------
vb1-storage         voelsung       Online       Online
                    siegfried      Offline      Offline


Best wishes,
 Armin
-- 
This message posted from opensolaris.org

[ha-clusters-discuss] (ZFS) file corruption with HAStoragePlus

Reply via email to