Re: [zfs-discuss] Zones on shared storage - a warning

Mike Gerdts Thu, 07 Jan 2010 14:07:07 -0800

[removed zones-discuss after sending heads-up that the conversation
will continue at zfs-discuss]


On Mon, Jan 4, 2010 at 5:16 PM, Cindy Swearingen
<cindy.swearin...@sun.com> wrote:
> Hi Mike,
>
> It is difficult to comment on the root cause of this failure since
> the several interactions of these features are unknown. You might
> consider seeing how Ed's proposal plays out and let him do some more
> testing...

Unfortunately Ed's proposal is not funded last I heard.  Ops Center
uses many of the same mechanisms for putting zones on ZFS.  This is
where I saw the problem initially.

> If you are interested in testing this with NFSv4 and it still fails
> the same way, then also consider testing this with a local file
> instead of a NFS-mounted file and let us know the results. I'm also
> unsure of using the same path for the pool and the zone root path,
> rather than one path for pool and a pool/dataset path for zone
> root path. I will test this myself if I get some time.

I have been unable to reproduce with a local file.  I have been able
to reproduce with NFSv4 on build 130.  Rather surprisingly the actual
checksums found in the ereports are sometimes "0x0 0x0 0x0 0x0" or
"0xbaddcafe00 ...".

Here's what I did:

- Install OpenSolaris build 130 (ldom on T5220)
- Mount some NFS space at /nfszone:
   mount -F nfs -o vers=4 $file:/path /nfszone
- Create a 10gig sparse file
   cd /nfszone
   mkfile -n 10g root
- Create a zpool
   zpool create -m /zones/nfszone nfszone /nfszone/root
- Configure and install a zone
   zonecfg -z nfszone
    set zonepath = /zones/nfszone
    set autoboot = false
    verify
    commit
    exit
   chmod 700 /zones/nfszone
   zoneadm -z nfszone install

- Verify that the nfszone pool is clean.  First, pkg history in the
zone shows the timestamp of the last package operation

  2010-01-07T20:27:07 install                   pkg             Succeeded

At 20:31 I ran:

# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        nfszone          ONLINE       0     0     0
          /nfszone/root  ONLINE       0     0     0

errors: No known data errors

I booted the zone.  By 20:32 it had accumulated 132 checksum errors:

 # zpool status nfszone
  pool: nfszone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        nfszone          DEGRADED     0     0     0
          /nfszone/root  DEGRADED     0     0   132  too many errors

errors: No known data errors

fmdump has some very interesting things to say about the actual
checksums.  The 0x0 and 0xbaddcafe00 seem to shout that these checksum
errors are not due to a couple bits flipped

# fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail
   2    cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62
0x290cbce13fc59dce
   3    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400
0x7e0aef335f0c7f00
   3    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800
0xd4f1025a8e66fe00
   4    cksum_actual = 0x0 0x0 0x0 0x0
   4    cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900
0x330107da7c4bcec0
   5    cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73
0x4e0b3a8747b8a8
   6    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00
0x280934efa6d20f40
   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00
0x89715e34fbf9cdc0
  16    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00
0x7f84b11b3fc7f80
  48    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500
0x82804bc6ebcfc0

I halted the zone, exported the pool, imported the pool, then did a
scrub.  Everything seemed to be OK:

# zpool export nfszone
# zpool import -d /nfszone nfszone
# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        nfszone          ONLINE       0     0     0
          /nfszone/root  ONLINE       0     0     0

errors: No known data errors
# zpool scrub nfszone
# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Thu Jan  7 21:56:47 2010
config:

        NAME             STATE     READ WRITE CKSUM
        nfszone          ONLINE       0     0     0
          /nfszone/root  ONLINE       0     0     0

errors: No known data errors

But then I booted the zone...

# zoneadm -z nfszone boot
# zpool status nfszone
  pool: nfszone
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Thu Jan  7 21:56:47 2010
config:

        NAME             STATE     READ WRITE CKSUM
        nfszone          ONLINE       0     0     0
          /nfszone/root  ONLINE       0     0   109

errors: No known data errors

I'm confused as to why this pool seems to be quite usable even with so
many checksum errors.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

Reply via email to