Re: a strange and terrible saga of the cursed iSCSI ZFS SAN

Eugene M. Zheganin Tue, 08 Aug 2017 00:24:08 -0700

On 05.08.2017 22:08, Eugene M. Zheganin wrote:

Hi,
I got a problem that I cannot solve just by myself. I have a iSCSI zfsSAN system that crashes, corrupting it's data. I'll be short, and tryto describe it's genesis shortly:
1) autumn 2016, SAN is set up, supermicro server, external JBOD,sandisk ssds, several redundant pools, FreeBSD 11.x (probably release,don't really remember - see below).
2) this is working just fine until early spring 2017

3) system starts to crash (various panics):

panic: general protection fault
panic: page fault
panic: Solaris(panic): zfs: allocating allocatedsegment(offset=6599069589504 size=81920)
panic: page fault
panic: page fault
panic: Solaris(panic): zfs: allocating allocatedsegment(offset=8245779054592 size=8192)
panic: page fault
panic: page fault
panic: page fault
panic: Solaris(panic): zfs: allocating allocatedsegment(offset=1792100934656 size=46080)
4) we memtested it immidiately, no problems found.
5) we switch sandisks to toshibas, we switch also the server to anidentical one, JBOD to an identical one, leaving same cables.
6) crashes don't stop.
7) we found that field engineers physically damaged (sic!) the SATAcables (main one and spare ones), and that 90% of the disks show ICRCSMART errors.
8) we replaced the cable (brand new HP one).

9) ATA SMART errors stopped increasing.

10) crashes continue.
11) we decided that probably when ZFS was moved over damaged cablesbetween JBODs it was somehow damaged too, so now it's panickingbecause of that. so we wiped the data completely, reinitialized theSAN system and put it back into the production. we even dd'ed eachdisk with zeroes (!) - just in case. Important note: the data wasimported using zfs send from another, stable system that is runing inproduction in another DC.
12) today we got another panic.

btw the pools look now like this:


# zpool status -v
  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0    62
          raidz1-0  ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da6     ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            da7     ONLINE       0     0     0
            da8     ONLINE       0     0     0
            da9     ONLINE       0     0     0
            da10    ONLINE       0     0     0
            da11    ONLINE       0     0     0
          raidz1-2  ONLINE       0     0    62
            da12    ONLINE       0     0     0
            da13    ONLINE       0     0     0
            da14    ONLINE       0     0     0
            da15    ONLINE       0     0     0
            da16    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        data/userdata/worker208:<0x1>

  pool: userdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME               STATE     READ WRITE CKSUM
        userdata           ONLINE       0     0  216K
          mirror-0         ONLINE       0     0  432K
            gpt/userdata0  ONLINE       0     0  432K
            gpt/userdata1  ONLINE       0     0  432K

errors: Permanent errors have been detected in the following files:

        userdata/worker36:<0x1>
        userdata/worker30:<0x1>
        userdata/worker31:<0x1>
        userdata/worker35:<0x1>
12) somewhere between p.5 and p.10 the pool became deduplicated (notdirectly connected to the problem, just for production reasons).
So, concluding: we had bad hardware, we replaced EACH piece (server,adapter, JBOD, cable, disks), and crashes just don't stop. We have 5another iSCSI SAN systems, almost fully identical that don't crash.Crashes on this particular system began when it was running same setof versions that stable systems.

So far my priority version is that something was broken in the iSCSI+zfsstack somewhere between r310734 (most recent version on my SAN systemsthat works) and r320056 (probably earlier, but r320056 is the firstrevision with documented crash).

So I downgraded back to r310734 (from a 11.1-RELEASE, which is affected,if I'm right).


Some things speak pro this version:

- the system was stable pre-spring 2017, before the upgrade happened

- zfs corruption happens _only_ on the pools that the iSCSI is servingfrom, no corruption happens on the zfs pools that have nothing to dowith providing zvils as iSCSI targets (and this seems to be the mostconvincing point).

- the faulty hardware was changed. though it was changed to a identicalhardware, BUT I have the very same set of identical hardware working inalmost identical environment under r310734 in another DC.

so far I'm not sure, because only 20 hours passed since the downgrade.However, if the system will be stable for more than a week (was neverstable that long on recent revisions), it will prove I'm right and I'llfile the PR.



Thanks.

Eugene.

_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Re: a strange and terrible saga of the cursed iSCSI ZFS SAN

Reply via email to