Re: [discuss] Again: illumos based ZFS storage failure

George Wilson Fri, 23 Nov 2012 05:49:53 -0800

It's possible that the adaptec pool has suspended because of some erroron the storage. Can you run the following as root and provide the output:


echo "::walk spa | ::print spa_t spa_name spa_suspended" | mdb -k


- George

On 11/23/12 5:43 AM, Gabriele Bulfon wrote:

Hi, I got the same problem this morning.
Thanks to Alasdair suggestion, I commented out the zfs "quota" commandfrom /etc/profile, so I
could enter the bash prompt and investigate.

As a summary of the problem:
- 3 zfs pools (rpool, areca, adpatec) each on a different controller:rpool as a mirror of internal disks, areca as raidz on 7 sata disks ofthe areca controller plus half space of SD as zlog, adaptec as raidzon 8 disks of the adaptec controller plus half space of SD as zlog.- areca space is used for NFS sharing to unix servers, and alwaysresponds.- adaptec space is used for CIFS sharing and an iScsi volume for theWindows PDC.
Then we have a vmware server with an iscsi resource store, given tothe virtualized PDC as a secondary disk for sqlserver data and somemore. PDC boots directly from the vmware server disks.
At once, both CIFS from the storage and PDC iscsi disk do not respond.
CIFS fails probably because the PDC AD is not responding, probablybusy checking the iscsi disk, in a loop.
Going into the storage bash:

- zpool status shows everything find, every pool is correct.
- /var/adm/messages shows only smbd time outs with the PDC, nohardware or zfs problem.- at the time of failure, fmdump -eVvm showed same previously founderrors, 3 days earlier- after rebooting all the infrastructure, fmdump -eVvm showed samepreviously found errors around the time of rebooting the storage, notat the time of the experienced failures. We find one stated error foreach disk of the adaptec controller (cut & paste at the end)
- a zfs list areca/* showed all the areca filesystems
- a zfs list adaptec blocked never returning
- any access to the zfs structure of the adaptec would block
- as suggested by Alasdair, I ran savecore -L (I checked to have thedump device and with enough space)- the savecore command ran for sometime until reaching 100%, thenblocked never returning.
- I could not "init 5" the storage, never returning
- I tried sending the poweroff signal with the power button, consoleshowed its intention to power off, but never did it
- I forced powered off via the power button.
- Once everything was powered on again, everything ran fine.
- I looked for the dump into /var/crash, but I had no /var/crash at all.

How can I investigate further this problem?
Do I have any chance to find the savecore output in the dump device,even if I have not /var/crash?
Here is the fmdump output:

Nov 23 2012 09:25:04.422282821 ereport.io.scsi.cmd.disk.dev.uderr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.uderr
ena = 0x2b08d604c300001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,3595@2/pci8086,370@0/pci9005,2bc@e/disk@7,0
devid = id1,sd@TAdaptec_3805____________8366BAF3
(end detector)

devid = id1,sd@TAdaptec_3805____________8366BAF3
driver-assessment = fail
op-code = 0x1a
cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
pkt-reason = 0x0
pkt-state = 0x1f
pkt-stats = 0x0
stat-code = 0x0
un-decode-info = sd_get_write_cache_enabled: Mode Sense caching pagecode mismatch 0
__ttl = 0x1





----------------------------------------------------------------------------------

Da: Alasdair Lumsden <[email protected]>
A: [email protected]
Data: 10 novembre 2012 14.37.20 CET
Oggetto: Re: [discuss] illumos based ZFS storage failure

    I haven't read the whole thread, but the next time it happens you'll
    want to invoke a panic and make the dump file available. You'll
    want to
    ensure that:

    1. Multithreaded dump is disabled in /etc/system with:

    * Disable MT dump
    set dump_plat_mincpu=0

    Without this there is a risk of your dump not saving correctly.

    2. That you have a dump device and that it's big enough to capture
    your
    kernel size (zfs set volsize=X rpool/dump)

    3. That dumpadm is happy and set to save cores etc:

    dumpadm -y -z on -c kernel -d /dev/zvol/dsk/rpool/dump

    There's lots of good info here:

    http://wiki.illumos.org/display/illumos/How+To+Report+Problems

    You can also inspect things with mdb while the system is up, but
    if it's
    a production system normally you want to get it rebooted and into
    production again ASAP. So in that situation, you can take a dump
    of the
    running system with:

    savecore -L

    One thing to keep in mind is /etc/profile runs /usr/sbin/quota, which
    can screw over logins when the zfs subsystem is unhappy. I really
    think
    it should be removed by default since on most systems quotas
    aren't even
    used. So comment it out - we do so on all our systems. This will give
    you a better chance of logging in when things go wrong.

    I think there's a way to SSH in bypassing /etc/profile but I can't
    remember what it is - perhaps someone can chime in.

    Good luck. Centralised storage is difficult to do and when it goes
    wrong
    everything that depends on it goes down. It's a "all your eggs in one
    giant failbasket". Doing it homebrew with ZFS is cost effective
    and can
    be fast, but it is also risky. This is why there are companies like
    Nexenta out there with certified combinations of hardware and
    software
    engineered to work together. This extends to validating firmware
    combinations of disks/HBAs/etc.

    Cheers,

    Alasdair


    -------------------------------------------
    illumos-discuss
    Archives: https://www.listbox.com/member/archive/182180/=now
    RSS Feed:
    https://www.listbox.com/member/archive/rss/182180/21175541-02f10c6f
    Modify Your Subscription: https://www.listbox.com/member/?&;
    Powered by Listbox: http://www.listbox.com
*illumos-discuss* | Archives<https://www.listbox.com/member/archive/182180/=now><https://www.listbox.com/member/archive/rss/182180/21175546-6311bfb2>| Modify<https://www.listbox.com/member/?&;>Your Subscription [Powered by Listbox] <http://www.listbox.com>





-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] Again: illumos based ZFS storage failure

Reply via email to