Re: [discuss] Again: illumos based ZFS storage failure

Jonathan Adams Tue, 27 Nov 2012 06:28:28 -0800

mentioned a while ago on this list, someone who had a _little_ trouble with
Adaptec cards ...


http://marc.info/?l=openbsd-misc&m=125783114503531&w=2
http://marc.info/?l=openbsd-misc&m=126775051500581&w=2

doesn't help the discussion much, other than a reason that we don't use
adaptec any more.

Jon

On 27 November 2012 14:10, George Wilson <[email protected]> wrote:

>  Was this data you provided below from a time when the server was hung?
> If not, try running this next time you see the issue.
>
> - George
>
>
> On 11/23/12 9:44 AM, Gabriele Bulfon wrote:
>
> Hi,
>
>
>  here is the output. Looks sane:
>
> sonicle@xstorage:~# echo "::walk spa | ::print spa_t spa_name
> spa_suspended" | mdb -k
> spa_name = [ "adaptec" ]
> spa_suspended = 0
> spa_name = [ "areca" ]
> spa_suspended = 0
> spa_name = [ "rpool1" ]
> spa_suspended = 0
>
>
>
>
> ------------------------------
>
>
> *Da:* George Wilson <[email protected]><[email protected]>
> *A:* [email protected]
> *Cc:* Gabriele Bulfon <[email protected]> <[email protected]>
> *Data:* 23 novembre 2012 14.46.55 CET
> *Oggetto:* Re: [discuss] Again: illumos based ZFS storage failure
>
>
>  It's possible that the adaptec pool has suspended because of some error
> on the storage. Can you run the following as root and provide the output:
>
> echo "::walk spa | ::print spa_t spa_name spa_suspended" | mdb -k
>
> - George
>
> On 11/23/12 5:43 AM, Gabriele Bulfon wrote:
>
>  Hi, I got the same problem this morning.
>
> Thanks to Alasdair suggestion, I commented out the zfs "quota" command
> from /etc/profile, so I
> could enter the bash prompt and investigate.
>
> As a summary of the problem:
> - 3 zfs pools (rpool, areca, adpatec) each on a different controller:
> rpool as a mirror of internal disks, areca as raidz on 7 sata disks of the
> areca controller plus half space of SD as zlog, adaptec as raidz on 8 disks
> of the adaptec controller plus half space of SD as zlog.
> - areca space is used for NFS sharing to unix servers, and always responds.
> - adaptec space is used for CIFS sharing and an iScsi volume for the
> Windows PDC.
>
>  Then we have a vmware server with an iscsi resource store, given to the
> virtualized PDC as a secondary disk for sqlserver data and some more. PDC
> boots directly from the vmware server disks.
>
>  At once, both CIFS from the storage and PDC iscsi disk do not respond.
> CIFS fails probably because the PDC AD is not responding, probably busy
> checking the iscsi disk, in a loop.
>
>  Going into the storage bash:
>
> - zpool status shows everything find, every pool is correct.
> - /var/adm/messages shows only smbd time outs with the PDC, no hardware or
> zfs problem.
> - at the time of failure, fmdump -eVvm showed same previously found
> errors, 3 days earlier
> - after rebooting all the infrastructure, fmdump -eVvm showed same
> previously found errors around the time of rebooting the storage, not at
> the time of the experienced failures. We find one stated error for each
> disk of the adaptec controller (cut & paste at the end)
> - a zfs list areca/* showed all the areca filesystems
> - a zfs list adaptec blocked never returning
> - any access to the zfs structure of the adaptec would block
> - as suggested by Alasdair, I ran savecore -L (I checked to have the dump
> device and with enough space)
> - the savecore command ran for sometime until reaching 100%, then blocked
> never returning.
> - I could not "init 5" the storage, never returning
> - I tried sending the poweroff signal with the power button, console
> showed its intention to power off, but never did it
> - I forced powered off via the power button.
> - Once everything was powered on again, everything ran fine.
> - I looked for the dump into /var/crash, but I had no /var/crash at all.
>
> How can I investigate further this problem?
> Do I have any chance to find the savecore output in the dump device, even
> if I have not /var/crash?
>
> Here is the fmdump output:
>
> Nov 23 2012 09:25:04.422282821 ereport.io.scsi.cmd.disk.dev.uderr
> nvlist version: 0
>  class = ereport.io.scsi.cmd.disk.dev.uderr
>  ena = 0x2b08d604c300001
>  detector = (embedded nvlist)
>  nvlist version: 0
>  version = 0x0
>  scheme = dev
>  device-path = /pci@0,0/pci8086,3595@2/pci8086,370@0/pci9005,2bc@e/disk@7
> ,0
>  devid = id1,sd@TAdaptec_3805____________8366BAF3
>  (end detector)
>
>  devid = id1,sd@TAdaptec_3805____________8366BAF3
>  driver-assessment = fail
>  op-code = 0x1a
>  cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
>  pkt-reason = 0x0
>  pkt-state = 0x1f
>  pkt-stats = 0x0
>  stat-code = 0x0
>  un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page code
> mismatch 0
>
>
>  __ttl = 0x1
>
>
>
>
>
>
> ----------------------------------------------------------------------------------
>
> Da: Alasdair Lumsden <[email protected]> <[email protected]>
> A: [email protected]
> Data: 10 novembre 2012 14.37.20 CET
> Oggetto: Re: [discuss] illumos based ZFS storage failure
>
> I haven't read the whole thread, but the next time it happens you'll
> want to invoke a panic and make the dump file available. You'll want to
> ensure that:
>
> 1. Multithreaded dump is disabled in /etc/system with:
>
> * Disable MT dump
> set dump_plat_mincpu=0
>
> Without this there is a risk of your dump not saving correctly.
>
> 2. That you have a dump device and that it's big enough to capture your
> kernel size (zfs set volsize=X rpool/dump)
>
> 3. That dumpadm is happy and set to save cores etc:
>
> dumpadm -y -z on -c kernel -d /dev/zvol/dsk/rpool/dump
>
> There's lots of good info here:
>
> http://wiki.illumos.org/display/illumos/How+To+Report+Problems
>
> You can also inspect things with mdb while the system is up, but if it's
> a production system normally you want to get it rebooted and into
> production again ASAP. So in that situation, you can take a dump of the
> running system with:
>
> savecore -L
>
> One thing to keep in mind is /etc/profile runs /usr/sbin/quota, which
> can screw over logins when the zfs subsystem is unhappy. I really think
> it should be removed by default since on most systems quotas aren't even
> used. So comment it out - we do so on all our systems. This will give
> you a better chance of logging in when things go wrong.
>
> I think there's a way to SSH in bypassing /etc/profile but I can't
> remember what it is - perhaps someone can chime in.
>
> Good luck. Centralised storage is difficult to do and when it goes wrong
> everything that depends on it goes down. It's a "all your eggs in one
> giant failbasket". Doing it homebrew with ZFS is cost effective and can
> be fast, but it is also risky. This is why there are companies like
> Nexenta out there with certified combinations of hardware and software
> engineered to work together. This extends to validating firmware
> combinations of disks/HBAs/etc.
>
> Cheers,
>
> Alasdair
>
>
> -------------------------------------------
> illumos-discuss
> Archives: https://www.listbox.com/member/archive/182180/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182180/21175541-02f10c6f
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com
>
>
>    *illumos-discuss* | 
> Archives<https://www.listbox.com/member/archive/182180/=now>
> <https://www.listbox.com/member/archive/rss/182180/21175546-6311bfb2> |
> Modify <https://www.listbox.com/member/?&;> Your Subscription 
> <http://www.listbox.com>
>
>
>
>   *illumos-discuss* | 
> Archives<https://www.listbox.com/member/archive/182180/=now>
> <https://www.listbox.com/member/archive/rss/182180/23508059-3f15f76a> |
> Modify<https://www.listbox.com/member/?&;>Your Subscription
> <http://www.listbox.com>
>



-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] Again: illumos based ZFS storage failure

Reply via email to