mentioned a while ago on this list, someone who had a _little_ trouble with Adaptec cards ...
http://marc.info/?l=openbsd-misc&m=125783114503531&w=2 http://marc.info/?l=openbsd-misc&m=126775051500581&w=2 doesn't help the discussion much, other than a reason that we don't use adaptec any more. Jon On 27 November 2012 14:10, George Wilson <[email protected]> wrote: > Was this data you provided below from a time when the server was hung? > If not, try running this next time you see the issue. > > - George > > > On 11/23/12 9:44 AM, Gabriele Bulfon wrote: > > Hi, > > > here is the output. Looks sane: > > sonicle@xstorage:~# echo "::walk spa | ::print spa_t spa_name > spa_suspended" | mdb -k > spa_name = [ "adaptec" ] > spa_suspended = 0 > spa_name = [ "areca" ] > spa_suspended = 0 > spa_name = [ "rpool1" ] > spa_suspended = 0 > > > > > ------------------------------ > > > *Da:* George Wilson <[email protected]><[email protected]> > *A:* [email protected] > *Cc:* Gabriele Bulfon <[email protected]> <[email protected]> > *Data:* 23 novembre 2012 14.46.55 CET > *Oggetto:* Re: [discuss] Again: illumos based ZFS storage failure > > > It's possible that the adaptec pool has suspended because of some error > on the storage. Can you run the following as root and provide the output: > > echo "::walk spa | ::print spa_t spa_name spa_suspended" | mdb -k > > - George > > On 11/23/12 5:43 AM, Gabriele Bulfon wrote: > > Hi, I got the same problem this morning. > > Thanks to Alasdair suggestion, I commented out the zfs "quota" command > from /etc/profile, so I > could enter the bash prompt and investigate. > > As a summary of the problem: > - 3 zfs pools (rpool, areca, adpatec) each on a different controller: > rpool as a mirror of internal disks, areca as raidz on 7 sata disks of the > areca controller plus half space of SD as zlog, adaptec as raidz on 8 disks > of the adaptec controller plus half space of SD as zlog. > - areca space is used for NFS sharing to unix servers, and always responds. > - adaptec space is used for CIFS sharing and an iScsi volume for the > Windows PDC. > > Then we have a vmware server with an iscsi resource store, given to the > virtualized PDC as a secondary disk for sqlserver data and some more. PDC > boots directly from the vmware server disks. > > At once, both CIFS from the storage and PDC iscsi disk do not respond. > CIFS fails probably because the PDC AD is not responding, probably busy > checking the iscsi disk, in a loop. > > Going into the storage bash: > > - zpool status shows everything find, every pool is correct. > - /var/adm/messages shows only smbd time outs with the PDC, no hardware or > zfs problem. > - at the time of failure, fmdump -eVvm showed same previously found > errors, 3 days earlier > - after rebooting all the infrastructure, fmdump -eVvm showed same > previously found errors around the time of rebooting the storage, not at > the time of the experienced failures. We find one stated error for each > disk of the adaptec controller (cut & paste at the end) > - a zfs list areca/* showed all the areca filesystems > - a zfs list adaptec blocked never returning > - any access to the zfs structure of the adaptec would block > - as suggested by Alasdair, I ran savecore -L (I checked to have the dump > device and with enough space) > - the savecore command ran for sometime until reaching 100%, then blocked > never returning. > - I could not "init 5" the storage, never returning > - I tried sending the poweroff signal with the power button, console > showed its intention to power off, but never did it > - I forced powered off via the power button. > - Once everything was powered on again, everything ran fine. > - I looked for the dump into /var/crash, but I had no /var/crash at all. > > How can I investigate further this problem? > Do I have any chance to find the savecore output in the dump device, even > if I have not /var/crash? > > Here is the fmdump output: > > Nov 23 2012 09:25:04.422282821 ereport.io.scsi.cmd.disk.dev.uderr > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.dev.uderr > ena = 0x2b08d604c300001 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci8086,3595@2/pci8086,370@0/pci9005,2bc@e/disk@7 > ,0 > devid = id1,sd@TAdaptec_3805____________8366BAF3 > (end detector) > > devid = id1,sd@TAdaptec_3805____________8366BAF3 > driver-assessment = fail > op-code = 0x1a > cdb = 0x1a 0x0 0x8 0x0 0x18 0x0 > pkt-reason = 0x0 > pkt-state = 0x1f > pkt-stats = 0x0 > stat-code = 0x0 > un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page code > mismatch 0 > > > __ttl = 0x1 > > > > > > > ---------------------------------------------------------------------------------- > > Da: Alasdair Lumsden <[email protected]> <[email protected]> > A: [email protected] > Data: 10 novembre 2012 14.37.20 CET > Oggetto: Re: [discuss] illumos based ZFS storage failure > > I haven't read the whole thread, but the next time it happens you'll > want to invoke a panic and make the dump file available. You'll want to > ensure that: > > 1. Multithreaded dump is disabled in /etc/system with: > > * Disable MT dump > set dump_plat_mincpu=0 > > Without this there is a risk of your dump not saving correctly. > > 2. That you have a dump device and that it's big enough to capture your > kernel size (zfs set volsize=X rpool/dump) > > 3. That dumpadm is happy and set to save cores etc: > > dumpadm -y -z on -c kernel -d /dev/zvol/dsk/rpool/dump > > There's lots of good info here: > > http://wiki.illumos.org/display/illumos/How+To+Report+Problems > > You can also inspect things with mdb while the system is up, but if it's > a production system normally you want to get it rebooted and into > production again ASAP. So in that situation, you can take a dump of the > running system with: > > savecore -L > > One thing to keep in mind is /etc/profile runs /usr/sbin/quota, which > can screw over logins when the zfs subsystem is unhappy. I really think > it should be removed by default since on most systems quotas aren't even > used. So comment it out - we do so on all our systems. This will give > you a better chance of logging in when things go wrong. > > I think there's a way to SSH in bypassing /etc/profile but I can't > remember what it is - perhaps someone can chime in. > > Good luck. Centralised storage is difficult to do and when it goes wrong > everything that depends on it goes down. It's a "all your eggs in one > giant failbasket". Doing it homebrew with ZFS is cost effective and can > be fast, but it is also risky. This is why there are companies like > Nexenta out there with certified combinations of hardware and software > engineered to work together. This extends to validating firmware > combinations of disks/HBAs/etc. > > Cheers, > > Alasdair > > > ------------------------------------------- > illumos-discuss > Archives: https://www.listbox.com/member/archive/182180/=now > RSS Feed: > https://www.listbox.com/member/archive/rss/182180/21175541-02f10c6f > Modify Your Subscription: https://www.listbox.com/member/?& > Powered by Listbox: http://www.listbox.com > > > *illumos-discuss* | > Archives<https://www.listbox.com/member/archive/182180/=now> > <https://www.listbox.com/member/archive/rss/182180/21175546-6311bfb2> | > Modify <https://www.listbox.com/member/?&> Your Subscription > <http://www.listbox.com> > > > > *illumos-discuss* | > Archives<https://www.listbox.com/member/archive/182180/=now> > <https://www.listbox.com/member/archive/rss/182180/23508059-3f15f76a> | > Modify<https://www.listbox.com/member/?&>Your Subscription > <http://www.listbox.com> > ------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
