Re: [discuss] Again: illumos based ZFS storage failure

Gabriele Bulfon Tue, 11 Dec 2012 03:17:25 -0800

I got the problem again today, I had moved the iscsi to the Areca, left only 
CIFS on the adaptec card,
and I have only CIFS not running. Probably this is a problem of the adaptec 
card itself, but I don't know
if it's the combination of the illumos kernel + this card.
I have no signal of error anywhere, but zfs command not responding on the 
filesystem over the adaptec.
Even df -h do note returns.
I tried the "walk spa" command during failure, but output the same:
sonicle@xstorage:~# echo "::walk spa | ::print spa_t spa_name spa_suspended" | 
mdb -k
spa_name = [ "adaptec" ]
spa_suspended = 0
spa_name = [ "areca" ]
spa_suspended = 0
spa_name = [ "rpool1" ]
spa_suspended = 0
Da:
George Wilson
A:
Gabriele Bulfon
Cc:
[email protected]
Data:
27 novembre 2012 15.10.45 CET
Oggetto:
Re: [discuss] Again: illumos based ZFS storage failure
Was this data you provided below from a      time when the server was hung? If 
not, try running this next time      you see the issue.
- George
On 11/23/12 9:44 AM, Gabriele Bulfon wrote:
Hi,
here is            the output. Looks sane:
sonicle@xstorage:~#              echo "::walk spa | ::print spa_t spa_name 
spa_suspended" |              mdb -k
spa_name              = [ "adaptec" ]
spa_suspended              = 0
spa_name              = [ "areca" ]
spa_suspended              = 0
spa_name              = [ "rpool1" ]
spa_suspended              = 0
Da:
George Wilson
A:
[email protected]
Cc:
Gabriele Bulfon
Data:
23 novembre 2012 14.46.55 CET
Oggetto:
Re: [discuss] Again: illumos based ZFS storage          failure
It's possible that the adaptec            pool has suspended because of some 
error on the storage. Can            you run the following as root and provide 
the output:
echo "::walk spa | ::print spa_t spa_name spa_suspended" |            mdb -k
- George
On 11/23/12 5:43 AM, Gabriele Bulfon wrote:
Hi, I got the same problem this                morning.
Thanks to Alasdair suggestion, I commented out the zfs                "quota" 
command from /etc/profile, so I
could enter the bash prompt and                investigate.
As a summary of the problem:
- 3 zfs pools (rpool, areca, adpatec) each on a                different 
controller: rpool as a mirror of internal                disks, areca as raidz 
on 7 sata disks of the areca                controller plus half space of SD as 
zlog, adaptec as                raidz on 8 disks of the adaptec controller plus 
half                space of SD as zlog.
- areca space is used for NFS sharing                to unix servers, and 
always responds.
- adaptec space is used for CIFS                sharing and an iScsi volume for 
the Windows PDC.
Then we have a vmware server with an                iscsi resource store, given 
to the virtualized PDC as a                secondary disk for sqlserver data 
and some more. PDC                boots directly from the vmware server disks.
At once, both CIFS from the storage                and PDC iscsi disk do not 
respond.
CIFS fails probably because the PDC AD                is not responding, 
probably busy checking the iscsi                disk, in a loop.
Going into the storage bash:
- zpool status shows everything find, every pool is                correct.
- /var/adm/messages shows only smbd                time outs with the PDC, no 
hardware or zfs problem.
- at the time of failure, fmdump -eVvm                showed same previously 
found errors, 3 days earlier
-                    after rebooting all the infrastructure, fmdump -eVvm       
             showed same previously found errors around the time                
    of rebooting the storage, not at the time of the                    
experienced failures. We find one stated error for                    each disk 
of the adaptec controller (cut &paste                    at the end)
-                    a zfs list areca/* showed all the areca filesystems
-                    a zfs list adaptec blocked never returning
-                    any access to the zfs structure of the adaptec would       
             block
-                    as suggested by Alasdair, I ran savecore -L (I             
       checked to have the dump device and with enough                    space)
-                    the savecore command ran for sometime until reaching       
             100%, then blocked never returning.
-                    I could not "init 5" the storage, never returning
-                    I tried sending the poweroff signal with the power         
           button, console showed its intention to power off,                   
 but never did it
-                    I forced powered off via the power button.
-                    Once everything was powered on again, everything ran       
             fine.
-                    I looked for the dump into /var/crash, but I had no        
            /var/crash at all.
How                    can I investigate further this problem?
Do I have any chance to find the savecore output in                    the dump 
device, even if I have not /var/crash?
Here is the fmdump output:
Nov                      23 2012 09:25:04.422282821                      
ereport.io.scsi.cmd.disk.dev.uderr
nvlist                      version: 0
class = ereport.io.scsi.cmd.disk.dev.uderr
ena = 0x2b08d604c300001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path =                      
/pci@0,0/pci8086,3595@2/pci8086,370@0/pci9005,2bc@e/disk@7,0
devid = id1,sd@TAdaptec_3805____________8366BAF3
(end detector)
devid = id1,sd@TAdaptec_3805____________8366BAF3
driver-assessment = fail
op-code = 0x1a
cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
pkt-reason = 0x0
pkt-state = 0x1f
pkt-stats = 0x0
stat-code = 0x0
un-decode-info = sd_get_write_cache_enabled: Mode                      Sense 
caching page code mismatch 0
__ttl = 0x1
----------------------------------------------------------------------------------
Da: Alasdair Lumsden
A:
[email protected]
Data: 10 novembre 2012 14.37.20 CET
Oggetto: Re: [discuss] illumos based ZFS storage failure
I haven't read                  the whole thread, but the next time it happens 
you'll
want to invoke a panic and make the dump file                  available. 
You'll want to
ensure that:
1. Multithreaded dump is disabled in /etc/system with:
* Disable MT dump
set dump_plat_mincpu=0
Without this there is a risk of your dump not saving                  correctly.
2. That you have a dump device and that it's big                  enough to 
capture your
kernel size (zfs set volsize=X rpool/dump)
3. That dumpadm is happy and set to save cores etc:
dumpadm -y -z on -c kernel -d /dev/zvol/dsk/rpool/dump
There's lots of good info here:
http://wiki.illumos.org/display/illumos/How+To+Report+Problems
You can also inspect things with mdb while the system                  is up, 
but if it's
a production system normally you want to get it                  rebooted and 
into
production again ASAP. So in that situation, you can                  take a 
dump of the
running system with:
savecore -L
One thing to keep in mind is /etc/profile runs                  
/usr/sbin/quota, which
can screw over logins when the zfs subsystem is                  unhappy. I 
really think
it should be removed by default since on most systems                  quotas 
aren't even
used. So comment it out - we do so on all our systems.                  This 
will give
you a better chance of logging in when things go                  wrong.
I think there's a way to SSH in bypassing /etc/profile                  but I 
can't
remember what it is - perhaps someone can chime in.
Good luck. Centralised storage is difficult to do and                  when it 
goes wrong
everything that depends on it goes down. It's a "all                  your eggs 
in one
giant failbasket". Doing it homebrew with ZFS is cost                  
effective and can
be fast, but it is also risky. This is why there are                  companies 
like
Nexenta out there with certified combinations of                  hardware and 
software
engineered to work together. This extends to                  validating 
firmware
combinations of disks/HBAs/etc.
Cheers,
Alasdair
-------------------------------------------
illumos-discuss
Archives:
https://www.listbox.com/member/archive/182180/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182180/21175541-02f10c6f
Modify Your Subscription:
https://www.listbox.com/member/?&;
Powered by Listbox:
http://www.listbox.com
illumos-discuss
|
Archives
|
Modify
Your                        Subscription




-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] Again: illumos based ZFS storage failure

Reply via email to