Re: [discuss] Again: illumos based ZFS storage failure

Gabriele Bulfon Fri, 14 Dec 2012 04:27:21 -0800

I will be substituting that card with another areca, on the running storage, 
soon.
I then will take that card and put it on a test storage machine and make it 
work a lot to reproduce
the problem, so I will be able to run your other checks.
And no, I can't generate a crashdump, because the command to do it never 
returns after going up to 100%......
Thanx a lot.
Gabriele.
Da:
George Wilson
A:
Gabriele Bulfon
Cc:
[email protected]
Data:
11 dicembre 2012 15.13.05 CET
Oggetto:
Re: [discuss] Again: illumos based ZFS storage failure
So this tells us that the pool is not      suspending because of any errors. 
The next thing to look at is      whether or not the I/Os are just not 
returning. Are you able to      generate a crash dump?
Here are some other things to try:
# echo "::stacks -m zfs"  | mdb -kzfs_threads.out
# echo "::stacks -c spa_sync" | mdb -k
# echo "::zio_state -r" | mdb -kzios.out
- George
On 12/11/12 6:15 AM, Gabriele Bulfon wrote:
I got the problem again today, I had moved the iscsi to the          Areca, 
left only CIFS on the adaptec card,
and I have only CIFS not running. Probably this is a          problem of the 
adaptec card itself, but I don't know
if it's the combination of the illumos kernel + this card.
I have no signal of error anywhere, but zfs command not          responding on 
the filesystem over the adaptec.
Even df -h do note returns.
I tried the "walk spa" command during failure, but output the          same:
sonicle@xstorage:~# echo "::walk spa | ::print spa_t            spa_name 
spa_suspended" | mdb -k
spa_name = [ "adaptec" ]
spa_suspended = 0
spa_name = [ "areca" ]
spa_suspended = 0
spa_name = [ "rpool1" ]
spa_suspended = 0
Da:
George Wilson
A:
Gabriele Bulfon
Cc:
[email protected]
Data:
27 novembre 2012 15.10.45 CET
Oggetto:
Re: [discuss] Again: illumos based ZFS storage          failure
Was this data you provided below            from a time when the server was 
hung? If not, try running            this next time you see the issue.
- George
On 11/23/12 9:44 AM, Gabriele Bulfon wrote:
Hi,
here                  is the output. Looks sane:
sonicle@xstorage:~#                    echo "::walk spa | ::print spa_t 
spa_name                    spa_suspended" | mdb -k
spa_name                    = [ "adaptec" ]
spa_suspended                    = 0
spa_name                    = [ "areca" ]
spa_suspended                    = 0
spa_name                    = [ "rpool1" ]
spa_suspended                    = 0
Da:
George Wilson
A:
[email protected]
Cc:
Gabriele Bulfon
Data:
23 novembre 2012 14.46.55 CET
Oggetto:
Re: [discuss] Again: illumos based ZFS                storage failure
It's possible that the                  adaptec pool has suspended because of 
some error on                  the storage. Can you run the following as root 
and                  provide the output:
echo "::walk spa | ::print spa_t spa_name                  spa_suspended" | mdb 
-k
- George
On 11/23/12 5:43 AM, Gabriele Bulfon wrote:
Hi, I got the same                      problem this morning.
Thanks to Alasdair suggestion, I commented out the                      zfs 
"quota" command from /etc/profile, so I
could enter the bash                      prompt and investigate.
As a summary of the problem:
- 3 zfs pools (rpool, areca, adpatec) each on a                      different 
controller: rpool as a mirror of                      internal disks, areca as 
raidz on 7 sata disks of                      the areca controller plus half 
space of SD as                      zlog, adaptec as raidz on 8 disks of the 
adaptec                      controller plus half space of SD as zlog.
- areca space is used                      for NFS sharing to unix servers, and 
always                      responds.
- adaptec space is used                      for CIFS sharing and an iScsi 
volume for the                      Windows PDC.
Then we have a vmware                      server with an iscsi resource store, 
given to the                      virtualized PDC as a secondary disk for 
sqlserver                      data and some more. PDC boots directly from the  
                    vmware server disks.
At once, both CIFS from                      the storage and PDC iscsi disk do 
not respond.
CIFS fails probably                      because the PDC AD is not responding, 
probably                      busy checking the iscsi disk, in a loop.
Going into the storage                      bash:
- zpool status shows everything find, every pool                      is 
correct.
- /var/adm/messages                      shows only smbd time outs with the 
PDC, no                      hardware or zfs problem.
- at the time of                      failure, fmdump -eVvm showed same 
previously found                      errors, 3 days earlier
- after rebooting all the                          infrastructure, fmdump -eVvm 
showed same                          previously found errors around the time of 
                         rebooting the storage, not at the time of the          
                experienced failures. We find one stated error                  
        for each disk of the adaptec controller (cut                          
&paste at the end)
- a zfs list areca/* showed all the                          areca filesystems
- a zfs list adaptec blocked never                          returning
- any access to the zfs structure of                          the adaptec would 
block
- as suggested by Alasdair, I ran                          savecore -L (I 
checked to have the dump device                          and with enough space)
- the savecore command ran for sometime                          until reaching 
100%, then blocked never                          returning.
- I could not "init 5" the storage,                          never returning
- I tried sending the poweroff signal                          with the power 
button, console showed its                          intention to power off, but 
never did it
- I forced powered off via the power                          button.
- Once everything was powered on again,                          everything ran 
fine.
- I looked for the dump into                          /var/crash, but I had no 
/var/crash at all.
How can I investigate further this                          problem?
Do I have any chance to find the savecore                          output in 
the dump device, even if I have not                          /var/crash?
Here is the fmdump output:
Nov 23 2012 09:25:04.422282821                            
ereport.io.scsi.cmd.disk.dev.uderr
nvlist version: 0
class =                            ereport.io.scsi.cmd.disk.dev.uderr
ena = 0x2b08d604c300001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path =                            
/pci@0,0/pci8086,3595@2/pci8086,370@0/pci9005,2bc@e/disk@7,0
devid =                            id1,sd@TAdaptec_3805____________8366BAF3
(end detector)
devid =                            id1,sd@TAdaptec_3805____________8366BAF3
driver-assessment = fail
op-code = 0x1a
cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
pkt-reason = 0x0
pkt-state = 0x1f
pkt-stats = 0x0
stat-code = 0x0
un-decode-info =                            sd_get_write_cache_enabled: Mode 
Sense                            caching page code mismatch 0
__ttl = 0x1
----------------------------------------------------------------------------------
Da: Alasdair Lumsden
A:
[email protected]
Data: 10 novembre 2012 14.37.20 CET
Oggetto: Re: [discuss] illumos based ZFS storage                      failure
I haven't read the whole                        thread, but the next time it 
happens you'll
want to invoke a panic and make the dump file                        available. 
You'll want to
ensure that:
1. Multithreaded dump is disabled in /etc/system                        with:
* Disable MT dump
set dump_plat_mincpu=0
Without this there is a risk of your dump not                        saving 
correctly.
2. That you have a dump device and that it's big                        enough 
to capture your
kernel size (zfs set volsize=X rpool/dump)
3. That dumpadm is happy and set to save cores                        etc:
dumpadm -y -z on -c kernel -d                        /dev/zvol/dsk/rpool/dump
There's lots of good info here:
http://wiki.illumos.org/display/illumos/How+To+Report+Problems
You can also inspect things with mdb while the                        system is 
up, but if it's
a production system normally you want to get it                        rebooted 
and into
production again ASAP. So in that situation, you                        can 
take a dump of the
running system with:
savecore -L
One thing to keep in mind is /etc/profile runs                        
/usr/sbin/quota, which
can screw over logins when the zfs subsystem is                        unhappy. 
I really think
it should be removed by default since on most                        systems 
quotas aren't even
used. So comment it out - we do so on all our                        systems. 
This will give
you a better chance of logging in when things go                        wrong.
I think there's a way to SSH in bypassing                        /etc/profile 
but I can't
remember what it is - perhaps someone can chime                        in.
Good luck. Centralised storage is difficult to                        do and 
when it goes wrong
everything that depends on it goes down. It's a                        "all 
your eggs in one
giant failbasket". Doing it homebrew with ZFS is                        cost 
effective and can
be fast, but it is also risky. This is why there                        are 
companies like
Nexenta out there with certified combinations of                        
hardware and software
engineered to work together. This extends to                        validating 
firmware
combinations of disks/HBAs/etc.
Cheers,
Alasdair
-------------------------------------------
illumos-discuss
Archives:
https://www.listbox.com/member/archive/182180/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182180/21175541-02f10c6f
Modify Your Subscription:
https://www.listbox.com/member/?&;
Powered by Listbox:
http://www.listbox.com
illumos-discuss
|
Archives
|
Modify
Your Subscription




-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] Again: illumos based ZFS storage failure

Reply via email to