Re: [discuss] illumos based ZFS storage failure

Gabriele Bulfon Mon, 12 Nov 2012 06:36:58 -0800

Going deeper in the problem about the Adaptec interface, I talked with the 
hardware guys
and someone told me that using this hardware / raidz may cause zfs to hang:
SATA Disks Western Digital Raid Edition
Adaptec 3805 (cache disabled).
I know that SAS disks are always a better solution than SATA.
But this is a cheaper solution, and we opted for SATA.
The hardware guy told me that there is the possibility that beacuse of the 
nature of SATA,
when using radiz, zfs may get mad in case of a disk failure, not receiving 
correct response from the
controller, zfs commands may hang.
Is this correct?
In case this is correct, it means I had a disk failure in the Adaptec 
controller, but having no access
to login (because of what Alasdair kindly noted) I had no way to see it.
Anyway, once the machine reset, I got everything up and running, and the zpool 
looks fine.
Should I run a scrub or what to check for disk problems on that controller?
Thanx again,
Gabriele.
Da:
Gabriele Bulfon
A:
[email protected]
Data:
12 novembre 2012 14.43.45 CET
Oggetto:
Re: [discuss] illumos based ZFS storage failure
Thanks for all the clues.
The Supsermicro system is configured with 3 pools:
- 1 boot pool on zfs mirror of 2 disks connected to the motherboard SATA
- 1 data pool on raidz of 8 disks connected on an Adaptec interface
- 1 data pool on raidz of 7 disks connected on an Areca interface
Each of the data pool has also a log device: an SSD disk split into two solaris 
partition, functioning
each as log device for 1 data pool.
Your question made me think about a possibility:
- The only portions of the storage still responding were the NFS share
- The NFS share are all on the Areca pool
- The CIFS and iScsi Volumes are all on the Adaptec interface
Maybe the Adaptec pool had problems? Maybe the Adaptec interface had problems?
At the moment I see nothing bad on it. But this may be a possibility.
BTW, I understand that HA clustering and the two-heads Supermicro can help, but 
if the
problem was just zfs iScsi software not responding, I don't think the hardware 
HA would have solved.
Don't you think so?
Gabriele.
----------------------------------------------------------------------------------
Da: Jim Klimov
A: [email protected]
Data: 10 novembre 2012 14.08.14 CET
Oggetto: Re: [discuss] illumos based ZFS storage failure
On 2012-11-10 10:09, Gabriele Bulfon wrote:
Hi, the PDC system disk is not on the storage, just a 150GB partition
for databases.
That's why I can't see how Windows did not let me in even on vmware console.
The requirements to have several DCs is a very nice trick from Microsoft
to get more licenses...
This is quite a normal requirement for highly available infrastructure
services. You do likely have DNS replicas, or several equivalent SMTP
relays, perhaps multi-mastered LDAP and so on? Do you use clustered
databases like PgSQL or MySQL?
That it costs extra money for some solutions, is another matter.
I heard, but never got to check, that one of SAMBA4's goals was to
replace the MS Domain Controllers in a manner compatible to MS AD
(and backed by an LDAP service for HA storage of domain data).
You might want to take a look at that and get a free solution, if
it already works.
There is no zfs command running on my .bashrc but, now you opened my eyes :
Just before entering the system via ssh, I tried to check the storage
via our web interface,
and it was correctly responding, until I went to the Pool management,
where the web interface
issued a "zpool list", and it showed me the available pools.
Then I opened the tree to see the filesystem..........and there it
stopped responding.......
At least I understand why I could not enter the system anymore (not even
on console...).
Last questions:
- shouldn't I find some logs into the svc/logs of the iscsi services? (I
don't...)
Maybe... unless the disk IOs froze and couldn't save the logs.
Are rpool and data pool drives (and pools) separate, or all in one?
I wonder now, if SMF can write logs into remote systems, like syslog...
- should I rise the swap space? (it's now 4.5GB, phys memory is 8GB).
Depends on what your box is doing. If it is mostly ZFS storage with
RAM going to ARC cache, likely swap won't help. If it has userspace
tasks that may need or require disk-based swap guarantees (notably
VirtualBox VMs) - you may need more swap, at least 1:1 with RAM.
- what may be the reasons of the pool failing? a zpool status shows it's
all fine.
I'd bet on software problems - like running out of memory, or bugs
in code - but have little proof or testing techniques except trying
to recreate the problem while monitoring the various stats closely.
Also it may be that some disk in the pool timed out on responses
and was not kicked by the SD driver and/or ZFS timeouts...
- any other way I can prevent this from happening?
HA clustering, shared storage, detect a dead node and STONITH? ;)
Perhaps one of those two-motherboards-in-one-rackcase servers
from Supermicro (with shared SAS buckets of drives) that Nexenta
announced partnering with and recommending a while ago...
HTH,
//Jim Klimov
-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175541-02f10c6f
Modify Your Subscription: https://www.listbox.com/member/?&;
Powered by Listbox: http://www.listbox.com
illumos-discuss
|
Archives
|
Modify
Your Subscription




-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] illumos based ZFS storage failure

Reply via email to