[zfs-discuss] Understanding when (and how) ZFS will use spare disks

2009-09-04 Thread Chris Siebenmann
 We have a number of shared spares configured in our ZFS pools, and
we're seeing weird issues where spares don't get used under some
circumstances.  We're running Solaris 10 U6 using pools made up of
mirrored vdevs, and what I've seen is:

* if ZFS detects enough checksum errors on an active disk, it will
  automatically pull in a spare.
* if the system reboots without some of the disks available (so that
  half of the mirrored pairs drop out, for example), spares will *not*
  get used. ZFS recognizes that the disks are not there; they are marked
  as UNAVAIL and the vdevs (and pools) as DEGRADED, but it doesn't try to
  use spares.

(This is in a SAN environment where half of all of the mirrors come
from one controller and half come from another one.)

 All of this makes me think that I don't understand how ZFS spares
really work, and under what circumstances they'll get used. Does
anyone know if there's a writeup of this somewhere?

(What I've gathered so far from reading zfs-discuss archives is that
ZFS spares are not handled automatically in the kernel code but are
instead deployed to pools by a fmd ZFS management module[*], doing more
or less 'zpool repace pool failing-dev spare' (presumably through
an internal code path, since 'zpool history' doesn't seem to show spare
deployment). Is this correct?)

 Also, searching turns up some old zfs-discuss messages suggesting that
not bringing in spares in response to UNAVAIL disks was a bug that's now
fixed in at least OpenSolaris. If so, does anyone know if the fix has
made it into S10 U7 (or is planned or available as a patch)?

 Thanks in advance.

- cks
[*: http://blogs.sun.com/eschrock/entry/zfs_hot_spares suggests that
it is 'zfs-retire', which is separate from 'zfs-diagnosis'.]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How are you supposed to remove faulted spares from pools?

2009-08-26 Thread Chris Siebenmann
 We have a situation where all of the spares in a set of pools have
gone into a faulted state and now, apparently, we can't remove them
or otherwise de-fault them. I'm confidant that the underlying disks
are fine, but ZFS seems quite unwilling to do anything with the spares
situation.

(The specific faulted state is 'FAULTED   corrupted data' in
'zpool status' output.)

 Environment: Solaris 10 U6 on x86 hardware. The disks are iSCSI LUNs
from backend storage devices.

 I have tried:
- 'zpool remove': it produces no errors, but doesn't remove anything.
- 'zpool replace pool drive': it reports that the device is reserved
  as a hot spare.
- 'zpool replace pool drive unused-drive': also reports 'device
  is reserved as a hot spare'.
- 'zpool clear': reports that it can't clear errors, the device is
  reserved as a hot spare.

 Because these are iSCSI LUNs, I can actually de-configure them (on the
Solaris side); would that make ZFS change its mind about the situation
and move to a state where I could remove them from the pools?

(Would exporting and then importing the pools make any difference,
especially if the iSCSI LUNs of the spares were removed? These are
production pools, so I can't just try it to see; it would create a
user-visible downtime.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Can a zpool cachefile be copied between systems?

2008-11-25 Thread Chris Siebenmann
 Suppose that you have a SAN environment with a lot of LUNs. In the
normal course of events this means that 'zpool import' is very slow,
because it has to probe all of the LUNs all of the time.

 In S10U6, the theoretical 'obvious' way to get around this for your
SAN filesystems seems to be to use a non-default cachefile (likely one
cachefile per virtual fileserver, although you could go all the way to
one cachefile per pool) and then copy this cachefile from the master
host to all of your other hosts. When you need to rapidly bring up a
virtual fileserver on a non-default host, you can just run
zpool import -c /where/ever/host.cache -a

 However, the S10U6 zpool documentation doesn't say if zpool cachefiles
can be copied between systems and used like this. Does anyone know if
this is a guaranteed property that is sure to keep working, something
that works right now but there's no guarantees that it will keep working
in future versions of Solaris and patches, or something that doesn't
work reliably in general?

(I have done basic tests with my S10U6 test machine, and it seems to
work ... but I might easily be missing something that makes it not
reliable.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] shrinking a zpool - roadmap

2008-08-21 Thread Chris Siebenmann
| The errant command which accidentally adds a vdev could just as easily
| be a command which scrambles up or erases all of the data.

 The difference between a mistaken command that accidentally adds a vdev
and the other ways to loose your data with ZFS is that the 'add a vdev'
accident is only one omitted word different from a command that you use
routinely. This is a very close distance, especially for fallible humans.

('zpool add ... mirror A B' and 'zpool add ... spare A'; omit either
'mirror' or 'spare' by accident and boom.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Forensic analysis [was: more ZFS recovery]

2008-08-12 Thread Chris Siebenmann
| As others have noted, the COW nature of ZFS means that there is a good
| chance that on a mostly-empty pool, previous data is still intact long
| after you might think it is gone.

 In the cases I am thinking of I am sure that the data was there.
Kernel panics just didn't let me get at it. Fortunately it was only
testing data, but I am now concerned about it happening in production.

| A utility to recover such data is (IMHO) more likely to be in the
| category of forensic analysis than a mount (import) process. There is
| more than enough information publically available for someone to build
| such a tool (hint, hint :-)

 To put it crudely, if I wanted to write my own software for this sort
of thing I would run Linux.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] more ZFS recovery

2008-08-11 Thread Chris Siebenmann
 I'm not Anton Rang, but:
| How would you describe the difference between the data recovery
| utility and ZFS's normal data recovery process?

 The data recovery utility should not panic my entire system if it runs
into some situation that it utterly cannot handle. Solaris 10 U5 kernel
ZFS code does not have this property; it is possible to wind up with ZFS
pools that will panic your system when you try to touch them.

(The same thing is true of a theoretical file system checking utility.)

 The data recovery utility can ask me questions about what I want it
to do in an ambiguous situation, or give me only partial results.

 The data recovery can be run read-only, so that I am sure that any
problems in it are not making my situation worse.

| Nobody thinks that an answer of sorry, we lost all of your data is
| acceptable.  However, there are failures which will result in loss of
| data no matter how clever the file system is.

 The problem is that there are currently ways to make ZFS lose all your
data when there are no hardware faults or failures, merely people or
software mis-handling pools. This is especially frustrating when the
only thing that is likely to be corrupted is ZFS metadata and the vast
majority (or all) of the data in the pool is intact, readable, and so
on.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Terrible zfs performance under NFS load

2008-08-01 Thread Chris Siebenmann
| Syslog is funny in that it does a lot of open/write/close cycles so
| that rotate can work trivially.

 I don't know of any version of syslog that does this (certainly Solaris
10 U5 syslog does not). The traditional syslog(d) performance issue
is that it fsync()'s after writing each log message, in an attempt to
maximize the chances that the log message will make it to disk and
survive a system crash, power outage, etc.

(Some versions of syslog let you turn this off for specific log files,
which is very useful for high volume, low importance ones.)

 I've heard that at one point, NFS + ZFS was known to have performance
issues with fsync()-heavy workloads. I don't know if that's still true
today (in either Solaris 10U5 or current OpenSolaris builds), or if all
of the issues have been fixed.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] What's the best way to get pool vdev structure information?

2008-08-01 Thread Chris Siebenmann
 For various sorts of manageability reasons[*], I need to be able to
extract information about the vdev and device structure of our ZFS pools
(partly because we're using iSCSI and MPXIO, which create basically
opaque device names). Unfortunately Solaris 10 U5 doesn't seem to
currently provide any script/machine readable output form for this
information, so I need to build something to do it myself.

 I can think of three different ways to do this:
* parse the output of 'zpool status'
* write a C program that directly uses libzfs to dump the information
  in a more script-readable format
* use Will Murnane's recently announced 'pyzfs' module to dump the
  information (conveniently I am already writing some of the management
  programs in Python)

 Each approach has its own set of drawbacks, so I'm curious if people
have opinions on which one will probably be the best/most stable over
time/etc. And if anyone already has code (or experience of subtle things
to watch out for), I would love to hear from you.

 Thanks in advance.

- cks
[*: for example, we need to be able to generate a single list of all of
the iSCSI target+LUNs that are in use on all of the fileservers, and
how the usage is distributed among fileservers and ZFS pools.
]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] memory hog

2008-06-16 Thread Chris Siebenmann
| I guess I find it ridiculous you're complaining about ram when I can
| purchase 4gb for under 50 dollars on a desktop.
|
| Its not like were talking about a 500 dollar purchase.

 'On a desktop' is an important qualification here. Server RAM is
more expensive, and then you get to multiply it by the number of
servers you are buying. It does add up.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-12 Thread Chris Siebenmann
| Every time I've come across a usage scenario where the submitter asks
| for per user quotas, its usually a university type scenario where
| univeristies are notorious for providing lots of CPU horsepower (many,
| many servers) attached to a simply dismal amount of back-end storage.

 Speaking as one of those pesky university people (although we don't use
quotas): one of the reasons this happens is that servers are a lot less
expensive than disk space. With disk space you have to factor in the
cost of backups and ongoing maintenance, wheras another server is just N
thousand dollars in one time costs and some rack space.

(This assumes that you are not rack space, heat, or power constrained,
which I think most university environments generally are not.)

 Or to put it another way: disk space is a permanent commitment,
servers are not.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-05 Thread Chris Siebenmann
| The ZFS filesystem approach is actually better than quotas for User
| and Shared directories, since the purpose is to limit the amount of
| space taken up *under that directory tree*.

 Speaking only for myself, I would find ZFS filesystems somewhat more
useful if they were more like directory trees and less like actual
filesystems. Right now, their filesystem nature creates several
limitations:

- you cannot semi-transparently convert an existing directory tree into
  a new ZFS filesystem; you need to move the directory tree aside, make
  a new filesystem, and copy all the data over.
- as separate filesystems, they have to be separately NFS mounted
- you cannot hardlink between separate filesystems, which is a problem
  if you want to use a lot of ZFS filesystems for fine-grained management
  of things like NFS permissions.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Tracking down the causes of a mysteriously shrinking ARC cache?

2008-06-05 Thread Chris Siebenmann
 I have a test Solaris machine with 8 GB of memory. When freshly booted,
the ARC consumes 5 GB (and I would be happy to make it consume more)
and file-level prefetching works great even when I hit the machine with
a lot of simultaneous sequential reads.  But overnight, the ARC has
shrunk to 2 GB (as reported by arcstat.pl) and file-level prefetching
is (as expected at that level) absolutely murdering the performance of
simultaneous sequential reads.

 So: is there any way to find out what is consuming the memory and
causing the ARC to shrink (and/or to reset its target size, since
arcstat.pl reports reports that 'c', the ARC target size, is also 2GB)?
I've looked at ps, which shows no large processes, I have a ::kmastat
dump from mdb (but I don't know much about how to read it), top still
reports that the system has 8 GB, and vmstat says that there is 1 GB of
free memory and no paging activity.

 Also, is there any way to force-reset the size of the ARC on a live
system, so I could at least periodically kick its maximum size up to
5 GB or 6 GB or so and hope that it sticks?

 Thanks in advance.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Bad results from importing a pool on two machines at once

2008-06-03 Thread Chris Siebenmann
 As part of testing for our planned iSCSI + ZFS NFS server environment,
I wanted to see what would happen if I imported a ZFS pool on two
machines at once (as might happen someday in, for example, a failover
scenario gone horribly wrong).

 What I expected was something between a pool with damage and a pool
that was unrecoverable. What I appear to have got is a a ZFS pool
that panics the system whenever you try to import it. The panic is a
'bad checksum (read on unknown off 0: ... [L0 packed nvlist]' error
from zfs:zfsctl_ops_root (I've put the whole thing at the end of this
message).

 I got this without doing very much to the dual-imported pool:
- import on both systems (-f'ing on one)
- read a large file a few times on both systems
- zpool export on one system
- zpool scrub on the other; system panics
- zpool import now panics either system

 One system was Solaris 10 U4 server with relatively current patches;
the other was Solaris 10 U5 with current patches.  (Both 64-bit x86.)

 What appears to be the same issue was reported back in April 2007 on
the mailing list, in the message
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/039238.html,
but I don't see any followups.

 Is this a known and filed bug? Is there any idea when it might be fixed
(or the fix appear in Solaris 10)?

 I have to say that I'm disappointed with ZFS's behavior here; I don't
expect a filesystem that claims to have all sorts of checksums and
survive all sorts of disk corruptions to *ever* panic because it doesn't
like the data on the disk. That is very definitely not 'surviving disk
corruption', especially since it seems to have happened to someone who
was not doing violence to their ZFS pools the way I was.

- cks
[The full panic:
Jun  3 11:05:14 sansol2 genunix: [ID 809409 kern.notice] ZFS: bad checksum 
(read on unknown off 0: zio 8e508340 [L0 packed nvlist] 4000L/600P 
DVA[0]=0:a8000c000:600 DVA[1]=0:1040003000:600 fletcher4 lzjb LE contiguous 
birth=119286 fill=1 
cksum=6e160f6970:632da4719324:3057ff16f69527:10e6e1af42eb9b10): error 50
Jun  3 11:05:14 sansol2 unix: [ID 10 kern.notice] 
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9dac0 
zfs:zfsctl_ops_root+3003724c ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9dad0 
zfs:zio_next_stage+65 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9db00 
zfs:zio_wait_for_children+49 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9db10 
zfs:zio_wait_children_done+15 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9db20 
zfs:zio_next_stage+65 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9db60 
zfs:zio_vdev_io_assess+84 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9db70 
zfs:zio_next_stage+65 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9dbd0 
zfs:vdev_mirror_io_done+c1 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9dbe0 
zfs:zio_vdev_io_done+14 ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9dc60 
genunix:taskq_thread+bc ()
Jun  3 11:05:14 sansol2 genunix: [ID 655072 kern.notice] fe8000f9dc70 
unix:thread_start+8 ()
]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Not automatically importing ZFS pools at boot

2008-06-03 Thread Chris Siebenmann
 Is there any way to set ZFS on a system so that it will not
automatically import all of the ZFS pools it had active when it was last
running?

 The problem with automatic importation is preventing disasters in a
failover situation. Assume that you have a SAN environment with the same
disks visible to system A and system B. If system A loses power (or
otherwise goes down) with ZFS pools live, you 'zpool import -f' them on
system B to get them available again, and system A comes back up, system
A will happily import the pools too, despite them being in use on system
B.

(And then there are explosions. Bad explosions. You will probably lose
pools hard, per my previous email.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Not automatically importing ZFS pools at boot

2008-06-03 Thread Chris Siebenmann
| On Nevada, use the 'cachefile' property.  On S10 releases, use '-R /'
| when creating/importing the pool.

 The drawback of '-R /' appears to be that it requires forcing the
import after a system reboot *all* the time (unless you explicitly
export the pool during reboot).

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-06-02 Thread Chris Siebenmann
| My impression is that the only real problem with incrementals from
| ufsdump or star is that you would like to have a database that tells
| you in which incremental a specific file with a specific time stamp
| may be found.

 In our situation here, this is done by the overall backup system
driving ufsdump et al (Amanda in our case). I think this is the best
way, because you don't necessarily want to keep the index on the machine
that you are backing up.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-29 Thread Chris Siebenmann
| I very strongly disagree.  The closest ZFS equivalent to ufsdump is
| 'zfs send'. 'zfs send' like ufsdump has initmiate awareness of the
| the actual on disk layout and is an integrated part of the filesystem
| implementation.

 I must strongly disagree in turn, at least for Solaris 10. 'zfs send'
suffers from three significant defects:

- you cannot selectively restore files from a 'zfs send' archive;
  restoring is an all or nothing affair.

- incrementals can only be generated relative to a snapshot, which
  means that doing incrementals may require you to use up significant
  amounts of disk space.

- it is currently explicitly documented as not being forward or backwards
  compatible. (I understand that this is not really the case and that this
  change of heart will be officially documented at some point; I hope that
  people will forgive me for not basing a backup strategy on word of future
  changes.)

 The first issue alone makes 'zfs send' completely unsuitable for the
purposes that we currently use ufsdump. I don't believe that we've lost
a complete filesystem in years, but we restore accidentally deleted
files all the time. (And snapshots are not the answer, as it is common
that a user doesn't notice the problem until well after the fact.)

('zfs send' to live disks is not the answer, because we cannot afford
the space, heat, power, disks, enclosures, and servers to spin as many
disks as we have tape space, especially if we want the fault isolation
that separate tapes give us. most especially if we have to build a
second, physically separate machine room in another building to put the
backups in.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Project Hardware

2008-05-25 Thread Chris Siebenmann
|  Primarily cost, reliability (less complex hw = less hw that can
|  fail), and serviceability (no need to rebuy the exact same raid card
|  model when it fails, any SATA controller will do).
|
| As long as the RAID is self-contained on the card, and the disks are
| exported as JBOD, then you should be able to replace the card with any
| adaptor supporting at least as many ports.

 I believe it's common for PC-level hardware RAID cards to save the RAID
configuration on the disks themselves, which takes a bit of space and
(if it's done at the start of the disk) may make the disk unrecognizable
by standard tools, even with a JBOD setting.

 The vendors presumably do this, among other reasons, so that replacing
a dead controller doesn't require your operating system and so on to be
running in order to upload a saved configuration or the like.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ways to speed up 'zpool import'?

2008-05-21 Thread Chris Siebenmann
[Eric Schrock:]
| Look at alternate cachefiles ('zpool set cachefile', 'zpool import -c
| cachefile', etc).  This avoids scanning all devices in the system
| and instead takes the config from the cachefile.

 This sounds great.
 Is there any information on when this change will make it to Solaris?
(In particular, is it going to be in S10 update 6, or only in a later
version?)

[Rich Teer:]
| How many pools is a bunch?  The ideal number of pools per server
| tends to one, so reducing the number of pools might be your best
| option.

 At a guess, we will probably have on the order of fifty to seventy
pools per fileserver (each of them using at least two LUNs, and we
will have multiple fileservers, all of which can see all of the LUNs).
We want to use many pools because we feel that using many pools is a
simpler way to manage selling fixed-size chunks of storage to users than
putting them all in a few pools and using quotas.

(There are other situations that call for many pools; the example I
remember from an earlier discussion on zfs-discuss is someone who
expected to have a bunch of zones and wanted to be able to move each of
them between machines.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08

2008-05-20 Thread Chris Siebenmann
| So, from a feature perspective it looks like S10U6 is going to be in
| pretty good shape ZFS-wise. If only someone could speak to (perhaps
| under the cloak of anonymity ;) ) the timing side :).

 For what it's worth, back in January or so we were told that S10U6 was
scheduled for August. Given that we were told more or less the same
thing about S10U4 last year and it slipped somewhat, I'm not expecting
S10U6 before about October or so.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Ways to speed up 'zpool import'?

2008-05-20 Thread Chris Siebenmann
 We're planning to build a ZFS-based Solaris NFS fileserver environment
with the backend storage being iSCSI-based, in part because of the
possibilities for failover. In exploring things in our test environment,
I have noticed that 'zpool import' takes a fairly long time; about
35 to 45 seconds per pool. A pool import time this slow obviously
has implications for how fast we can import a bunch of pools during
a failover situation, so I'd like to speed it up somehow (ideally in
non-hacky ways).

(Trying to do all of the 'zpool import's in parallel doesn't seem
to speed the collective set of them up relative to doing them
sequentially.)

 My test environment currently has 132 iSCSI LUNs (and 132 pools, one
per LUN, because I wanted to test with extremes) on an up to date S10U4
machine.  A truss of a 'zpool import' suggests that it spends most of
its time opening various disk devices and reading things from them and
most of the rest of the time doing modctl() calls and ZFS ioctls().

(Also, using 'zpool import -d' with a prepared directory that has only
symlinks to the particular /dev/dsk device entries for a pool's LUN
speeds things up dramatically.)

 So, are there any tricks to speeding up ZFS pool import here (short of
the 'zpool import -d' stuff)? Would Sun Cluster manage this faster, or
does its ZFS pool failover stuff basically reduce to 'zpool import' too?

 Thanks in advance.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Weird performance issue with ZFS with lots of simultaneous reads

2008-05-16 Thread Chris Siebenmann
| Have you tried to disable vdev caching and leave file level
| prefetching?

 If you mean setting zfs_vdev_cache_bshift to 13 (per the ZFS Evil
Tuning Guide) to turn off device-level prefetching then yes, I have
tried turning off just that; it made no difference.

 If there's another tunable then I don't know about it and haven't
tried it (and would be pleased to).

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Weird performance issue with ZFS with lots of simultaneous reads

2008-05-14 Thread Chris Siebenmann
I wrote:
|  I have a ZFS-based NFS server (Solaris 10 U4 on x86) where I am
| seeing a weird performance degradation as the number of simultaneous
| sequential reads increases.

 To update zfs-discuss on this: after more investigation, this seems
to be due to file-level prefetching. Turning file-level prefetching
off (following the directions of the ZFS Evil Tuning Guide) returns
NFS server performance to full network bandwidth when there are lots
of simultaneous sequential reads. Unfortunately it significantly
reduces the performance of a single sequential read (when the server is
otherwise idle).

 The problem is definitely not an issue of having too many pools or too
many LUNS; I saw the same issue with a single striped pool made from 12
whole-disk LUNs. (And the issue happens locally as well as remotely, so
it's not NFS; it's just easier to measure with an NFS client, because
you can clearly see the (maximum) aggregate data rate to all of the
sequential reads.)

| CS (It is limited testing because it is harder to accurately measure
| CS what aggregate data rate I'm getting and harder to run that many
| CS simultaneous reads, as if I run too many of them the Solaris
| CS machine locks up due to overload.)
|
| that's strange - what exactly happens when it locks up? Does it
| panic?

 I have to apologize; this happened during an earlier round of
tests, when the Solaris machine had too little memory for the number
of pools I had on it. According to my notes, the behavior in the
with-prefetch state is that the machine can survive but is extremely
unresponsive until the test programs finish. (I haven't retested with
file prefetching turned off.)

(Here 'locks up' means it becomes basically totally unresponsive,
although it seems to still be doing IO.)

 I am using a test program that is basically dd with some reporting; it
reads a 1 MB buffer from standard in and writes it to standard out. In
these tests, each reader's stdin is a (different) 10 GB file and their
stdout is /dev/null.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Weird performance issue with ZFS with lots of simultaneous reads

2008-05-09 Thread Chris Siebenmann
 I have a ZFS-based NFS server (Solaris 10 U4 on x86) where I am seeing
a weird performance degradation as the number of simultaneous sequential
reads increases.

 Setup:
NFS client - Solaris NFS server - iSCSI target machine

 There are 12 physical disks on the iSCSI target machine. Each of them
is sliced up into 11 parts and the parts exported as individual LUNs to
the Solaris server. The Solaris server uses each LUN as a separate ZFS
pool (giving 132 pools in total) and exports them all to the NFS client.

(The NFS client and the iSCSI target machine are both running Linux.
The Solaris NFS server has 4 GB of RAM.)

 When the NFS client starts a sequential read against one filesystem
from each physical disk, the iSCSI target machine and the NFS client
both use the full network bandwidth and each individual read gets
1/12th of it (about 9.something MBytes/sec). Starting a second set of
sequential reads against each disk (to a different pool) behaves the
same, as does starting a third set.

 However, when I add a fourth set of reads thing change; while the
NFS server continues to read from the iSCSI target at full speed, the
data rate to the NFS client drops significantly. By the time I hit
9 reads per physical disk, the NFS client is getting a *total* of 8
MBytes/sec.  In other words, it seems that ZFS on the NFS server is
somehow discarding most of what it reads from the iSCSI disks, although
I can't see any sign of this in 'vmstat' output on Solaris.

 Also, this may not be just an NFS issue; in limited testing with local
IO on the Solaris machine it seems that I may be seeing the same effect
with the same rough magnitude.

(It is limited testing because it is harder to accurately measure what
aggregate data rate I'm getting and harder to run that many simultaneous
reads, as if I run too many of them the Solaris machine locks up due to
overload.)

 Does anyone have any ideas of what might be going on here, and how I
might be able to tune things on the Solaris machine so that it performs
better in this situation (ideally without harming performance under
smaller loads)? Would partitioning the physical disks on Solaris instead
of splitting them up on the iSCSI target make a significant difference?

 Thanks in advance.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-05-05 Thread Chris Siebenmann
[Jeff Bonwick:]
| That said, I suspect I know the reason for the particular problem
| you're seeing: we currently do a bit too much vdev-level caching.
| Each vdev can have up to 10MB of cache.  With 132 pools, even if
| each pool is just a single iSCSI device, that's 1.32GB of cache.
| 
| We need to fix this, obviously.  In the interim, you might try
| setting zfs_vdev_cache_size to some smaller value, like 1MB.

 I wanted to update the mailing list with a success story: I added
another 2GB of memory to the server (bringing it to 4GB total),
tried my 132-pool tests again, and things worked fine. So this seems
to have been the issue and I'm calling it fixed now.

(I decided that adding some more memory to the server was simpler
in the long run than setting system parameters.)

 I can still make the Solaris system lock up solidly if I do extreme
things, like doing 'zfs scrub pool ' for all 132 pools, but I'm not
too surprised by that; you can always kill a system if you try hard
enough. The important thing for me is that routine things don't kill the
system any more just because it has so many pools.

 So: thank you, everyone.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-05-01 Thread Chris Siebenmann
| I think the root cause of the issue is that multiple groups are buying
| physical rather than virtual storage yet it is all being attached to a
| single system.

 They're actually buying constant-sized chunks of virtual storage, which
is provided through a pool of SAN-based disk space. This means that
we're always going to have a certain number of logical pools of storage
space to manage that are expanded in fixed-size chunks; the question is
whether to manage them as separate ZFS pools or to aggregate them into
fewer ZFS pools and then use quotas on sub-hierarchies.

(With local storage you wouldn't have much choice; the physical disk
size is not likely to map nicely into the constant-sized chunks you sell
to people. With SAN storage you can pretty much make the 'disks' that
Solaris sees map straight to the chunk size.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-05-01 Thread Chris Siebenmann
| There are two issues here.  One is the number of pools, but the other
| is the small amount of RAM in the server.  To be honest, most laptops
| today come with 2 GBytes, and most servers are in the 8-16 GByte range
| (hmmm... I suppose I could look up the average size we sell...)

 Speaking as a sysadmin (and a Sun customer), why on earth would I have
to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
have that memory in the NFS client machines, where it can actually be
put to work by user programs.

(If I have decently provisioned NFS client machines, I don't expect much
from the NFS fileserver's cache. Given that the clients have caches too,
I believe that the server's cache will mostly be hit for things that the
clients cannot cache because of NFS semantics, like NFS GETATTR requests
for revalidation and the like.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-04-30 Thread Chris Siebenmann
 I have a test system with 132 (small) ZFS pools[*], as part of our
work to validate a new ZFS-based fileserver environment. In testing,
it appears that we can produce situations that will run the kernel out
of memory, or at least out of some resource such that things start
complaining 'bash: fork: Resource temporarily unavailable'. Sometimes
the system locks up solid.

 I've found at least two situations that reliably do this:
- trying to 'zpool scrub' each pool in sequence (waiting for each scrub
  to complete before starting the next one).
- starting simultaneous sequential read IO from all pools from a NFS client.
  (trying to do the same IO from the server basically kills the server
  entirely.)

 If I aggregate the same disk space into 12 pools instead of 132, the
same IO load does not kill the system.

 The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB
of swap, running 64-bit Solaris 10 U4 with an almost current set of
patches; it gets the storage from another machine via ISCSI. The pools
are non-redundant, with each vdev being a whole ISCSI LUN.

 Is this a known issue (or issues)? If this isn't a known issue, does
anyone have pointers to good tools to trace down what might be happening
and where memory is disappearing and so on? Does the system plain need
more memory for this number of pools and if so, does anyone know how
much?

 Thanks in advance.

(I was pointed to mdb -k's '::kmastat' by some people on the OpenSolaris
IRC channel but I haven't spotted anything particularly enlightening in
its output, and I can't run it once the system has gone over the edge.)

- cks
[*: we have an outstanding uncertainty over how many ZFS pools a
single system can sensibly support, so testing something larger
than we'd use in production seemed sensible.]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-04-30 Thread Chris Siebenmann
| Still, I'm curious -- why lots of pools?  Administration would be
| simpler with a single pool containing many filesystems.

 The short answer is that it is politically and administratively easier
to use (at least) one pool per storage-buying group in our environment.
This got discussed in more detail in the 'How many ZFS pools is it
sensible to use on a single server' zfs-discuss thread I started earlier
this month[*].

(Trying to answer the question myself is the reason I wound up setting
up 132 pools on my test system and discovering this issue.)

- cks
[*: http://opensolaris.org/jive/thread.jspa?threadID=56802]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How many ZFS pools is it sensible to use on a single server?

2008-04-12 Thread Chris Siebenmann
| Hi Chris, I would have thought that managing multiple pools (you
| mentioned 200) would be an absolute administrative nightmare. If you
| give more details about your storage needs like number of users, space
| required etc it might become clearer what you're thinking of setting
| up.

 Every university department has to face the issue of how to allocate
disk space to people. Here, we handle storage allocation decisions
through the relatively simple method of selling fixed-size chunks of
storage to faculty (either single professors or groups of them) for a
small one-time fee.

(We used fixed size chunks partly because it is simpler to administer
and to set prices, and partly because it is our current model in our
Solaris 8 + DiskSuite + constant-sized partitions environment.)

 So, we are always going to have a certain number of logical pools of
storage space to manage. The question is whether to handle them as
separate ZFS pools or aggregate them into fewer ZFS pools and then
administer them as sub-hierarchies using quotas[*], and our current
belief is that doing the former is simpler to administer and simpler to
explain to users.

 200 pools on a single server is probably pessimistic (hopefully there
will be significantly fewer), but could happen if people go wild with
separate pools and there is a failover situation where a single physical
server has to handle several logical fileservers at once.

| Also, I see you were considering 200 pools on a single
| server. Considering that you'll want redundancy in each pool, if
| you're forming your pools from complete physical disks, you are
| looking at 400 disks minimum if you use a simple 2-disk mirror for
| each pool. I think it's not recommended t use partial disk slices to
| form pools -- use whole disks.

 We're not going to use local disk storage on the fileservers for
various reasons, including failover and easier long-term storage
management and expansion. We have pretty much settled on iSCSI
(mirroring each ZFS vdev across two controllers, so our fileservers do
not panic if we lose a single controller). The fixed-size chunks will be
done at the disk level, either as slices from a single LUN on Solaris or
as individual LUNs sliced out of each disk on the iSCSI target.

(Probably the latter, because it lets us use more slices per disk and
we have some number of 'legacy' 35 GB disk chunks that we cannot really
give free size upgrades to.)

(Using full disks as the chunk size is infeasible for several reasons.)

- cks
[*: we've experimented, and quotas turn out to work better than reservations
for this purpose. If anyone wants more details, see
http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSReservationsVsQuotas
]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How many ZFS pools is it sensible to use on a single server?

2008-04-12 Thread Chris Siebenmann
| I don't think that's the case.  What's wrong with setting both a quota
| and a reservation on your user filesystems?

 In a shared ZFS pool situation I don't think we'd get anything from
using both. We have to use something to limit people to the storage that
they bought, and in at least S10 U4 quotas work better for this (we
tested).

| What advantage will multiple zpools present over a single one with
| filesystems carved out of it?  With a single pool, you can expand
| filesystems if the user requests it just by changing the quota and
| reservation for that filesystem, and add more capacity if necessary
| by adding more disks to the pool.  If your policy is to use, say, a
| single pair of 35GB mirrors per zpool and the user wants more space,
| they need to split their files into categories somehow.

 Pools can/will have more than one vdev. The plan is that we will
have a set of unallocated fixed-size chunks of disk space (as LUNs or
slices). When someone buys more space, we pair up two such chunks and
add them to the person's pool as a mirrored vdev.

 With the single pool approach, you have a number of issues:
- if you keep pools at their currently purchased space, you have to both
  add a new vdev *and* bump someone's quota by the appropriate amount.
  This is somewhat more work and opens you up to the possibility of stupid
  mistakes when you change the quotas.

- if you preallocate space to pools before it is purchased by anyone,
  you have to statically split your space between fileservers in advance.
  You may also need to statically split the space between multiple pools
  on a single fileserver, if a single pool would otherwise have too many
  disks to make you comfortable; this limits how much space a person can
  add to their their existing allocation in an artificial way.

- if a disaster happens and you lose both sides of a mirrored vdev, you
  will have lost a *lot* more data (and a lot more people will be affected)
  than if you had things split up into separate pools. (Of course, this
  depends on how many of your separate pools had vdevs involving the
  pair of disks that you just lost; you could lose nearly as much data,
  if most of your pools were using chunks of the disk.)

  This argues for having multiple pools on a fileserver, which runs you
  into the 'people can only grow so far' problem.

We plan to use snapshots only while we take backups, partly because of
their effects on quotas and so on. Any additional usage of snapshots
would probably be under user control, so that the people who own the
space can make decisions like 'we will accept losing some space so that
we can instantly go back to yesterday'.

(There are groups that would probably take that, and groups that never
would.)

| You might want to use refquota and refreservation if you're
| running a Solaris that supports them---that precludes Solaris 10u4,
| unfortunately.  If you're running Nevada, though, they're definitely
| the way to go.

 This is going to be a production environment, so we're pretty much
stuck to Solaris 10 Uwhatever is current.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris ZFS NAS Setup

2008-04-08 Thread Chris Siebenmann
| Is it really true that as the guy on the above link states (Please
| read the link, sorry) when one iSCSI mirror goes off line, the
| initiator system will panic?  Or even worse, not boot its self cleanly
| after such a panic?  How could this be?  Anyone else with experience
| with iSCSI based ZFS mirrors?

 Our experience with Solaris 10U4 and iSCSI targets is that Solaris only
panics if the pool fails entirely (eg, you lose both/all mirrors in a
mirrored vdev). The fix for this is in current OpenSolaris builds, and
we have been told by our Sun support people that it will (only) appear
in Solaris 10 U6, apparently scheduled for sometime around fall.

 My experience is that Solaris will normally recover after the panic and
reboot, although failed ZFS pools will be completely inaccessible as you'd
expect. However, there are two gotchas:

* under at least some circumstances, a completely inaccessible iSCSI
  target (as you might get with, eg, a switch failure) will stall booting
  for a significant length of time (tens of minutes, depending on how many
  iSCSI disks you have on it).

* if a ZFS pool's storage is present but unwritable for some reason,
  Solaris 10 U4 will panic the moment it tries to bring the pool up;
  you will wind up stuck in a perpetual 'boot, panic, reboot, ...'
  cycle until you forcibly remove the storage entirely somehow.

The second issue is presumably fixed as part of the general fix of 'ZFS
panics on pool failure', although we haven't tested it explicitly. I
don't know if the first issue is fixed in current Nevada builds.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How many ZFS pools is it sensible to use on a single server?

2008-04-08 Thread Chris Siebenmann
 In our environment, the politically and administratively simplest
approach to managing our storage is to give each separate group at
least one ZFS pool of their own (into which they will put their various
filesystems). This could lead to a proliferation of ZFS pools on our
fileservers (my current guess is at least 50 pools and perhaps up to
several hundred), which leaves us wondering how well ZFS handles this
many pools.

 So: is ZFS happy with, say, 200 pools on a single server? Are there any
issues (slow startup, say, or peculiar IO performance) that we'll run
into? Has anyone done this in production? If there are issues, is there
any sense of what the recommended largest number of pools per server is?

 Thanks in advance.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and multipath with iSCSI

2008-04-05 Thread Chris Siebenmann
| You DO mean IPMP then.  That's what I was trying to sort out, to make
| sure th at you were talking about the IP part of things, the iSCSI
| layer.

 My apologies for my lack of clarity. We are not looking at IPMP
multipathing; we are using MPxIO multipathing (mpathadm et al), which
operates at what one can think of as a higher level.

(IPMP gives you a single session to iSCSI storage over multiple network
devices. MPxIO and appropriate lower level iSCSI settings gives you
multiple sessions to iSCSI storage over multiple networks and multiple
network devices.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS and multipath with iSCSI

2008-04-04 Thread Chris Siebenmann
 We're currently designing a ZFS fileserver environment with iSCSI based
storage (for failover, cost, ease of expansion, and so on). As part of
this we would like to use multipathing for extra reliability, and I am
not sure how we want to configure it.

 Our iSCSI backend only supports multiple sessions per target, not
multiple connections per session (and my understanding is that the
Solaris initiator doesn't currently support multiple connections
anyways). However, we have been cautioned that there is nothing in
the backend that imposes a global ordering for commands between the
sessions, and so disk IO might get reordered if Solaris's multipath load
balancing submits part of it to one session and part to another.

 So: does anyone know if Solaris's multipath and iSCSI systems already
take care of this, or if ZFS already is paranoid enough to deal
with this, or if we should configure Solaris multipathing to not
load-balance?

(A load-balanced multipath configuration is simpler for us to
administer, at least until I figure out how to tell Solaris multipathing
which is the preferrred network for any given iSCSI target so we can
balance the overall network load by hand.)

 Thanks in advance.

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and multipath with iSCSI

2008-04-04 Thread Chris Siebenmann
| I assume you mean IPMP here, which refers to ethernet multipath.
|
| There is also the other meaning of multipath referring to multiple
| paths to the storage array typically enabled by stmsboot command.

 We are currently looking at (and testing) the non-ethernet sort of
multipathing, partly as being the simplest way to have those multiple
paths to the iSCSI backend storage by using two completely independant
networks. This should also give us greater aggregate bandwidth through
the entire fabric.

(Each iSCSI backend unit will have two network interfaces with separate
IP addresses and so on.)

- cks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss