from:"Matthew Ahrens"

Re: [zfs-discuss] XATTRs, ZAP and the Mac

2006-05-03 Thread Matthew Ahrens

On Wed, May 03, 2006 at 03:22:53PM -0400, Maury Markowitz wrote:
  I think that's the disconnect. WHY are they full-fledged files?
 
 Because that's what the specification calls for.
 
 Right, but that's my concern. To me this sounds like historically 
 circular reasoning...
 
 20xx) we need a new file system that supports xaddrs
 well xaddrs are this second file, so...
 
 To me it appears that there is some confusion between the purpose and 
 implementation.

 Certainly if xaddrs were originally introduced to store, well, x
 addrs, then the implementation is a poor one. Years later the
 _implementation_ was copied, even though it was never a good one.

I think you are confusing the interface with the implementation.  ZFS
has copied (aka. adhered to) a pre-existing interface[*].  Our
implementation of that interface is in some ways similar to other
implementations.  I believe that our implementation is a very good one,
but if you have specific suggestions for how it could be improved, we'd
love to hear them.

[*] The solaris extended attributes interface is actually more
accurately called named streams, and has been used as the back-end for
CIFS (Windows) and NFSv4 named-streams protocols.  See the fsattr(5)
manpage.

We appreciate your suggestion that we implement a higher-performance
method for storing additional metadata associated with files.  This will
most likely not be possible within the extended attribute interface, and
will require that we design (and applications use) a new interface.
Having specific examples of how that interface would be used will help
us to design a useful feature.

 The real problem is that there is nothing like a general overview of
 the zfs system as a whole

I agree that a higher-level overview would be useful.

 COMPARING the system with the widely understood UFS would be
 invaluable, IMHO.

Agreed, thanks for the suggestion.  Unfortunately, ZFS and UFS are
sufficiently different that I think the comparison would only be useful
for a very limited part of ZFS, say from the file/directory down.

 But to the specifics. You asked why I thought it was that the file
 name did not appear. Well, that's because the term file name (or
 filename) does not appear anywhere in the document.

Thanks, maybe we should use that keyword in section 6.2 to help when
doing a search.

 So then, at a first glance it seems that one would expect to find the
 directory description in Chapter 6, which has a subsection called
 Directories and Directory Traversal.

I believe that that section does in fact describe directories.  Perhaps
the description could be made more explicit (eg. The ZAP object which
stores the directory maps from filename to object number.  Each entry in
the ZAP is a single directory entry.  The entry's name is the filename,
and its value is the object number which identifies that file.

 That section describes the znode_phys_t structure.

You're right, it also describes the znode_phys_t.  There should be a
section break after the first paragraph, before we start talking about
the znode_phys_t.

 Maybe I'm going down a dark alley here, but is there any reason this
 split still exists under zfs? IE, I asumed that the znode_phys_t would
 be located in the directory ZAP, because to my mind, that's where
 metadata belongs.

ZFS must support POSIX semantics, part of which is hard links.  Hard
links allow you to create multiple names (directory entries) for the
same file.  Therefore, all UNIX filesystems have chosen to store the
file information separately for the directory entries (otherwise, you'd
have multiple copies, and need pointers between all of them so you could
update them all -- yuck).

Hard links suck for FS designers because they constrain our
implementation in this way.  We'd love to have the flexability to easily
store metadata with the directory entry.  We've actually contemplated
caching the metadata needed to do a stat(2) in the directory entry, to
improve performance of directory traversals like find(1).  Perhaps we'll
be able to add this performance improvement in an future release.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Matthew Ahrens

On Thu, May 18, 2006 at 12:46:28PM -0700, Charlie wrote:
 Traditional (amanda). I'm not seeing a way to dump zfs file systems to
 tape without resorting to 'zfs send' being piped through gtar or
 something. Even then, the only thing I could restore was an entire file
 system. (We frequently restore single files for users...)
 
 Perhaps, since zfs isn't limited to one snapshot per FS like fssnap is,
 I should be redesigning everything. It sounds like I should look at
 using many snapshots, and dumping to tape (each file system, somehow)
 less frequently.

That's right.  With ZFS, there should never be a need to go to tape to
recover an accidentally deleted file, becuase it's easy[*] to keep lots
of snapshots around.

[*] Well, modulo 6373978 want to take lots of snapshots quickly ('zfs
snapshot -r').  I'm working on that...

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tracking error to file

2006-05-23 Thread Matthew Ahrens

On Tue, May 23, 2006 at 11:49:47AM +0200, Wout Mertens wrote:
 Can that same method be used to figure out what files changed between  
 snapshots?

To figure out what files changed, we need to (a) figure out what object
numbers changed, and (b) do the object number to file name translation.

The method I described (using zdb) will not be involved in either step.
zdb is an undocumented interface, and using it for this purpose is only
a workaround.  However, the same algorithms implemented in zdb will be
used to do step (b), the object number to file name translation.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Misc questions

2006-05-23 Thread Matthew Ahrens

On Tue, May 23, 2006 at 02:34:30PM -0700, Jeff Victor wrote:
 * When you share a ZFS fs via NFS, what happens to files and
 filesystems that exceed the limits of NFS?

What limits do you have in mind?  I'm not an NFS expert, but I think
that NFSv4 (and probably v3) supports 64-bit file sizes, so there would
be no limit mismatch there.

 * Is there a recommendation or some guidelines to help answer the
 question how full should a pool be before deciding it's time add disk
 space to a pool?

I'm not sure, but I'd guess around 90%.

 * Migrating pre-ZFS backups to ZFS backups: is there a better method
 than restore the old backup into a ZFS fs, then back it up using zfs
 send?

No.

 * Are ZFS quotas enforced assuming that compressed data is compressed,
 or uncompressed?

Quotas apply to the amount of space used, after compression.  This is
the space reported by 'zfs list', 'zfs get used', 'df', 'du', etc.

 The former seems to imply that the following would create a mess:
   1) Turn on compression
   2) Store data in the pool until the pool is almost full
   3) Turn off compression
   4) Read and re-write every file (thus expanding each file)

Since this example doesn't involve quotas, their behavior is not
applicable here.  In this example, there will be insufficient space in
the pool to store your data, so your write operation will fail with
ENOSPC.  Perhaps a messy situation, but I don't see any alternative.  If
this is a concern, don't use compression.

If you filled up a filesystem's quota rather than a pool, the behavior
would be the same except you would get EDQUOT rather than ENOSPC.

 * What block sizes will ZFS use?  Is there an explanation somewhere
 about its method of choosing blocksize for a particular workload?

Files smaller than 128k will be stored in a single block, whose size is
rounded up to the nearest sector (512 bytes).  Files larger than 128k
will be stored in multiple 128k blocks (unless the recordsize property
has been set -- see the zfs(1m) manpage for an explanation of this).

Thanks for using zfs!
--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and HSM

2006-05-24 Thread Matthew Ahrens

On Wed, May 24, 2006 at 03:43:54PM -0400, Scott Dickson wrote:
 I said I had several questions to start threads on
 
 What about ZFS and various HSM solutions?  Do any of them already work 
 with ZFS?  Are any going to?  It seems like HSM solutions that access 
 things at a file level would have little trouble integrating with ZFS.  
 But ones that work at a block level would have a harder time.

Sun is working on getting SAM (a HSM which is currently wedded to QFS)
working with ZFS.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS mirror and read policy; kstat I/O values for zfs

2006-05-26 Thread Matthew Ahrens

On Fri, May 26, 2006 at 09:40:57PM +0200, Daniel Rock wrote:
 So you can see the second disk of each mirror pair (c4tXd0) gets almost no 
 I/O. How does ZFS decide from which mirror device to read?

You are almost certainly running in to this known bug:

630 reads from mirror are not spread evenly

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] question about ZFS performance for webserving/java

2006-06-01 Thread Matthew Ahrens

On Thu, Jun 01, 2006 at 11:35:41AM -1000, David J. Orman wrote:
 3 - App server would be running in one zone, with a (NFS) mounted ZFS
 filesystem as storage.
 
 4 - DB server (PgSQL) would be running in another zone, with a (NFS)
 mounted ZFS filesystem as storage.

Why would you use NFS?  These zones are on the same machine as the
storage, right?  You can simply export filesystems in your pool to the
various zones (see zfs(1m) and zonecfg(1m) manpages).  This will result
in better performance.

 5 - Multiple disk redundancy is needed. So, I'm assuming two raid-z
 pools of 3 drives each, mirrored is the solution. If people have a
 better suggestion, tell me! :P

There is no need for multiple pools.  Perhaps you meant two raid-z
groups (aka vdevs) in a single pool?  Also, wouldn't you want to use
all 8 disks, therefore use two 4-disk raid-z groups?  This way you would
get 3 disks worth of usable space.

Depending on how much space you need, you should consider using a single
double-parity RAID-Z group with your 8 disks.  This would give you 6
disks worth of usable space.  Given that you want to be able to tolerate
two failures, that is probably your best solution.  Other solutions
would include three 3-way mirrors (if you can fit another drive in your
machine), giving you 3 disks worth of usable space.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Matthew Ahrens

On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:
 So as administrator what do I need to do to set
 /export/home up for users to be able to create their own
 snapshots, create dependent filesystems (but still mounted
 underneath their /export/home/usrname)?
 
 In other words, is there a way to specify the rights of the
 owner of a filesystem rather than the individual - eg, delayed
 evaluation of the owner?
 
 I think you're asking for the -c Creator flag.  This allows
 permissions (eg, to take snapshots) to be granted to whoever creates the
 filesystem.  The above example shows how this might be done.
 
 --matt
 
 Actually, I think I mean owner.
 
 I want root to create a new filesystem for a new user under
 the /export/home filesystem, but then have that user get the
 right privs via inheritance rather than requiring root to run
 a set of zfs commands.

In that case, how should the system determine who the owner is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS needs a viable backup mechanism

2006-07-17 Thread Matthew Ahrens

On Fri, Jul 07, 2006 at 04:00:38PM -0400, Dale Ghent wrote:
 Add an option to zpool(1M) to dump the pool config as well as the  
 configuration of the volumes within it to an XML file. This file  
 could then be sucked in to zpool at a later date to recreate/ 
 replicate the pool and its volume structure in one fell swoop. After  
 that, Just Add Data(tm).

Yep, this has been on our to-do list for quite some time:

RFE #6276640 zpool config
RFE #6276912 zfs config

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] metadata inconsistency?

2006-07-17 Thread Matthew Ahrens

On Thu, Jul 06, 2006 at 12:46:57AM -0700, Patrick Mauritz wrote:
 Hi,
 after some unscheduled reboots (to put it lightly), I've got an interesting 
 setup on my notebook's zfs partition:
 setup: simple zpool, no raid or mirror, a couple of zfs partitions, one zvol 
 for swap. /foo is one such partition, /foo/bar the directory with the issue.
 
 directly after the reboot happened:
 $ ls /foo/bar
 test.h
 $ ls -l /foo/bar
 Total 0
 
 the file wasn't accessible with cat, etc.

This can happen when the file appears in the directory listing (ie.
getdents(2)), but a stat(2) on the file fails.  Why that stat would fail
is a bit of a mystery, given that ls doesn't report the error.

It could be that the underlying hardware has failed, and the directory
is still intact but the file's metadata has been damaged.  (Note, this
would be hardware error, not metadata inconsistency.)

Another possibility is that the file's inode number is too large to be
expressed in 32 bits, thus causing a 32-bit stat() to fail.  However,
I don't think that Sun's ls(1) should be issuing any 32-bit stats (even
on a 32-bit system, it should be using stat64).

 somewhat later (new data appeared on /foo, in /foo/baz):
 $ ls -l /foo/bar
 Total 3
 -rw-r--r-- 1 user group 1400 Jul 6 02:14 test.h
 
 the content of test.h is the same as the content of /foo/baz/quux now,
 but the refcount is 1!
 
 $ chmod go-r /foo/baz/quux
 $ ls -l /foo/bar
 Total 3
 -rw--- 1 user group 1400 Jul 6 02:14 test.h

This behavior could also be explained if there is an unknown bug which
causes the object representing the file to be deleted, but not the
directory entry pointing to it.

 anyway, how do I get rid of test.h now without making quux unreadable?
 (the brute force approach would be a new partition, moving data over
 with copying - instead of moving - the troublesome file, just in case
 - not sure if zfs allows for links that cross zfs partitions and thus
 optimizes such moves, then zfs destroy data/test, but there might be a
 better way?)

Before trying to rectify the problem, could you email me the output of
'zpool status' and 'zdb -vvv foo'?  

FYI, there are no cross-filesystem links, even with ZFS.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How can I watch IO operations with dtrace on zfs?

2006-07-20 Thread Matthew Ahrens

On Thu, Jul 20, 2006 at 12:58:31AM -0700, Trond Norbye wrote:
 I have been using iosoop script (see
 http://www.opensolaris.org/os/community/dtrace/scripts/) written by
 Brendan Gregg to look at the IO operations of my application.
...
 So how can I get the same information from a ZFS file-system?

As you can see, ZFS is not yet fully integrated with the dtrace i/o
provider.  With ZFS, writes are (typically) deferred, so it is
nontrivial to assign each write i/o to a particular application.  If you
are familiar with dtrace, you can use fbt to look at the zio_done()
function, eg. with something like this:

zio_done:entry
/args[0]-io_type == 1  args[0]-io_bp != NULL/
{
@bytes[read,
args[0]-io_bookmark.zb_objset,
args[0]-io_bookmark.zb_object,
args[0]-io_bookmark.zb_level,
args[0]-io_bookmark.zb_blkid != 0] =
/* sum(args[0]-io_size); */
count();
}

zio_done:entry
/args[0]-io_type == 2/
{
@bytes[write,
args[0]-io_bookmark.zb_objset,
args[0]-io_bookmark.zb_object,
args[0]-io_bookmark.zb_level,
args[0]-io_bookmark.zb_blkid != 0] =
/* sum(args[0]-io_size); */
count();
}

END
{
printf(r/w objset object level blk0 i/os\n);
printa(%5s %4d %7d %d %d [EMAIL PROTECTED], @bytes);
printf(r/w objset object level blk0 i/os\n);
}

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Quotas and Snapshots

2006-07-25 Thread Matthew Ahrens

On Tue, Jul 25, 2006 at 11:13:16AM -0700, Brad Plecs wrote:
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6431277
 
 What I'd really like to see is ... the ability for the snapshot space
 to *not* impact the filesystem space).  

Yep, as Eric mentioned, that is the purpose of this RFE (want
filesystem-only quotas).

I imagine that this would be implemented as a quota against the space
referenced (as currently reported by 'zfs list', 'zfs get refer',
'df', etc; see the zfs(1m) manpage for details).

 in fact, I think a lot of ZFS's hierarchical features would be more
 valuable if parent filesystems included their descendants (backups and
 NFS sharing, for example), but I'm sure there are just as many
 arguments against that as for it.

Yep, we're working on making more features work on this and all
descendents.  For example, the recently implemented 'zfs snapshot -r'
can create snapshots of a filesystem and all its descendents.  This
feature will be part of Solaris 10 update 3.  We're also working on 'zfs
send -r' (RFE 6421958).

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Quotas and Snapshots

2006-07-25 Thread Matthew Ahrens

On Tue, Jul 25, 2006 at 07:24:51PM -0500, Mike Gerdts wrote:
 On 7/25/06, Brad Plecs [EMAIL PROTECTED] wrote:
 What I'd really like to see is ... the ability for the snapshot space
 to *not* impact the filesystem space).
 
 The idea is that you have two storage pools - one for live data, one
 for backup data.  Your live data is *probably* on faster disks than
 your backup data.  The live data and backup data may or may not be on
 the same server.  Whenever you need to perform backups you do
 something along the lines of:
 
 yesterday=$1
 today=$2
 for user in $allusers ; do
zfs snapshot users/[EMAIL PROTECTED]
zfs snapshot backup/$user/[EMAIL PROTECTED]
zfs clone backup/$user/[EMAIL PROTECTED] backup/$user/$today
rsync -axuv /users/$user/.zfs/snapshot/$today /backup/$user/$today
zfs destroy users/[EMAIL PROTECTED]
zfs destroy backup/$user/$lastweek
 done

You can simplify and improve the performance of this considerably by
using 'zfs send':

for user in $allusers ; do
zfs snapshot users/[EMAIL PROTECTED]
zfs send -i $yesterday users/[EMAIL PROTECTED] | \
ssh $host zfs recv -d $backpath
ssh $host zfs destroy $backpath/$user/$lastweek
done

You can send the backup to the same or different host, and the same or
different pool, as your hardware needs dictate.  'zfs send' will be much
faster than rsync because we can use ZFS metadata to determine which
blocks were changed without traversing all files  directories.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supporting ~10K users on ZFS

2006-07-27 Thread Matthew Ahrens

On Thu, Jun 29, 2006 at 08:20:56PM +0200, Robert Milkowski wrote:
 btw: I belive it was discussed here before - it would be great if one
 would automatically convert given directory on zfs filesystem into zfs
 filesystem (without actually copying all data)

Yep, and an RFE filed:  6400399 want zfs split

 and vice versa (making given zfs filesystem a directory)

But more filesystems is better!  :-)  (and, this would be pretty
nontrivial, we'd have to resolve conflicting inode (object) numbers,
thus rewriting all metadata).

Back to slogging through old mail archives,
--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] This may be a somewhat silly question ...

2006-07-27 Thread Matthew Ahrens

On Tue, Jun 27, 2006 at 06:30:46PM -0400, Dennis Clarke wrote:
 
 ... but I have to ask.
 
 How do I back this up?

The following two RFEs would help you out enormously:

6421958 want recursive zfs send ('zfs send -r')
6421959 want zfs send to preserve properties ('zfs send -p')

As far as RFEs go, these are pretty high priority...

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS compression best-practice?

2006-07-27 Thread Matthew Ahrens

On Thu, Jul 27, 2006 at 03:54:02PM -0400, Christine Tran wrote:
 - What is the compression algorithm used?

It is based on the Lempel-Ziv algorithm.

 - Is there a ZFS feature that will output the real uncompressed size of 
 the data?  The scenario is if they had to move a compressed ZFS 
 filesystem back to UFS, say.  'ls' will give the file's real 
 uncompressed size, but customer had rather not write a script to sum 
 everything up.

You can multiply the 'referenced' and 'compressratio' property of a
filesystem to find out how much space it would use if uncompressed.

 - Customer wants to do a diff between snapshots.  Is there an RFE 
 already filed?

Two, in fact:

6370738 zfs diffs filesystems
6425091 want 'zfs diff' to list files that have changed between snapshots

 _ Customer would like benchmarking numbers.  I think there is a blog 
 item but do we have something more official?

No; we're working on some more unofficial benchmark numbers, though :-)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Boot Disk

2006-07-27 Thread Matthew Ahrens

On Thu, Jul 27, 2006 at 08:17:03PM -0500, Malahat Qureshi wrote:
 Is there any way to boot of from zfs disk work around ??

Yes, see
http://blogs.sun.com/roller/page/tabriz?entry=are_you_ready_to_rumble

--mat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-08 Thread Matthew Ahrens

On Tue, Aug 08, 2006 at 06:11:09PM +0200, Robert Milkowski wrote:
 filebench/singlestreamread v440
 
 1. UFS, noatime, HW RAID5 6 disks, S10U2
  70MB/s
 
 2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1)
  87MB/s
 
 3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2
  130MB/s
  
 4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44
  133MB/s

FYI, Streaming read performance is improved considerably by Mark's
prefetch fixes which are in build 45.  (However, as mentioned you will
soon run into the bandwidth of a single fiber channel connection.)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAID10

2006-08-08 Thread Matthew Ahrens

On Tue, Aug 08, 2006 at 09:54:16AM -0700, Robert Milkowski wrote:
 Hi.
 
 snv_44, v440
 
 filebench/varmail results for ZFS RAID10 with 6 disks and 32 disks.
 What is suprising is that the results for both cases are almost the same!
 
 
 
 6 disks:
 
IO Summary:  566997 ops 9373.6 ops/s, (1442/1442 r/w)  45.7mb/s,
 299us cpu/op,   5.1ms latency
IO Summary:  542398 ops 8971.4 ops/s, (1380/1380 r/w)  43.9mb/s,
 300us cpu/op,   5.4ms latency
 
 
 32 disks:
IO Summary:  572429 ops 9469.7 ops/s, (1457/1457 r/w)  46.2mb/s,
 301us cpu/op,   5.1ms latency
IO Summary:  560491 ops 9270.6 ops/s, (1426/1427 r/w)  45.4mb/s,
 300us cpu/op,   5.2ms latency
 

 
 Using iostat I can see that with 6 disks in a pool I get about 100-200 IO/s 
 per disk in a pool, and with 32 disk pool I get only 30-70 IO/s per disk in a 
 pool. Each CPU is used at about 25% in SYS (there're 4 CPUs).
 
 Something is wrong here.

It's possible that you are CPU limited.  I'm guessing that your test
uses only one thread, so that may be the limiting factor.

We can get a quick idea of where that CPU is being spent if you can run
'lockstat -kgIW sleep 60' while your test is running, and send us the
first 100 lines of output.  It would be nice to see the output of
'iostat -xnpc 3' while the test is running, too.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS RAID10

2006-08-08 Thread Matthew Ahrens

On Tue, Aug 08, 2006 at 10:42:41AM -0700, Robert Milkowski wrote:
 filebench in varmail by default creates 16 threads - I configrm it
 with prstat, 16 threrads are created and running.

Ah, OK.  Looking at these results, it doesn't seem to be CPU bound, and
the disks are not fully utilized either.  However, because the test is
doing so much synchronous writes (eg. by calling fsync()), we are
continually writing out the intent log.

Unfortunately, we are only able to issue a small number of concurrent
i/os while doing the intent log writes.  All the threads must wait for
the intent log blocks to be written before they can enqueue more data.
Therefore, we are essentially doing:

many threads call fsync().
one of them will flush the intent log, issuing a few writes to the disks
all of the threads wait for the writes to complete
repeat.

This test fundamentally requires waiting for lots of syncronous writes.
Assuming no other activity on the system, the performance of syncronous
writes does not scale with the number of drives, it scales with the
drive's write latency.

If you were to alter the test to not require everything to be done
synchronously, then you would see much different behavior.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: 'canmount' option

2006-08-10 Thread Matthew Ahrens

On Thu, Aug 10, 2006 at 10:23:20AM -0700, Eric Schrock wrote:
 A new option will be added, 'canmount', which specifies whether the
 given filesystem can be mounted with 'zfs mount'.  This is a boolean
 property, and is not inherited.

Cool, looks good.  Do you plan to implement this using the generic
(inheritable) property infrastructure (eg. dsl_prop_set/get()), and just
ignore the setting if it is inherited?

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: 'canmount' option

2006-08-10 Thread Matthew Ahrens

On Thu, Aug 10, 2006 at 10:44:46AM -0700, Eric Schrock wrote:
 Right now I'm using the generic property mechanism, but  have a special
 case in dsl_prop_get_all() to ignore searching parents for this
 particular property.  I'm not thrilled about it, but I only see two
 other options:
 
 1. Do not use the generic infrastructure.  This requires much more
invasive changes that I'd rather avoid.
 
 2. From the kernel's perspective have it be inheritable, but then fake
up the non-inherited state in libzfs.  i.e. if the source is not
ZFS_SRC_LOCAL, then pretend like it isn't set at all.
 
 If the current hack is too offensive, moving it into libzfs seems like a
 reasonable option.

Yeah, I guess I was suggesting (2), but having a check in dsl_prop code
might be better.  It would probably be better to base it off some value
stored in the zfs_prop_t though, rather than hard-coding canmount into
dsl_prop.c.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Difficult to recursive-move ZFS filesystems to another server

2006-08-11 Thread Matthew Ahrens

On Fri, Aug 11, 2006 at 10:02:41AM -0700, Brad Plecs wrote:
 There doesn't appear to be a way to move zfspool/www and its
 decendants en masse to a new machine with those quotas intact.  I have
 to script the recreation of all of the descendant filesystems by hand. 

Yep, you need

6421959 want zfs send to preserve properties ('zfs send -p')
6421958 want recursive zfs send ('zfs send -r')

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] in-kernel gzip compression

2006-08-17 Thread Matthew Ahrens

On Thu, Aug 17, 2006 at 02:53:09PM +0200, Robert Milkowski wrote:
 Hello zfs-discuss,
 
   Is someone actually working on it? Or any other algorithms?
   Any dates?

Not that I know of.  Any volunteers? :-)

(Actually, I think that a RLE compression algorithm for metadata is a
higher priority, but if someone from the community wants to step up, we
won't turn your code away!)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] in-kernel gzip compression

2006-08-17 Thread Matthew Ahrens

On Thu, Aug 17, 2006 at 10:28:10AM -0700, Adam Leventhal wrote:
 On Thu, Aug 17, 2006 at 10:00:32AM -0700, Matthew Ahrens wrote:
  (Actually, I think that a RLE compression algorithm for metadata is a
  higher priority, but if someone from the community wants to step up, we
  won't turn your code away!)
 
 Is RLE likely to be more efficient for metadata?

No, it it not likely to achieve a higher compression ratio.  However, it
should use significantly less CPU time.  We've seen some circumstances
where the CPU usage caused by compressing metadata can be not as trivial
as we'd like.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] problem with zfs receive -i

2006-08-19 Thread Matthew Ahrens

On Sat, Aug 19, 2006 at 07:21:52PM -0700, Frank Cusack wrote:
 On August 19, 2006 7:06:06 PM -0700 Matthew Ahrens [EMAIL PROTECTED] 
 wrote:
 My guess is that the filesystem is not mounted.  It should be remounted
 after the 'zfs recv', but perhaps that is not happening correctly.  You
 can see if it's mounted by running 'df' or 'zfs list -o name,mounted'.
 
 You are right, it's not mounted.
 
 Did the 'zfs recv' print any error messages?
 
 nope.
 
  Are you able to reproduce this behavior?
 
 easily.

Hmm, I think there must be something special about your filesystems or
configuration; I'm not able to reproduce it.  One possible cause for
trouble is if you are doing the 'zfs receive' into a filesystem which
has descendent filesystems (eg, you are doing 'zfs recv pool/[EMAIL PROTECTED]'
and pool/fs/child exists).  This isn't handled correctly now, but you
should get an error message in that case.  (This will be fixed by some
changes Noel is going to putback next week.)

Could you send me the output of 'truss zfs recv ...', and 'zfs list' and
'zfs get -r all pool' on both the source and destination systems?

 ah ok.  Note that if I do zfs send; zfs send -i on the local side, then
 do zfs list; zfs mount -a on the remote side, I still show space used
 in the @7.1 snapshot, even though I didn't touch anything.  I guess mounting
 accesses the mount point and updates the atime.

Hmm, maybe.  I'm not sure if that's exactly what's happening, because
mounting and unmounting a filesystem doesn't seem to update the atime
for me.  Does the @7.1 snapshot show used space before you do the 'zfs
mount -a'?

 On the local side, how come after I take the 7.1 snapshot and then 'ls',
 the 7.1 snapshot doesn't start using up space?  Shouldn't my ls of the
 mountpoint update the atime also?

I believe what's happening here is that although we update the in-core
atime, we sometimes defer pushing it to disk.  You can force the atime
to be pushed to disk by unmounting the filesystem.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Niagara and ZFS compression?

2006-08-20 Thread Matthew Ahrens

On Sun, Aug 20, 2006 at 08:38:03PM -0700, Luke Lonergan wrote:
 Matthew,
 
 On 8/20/06 6:20 PM, Matthew Ahrens [EMAIL PROTECTED] wrote:
 
  This was not the design, we're working on fixing this bug so that many
  threads will be used to do the compression.
 
 Is this also true of decompression?

I believe that decompression already runs in many threads.  If you see
differently, let us know.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Performance compared to UFS VxFS

2006-08-22 Thread Matthew Ahrens

On Tue, Aug 22, 2006 at 06:15:08AM -0700, Tony Galway wrote:
 A question (well lets make it 3 really) ? Is vdbench a useful tool
 when testing file system performance of a ZFS file system? Secondly -
 is ZFS write performance really much worse than UFS or VxFS? and Third
 - what is a good benchmarking tool to test ZFS vs UFS vs VxFS?
...
 sd=ZFS,lun=/pool/TESTFILE,size=10g,threads=8
 wd=ETL,sd=ZFS,rdpct=0,  seekpct=80
 rd=ETL,wd=ETL,iorate=max,elapsed=1800,interval=5,forxfersize=(1k,4k,8k,32k)

ZFS write performance should be much better than UFS or VxFS.  

What exactly is the write workload?  It sounds like it is doing
effectively random writes of various (1k,4k,8k,32k) record sizes.  As
these record sizes are all smaller than ZFS's default block size (128k),
they will all require ZFS to read in the 128k block.  Whereas UFS (on
x86) uses a 4k block size by default so the 4k, 8k, and 32k record size
writes will not require any reads, only the 1k records will require UFS
to read the block in from disk.

When doing record-structured access (eg. databases), it is recommended
that you do 'zfs set recordsize=XXX' to set ZFS's block size to match
your application's record size.  In this case perhaps you should set it
to 4k to match UFS.

 I am seeing large periods of time where is no reported activity, and
 if I am looking at zfs iostat I do see consistent writing however

How are you measuring this reported activity?  If your application is
trying to write faster than the storage can keep up with, then it will
have to be throttled.  So if you are measuring this at the application
or syscall level, then this is the expected behavior and does not
indicate a performance problem in and of itself.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS compression

2006-08-22 Thread Matthew Ahrens

On Tue, Aug 22, 2006 at 08:43:32AM -0700, roland wrote:
 can someone tell, how effective is zfs compression and
 space-efficiency (regarding small files) ?
 
 since compression works at the block level, i assume compression may
 not come into effect as some may expect. (maybe i`m wrong here)

It's true that since we are compressing a block at a time, there are
some efficiencies of whole-large-file compression that will be lost.
However, since ZFS uses 128k blocks on large files, the difference
should be neglegable.  For smaller files, ZFS uses a single block that
exactly fits the file (compressed or not) (rounded up to the nearest
sector size (512 bytes)).  So I believe that ZFS's compression
infrastructure permits good efficiency.

However, at this point, we have only implemented one compression
algorithm, which is much faster than, but does not compress as much as,
gzip.  We plan to implement a broader range of compression algorithms in
the future.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Issue with zfs snapshot replication from version2 to version3 pool.

2006-08-22 Thread Matthew Ahrens

Shane,  I wasn't able to reproduce this failure on my system.  Could you
try running Eric's D script below and send us the output while running
'zfs list'?

thanks,
--matt

On Fri, Aug 18, 2006 at 09:47:45AM -0700, Eric Schrock wrote:
 Can you send the output of this D script while running 'zfs list'?
 
 #!/sbin/dtrace -s
 
 zfs_ioc_snapshot_list_next:entry
 {
   trace(stringof(args[0]-zc_name));
 }
 
 zfs_ioc_snapshot_list_next:return
 {
   trace(arg1);
 }
 
 
 - Eric
 
 On Fri, Aug 18, 2006 at 09:27:36AM -0700, Shane Milton wrote:
  I did a little bit of digging, and didn't turn up any known issues.  Any 
  insite would be appreciated.
  
  Basically I replicated a zfs snapshot from a version2 storage pool into a 
  version3 pool and it seems to have corrupted the version3 pool.  At the 
  time of the error both pools were running on the same system (amd64 build44)
  
  The command used was something similiar to the following.
  zfs send [EMAIL PROTECTED] | zfs recv [EMAIL PROTECTED]
  
  zfs list, zfs list-r version3pool_name, zpool destroy version3pool_name 
  all end with a core dump.
  
  After a little digging with mdb and truss, It seems to be dying around the 
  function ZFS_IOC_SNAPSHOT_LIST_NEXT.
  
  I'm away from the system at the moment, but do have some of the core files 
  and truss output for those interested.
  
  # truss zfs list
  execve(/sbin/zfs, 0x08047E90, 0x08047E9C)  argc = 2
  resolvepath(/usr/lib/ld.so.1, /lib/ld.so.1, 1023) = 12
  resolvepath(/sbin/zfs, /sbin/zfs, 1023) = 9
  sysconfig(_CONFIG_PAGESIZE) = 4096
  xstat(2, /sbin/zfs, 0x08047C48)   = 0
  open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT
  xstat(2, /lib/libzfs.so.1, 0x08047448)= 0
  resolvepath(/lib/libzfs.so.1, /lib/libzfs.so.1, 1023) = 16
  open(/lib/libzfs.so.1, O_RDONLY)  = 3
  ..
  ...
  ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08045FBC)  = 0
  ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08046DFC) = 0
  ioctl(3, ZFS_IOC_OBJSET_STATS, 0x080450BC)  = 0
  ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08045EFC) Err#3 ESRCH
  ioctl(3, ZFS_IOC_SNAPSHOT_LIST_NEXT, 0x08045EFC) Err#22 EINVAL
  fstat64(2, 0x08044EE0)  = 0
  internal error: write(2,  i n t e r n a l   e r r.., 16)  = 16
  Invalid argumentwrite(2,  I n v a l i d   a r g u.., 16)  = 16
  
  write(2, \n, 1)   = 1
  sigaction(SIGABRT, 0x, 0x08045E30)  = 0
  sigaction(SIGABRT, 0x08045D70, 0x08045DF0)  = 0
  schedctl()  = 0xFEBEC000
  lwp_sigmask(SIG_SETMASK, 0x, 0x) = 0xFFBFFEFF [0x]
  lwp_kill(1, SIGABRT)= 0
  Received signal #6, SIGABRT [default]
siginfo: SIGABRT pid=1444 uid=0 code=-1
  
  
  Thanks
  -Shane
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import: snv_33 to S10 6/06

2006-08-23 Thread Matthew Ahrens

On Wed, Aug 23, 2006 at 09:57:04AM -0400, James Foronda wrote:
 Hi,
 
 [EMAIL PROTECTED] cat /etc/release
Solaris Nevada snv_33 X86
   Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
   Assembled 06 February 2006
 
 I have zfs running well on this box.  Now, I want to upgrade to Solaris 
 10 6/06 release. 
 
 Question: Will the 6/06 release recognize the zfs created by snv_33?  I 
 seem to recall something about being at a certain release level for 6/06 
 to be able to import without problems.. I searched the archives but I 
 can't find where I read that anymore.

Yes, new releases of Solaris can seamlessly access any ZFS pools created
with Solaris Nevada or 10 (but not pools from before ZFS was integrated
into Solaris, in October 2005).

However, once you upgrade to build 35 or later (including S10 6/06), do
not downgrade back to build 34 or earlier, per the following message:

Summary: If you use ZFS, do not downgrade from build 35 or later to
build 34 or earlier.

This putback (into Solaris Nevada build 35) introduced a backwards-
compatable change to the ZFS on-disk format.  Old pools will be
seamlessly accessed by the new code; you do not need to do anything
special.

However, do *not* downgrade from build 35 or later to build 34 or
earlier.  If you do so, some of your data may be inaccessible with the
old code, and attemts to access this data will result in an assertion
failure in zap.c.

We have fixed the version-checking code so that if a similar change
needs to be made in the future, the old code will fail gracefully with
an informative error message.

After upgrading, you should consider running 'zpool upgrade' to enable
the latest features of ZFS, including ditto blocks, hot spares, and
double-parity RAID-Z.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import: snv_33 to S10 6/06

2006-08-23 Thread Matthew Ahrens

On Thu, Aug 24, 2006 at 08:12:34AM +1000, Boyd Adamson wrote:
 Isn't the whole point of the zpool upgrade process to allow users to  
 decide when they want to remove the fall back to old version option?
 
 In other words shouldn't any change that eliminates going back to an  
 old rev require an explicit zpool upgrade?

Yes, that is exactly the case.

Unfortunately, builds prior to 35 had some latent bugs which made
implementation of 'zpool upgrade' nontrivial.  Thus we issued this
one-time do not downgrade message and promptly implemented 'zpool
upgrade'.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] space accounting with RAID-Z

2006-08-23 Thread Matthew Ahrens

I just realized that I forgot to send this message to zfs-discuss back
in May when I fixed this bug.  Sorry for the delay.

The putback of the following bug fix to Solaris Nevada build 42 and
Solaris 10 update 3 build 3 (and coinciding with the change to ZFS
on-disk version 3) changes the behavior of space accounting when using
pools with raid-z:

6288488 du reports misleading size on RAID-Z

The old behavior is that on raidz vdevs, the space used and available
includes the space used to store the data redundantly (ie. the parity
blocks).  On mirror vdevs, and all other products' RAID-4/5
implementations, it does not, leading to confustion.  Customers are
accustomed to the redundant space not being reported, so this change
makes zfs do that for raid-z devices as well.

The new behavior applies to:
(a) newly created pools (with version 3 or later)
(b) old (version 1 or 2) pools which, when 'zpool upgrade'-ed, did not
have any raid-z vdevs (but have since 'zpool add'-ed a raid-z vdev)

Note that the space accounting behavior will never change on old raid-z
pools.  If the new behavior is desired, these pools must be backed up,
destroyed, and re-'zpool create'-ed.

The 'zpool list' output is unchanged (ie. it still includes the space
used for parity information).  This is bug 6308817 discrepancy between
zfs and zpool space accounting.

The reported space used may be slightly larger than the parity-free size
because the amount of space used to store parity with RAID-Z varies
somewhat with blocksize (eg. even small blocks need at least 1 sector of
parity).  On most workloads[*], the overwhelming majority of space is
stored in 128k blocks, so this effect is typically not very pronounced.

--matt

[*] One workload where this effect can be noticable is when the
'recordsize' property has be decreased, eg. for a database or zvol.
However, in this situation the rounding error space can be completely
eliminated by using an appropriate number of disks in the raid-z group,
according to the following table:

exact   optimal num. disks
  recordsize  raidz1  raidz2
  8k+   3, 5 or 9   6, 10 or 18
  4k3 or 5  6 or 10
  2k3   6

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Need Help: didn't create the pool as radiz but stripes

2006-08-24 Thread Matthew Ahrens

On Thu, Aug 24, 2006 at 10:12:12AM -0600, Arlina Goce-Capiral wrote:
 It does appear that the disk is fill up by 140G.

So this confirms what I was saying, that they are only able to write
ndisks-1 worth of data (in this case, ~68GB * (3-1) == ~136GB.  So there
is no unexpected behavior with respect to the size of their raid-z pool,
just the known (and now fixed) bug.

 I think I now know what happen.  I created a raidz pool and I did not
 write any data to it before I just pulled out a disk.  So I believe the
 zfs filesystem did not initialize yet.  So this is why my zfs filesystem
 was unusable.  Can you confirm this? 

No, that should not be the case.  As soon as the 'zfs' or 'zpool'
command completes, everything will be on disk for the requested action.

 But when I created a zfs filesystem and wrote data to it, it could now
 lose a disk and just be degraded.  I tested this part by removing the
 disk partition in format. 

Well, it sounds like you are testing two different things:  first you
tried physically pulling out a disk, then you tried re-partitioning a
disk.

It sounds like there was a problem when you pulled out the disk.  If you
can describe the problem further (Did the machine panic?  What was the
panic message?) then perhaps we can diagnose it.

 I will try this same test to re-duplicate my issue, but can you confirm
 for me if my zfs filesystem as a raidz requires me to write data to it
 first before it's really ready?

No, that should not be the case.

 Any ideas when the Solaris 10 update 3 (11/06) be released?

I'm not sure, but November or December sounds about right.  And of
course, if they want the fix sooner they can always use Solaris Express
or OpenSolaris!

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] unaccounted for daily growth in ZFS disk space usage

2006-08-24 Thread Matthew Ahrens

On Thu, Aug 24, 2006 at 07:07:45AM -0700, Joe Little wrote:
 We finally flipped the switch on one of our ZFS-based servers, with
 approximately 1TB of 2.8TB (3 stripes of 950MB or so, each of which is
 a RAID5 volume on the adaptec card). We have snapshots every 4 hours
 for the first few days. If you add up the snapshot references it
 appears somewhat high versus daily use (mostly mail boxes, spam, etc
 changing), but say an aggregate of no more than 400+MB a day.
 
 However, zfs list shows our daily pool as a whole, and per day we are
 growing by .01TB, or more specifically 80GB a day. That's a far cry
 different than the 400MB we can account for. Is it possible that
 metadata/ditto blocks, or the like is trully growing that rapidly. By
 our calculations, we will triple our disk space (sitting still) in 6
 months and use up the remaining 1.7TB. Of course, this is only with
 2-3 days of churn, but its an alarming rate where before on the NetApp
 we didn't see anything close to this rate.

How are you calculating this 400MB/day figure?  Keep in mind that space
used by each snapshot is the amount of space unique to that snapshot.
Adding up the space used by all your snapshots is *not* the amount of
space that they are all taking up cumulatively.  For leaf filesystems
(those with no descendents), you can calculate the space used by
all snapshots as (fs's used - fs's referenced).

How many filesystems do you have?  Can you send me the output of 'zfs
list' and 'zfs get -r all pool'?

How much space did you expect to be using, and what data is that based
on?  Are you sure you aren't writing 80GB/day to your pool?

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and very large directories

2006-08-24 Thread Matthew Ahrens

On Thu, Aug 24, 2006 at 01:15:51PM -0500, Nicolas Williams wrote:
 I just tried creating 150,000 directories in a ZFS roto directory.  It
 was speedy.  Listing individual directories (lookup) is fast.

Glad to hear that it's working well for you!

 Listing the large directory isn't, but that turns out to be either
 terminal I/O or collation in a UTF-8 locale (which is what I use; a
 simple DTrace script showed that to be my problem):
 
 % ptime ls
 ...
 
 real9.850
 user6.263   - ouch, UTF-8 hurts
 sys 0.245

Yep, beware of using 'ls' on large directories!  See also:

6299769 'ls' memory usage is excessive
6299767 'ls -f' should not buffer output

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re: [zfs-discuss] unaccounted for daily growth in ZFS disk space usage

2006-08-24 Thread Matthew Ahrens

On Thu, Aug 24, 2006 at 02:21:33PM -0700, Joe Little wrote:
 well, by deleting my 4-hourlies I reclaimed most of the data. To
 answer some of the questions, its about 15 filesystems (decendents
 included). I'm aware of the space used by snapshots overlapping. I was
 looking at the total space (zpool iostat reports) and seeing the diff
 per day. The 400MB/day was be inspection and by looking at our nominal
 growth on a netapp.
 
 It would appear that if one days many snapshots, there is an initial
 quick growth in disk usage, but once those snapshot meet their
 retention level (say 12), the growth would appear to match our typical
 400MB/day. Time will prove this one way or other. By simply getting
 rid of hourly snapshots and collapsing to dailies for two days worth,
 I reverted to only ~1-2GB total growth, which is much more in line
 with expectations.

OK, so sounds like there is no problem here, right?  You were taking
snapshots every 4 hours, which took up no more space than was needed,
but more than you would like (and more than daily snapshots).  Using
daily snapshots the space usage is in line with daily snapshots on
NetApp.

 For various reasons, I can't post the zfs list type results as yet.
 I'll need to get the ok for that first.. Sorry..

It sounds like there is no problem here so no need to post the output.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + rsync, backup on steroids.

2006-08-30 Thread Matthew Ahrens


James Dickens wrote:

Why not make a snapshots on a production and then send incremental
backups over net? Especially with a lot of files it should be MUCH
faster than rsync.


because its a ZFS limited solution, if the source is not ZFS it won't
work, and i'm not sure how much faster incrementals would be than
rsysnc since rsync only shares checksums untill it finds a block that
has changed.


'zfs send' is *incredibly* faster than rsync.

rsync needs to traverse all the metadata, so it is fundamentally O(all 
metadata).  It needs to read every directory and stat every file, to 
figure out what's been changed.  Then it needs to read all of every 
changed file to figure out what parts of it have been changed.


In contrast, 'zfs send' essentially only needs to read the changed data, 
so it is O(changed data).  We can do this by leveraging our knowledge of 
the zfs internal structure, eg. block birth times.


That said, there is still a bunch of low-hanging performance fruit in 
'zfs send', which I'll be working on over the next few months.  And of 
course if you need a cross-filesystem tool then 'zfs send' is not for 
you.  But give it a try if you can, and let us know how it works for you!


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + rsync, backup on steroids.

2006-08-30 Thread Matthew Ahrens


Dick Davies wrote:

On 30/08/06, Matthew Ahrens [EMAIL PROTECTED] wrote:


'zfs send' is *incredibly* faster than rsync.


That's interesting. We had considered it as a replacement for a
certain task (publishing a master docroot to multiple webservers)
but a quick test with ~500Mb of data showed the zfs send/recv
to be about 5x slower than rsync for the initial copy.

You're saying subsequent copies (zfs send -i?) should be faster?


Yes.  The architectural benefits of 'zfs send' over rsync only apply to 
sending incremental changes.  When sending a full backup, both schemes 
have to traverse all the metadata and send all the data, so the *should* 
be about the same speed.


However, as I mentioned, there's still some low-hanging performance 
issues with 'zfs send', although I'm surprised that it was 5x slower 
than rsync!  I'd like to look into that issue some more... What type of 
files were you sending?  Eg. approximately what size files, how many 
files, how many files/directory?


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with expanding LUNs

2006-08-31 Thread Matthew Ahrens


Theo Bongers wrote:

Please can anyone tell me how to handle with a LUN that is expanded (on a RAID 
array or SAN storage)? and grow the filesystem without data-loss?
How does ZFS looks at the volume. In other words how can I grow the filesystem 
after LUN expansion.
Do I need to format/type/autoconfigre/label on the specific device?


I believe that if you have given ZFS the whole disk, then it will 
automatically detect that the LUN has grown when it opens the device. 
You can cause this to happen by rebooting the machine, or running 'zpool 
export poolname; zpool import poolname'.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS + rsync, backup on steroids.

2006-08-31 Thread Matthew Ahrens


Roch wrote:

Matthew Ahrens writes:
  Robert Milkowski wrote:
   IIRC unmounting ZFS file system won't flush its caches - you've got to
   export entire pool.
  
  That's correct.  And I did ensure that the data was not cached before 
  each of my tests.


Matt  ?

It seems to me  that (at  least  in the past) unmount  would
actually cause   the data to  not be  accessible (read would
issue an I/O) even if potentially the associated memory with
previous cached data was not quite reaped back to the OS.


Looks like you're right, we do (mostly) evict the data when a filesystem 
is unmounted.  The exception is if some of its cached data is being 
shared with another filesystem (eg, via a clone fs), then that data will 
not be evicted.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] migrating data across boxes

2006-08-31 Thread Matthew Ahrens


John Beck wrote:

% zfs snapshot -r [EMAIL PROTECTED]
% zfs send space/[EMAIL PROTECTED] | ssh newbox zfs recv -d space
% zfs send space/[EMAIL PROTECTED] | ssh newbox zfs recv -d space

...

% zfs set mountpoint=/export/home space
% zfs set mountpoint=/usr/local space/local
% zfs set sharenfs=on space/jbeck space/local


I'm working on some enhancements to zfs send/recv that will simplify 
this even further, especially in cases where you have many filesystems, 
snapshots, or changed properties.  In particular, you'll be able to 
simply do:


# zfs snapshot -r [EMAIL PROTECTED]
# zfs send -r -b [EMAIL PROTECTED] | ssh newbox zfs recv -p -d newpool

The send -b flag means to send from the beginning.  This will send a 
full stream of the oldest snapshot, and incrementals up to the named 
snapshot (eg, from @a to @b, from @b to @c, ... @j to @today).  This way 
your new pool will have all of the snapshots from your old pool.


The send -r flag means to do this for all the filesystem's descendants 
as well (in this case, space/jbeck and space/local).


The recv -p flag means to preserve locally set properties (in this 
case, the mountpoint and sharenfs settings).


For more information, see RFEs 6421959 and 6421958, and watch for a 
forthcoming formal interface proposal.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs clones

2006-09-01 Thread Matthew Ahrens


Marlanne DeLaSource wrote:

As I understand it, the snapshot of a set is used as a reference by the clone.

So the clone is initially a set of pointers to the snapshot. That's why it is 
so fast to create.

How can I separate it from the snapshot ? (so that df -k or zfs list will 
display for a 48G drive
pool/fs1 4G 40G
pool/clone 4G 40G

instead of 
pool/fs1 4G 44G

pool/clone 4G 44G )

I hope I am clear enough :/


There is no way to separate a clone from its origin snapshot.

I think the numbers you're posting are:

FS  REFDAVAIL
pool/fs14G  40G
pool/clone  4G  40G

So you want it to say that less space is available than really is?

Perhaps what you want is to set a reservation on the clone for its 
initial size, so that you will be guaranteed to have enough space to 
overwrite its initial contents with new contents of the same size?


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Need Help: Getting error zfs:bad checksum (read on unknown off...)

2006-09-01 Thread Matthew Ahrens


Arlina Goce-Capiral wrote:
Customer's main concern right now is to make the system bootable but it 
seems couldn't do that since the bad disks is part
of the zfs filesystems.  Is there a way to disable or clear out the bad 
zfs filesystem so system can be booted?


Yes, see this FAQ:

http://opensolaris.org/os/community/zfs/faq/#zfspanic

quote:

What can I do if ZFS panics on every boot?

ZFS is designed to survive arbitrary hardware failures through the 
use of redundancy (mirroring or RAID-Z). Unfortunately, certain failures 
in non-replicated configurations can cause ZFS to panic when trying to 
load the pool. This is a bug, and will be fixed in the near future 
(along with several other nifty features like background scrubbing and 
the ability to see a list of corrupted files). In the meantime, if you 
find yourself in the situation where you cannot boot due to a corrupt 
pool, do the followng:


   1. boot using '-m milestone=none'
   2. # mount -o remount /
   3. # rm /etc/zfs/zpool.cache
   4. # reboot

This will remove all knowledge of pools from your system. You will 
have to re-create your pool and restore from backup.



--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens

Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.


Your comments are appreciated!

--matt

A. INTRODUCTION

ZFS stores multiple copies of all metadata.  This is accomplished by
storing up to three DVAs (Disk Virtual Addresses) in each block pointer.
This feature is known as Ditto Blocks.  When possible, the copies are
stored on different disks.

See bug 6410698 ZFS metadata needs to be more highly replicated (ditto
blocks) for details on ditto blocks.

This case will extend this feature to allow system administrators to
store multiple copies of user data as well, on a per-filesystem basis.
These copies are in addition to any redundancy provided at the pool
level (mirroring, raid-z, etc).

B. DESCRIPTION

A new property will be added, 'copies', which specifies how many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the 'copies'
property be set at filesystem-creation time
(eg. 'zfs create -o copies=2 pool/fs').

The pool must be at least on-disk version 2 to use this feature (see
'zfs upgrade').

By default (copies=1), only two copies of most filesystem metadata are
stored.  However, if we are storing multiple copies of user data, then 3
copies (the maximum) of filesystem metadata will be stored.

This feature is similar to using mirroring, but differs in several
important ways:

* Different filesystems in the same pool can have different numbers of
  copies.
* The storage configuration is not constrained as it is with mirroring
  (eg. you can have multiple copies even on a single disk).
* Mirroring offers slightly better performance, because only one DVA
  needs to be allocated.
* Mirroring offers slightly better redundancy, because one disk from
  each mirror can fail without data loss.

It is important to note that the copies provided by this feature are in
addition to any redundancy provided by the pool configuration or the
underlying storage.  For example:

* In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
  will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
  1 disk failing without data loss.
* In a pool with 2-way mirrors, a filesystem with copies=3
  will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
  5 disks failing without data loss (assuming that there are at least
  ncopies=3 mirror groups).
* In a pool with single-parity raid-z a filesystem with copies=2
  will be stored with 2 copies, each copy protected by its own parity
  block.  The filesystem can tolerate any 3 disks failing without data
  loss (assuming that there are at least ncopies=2 raid-z groups).


C. MANPAGE CHANGES
*** zfs.man4Tue Jun 13 10:15:38 2006
--- zfs.man5Mon Sep 11 16:34:37 2006
***
*** 708,714 
--- 708,725 
   they are inherited.


+  copies=1 | 2 | 3

+Controls the number of copies of data stored for this dataset.
+These copies are in addition to any redundancy provided by the
+pool (eg. mirroring or raid-z).  The copies will be stored on
+different disks if possible.
+
+Changing this property only affects newly-written data.
+Therefore, it is recommended that this property be set at
+filesystem creation time, using the '-o copies=' option.
+
+
Temporary Mountpoint Properties
   When a file system is mounted, either through mount(1M)  for
   legacy  mounts  or  the  zfs mount command for normal file


D. REFERENCES
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens


James Dickens wrote:

On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote:

B. DESCRIPTION

A new property will be added, 'copies', which specifies how many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the 'copies'
property be set at filesystem-creation time
(eg. 'zfs create -o copies=2 pool/fs').


would the user be held acountable for the space used by the extra
copies? 


Doh!  Sorry I forgot to address that.  I'll amend the proposal and 
manpage to include this information...


Yes, the space used by the extra copies will be accounted for, eg. in 
stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.



so if a user has a 1GB quota and stores one  512MB file with
two copies activated, all his space will be used? 


Yes, and as mentioned this will be reflected in all the space accounting 
tools.



what happens if the
same user stores a file that is 756MB on the filesystem with multiple
copies enabled an a 1GB quota, does the save fail?


Yes, they will get ENOSPC and see that their filesystem is full.


How would the user
tell that his filesystem is full since all the tools he is used to
report he is using only 1/2 the space?


Any tool will report that in fact all space is being used.


Is there a way for the sysdmin to get rid of the excess copies should
disk space needs require it?


No, not without rewriting them.  (This is the same behavior we have 
today with the 'compression' and 'checksum' properties.  It's a 
long-term goal of ours to be able to go back and change these things 
after the fact (scrub them in, so to say), but with snapshots, this is 
extremely nontrivial to do efficiently and without increasing the amount 
of space used.)



If I start out 2 copies and later change it to on 1 copy,  do the
files created before keep there 2 copies?


Yep, the property only affects newly-written data.


what happens if root needs to store a copy of an important file and
there is no space but there is space if extra copies are reclaimed?


They will get ENOSPC.


Will this be configurable behavior?


No.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens


Mike Gerdts wrote:

Is there anything in the works to compress (or encrypt) existing data
after the fact?  For example, a special option to scrub that causes
the data to be re-written with the new properties could potentially do
this.


This is a long-term goal of ours, but with snapshots, this is extremely 
nontrivial to do efficiently and without increasing the amount of space 
used.) .


 If so, this feature should subscribe to any generic framework

provided by such an effort.


Yep, absolutely.


* Mirroring offers slightly better redundancy, because one disk from
   each mirror can fail without data loss.


Is this use of slightly based upon disk failure modes?  That is, when
disks fail do they tend to get isolated areas of badness compared to
complete loss?  I would suggest that complete loss should include
someone tripping over the power cord to the external array that houses
the disk.


I'm basing this slightly better call on a model of random, 
complete-disk failures.  I know that this is only an approximation. 
With many mirrors, most (but not all) 2-disk failures can be tolerated. 
 With copies=2, almost no 2-top-level-vdev failures will be tolerated, 
because it's likely that *some* block will have both its copies on those 
2 disks.  With mirrors, you can arrange to mirror across cabinets, not 
within them, which you can't do with copies.



It is important to note that the copies provided by this feature are in
addition to any redundancy provided by the pool configuration or the
underlying storage.  For example:


All of these examples seem to assume that there six disks.


Not really.  There could be any number of mirrors or raid-z groups 
(although I note, you need at least 'copies' groups to survive the max 
whole-disk failures).



* In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
   will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
   1 disk failing without data loss.
* In a pool with 2-way mirrors, a filesystem with copies=3
   will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
   5 disks failing without data loss (assuming that there are at least
   ncopies=3 mirror groups).


This one assumes best case scenario with 6 disks.  Suppose you had 4 x
72 GB and 2 x 36 GB disks.  You could end up with multiple copies on
the 72 GB disks.


Yes, all these examples assume that our putting the copies on different 
disks when possible actually worked out.  It will almost certainly work 
out unless you have a small number of different-sized devices, or are 
running with very little free space.  If you need hard guarantees, you 
need to use actual mirroring.



Any statement about physical location on the disk?   It would seem as
though locating two copies sequentially on the disk would not provide
nearly the amount of protection as having them fairly distant from
each other.


Yep, if the copies can't be stored on different disks, they will be 
stored spread-out on the same disk if possible (I think we aim for one 
on each quarter of the disk).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-11 Thread Matthew Ahrens


James Dickens wrote:

though I think this is a cool feature, I think i needs more work. I
think there sould be an option to make extra copies expendible. So the
extra copies are a request, if the space is availible make them, if
not complete the write, and log the event.


Are you asking for the extra copies that have already been written to be 
dynamically freed up when we are running low on space?  That could be 
useful, but it isn't the problem I'm trying to solve with the 'copies' 
property (not to mention it would be extremely difficult to implement).



It the user really requires guaranteed extra copies, then use mirrored
or raided disks.


Right, if you want everything to have extra redundancy, that use case is 
handled just fine today by mirrors or RAIDZ.


The case where 'copies' is useful is when you want some data to be 
stored with more redundancy than others, without the burden of setting 
up different pools.



It seems just to be a nightmare for the administrator, you start with
3 copies and then change to 2 copies, you will have phantom copies
that are only known to exist to the OS, it won't show in any reports,
zfs list doesn't have an option to show which files have multiple
clones and which dont. There is no way to destroy multiple clones
without rewriting every file on the disk.


(I'm assuming you mean copies, not clones.)

So would you prefer that the property be restricted to only being set at 
filesystem creation time, and not changed later?  That way the number of 
copies of all files in the filesystem is always the same.


It seems like the issue of knowing how many copies there are would be 
much worse in the system you're asking for where the extra copies are 
freed up as needed...


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and free space

2006-09-12 Thread Matthew Ahrens


Robert Milkowski wrote:

Hello Mark,

Monday, September 11, 2006, 4:25:40 PM, you wrote:

MM Jeremy Teo wrote:

Hello,

how are writes distributed as the free space within a pool reaches a
very small percentage?

I understand that when free space is available, ZFS will batch writes
and then issue them in sequential order, maximising write bandwidth.
When free space reaches a minimum, what happens?

Thanks! :)


MM Just what you would expect to happen:

MM As contiguous write space becomes unavailable, writes will be come
MM scattered and performance will degrade.  More importantly: at this
MM point ZFS will begin to heavily write-throttle applications in order
MM to ensure that there is sufficient space on disk for the writes to
MM complete.  This means that there will be less writes to batch up
MM in each transaction group for contiguous IO anyway.

MM As with any file system, performance will tend to degrade at the
MM limits.  ZFS keeps a small overhead reserve (much like other file
MM systems) to help mitigate this, but you will definitely see an
MM impact.

I hope it won't be a problem if space is getting low i a file system
with quota set however in a pool the file system is in there's plenty
of space, right?


If you are running close to your quota, there will be a little bit of 
performance degradation, but not to the same degree as when running low 
on free space in the pool.  The reason performance degrades when you're 
near your quota is that we aren't exactly sure how much space will be 
used until we actually get around to writing it out (due to compression, 
snapshots, etc).  So we have to write things out in smaller batches (ie. 
flush out transaction groups more frequently than is optimal).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Matthew Ahrens


Matthew Ahrens wrote:
Here is a proposal for a new 'copies' property which would allow 
different levels of replication for different filesystems.


Thanks everyone for your input.

The problem that this feature attempts to address is when you have some 
data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS's pooled storage model. 
 (You have to divide up your storage, you'll end up with stranded 
storage and bandwidth, etc.)


Given the overwhelming criticism of this feature, I'm going to shelve it 
for now.


Out of curiosity, what would you guys think about addressing this same 
problem by having the option to store some filesystems unreplicated on 
an mirrored (or raid-z) pool?  This would have the same issues of 
unexpected space usage, but since it would be *less* than expected, that 
might be more acceptable.  There are no plans to implement anything like 
this right now, but I just wanted to get a read on it.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

2006-09-12 Thread Matthew Ahrens


Dick Davies wrote:

For the sake of argument, let's assume:

1. disk is expensive
2. someone is keeping valuable files on a non-redundant zpool
3. they can't scrape enough vdevs to make a redundant zpool
   (remembering you can build vdevs out of *flat files*)


Given those assumptions, I think that the proposed feature is the 
perfect solution.  Simply put those files in a filesystem that has copies1.


Also note that using files to back vdevs is not a recommended solution.


If the user wants to make sure the file is 'safer' than others, he
can just make multiple copies. Either to a USB disk/flashdrive, cdrw,
dvd, ftp server, whatever.


It seems to me that asking the user to solve this problem by manually 
making copies of all his files puts all the burden on the 
user/administrator and is a poor solution.


For one, they have to remember to do it pretty often.  For two, when 
they do experience some data loss, they have to manually reconstruct the 
files!  They could have one file which has part of it missing from copy 
A and part of it missing from copy B.  I'd hate to have to reconstruct 
that manually from two different files, but the proposed solution would 
do this transparently.



The redundancy you're talking about is what you'd get from 'cp
/foo/bar.jpg /foo/bar.jpg.ok', except it's hidden from the user and
causing headaches for anyone trying to comprehend, port or extend the
codebase in the future.


Whether it's hard to understand is debatable, but this feature 
integrates very smoothly with the existing infrastructure and wouldn't 
cause any trouble when extending or porting ZFS.



I'm afraid I honestly think this greatly complicates the conceptual model
(not to mention the technical implementation) of ZFS, and I haven't seen
a convincing use case.


Just for the record, these changes are pretty trivial to implement; less 
than 50 lines of code changed.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Matthew Ahrens


Torrey McMahon wrote:

Matthew Ahrens wrote:
The problem that this feature attempts to address is when you have 
some data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS's pooled storage 
model.  (You have to divide up your storage, you'll end up with 
stranded storage and bandwidth, etc.) 


Can you expand? I can think of some examples where using multiple pools 
- even on the same host - is quite useful given the current feature set 
of the product.  Or are you only discussing the specific case where a
host would want more reliability for a certain set of data then an 
other? If that's the case I'm still confused as to what failure cases 
would still allow you to retrieve your data if there are more then one 
copy in the fs or pool.but I'll gladly take some enlightenment. :)


(My apologies for the length of this response, I'll try to address most 
of the issues brought up recently...)


When I wrote this proposal, I was only seriously thinking about the case 
where you want different amounts of redundancy for different data. 
Perhaps because I failed to make this clear, discussion has concentrated 
on laptop reliability issues.  It is true that there would be some 
benefit to using multiple copies on a single-disk (eg. laptop) pool, but 
of course it would not protect against the most common failure mode 
(whole disk failure).


One case where this feature would be useful is if you have a pool with 
no redundancy (ie. no mirroring or raid-z), because most of the data in 
the pool is not very important.  However, the pool may have a bunch of 
disks in it (say, four).  The administrator/user may realize (perhaps 
later on) that some of their data really *is* important and they would 
like some protection against losing it if a disk fails.  They may not 
have the option of adding more disks to mirror all of their data (cost 
or physical space constraints may apply here).  Their problem is solved 
by creating a new filesystem with copies=2 and putting the important 
data there.  Now, if a disk fails, then the data in the copies=2 
filesystem will not be lost.  Approximately 1/4 of the data in other 
filesystems will be lost.  (There is a small chance that some tiny 
fraction of the data in the copies=2 filesystem will still be lost if we 
were forced to put both copies on the disk that failed.)


Another plausible use case would be where you have some level of 
redundancy, say you have a Thumper (X4500) with its 48 disks configured 
into 9 5-wide single-parity raid-z groups (with 3 spares).  If a single 
disk fails, there will be no data loss.  However, if two disks within 
the same raid-z group fail, data will be lost.  In this scenario, 
imagine that this data loss probability is acceptable for most of the 
data stored here, but there is some extremely important data for which 
this is unacceptable.  Rather than reconfiguring the entire pool for 
higher redundancy (say, double-parity raid-z) and less usable storage, 
you can simply create a filesystem with copies=2 within the raid-z 
storage pool.  Data within that filesystem will not be lost even if any 
three disks fail.


I believe that these use cases, while not being extremely common, do 
occur.  The extremely low amount of engineering effort required to 
implement the feature (modulo the space accounting issues) seems 
justified.  The fact that this feature does not solve all problems (eg, 
it is not intended to be a replacement for mirroring) is not a downside; 
not all features need to be used in all situations :-)


The real problem with this proposal is the confusion surrounding disk 
space accounting with copies1.  While the same issues are present when 
using compression, people are understandably less upset when files take 
up less space than expected.  Given the current lack of interest in this 
feature, the effort required to address the space accounting issue does 
not seem justified at this time.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshots and backing store

2006-09-13 Thread Matthew Ahrens


Nicolas Dorfsman wrote:

Hi,

There's something really bizarre in ZFS snaphot specs : Uses no
separate backing store. .

Hum...if I want to mutualize one physical volume somewhere in my SAN
as THE snaphots backing-store...it becomes impossible to do !
Really bad.

Is there any chance to have a backing-store-file option in a future
release ?

In the same idea, it would be great to have some sort of propertie to
add a disk/LUN/physical_space to a pool, only reserved to
backing-store.  At now, the only thing I see to disallow users to use
my backing-store space for their usage is to put quota.


If you want to copy your filesystems (or snapshots) to other disks, you
can use 'zfs send' to send them to a different pool (which may even be 
on a different machine!).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Snapshots and backing store

2006-09-13 Thread Matthew Ahrens


Nicolas Dorfsman wrote:

We need to think ZFS as ZFS, and not as a new filesystem ! I mean,
the whole concept is different.


Agreed.


So. What could be the best architecture ?


What is the problem?


With UFS, I used to have separate metadevices/LUNs for each
application. With ZFS, I thought it would be nice to use a separate
pool for each application.


Ick.  It would be much better to have one pool, and a separate
filesystem for each application.


But, it means multiply snapshot backing-store OR dynamically
remove/add this space/LUN to pool where we need to do backups.


I don't understand this statement.  What problem are you trying to 
solve?  If you want to do backups, simply take a snapshot, then point 
your backup program at it.  If you want faster incremental backups, use 
'zfs send -i' to generate the file to backup.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature

2006-09-14 Thread Matthew Ahrens


Bady, Brant RBCM:EX wrote:

Actually to clarify - what I want to do is to be able to read the
associated checksums ZFS creates for a file and then store them in an
external system e.g. an oracle database most likely


Rather than storing the checksum externally, you could simply let ZFS 
verify the integrity of the data.  Whenever you want to check it, just 
run 'zpool scrub'.


Of course, if you don't trust ZFS to do that for you, you probably 
wouldn't trust it to tell you the checksum either!


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: zfs clones

2006-09-18 Thread Matthew Ahrens


Jan Hendrik Mangold wrote:

I didn't ask the original question, but I have a scenario where I
want to use clone as well and encounter a (designed?) behaviour I am
trying to understand.

I create a filesystem A with ZFS and modify it to a point where I
create a snapshot [EMAIL PROTECTED] Then I clone that snapshot to create a new
filesystem B. I seem to have two filesystem entities I can make
independant modifications and snapshots with/on/from.

The problem I am running into is that when modifying A and wanting to
rollback to the snapshot [EMAIL PROTECTED] I can't do that as long as the clone 
B
is mounted.

Is this a case where I would benefit from the ability to sperate the
clone? Or is this something not possible with ZFS?


Hmm, actually this is unexpected; you shouldn't have to unmount the 
clone to do the rollback on the origin filesystem.  I think that our 
command-line tool is simply being a bit overzealous.  I've filed bug 
6472202 to track this issue; it should be pretty straightforward to fix.


Thanks for bringing this to our attention!
--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: zfs clones

2006-09-18 Thread Matthew Ahrens


Mike Gerdts wrote:

A couple scenarios from environments that I work in, using legacy
file systems and volume managers:

1) Various test copies need to be on different spindles to remove any
perceived or real performance impact imposed by one or the other.
Arguably by having the IO activity spread across all the spindles
there would be fewer bottlenecks.  However, if you are trying to
simulate the behavior of X production spindles, doing so with 1.3 X
or 2 X spindles is not a proper comparison.  Hence being wasteful and
getting suboptimal performance may be desirable.  If you don't
understand that logic, you haven't worked in a big enough company or
studied Dilbert enough.  :)


Here it makes sense to be using X spindles.  However, using a clone 
filesystem will perform the same as a non-clone filesystem.  So if you 
have enough space on those X spindles for the clone, I don't think 
there's any need for additional separation.


Of course, this may not eliminate imagined performance difference (eg, 
your Dilbert reference :-), in which case you can simply use 'zfs send | 
zfs recv' to send the snapshot to a suitably-isolated pool/machine.



2) One of the copies of the data needs to be portable to another
system while the original stays put.  This could be done to refresh
non-production instances from production, to perform backups in such a
way that it doesn't put load on the production spindles, networks,
etc.


This is a case where you should be using multiple pools (possibly on the 
same host), and using 'zfs send | zfs recv' between them.  In some 
cases, you may be able to attach the storage to the destination machine 
and use the network to move the data, eg. 'zfs send | ssh dest zfs recv'.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Fastest way to send 100gb ( with ZFS send )

2006-09-30 Thread Matthew Ahrens


Anantha N. Srirama wrote:

You're most certainly are hitting the SSH limitation. Note that
SSH/SCP sessions are single threaded and won't utilize all of the
system resources even if they are available.


You may want to try 'ssh -c blowfish' to use the (faster) blowfish 
encryption algorithm rather than default of triple-DES.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to make an extended LUN size known to ZFS and Solaris

2006-10-03 Thread Matthew Ahrens


Michael Phua - PTS wrote:

Hi,

Our customer has an Sun Fire X4100 with Solaris 10 using ZFS and a HW RAID
array (STK D280).

He has extended a LUN on the storage array and want to make this new size
known to ZFS and Solaris.

Does anyone know if this can be done and how it can be done.


Unfortunately, there's no good way to do this at the moment.

When you give ZFS the whole disk, we put a EFI label on the disk and 
make one big slice for our use.  However, when the LUN grows, that slice 
stays the same size.  ZFS needs to write a new EFI label describing the 
new size, before it can use the new space.  I've filed bug 6475340 to 
track this issue.


As a workaround, it *should* be possible to manually relabel the device 
with format(1m), but unfortunately bug 4967547 (a problem with format) 
prevents this from working correctly.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What's going to make it into 11/06?

2006-10-05 Thread Matthew Ahrens


Darren Dunham wrote:

What about ZFS root?. And compatibility with Live Upgrade?. Any
timetable estimation?.


ZFS root has been previously announced as targeted for update 4.


ZFS root support will most likely not be available in Solaris 10 until 
update 5.  (And of course this is subject to change...)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: directory tree removal issue with zfs on Blade 1500/PC rack server IDE disk

2006-10-05 Thread Matthew Ahrens


Stefan Urbat wrote:

By the way, I have to wait a few hours to umount and check mountpoint
permissions, because an automated build is currently running on that
zfs --- the performance of [EMAIL PROTECTED] is indeed rather poor (much worse
than ufs), but this is another, already documented and bug entry
honoured issue.


Really?  Are you allowing ZFS to use the entire disk (and thus turn on 
the disk's write cache)?  Can you describe your workload and give 
numbers on both ZFS and UFS?  What bug was filed?


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Unbootable system recovery

2006-10-05 Thread Matthew Ahrens


Ewen Chan wrote:

However, in order for me to lift the unit, I needed to pull the
drives out so that it would actually be moveable, and in doing so, I
think that the drive-cable-port allocation/assignment has
changed.


If that is the case, then ZFS would automatically figure out the new 
mapping.  (Of course, there could be an undiscovered bug in that code.)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A versioning FS

2006-10-06 Thread Matthew Ahrens


[EMAIL PROTECTED] wrote:

On Fri, Oct 06, 2006 at 01:14:23AM -0600, Chad Leigh -- Shire.Net LLC wrote:

But I would dearly like to have a versioning capability.


Me too.
Example (real life scenario): there is a samba server for about 200
concurrent connected users. They keep mainly doc/xls files on the
server.  From time to time they (somehow) currupt their files (they
share the files so it is possible) so they are recovered from backup.
Having versioning they could be said that if their main file is
corrupted they can open previous version and keep working.
ZFS snapshots is not solution in this case because we would have to
create snapshots for 400 filesystems (yes, each user has its filesystem
and I said that there are 200 concurrent connections but there much more
accounts on the server) each hour or so.


I completely disagree.  In this scenario (and almost all others), use of 
regular snapshots will solve the problem.  'zfs snapshot -r' is 
extremely fast, and I'm working on some new features that will make 
using snapshots for this even easier and better-performing.


If you disagree, please tell us *why* you think snapshots don't solve 
the problem.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A versioning FS

2006-10-06 Thread Matthew Ahrens


Jeremy Teo wrote:

A couple of use cases I was considering off hand:

1. Oops i truncated my file
2. Oops i saved over my file
3. Oops an app corrupted my file.
4. Oops i rm -rf the wrong directory.
All of which can be solved by periodic snapshots, but versioning gives
us immediacy.

So is immediacy worth it to you folks? I rather not embark on writing
and finishing code on something no one wants besides me.


In my opinion, the marginal benefit of per-write(2) versions over 
snapshots (which can be per-transaction, ie. every ~5 seconds) does not 
outweigh the complexity of implementation and use/administration.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] can't recv incremental snapshot

2006-10-06 Thread Matthew Ahrens

Frank Cusack wrote:
[EMAIL PROTECTED]:~]# zfs send -i export/zone/www/[EMAIL PROTECTED] export/zone/www/[EMAIL PROTECTED]
| ssh cookies zfs recv export/zone/www/html

cannot receive: destination has been modified since most recent snapshot --
use 'zfs rollback' to discard changes

I was going to try deleting all snaps and start over with a new snap but I
thought someone might be interested in figuring out what's going on here.

That should not be necessary!

I assume that you already followed the suggestion of doing 'zfs
rollback', and you got the same message after trying the incremental
recv again. If not, try that first.

There are a couple of things that could cause this. One is that some
process is inadvertently modifying the destination (eg. by reading
something, causing the atime to be updated). You can get around this by
making the destination fs readonly=on.

Another possibility is that you're hitting 6343779 ZPL's delete queue
causes 'zfs restore' to fail.

In either case, you can fix the problem by using zfs recv -F which
will do the rollback for you and make sure nothing happens between the
rollback and the recv. You need to be running build 48 or later to use
'zfs recv -F'.

If you can't run build 48 or later, then you can workaround the problem
by not mounting the filesystem in between the 'rollback' and the 'recv':

cookies# zfs set mountpoint=none export/zone/www/html
cookies# zfs rollback export/zone/www/[EMAIL PROTECTED]
milk# zfs send -i @4 export/zone/www/[EMAIL PROTECTED] | ssh cookies zfs recv
export/zone/www/html

Let me know if one of those options works for you.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] can't recv incremental snapshot

2006-10-06 Thread Matthew Ahrens


Frank Cusack wrote:

If you can't run build 48 or later, then you can workaround the problem
by not mounting the filesystem in between the 'rollback' and the 'recv':

cookies# zfs set mountpoint=none export/zone/www/html
cookies# zfs rollback export/zone/www/[EMAIL PROTECTED]
milk# zfs send -i @4 export/zone/www/[EMAIL PROTECTED] | ssh cookies zfs recv
export/zone/www/html

Let me know if one of those options works for you.


Setting mountpoint=none works, but once I set the mountpoint option back
it fails again.  That is, I successfully send the incremental, reset the
mountpoint option, rollback and send and it fails.


I don't follow... could you list the exact sequence of commands you used 
and their output?  I think you're saying that you were able to 
successfully receive the @[EMAIL PROTECTED] incremental, but when you tried the @[EMAIL PROTECTED] 
incremental without doing mountpoint=none, the recv failed.  So you're 
saying that you need mountpoint=none for any incremental recv's, not 
just @[EMAIL PROTECTED]



So I guess there is a filesystem access somewhere somehow immediately after
the rollback.  I can't run b48 (any idea if -F will be in 11/06?).  


I don't think so.  Look for it in Solaris 10 update 4.


However, I really do this via
a script which does a rollback then immediately does the send.  This script
always fails. 


It sounds like the mountpoint=none trick works for you, so can't you 
just incorporate it into your script?  Eg:


while (want to send snap) {
zfs set mountpoint=none destfs
zfs rollback [EMAIL PROTECTED]
zfs send -i @bla [EMAIL PROTECTED] | ssh desthost zfs recv bla
zfs inherit mountpoint destfs
sleep ...
}


readonly=on doesn't help.  That is,

cookies# zfs set readonly=on export/zone/www/html
cookies# zfs rollback export/zone/www/[EMAIL PROTECTED]
milk# zfs send ...
 ... destination has been modified ...


This implies that you are hitting 6343779 (or some other bug) which is 
causing your fs to be modified, rather than some spurious process.  But 
I would expect that to be rare, so it would be surprising if you see 
this happening with many different snapshots.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] can't recv incremental snapshot

2006-10-06 Thread Matthew Ahrens


Frank Cusack wrote:

No, I just tried the @[EMAIL PROTECTED] incremental again.  I didn't think to 
try
another incremental.  So I was basically doing the mountpoint=none trick,
they trying @[EMAIL PROTECTED] again without doing mountpoint=none.


Again, seeing the exact sequence of commands you ran would make it 
quicker for me to diagnose this.


I think you're saying that you ran:

zfs set mountpoint=none destfs
zfs rollback [EMAIL PROTECTED]
zfs send -i @4 [EMAIL PROTECTED] | zfs recv ... - success
zfs inherit mountpoint destfs
zfs rollback -r [EMAIL PROTECTED]
zfs send -i @4 [EMAIL PROTECTED] | zfs recv ... - failure

This would be consistent with hitting bug 6343779.


It sounds like the mountpoint=none trick works for you, so can't you just
incorporate it into your script?  Eg:


Sure.  I was just trying to identify the problem correctly, in case
this isn't just another instance of an already-known problem.
mountpoint=none is really suboptimal for me though, it means i cannot
have services running on the receiving host.  I was hoping readonly=on
would do the trick.


Really?  I find it hard to believe that mountpoint=none causes any more 
problems than 'zfs recv' by itself, since 'zfs recv' of an incremental 
stream always unmounts the destination fs while the recv is taking place.



It's all existing snapshots on that one filesystem.  If I take a new
snapshot (@6) and send it, it works.  Which seems weird to me.  It seems
to be something to do with the sending host, not the receiving host.


From the information you've provided, my best guess is that the problem 
is associated with your @4 snapshot, and you are hitting 6343779.  Here 
is the bug description:


Even when not accessing a filesystem, it can become
dirty due to the zpl's delete queue.  This means
that even if you are just 'zfs restore'-ing incremental
backups into the filesystem, it may fail because the
filesystem has been modified.

One possible solution would be to make filesystems
created by 'zfs restore' be readonly by default, and have
the zpl not process the delete queue if it is mounted
readonly.
*** (#1 of 2): 2005-10-31 03:31:02 PST [EMAIL PROTECTED]

Note, currently even if you manually set the filesystem to be readonly,
the ZPL will still process the delete queue, making it particularly
difficult to ensure there are no changes since a most recent snapshot
which has entries in the delete queue.  The only workaround I could
find is to not mount the filesystem.
*** (#2 of 2): 2005-10-31 03:34:56 PST [EMAIL PROTECTED]

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] can't recv incremental snapshot

2006-10-06 Thread Matthew Ahrens


Frank Cusack wrote:

Really?  I find it hard to believe that mountpoint=none causes any more
problems than 'zfs recv' by itself, since 'zfs recv' of an incremental
stream always unmounts the destination fs while the recv is taking place.


You're right.  I forgot I was having problems with this anyway.


You'd probably be interested in RFE 6425096 want online (read-only) 
'zfs recv'.  Unfortunately this isn't a priority at the moment.



It's all existing snapshots on that one filesystem.  If I take a new
snapshot (@6) and send it, it works.  Which seems weird to me.  It seems
to be something to do with the sending host, not the receiving host.


 From the information you've provided, my best guess is that the problem
is associated with your @4 snapshot, and you are hitting 6343779.


Well, all existing snapshots (@0, @1 ... @4).  I will add changing of
the mountpoint property to my script.


That's a bit surprising, but I'm glad we have a workaround for you. 
'zfs recv -F' will make this a bit smoother once you have it.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool list No known data errors

2006-10-09 Thread Matthew Ahrens


ttoulliu2002 wrote:

Hi:

I have zpool created 
# zpool list

NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
ktspool34,5G   33,5K   34,5G 0%  ONLINE -

However, zpool status shows no known data error.  May I know what is the problem
# zpool status
  pool: ktspool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
ktspool ONLINE   0 0 0
  c0t1d0s6  ONLINE   0 0 0

errors: No known data errors


Please do not crosspost to both zfs-discuss and zfs-code.  zfs-code is a 
subset of zfs-discuss, so just post to zfs-discuss.


To answer your question, there does not appear to be any problem.  Why 
do you think there is a problem?


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: directory tree removal issue with zfs on Blade 1500/PC rack server IDE disk

2006-10-09 Thread Matthew Ahrens


Stefan Urbat wrote:

What bug was filed?


6421427 is nfs related, but another forum member thought, that it is in fact a 
general IDE performance bottleneck behind, and was only made visible in this 
case. There is a report, that on an also with simple IDE equipped Blade 150 the 
same issue with low performance is visible: 
http://www.opensolaris.org/jive/thread.jspa?messageID=57201


Ah yes, good old 6421427.  The fix for that should be putback into 
opensolaris any day now.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to carve up 8 disks

2006-10-12 Thread Matthew Ahrens


Brian Hechinger wrote:

Ok, previous threads have lead me to believe that I want to make raidz
vdevs [0] either 3, 5 or 9 disks in size [1].  Let's say I have 8 disks.
Do I want to create a zfs pool with a 5-disk vdev and a 3-disk vdev?
Are there performance issues with mixing differently sized raidz vdevs
in a pool?  If there *is* a performance hit to mix like that, would it
be greater or lesser than building an 8-disk vdev?


Unless you are running a database (or other record-structured 
application), or have specific performance data for your workload that 
supports your choice, I wouldn't worry about using the 
power-of-two-plus-parity size stripes.


I'd choose between (in order of decreasing available io/s):

4x 2-way mirrors (most io/s and most read bandwidth)
2x 4-way raidz1
1x 8-way raidz1 (most write bandwidth)
1x 8-way raidz2 (most redundant)


[0] - Just for clarity, what are the sub-pools in a pool, the actual
raidz/mirror/etc containers called.  What is the correct term to refer
to them?  I don't want any extra confusion here. ;)


We would usually just call them vdevs (or to be more specific, 
top-level vdevs).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Where is the ZFS configuration data stored?

2006-10-12 Thread Matthew Ahrens


Steven Goldberg wrote:
Thanks Matt.  So is the config/meta info for the pool that is stored 
within the pool kept in a file?  Is the file user readable or binary? 


It is not user-readable.  See the on-disk format document, linked here:

http://www.opensolaris.org/os/community/zfs/docs/

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Self-tuning recordsize

2006-10-13 Thread Matthew Ahrens


Jeremy Teo wrote:

Would it be worthwhile to implement heuristics to auto-tune
'recordsize', or would that not be worth the effort?


It would be really great to automatically select the proper recordsize 
for each file!  How do you suggest doing so?


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs and zones

2006-10-13 Thread Matthew Ahrens


Roshan Perera wrote:

Hi Jeff  Robert, Thanks for the reply. Your interpretation is
correct and the answer spot on.

This is going to be at a VIP clients QA/production environment and
first introduction to 10, zones and zfs. Anything unsupported is not
allowed. Hence I may have to wait for the fix. Do you know roughly
when the fixes will be available. So that I can give the cusrtomer
some time related info. Thanks again. Roshan


Using ZFS for a zones root is currently planned to be supported in 
solaris 10 update 5, but we are working on moving it up to update 4.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thumper and ZFS

2006-10-13 Thread Matthew Ahrens


Robert Milkowski wrote:

Hello Richard,

Friday, October 13, 2006, 8:05:18 AM, you wrote:

REP Do you want data availability, data retention, space, or performance?

data availability, space, performance

However we're talking about quite a lot of small IOs (r+w).


Then you should seriously consider using mirrors.


The real question was what do you think about creating each raid group
only from disks from different controllers so controller failure won't
affect data availability.


On thumper, where the controllers (and cables, etc) are integrated into 
the system board, controller failure is extremely unlikely.  These 
controllers are much more reliable than your traditional SCSI card in a 
PCI slot.  In fact, most controller failures are due to SCSI bus 
negotiation problems (confused devices, bad cables, etc), which simply 
don't exist in the point-to-point (ie. SCSI, SAS) world.  So I wouldn't 
worry very much about spreading across controllers for the sake of 
controller failure.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Usability issue : improve means of finding ZFS-physdevice(s) mapping

2006-10-13 Thread Matthew Ahrens


Robert Milkowski wrote:

Hello Noel,

Friday, October 13, 2006, 11:22:06 PM, you wrote:

ND I  don't understand why you can't use 'zpool status'?  That will show
ND the pools and the physical devices in each and is also a pretty basic
ND command.  Examples are given in the sysadmin docs and manpages for  
ND ZFS on the opensolaris ZFS community page.


Showing physical devs in df output with ZFS is not right and I do not
imagine how one would show in df output for a pool with dozen disks.

But an option to zpool command to display config in such a way so it's
easy (almost copypaste) to recreate such config would be useful.
Something like metastat -p.


Agreed, see 6276640 zpool config.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Self-tuning recordsize

2006-10-15 Thread Matthew Ahrens


Jeremy Teo wrote:

Would it be worthwhile to implement heuristics to auto-tune
'recordsize', or would that not be worth the effort?


Here is one relatively straightforward way you could implement this.

You can't (currently) change the recordsize once there are multiple 
blocks in the file.  This shouldn't be too bad because by the time 
they've written 128k, you should have enough info to make the choice. 
In fact, that might make a decent algorithm:


* Record the first write size (in the ZPL's znode)
* If subsequent writes differ from that size, reset write size to zero
* When a write comes in past 128k, see if the write size is still 
nonzero; if so, then read in the 128k, decrease the blocksize to the 
write size, fill in the 128k again, and finally do the new write.


Obviously you will have to test this algorithm and make sure that it 
actually detects the recordsize on various databases.  They may like to 
initialize their files with large writes, which would break this.  If 
you have to change the recordsize once the file is big, you will have to 
rewrite everything[*], which would be time consuming.


--matt

[*] Or if you're willing to hack up the DMU and SPA, you'll just have 
to re-read everything to compute the new checksums and re-write all the 
indirect blocks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Configuring a 3510 for ZFS

2006-10-16 Thread Matthew Ahrens


Torrey McMahon wrote:

Richard Elling - PAE wrote:

Anantha N. Srirama wrote:
I'm glad you asked this question. We are currently expecting 3511 
storage sub-systems for our servers. We were wondering about their 
configuration as well. This ZFS thing throws a wrench in the old line 
think ;-) Seriously, we now have to put on a new hat to figure out 
the best way to leverage both the storage sub-system as well as ZFS.


[for the archives]
There is *nothing wrong* with treating ZFS like UFS when configuring 
with LUNs
hosted on RAID arrays.  It is true that you will miss some of the 
self-healing
features of ZFS, but at least you will know when the RAID array has 
munged your
data -- a feature missing on UFS and most other file systems. 


Of you just offer ZFS multiple LUNs from the RAID array.

The issue is putting ZFS on a single LUN be it a disk in a JBOD or a LUN 
offered from a HW RAID array. If someone goes wrong and the LUN becomes 
inaccessible then ... blamo! You're toasted. If ZFS detects a data 
inconsistency then it can't look to an other block for a mirrored copy, 
ala ZFS mirror, or to a parity block, ala RAIDZ.


Right, I think Richard's point is that even if you just give ZFS a 
single LUN, ZFS is still more reliable than other filesystems (eg, due 
to its checksums to prevent silent data corruption and multiple copies 
of metadata to lessen the hurt of small amounts of data loss).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshots impact on performance

2006-10-17 Thread Matthew Ahrens


Robert Milkowski wrote:

If it happens again I'll try to get some more specific data - however
it depends on when it happens as during peak hours I'll probably just
destroy a snapshot to get it working.


If it happens again, it would be great if you could gather some data 
before you destroy the snapshot so we have some chance of figuring out 
what's going on here.  'iostat -xnpc 1' will tell us if it's CPU or disk 
bound.  'lockstat -kgiw sleep 10' will tell us what functions are using 
CPU.  'echo ::walk thread|::findstack | mdb -k' will tell us where 
threads are stuck.


Actually, if you could gather each of those both while you're observing 
the problem, and then after the problem goes away, that would be helpful.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Matthew Ahrens


Jeremy Teo wrote:

Heya Anton,

On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote:
No, the reason to try to match recordsize to the write size is so that 
a small write does not turn into a large read + a large write.  In 
configurations where the disk is kept busy, multiplying 8K of data 
transfer up to 256K hurts.


(Actually ZFS goes up to 128k not 256k (yet!))


Ah. I knew i was missing something. What COW giveth, COW taketh away...


Yes, although actually most non-COW filesystems have this same problem, 
because they don't write partial blocks either, even though technically 
they could.  (And FYI, checksumming would take away the ability to 
write partial blocks too.)



1) Set recordsize manually
2) Allow the blocksize of a file be changed even if there are multiple
blocks in the file.


Or, as has been suggested, add an API for apps to tell us the recordsize 
before they populate the file.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ENOSPC : No space on file deletion

2006-10-19 Thread Matthew Ahrens


Erblichs wrote:

Now the stupid question..
If the snapshot is identical to the FS, I can't
remove files from the FS because of the snapshot
and removing files from the snapshot only removes
a reference to the file and leaves the memory.

So, how do I do a atomic file removes on both the
original and the snapshot(s). Yes, I am assuming that
I have backed up the file offline.

Can I request a possible RFE to be able to force a
file remove from the original FS and if found elsewhere
remove that location too IFF a ENOSPC would fail the
original rm?


No, you can not remove files from snapshots.  Snapshots can not be 
changed.  If you are out of space because of snapshots, you can always 
'zfs destroy' the snapshot :-)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirrored Raidz

2006-10-20 Thread Matthew Ahrens


Richard Elling - PAE wrote:

Anthony Miller wrote:

Hi,

I've search the forums and not found any answer to the following.

I have 2 JBOD arrays each with 4 disks.

I want to create create a raidz on one array and have it mirrored to 
the other array.


Today, the top level raid sets are assembled using dynamic striping.  There
is no option to assemble the sets with mirroring.  Perhaps the ZFS team can
enlighten us on their intentions in this area?


Our thinking is that if you want more redundancy than RAID-Z, you should 
use RAID-Z with double parity, which provides more reliability and more 
usable storage than a mirror of RAID-Zs would.


(Also, expressing mirror of RAID-Zs from the CLI would be a bit messy; 
you'd have to introduce parentheses in vdev descriptions or something.)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing number of disks in a RAID-Z?

2006-10-23 Thread Matthew Ahrens


Robert Milkowski wrote:

Hello Jeremy,

Monday, October 23, 2006, 5:04:09 PM, you wrote:

JT Hello,


Shrinking the vdevs requires moving data.  Once you move data, you've
got to either invalidate the snapshots or update them.  I think that
will be one of the more difficult parts.



JT Updating snapshots would be non-trivial, but doable. Perhaps some sort
JT of reverse mapping or brute force search to relate snapshots to
JT blocks.

IMHO ability to shrink/grow pools even if restricted so no snapshots
and clones can be present in a pool during shrinking/growing would
still be a great feature.


FYI, we're working on being able to shrink pools with no restrictions. 
Unfortunately I don't have an ETA for you on this, though.


And as I'm sure you know, you can always grow pools :-)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: zone with lofs zfs - why legacy

2006-10-23 Thread Matthew Ahrens


Jens Elkner wrote:

Yes, I guessed that, but hopefully not that much ...
Thinking about it, it would suggest to me (if I need abs. max. perf): the best
thing to do is, to create a pool inside the zone and to use zfs on it ?


Using a ZFS filesystem within a zone will go just as fast as in the 
global zone, so there's no need to create multiple pools.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing number of disks in a RAID-Z?

2006-10-24 Thread Matthew Ahrens


Erik Trimble wrote:

Matthew Ahrens wrote:

Erik Trimble wrote:
The ability to expand (and, to a less extent, shrink) a RAIDZ or 
RAIDZ2 device is actually one of the more critical missing features 
from ZFS, IMHO.   It is very common for folks to add additional shelf 
or shelves into an existing array setup, and if you have created a 
pool which uses RAIDZ across the shelves (a good idea), then you want 
to add the new shelves into the existing RAIDZ setup.


Out of curiosity, what software (filesystem and/or volume manager) and 
configuration are you using today to achieve this?


--matt
I can't speak for VxVM, since I can't remember if it has the capability, 
but most hardware RAID controllers and SAN controllers have had this 
ability for ages (which, combines with VxVM or other FS's that can 
grow/shrink a FS when the underlying partition size changes).


See:  IBM's ServeRAID controllers, HP's MSA-series array heads, etc.


Right, but those are volume managers or hardware devices that export 
LUNs.  They can shrink the LUN by simply throwing away the end of it. 
ZFS's zvols can do this too.


Shrinking the *filesystem* that sits on top of that LUN is a much more 
difficult problem!  (But, as I've mentioned, it's one we're going to solve.)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Snapshots impact on performance

2006-10-24 Thread Matthew Ahrens


Robert Milkowski wrote:

Hi.

On nfs clients which are mounting file system f3-1/d611 I can see 3-5s 
periods of 100% busy (iostat) and almost no IOs issued to nfs server, on nfs 
server at the same time disk activity is almost 0 (both iostat and zpool 
iostat). However CPU activity increases in SYS during that periods.



Different time period when disk activity is small:

# lockstat -kgIw sleep 10 | less


Did you happen to get 'lockstat -kgIW' output while the problem was 
occurring?  (note the capital W)  I'm not sure how to interpret the -w 
output... (and sorry I gave you the wrong flags before).



Now during another period when disk activity is low and nfs clients see problem:

# dtrace -n fbt:::entry'{self-vt=vtimestamp;}' -n fbt:::return'/self-vt/[EMAIL 
PROTECTED](vtimestamp-self-vt);self-vt=0;}' -n tick-5s'{printa(@);exit(0);}'
[...]
  page_next_scan_large   23648600
  generic_idle_cpu   69234100
  disp_getwork  139261800
  avl_walk  669424900


Hmm, that's a possibility, but the method you're using to gather this 
information (tracing *every* function entry and exit) is a bit 
heavy-handed, and it may be distorting the results.



Heh, I'm sure I have seen avl_walk consuming lot of CPU before...

So wait for another such period and (6-7seconds):

# dtrace -n fbt::avl_walk:entry'[EMAIL PROTECTED]()]=count();}'
[...]



  zfs`metaslab_ff_alloc+0x9c
  zfs`space_map_alloc+0x10
  zfs`metaslab_group_alloc+0x1e4
  zfs`metaslab_alloc_dva+0x114
  zfs`metaslab_alloc+0x2c
  zfs`zio_alloc_blk+0x70
  zfs`zil_lwb_write_start+0x8c
  zfs`zil_lwb_commit+0x1ac
  zfs`zil_commit+0x1b0
  zfs`zfs_fsync+0xa8
  genunix`fop_fsync+0x14
  nfssrv`rfs3_create+0x700
  nfssrv`common_dispatch+0x444
  rpcmod`svc_getreq+0x154
  rpcmod`svc_run+0x198
  nfs`nfssys+0x1c4
  unix`syscall_trap32+0xcc
  1415957


Hmm, assuming that avl_walk() is actually consuming most of our CPU 
(which the lockstat -kgIW will confirm), this seems to indicate that 
we're taking a long time trying to find free chunks of space.  This may 
happen if you have lots of small fragments of free space, but no chunks 
large enough to hold the block we're trying to allocate.  We try to 
avoid this situation by trying to allocate from the metaslabs with the 
most free space, but it's possible that there's an error in this algorithm.



So lets destroy oldest snapshot:

# zfs destroy f3-1/[EMAIL PROTECTED]
[it took about 4 minutes!]



After snapshot was destroyed problem is completly gone.


FYI, destroying the snapshot probably helped simply by (a) returning 
some big chunks of space to the pool and/or (b) perturbing the system 
enough so that we try different metaslabs which aren't so fragmented.


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS hangs systems during copy

2006-10-26 Thread Matthew Ahrens


Juergen Keil wrote:


Sounds familiar. Yes it is a small system a Sun blade 100 with 128MB of 
memory.


Oh, 128MB...



Btw, does anyone know if there are any minimum hardware (physical memory)
requirements for using ZFS?

It seems as if ZFS wan't tested that much on machines with 256MB (or less)
memory...


The minimum hardware requirement for Solaris 10 (including ZFS) is 
256MB, and we did test with that :-)


On small memory systems, make sure that you are running with 
kmem_flags=0 (this is the default on non-debug builds, but debug builds 
default to kmem_flags=f and you will have to manually change it in 
/etc/system).


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: copying a large file..

2006-10-30 Thread Matthew Ahrens


Jeremy Teo wrote:

This is the same problem described in
6343653 : want to quickly copy a file from a snapshot.


Actually it's a somewhat different problem.  Copying a file from a 
snapshot is a lot simpler than copying a file from a different 
filesystem.  With snapshots, things are a lot more constrained, and we 
already have the infrastructure for a filesystem referencing the same 
blocks as its snapshots.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS thinks my 7-disk pool has imaginary disks

2006-10-31 Thread Matthew Ahrens


Rince wrote:

Hi all,

I recently created a RAID-Z1 pool out of a set of 7 SCSI disks, using 
the following command:


# zpool create magicant raidz c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 
c5t6d0


It worked fine, but I was slightly confused by the size yield (99 GB vs 
the 116 GB I had on my other RAID-Z1 pool of same-sized disks).


This is probably because your old pool was hitting
6288488 du reports misleading size on RAID-Z
Pools created with more recent bits won't hit this.

(note, 99/116 == 6/7)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs receive into zone?

2006-11-03 Thread Matthew Ahrens


Jeff Victor wrote:
If I add a ZFS dataset to a zone, and then want to zfs send from 
another computer into a file system that the zone has created in that 
data set, can I zfs send to the zone, or can I send to that zone's 
global zone, or will either of those work?


I believe that the 'zfs send' can be done from either the global or 
local zone just fine.  You can certainly do it from the local zone.


FYI, if you are doing a 'zfs recv' into a filesystem that's been 
designated to a zone, you should do the 'zfs recv' inside the zone.


(I think it's possible to do the 'zfs recv' in the global zone, but I 
think you'll have to first make sure that it isn't mounted in the local 
zone.  This is because the global zone doesn't know how to go into the 
local zone and unmount it.)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Size of raidz

2006-11-06 Thread Matthew Ahrens


Vahid Moghaddasi wrote:

I created a raidz from three 70GB disks and got a total of 200GB out
of it. It't that supposed to give 140GB?


You are hitting
6288488 du reports misleading size on RAID-Z
which affects pools created before build 42 or s10u3.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] linux versus sol10

2006-11-08 Thread Matthew Ahrens


Robert Milkowski wrote:

PvdZ This could be related to Linux trading reliability for speed by doing
PvdZ async metadata updates.
PvdZ If your system crashes before your metadata is flushed to disk your  
PvdZ filesystem might be hosed and a restore

PvdZ from backups may be needed.

you can achieve something similar with fastfs on ufs file systems and
setting zil_disable to 1 on ZFS.


No, zil_disable does not trade off consistency for performance the way 
'fastfs' on ufs or async metadata updates on EXT do!


Setting zil_disable causes ZFS to not push synchronous operations (eg, 
fsync(), O_DSYNC, NFS ops) to disk immediately, but it does not 
compromise filesystem integrity in any way.  Unlike these other 
filesystems fast modes, ZFS (even with zil_disable=1) will not corrupt 
itself and send you to backup tapes.


To state it another way, setting 'zil_disable=1' on ZFS will at worst 
cause some operations which requested synchronous semantics to not 
actually be on disk in the event of a crash, whereas other filesystems 
can corrupt themselves and lose all your data.


All that said, 'zil_disable' is a completely unsupported hack, and 
subject to change at any time.  It will probably eventually be replaced 
by 6280630 zil synchronicity.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Production ZFS Server Death (06/06)

2006-11-28 Thread Matthew Ahrens


Elizabeth Schwartz wrote:
On 11/28/06, *David Dyer-Bennet* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
wrote:


Looks to me like another example of ZFS noticing and reporting an
error that would go quietly by on any other filesystem.  And if you're
concerned with the integrity of the data, why not use some ZFS
redundancy?  (I'm guessing you're applying the redundancy further
downstream; but, as this situation demonstrates, separating it too far
from the checksum verification makes it less useful.)


Well, this error meant that two files on the file system were 
inaccessible, and one user was completely unable to use IMAP, so I don't 
know about unnoticeable.


David said, [the error] would go quietly by on any other filesystem. 
The point is that ZFS detected and reported the fact that your hardware 
corrupted the data.  A different filesystem would have simply given your 
application the incorrect data.



How would I use more redundancy?


By creating a zpool with some redundancy, eg. 'zpool create poolname 
mirror disk1 disk2'.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS related kernel panic

2006-12-04 Thread Matthew Ahrens


Jason J. W. Williams wrote:

Hi all,

Having experienced this, it would be nice if there was an option to
offline the filesystem instead of kernel panicking on a per-zpool
basis. If its a system-critical partition like a database I'd prefer
it to kernel-panick and thereby trigger a fail-over of the
application. However, if its a zpool hosting some fileshares I'd
prefer it to stay online. Putting that level of control in would
alleviate a lot of the complaints it seems to me...or at least give
less of a leg to stand on. ;-)


Agreed, and we are working on this.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool mirror

2006-12-11 Thread Matthew Ahrens


Gino Ruopolo wrote:

Hi All,

we have some ZFS pools on production with more than 100s fs and more
than 1000s snapshots on them. Now we do backups with zfs send/receive
with some scripting but I'm searching for a way to mirror each zpool
to an other one for backup purposes (so including all snapshots!). Is
that possible?


Not right now (without a bunch of shell-scripting).  I'm working on 
being able to send a whole tree of filesystems  their snapshots. 
Would that do what you want?


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Sol10u3 -- is du bug fixed?

2006-12-12 Thread Matthew Ahrens


Jeb Campbell wrote:

After upgrade you did actually re-create your raid-z
pool, right?


No, but I did zpool upgrade -a.

Hmm, I guess I'll try re-writing the data first.  I know you have to do that if 
you change compression options.

Ok -- rewriting the data doesn't work ...

I'll create a new temp pool and see what that does ... then I'll investigate 
options for recreating my big pool ...


Unfortunately, this bug is only fixed when you create the pool on the 
new bits.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance problems during 'destroy' (and bizzare Zone problem as well)

2006-12-12 Thread Matthew Ahrens


Anantha N. Srirama wrote:

 - Why is the destroy phase taking so long?


Destroying clones will be much faster with build 53 or later (or the 
unreleased s10u4 or later) -- see bug 6484044.



 - What can explain the unduly long snapshot/clone times
 - Why didn't the Zone startup?
 - More surprisingly why did the Zone startup after an hour?


Perhaps there was so much activity on the system that we couldn't push 
out transaction groups in the usual  5 seconds.  'zfs snapshot' and 
'zfs clone' take at least 1 transaction group to complete, so this could 
explain it.  We've seen this problem as well and are working on a fix...


--mat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [security-discuss] Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-19 Thread Matthew Ahrens


Bill Sommerfeld wrote:

On Tue, 2006-12-19 at 16:19 -0800, Matthew Ahrens wrote:

Darren J Moffat wrote:
I believe that ZFS should provide a method of bleaching a disk or part 
of it that works without crypto having ever been involved.

I see two use cases here:


I agree with your two, but I think I see a third use case in Darren's
example: bleaching disks as they are removed from a pool.


That sounds plausible too.  (And you could implement it as 'zfs destroy 
-r pool; zpool bleach pool'



We may need a second dimension controlling *how* to bleach..


You mean whether we do single overwrite with zeros, muliple overwrites 
with some crazy government-mandated patterns, etc, right?  That's what I 
meant by the value of the property can specify what type of bleach to 
use (perhaps taking the metaphor a bit too far) for example, 'zfs set 
bleach=how fs'.  Like other properties, we would provide bleach=on 
which would choose a reasonable default.  We'd need something similar 
with 'zpool bleach' (eg 'zpool bleach [-o how] pool').


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The size of a storage pool

2006-12-19 Thread Matthew Ahrens


Nathalie Poulet (IPSL) wrote:

Hello,
After an export and an importation, the size of the pool remains 
unchanged. As there were no data on this partition, I destroyed and 
recreate the pool. The size was indeed taken into account.


The correct size  is indicated by the order zpool list. The order df 
- k shows a size higher than the real size. The order zfs list shows 
a lower size. Why?


As Tomas pointed out, zfs list and df -k show the same size.  zpool 
list shows slightly more, because it does its accounting differently, 
taking into account only actual blocks allocated, whereas the others 
show usable space, taking into account the small amount of space we 
reserve for allocation efficiency (as well as quotas or reservations, if 
you have them).


The fact that 'zpool list' shows the raw values is bug 6308817 
discrepancy between zfs and zpool space accounting. 



--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS in a SAN environment

2006-12-20 Thread Matthew Ahrens


Jason J. W. Williams wrote:

INFORMATION: If a member of this striped zpool becomes unavailable or
develops corruption, Solaris will kernel panic and reboot to protect
your data.


This is a bug, not a feature.  We are currently working on fixing it.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 >

1 - 100 of 335 matches

Mail list logo