Re: [zfs-discuss] XATTRs, ZAP and the Mac
On Wed, May 03, 2006 at 03:22:53PM -0400, Maury Markowitz wrote: I think that's the disconnect. WHY are they full-fledged files? Because that's what the specification calls for. Right, but that's my concern. To me this sounds like historically circular reasoning... 20xx) we need a new file system that supports xaddrs well xaddrs are this second file, so... To me it appears that there is some confusion between the purpose and implementation. Certainly if xaddrs were originally introduced to store, well, x addrs, then the implementation is a poor one. Years later the _implementation_ was copied, even though it was never a good one. I think you are confusing the interface with the implementation. ZFS has copied (aka. adhered to) a pre-existing interface[*]. Our implementation of that interface is in some ways similar to other implementations. I believe that our implementation is a very good one, but if you have specific suggestions for how it could be improved, we'd love to hear them. [*] The solaris extended attributes interface is actually more accurately called named streams, and has been used as the back-end for CIFS (Windows) and NFSv4 named-streams protocols. See the fsattr(5) manpage. We appreciate your suggestion that we implement a higher-performance method for storing additional metadata associated with files. This will most likely not be possible within the extended attribute interface, and will require that we design (and applications use) a new interface. Having specific examples of how that interface would be used will help us to design a useful feature. The real problem is that there is nothing like a general overview of the zfs system as a whole I agree that a higher-level overview would be useful. COMPARING the system with the widely understood UFS would be invaluable, IMHO. Agreed, thanks for the suggestion. Unfortunately, ZFS and UFS are sufficiently different that I think the comparison would only be useful for a very limited part of ZFS, say from the file/directory down. But to the specifics. You asked why I thought it was that the file name did not appear. Well, that's because the term file name (or filename) does not appear anywhere in the document. Thanks, maybe we should use that keyword in section 6.2 to help when doing a search. So then, at a first glance it seems that one would expect to find the directory description in Chapter 6, which has a subsection called Directories and Directory Traversal. I believe that that section does in fact describe directories. Perhaps the description could be made more explicit (eg. The ZAP object which stores the directory maps from filename to object number. Each entry in the ZAP is a single directory entry. The entry's name is the filename, and its value is the object number which identifies that file. That section describes the znode_phys_t structure. You're right, it also describes the znode_phys_t. There should be a section break after the first paragraph, before we start talking about the znode_phys_t. Maybe I'm going down a dark alley here, but is there any reason this split still exists under zfs? IE, I asumed that the znode_phys_t would be located in the directory ZAP, because to my mind, that's where metadata belongs. ZFS must support POSIX semantics, part of which is hard links. Hard links allow you to create multiple names (directory entries) for the same file. Therefore, all UNIX filesystems have chosen to store the file information separately for the directory entries (otherwise, you'd have multiple copies, and need pointers between all of them so you could update them all -- yuck). Hard links suck for FS designers because they constrain our implementation in this way. We'd love to have the flexability to easily store metadata with the directory entry. We've actually contemplated caching the metadata needed to do a stat(2) in the directory entry, to improve performance of directory traversals like find(1). Perhaps we'll be able to add this performance improvement in an future release. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota
On Thu, May 18, 2006 at 12:46:28PM -0700, Charlie wrote: Traditional (amanda). I'm not seeing a way to dump zfs file systems to tape without resorting to 'zfs send' being piped through gtar or something. Even then, the only thing I could restore was an entire file system. (We frequently restore single files for users...) Perhaps, since zfs isn't limited to one snapshot per FS like fssnap is, I should be redesigning everything. It sounds like I should look at using many snapshots, and dumping to tape (each file system, somehow) less frequently. That's right. With ZFS, there should never be a need to go to tape to recover an accidentally deleted file, becuase it's easy[*] to keep lots of snapshots around. [*] Well, modulo 6373978 want to take lots of snapshots quickly ('zfs snapshot -r'). I'm working on that... --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tracking error to file
On Tue, May 23, 2006 at 11:49:47AM +0200, Wout Mertens wrote: Can that same method be used to figure out what files changed between snapshots? To figure out what files changed, we need to (a) figure out what object numbers changed, and (b) do the object number to file name translation. The method I described (using zdb) will not be involved in either step. zdb is an undocumented interface, and using it for this purpose is only a workaround. However, the same algorithms implemented in zdb will be used to do step (b), the object number to file name translation. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Misc questions
On Tue, May 23, 2006 at 02:34:30PM -0700, Jeff Victor wrote: * When you share a ZFS fs via NFS, what happens to files and filesystems that exceed the limits of NFS? What limits do you have in mind? I'm not an NFS expert, but I think that NFSv4 (and probably v3) supports 64-bit file sizes, so there would be no limit mismatch there. * Is there a recommendation or some guidelines to help answer the question how full should a pool be before deciding it's time add disk space to a pool? I'm not sure, but I'd guess around 90%. * Migrating pre-ZFS backups to ZFS backups: is there a better method than restore the old backup into a ZFS fs, then back it up using zfs send? No. * Are ZFS quotas enforced assuming that compressed data is compressed, or uncompressed? Quotas apply to the amount of space used, after compression. This is the space reported by 'zfs list', 'zfs get used', 'df', 'du', etc. The former seems to imply that the following would create a mess: 1) Turn on compression 2) Store data in the pool until the pool is almost full 3) Turn off compression 4) Read and re-write every file (thus expanding each file) Since this example doesn't involve quotas, their behavior is not applicable here. In this example, there will be insufficient space in the pool to store your data, so your write operation will fail with ENOSPC. Perhaps a messy situation, but I don't see any alternative. If this is a concern, don't use compression. If you filled up a filesystem's quota rather than a pool, the behavior would be the same except you would get EDQUOT rather than ENOSPC. * What block sizes will ZFS use? Is there an explanation somewhere about its method of choosing blocksize for a particular workload? Files smaller than 128k will be stored in a single block, whose size is rounded up to the nearest sector (512 bytes). Files larger than 128k will be stored in multiple 128k blocks (unless the recordsize property has been set -- see the zfs(1m) manpage for an explanation of this). Thanks for using zfs! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and HSM
On Wed, May 24, 2006 at 03:43:54PM -0400, Scott Dickson wrote: I said I had several questions to start threads on What about ZFS and various HSM solutions? Do any of them already work with ZFS? Are any going to? It seems like HSM solutions that access things at a file level would have little trouble integrating with ZFS. But ones that work at a block level would have a harder time. Sun is working on getting SAM (a HSM which is currently wedded to QFS) working with ZFS. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS mirror and read policy; kstat I/O values for zfs
On Fri, May 26, 2006 at 09:40:57PM +0200, Daniel Rock wrote: So you can see the second disk of each mirror pair (c4tXd0) gets almost no I/O. How does ZFS decide from which mirror device to read? You are almost certainly running in to this known bug: 630 reads from mirror are not spread evenly --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about ZFS performance for webserving/java
On Thu, Jun 01, 2006 at 11:35:41AM -1000, David J. Orman wrote: 3 - App server would be running in one zone, with a (NFS) mounted ZFS filesystem as storage. 4 - DB server (PgSQL) would be running in another zone, with a (NFS) mounted ZFS filesystem as storage. Why would you use NFS? These zones are on the same machine as the storage, right? You can simply export filesystems in your pool to the various zones (see zfs(1m) and zonecfg(1m) manpages). This will result in better performance. 5 - Multiple disk redundancy is needed. So, I'm assuming two raid-z pools of 3 drives each, mirrored is the solution. If people have a better suggestion, tell me! :P There is no need for multiple pools. Perhaps you meant two raid-z groups (aka vdevs) in a single pool? Also, wouldn't you want to use all 8 disks, therefore use two 4-disk raid-z groups? This way you would get 3 disks worth of usable space. Depending on how much space you need, you should consider using a single double-parity RAID-Z group with your 8 disks. This would give you 6 disks worth of usable space. Given that you want to be able to tolerate two failures, that is probably your best solution. Other solutions would include three 3-way mirrors (if you can fit another drive in your machine), giving you 3 disks worth of usable space. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: delegated administration
On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote: So as administrator what do I need to do to set /export/home up for users to be able to create their own snapshots, create dependent filesystems (but still mounted underneath their /export/home/usrname)? In other words, is there a way to specify the rights of the owner of a filesystem rather than the individual - eg, delayed evaluation of the owner? I think you're asking for the -c Creator flag. This allows permissions (eg, to take snapshots) to be granted to whoever creates the filesystem. The above example shows how this might be done. --matt Actually, I think I mean owner. I want root to create a new filesystem for a new user under the /export/home filesystem, but then have that user get the right privs via inheritance rather than requiring root to run a set of zfs commands. In that case, how should the system determine who the owner is? We toyed with the idea of figuring out the user based on the last component of the filesystem name, but that seemed too tricky, at least for the first version. FYI, here is how you can do it with an additional zfs command: # zfs create tank/home/barts # zfs allow barts create,snapshot,... tank/home/barts --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS needs a viable backup mechanism
On Fri, Jul 07, 2006 at 04:00:38PM -0400, Dale Ghent wrote: Add an option to zpool(1M) to dump the pool config as well as the configuration of the volumes within it to an XML file. This file could then be sucked in to zpool at a later date to recreate/ replicate the pool and its volume structure in one fell swoop. After that, Just Add Data(tm). Yep, this has been on our to-do list for quite some time: RFE #6276640 zpool config RFE #6276912 zfs config --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] metadata inconsistency?
On Thu, Jul 06, 2006 at 12:46:57AM -0700, Patrick Mauritz wrote: Hi, after some unscheduled reboots (to put it lightly), I've got an interesting setup on my notebook's zfs partition: setup: simple zpool, no raid or mirror, a couple of zfs partitions, one zvol for swap. /foo is one such partition, /foo/bar the directory with the issue. directly after the reboot happened: $ ls /foo/bar test.h $ ls -l /foo/bar Total 0 the file wasn't accessible with cat, etc. This can happen when the file appears in the directory listing (ie. getdents(2)), but a stat(2) on the file fails. Why that stat would fail is a bit of a mystery, given that ls doesn't report the error. It could be that the underlying hardware has failed, and the directory is still intact but the file's metadata has been damaged. (Note, this would be hardware error, not metadata inconsistency.) Another possibility is that the file's inode number is too large to be expressed in 32 bits, thus causing a 32-bit stat() to fail. However, I don't think that Sun's ls(1) should be issuing any 32-bit stats (even on a 32-bit system, it should be using stat64). somewhat later (new data appeared on /foo, in /foo/baz): $ ls -l /foo/bar Total 3 -rw-r--r-- 1 user group 1400 Jul 6 02:14 test.h the content of test.h is the same as the content of /foo/baz/quux now, but the refcount is 1! $ chmod go-r /foo/baz/quux $ ls -l /foo/bar Total 3 -rw--- 1 user group 1400 Jul 6 02:14 test.h This behavior could also be explained if there is an unknown bug which causes the object representing the file to be deleted, but not the directory entry pointing to it. anyway, how do I get rid of test.h now without making quux unreadable? (the brute force approach would be a new partition, moving data over with copying - instead of moving - the troublesome file, just in case - not sure if zfs allows for links that cross zfs partitions and thus optimizes such moves, then zfs destroy data/test, but there might be a better way?) Before trying to rectify the problem, could you email me the output of 'zpool status' and 'zdb -vvv foo'? FYI, there are no cross-filesystem links, even with ZFS. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How can I watch IO operations with dtrace on zfs?
On Thu, Jul 20, 2006 at 12:58:31AM -0700, Trond Norbye wrote: I have been using iosoop script (see http://www.opensolaris.org/os/community/dtrace/scripts/) written by Brendan Gregg to look at the IO operations of my application. ... So how can I get the same information from a ZFS file-system? As you can see, ZFS is not yet fully integrated with the dtrace i/o provider. With ZFS, writes are (typically) deferred, so it is nontrivial to assign each write i/o to a particular application. If you are familiar with dtrace, you can use fbt to look at the zio_done() function, eg. with something like this: zio_done:entry /args[0]-io_type == 1 args[0]-io_bp != NULL/ { @bytes[read, args[0]-io_bookmark.zb_objset, args[0]-io_bookmark.zb_object, args[0]-io_bookmark.zb_level, args[0]-io_bookmark.zb_blkid != 0] = /* sum(args[0]-io_size); */ count(); } zio_done:entry /args[0]-io_type == 2/ { @bytes[write, args[0]-io_bookmark.zb_objset, args[0]-io_bookmark.zb_object, args[0]-io_bookmark.zb_level, args[0]-io_bookmark.zb_blkid != 0] = /* sum(args[0]-io_size); */ count(); } END { printf(r/w objset object level blk0 i/os\n); printa(%5s %4d %7d %d %d [EMAIL PROTECTED], @bytes); printf(r/w objset object level blk0 i/os\n); } --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Quotas and Snapshots
On Tue, Jul 25, 2006 at 11:13:16AM -0700, Brad Plecs wrote: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6431277 What I'd really like to see is ... the ability for the snapshot space to *not* impact the filesystem space). Yep, as Eric mentioned, that is the purpose of this RFE (want filesystem-only quotas). I imagine that this would be implemented as a quota against the space referenced (as currently reported by 'zfs list', 'zfs get refer', 'df', etc; see the zfs(1m) manpage for details). in fact, I think a lot of ZFS's hierarchical features would be more valuable if parent filesystems included their descendants (backups and NFS sharing, for example), but I'm sure there are just as many arguments against that as for it. Yep, we're working on making more features work on this and all descendents. For example, the recently implemented 'zfs snapshot -r' can create snapshots of a filesystem and all its descendents. This feature will be part of Solaris 10 update 3. We're also working on 'zfs send -r' (RFE 6421958). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Quotas and Snapshots
On Tue, Jul 25, 2006 at 07:24:51PM -0500, Mike Gerdts wrote: On 7/25/06, Brad Plecs [EMAIL PROTECTED] wrote: What I'd really like to see is ... the ability for the snapshot space to *not* impact the filesystem space). The idea is that you have two storage pools - one for live data, one for backup data. Your live data is *probably* on faster disks than your backup data. The live data and backup data may or may not be on the same server. Whenever you need to perform backups you do something along the lines of: yesterday=$1 today=$2 for user in $allusers ; do zfs snapshot users/[EMAIL PROTECTED] zfs snapshot backup/$user/[EMAIL PROTECTED] zfs clone backup/$user/[EMAIL PROTECTED] backup/$user/$today rsync -axuv /users/$user/.zfs/snapshot/$today /backup/$user/$today zfs destroy users/[EMAIL PROTECTED] zfs destroy backup/$user/$lastweek done You can simplify and improve the performance of this considerably by using 'zfs send': for user in $allusers ; do zfs snapshot users/[EMAIL PROTECTED] zfs send -i $yesterday users/[EMAIL PROTECTED] | \ ssh $host zfs recv -d $backpath ssh $host zfs destroy $backpath/$user/$lastweek done You can send the backup to the same or different host, and the same or different pool, as your hardware needs dictate. 'zfs send' will be much faster than rsync because we can use ZFS metadata to determine which blocks were changed without traversing all files directories. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supporting ~10K users on ZFS
On Thu, Jun 29, 2006 at 08:20:56PM +0200, Robert Milkowski wrote: btw: I belive it was discussed here before - it would be great if one would automatically convert given directory on zfs filesystem into zfs filesystem (without actually copying all data) Yep, and an RFE filed: 6400399 want zfs split and vice versa (making given zfs filesystem a directory) But more filesystems is better! :-) (and, this would be pretty nontrivial, we'd have to resolve conflicting inode (object) numbers, thus rewriting all metadata). Back to slogging through old mail archives, --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] This may be a somewhat silly question ...
On Tue, Jun 27, 2006 at 06:30:46PM -0400, Dennis Clarke wrote: ... but I have to ask. How do I back this up? The following two RFEs would help you out enormously: 6421958 want recursive zfs send ('zfs send -r') 6421959 want zfs send to preserve properties ('zfs send -p') As far as RFEs go, these are pretty high priority... --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression best-practice?
On Thu, Jul 27, 2006 at 03:54:02PM -0400, Christine Tran wrote: - What is the compression algorithm used? It is based on the Lempel-Ziv algorithm. - Is there a ZFS feature that will output the real uncompressed size of the data? The scenario is if they had to move a compressed ZFS filesystem back to UFS, say. 'ls' will give the file's real uncompressed size, but customer had rather not write a script to sum everything up. You can multiply the 'referenced' and 'compressratio' property of a filesystem to find out how much space it would use if uncompressed. - Customer wants to do a diff between snapshots. Is there an RFE already filed? Two, in fact: 6370738 zfs diffs filesystems 6425091 want 'zfs diff' to list files that have changed between snapshots _ Customer would like benchmarking numbers. I think there is a blog item but do we have something more official? No; we're working on some more unofficial benchmark numbers, though :-) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Boot Disk
On Thu, Jul 27, 2006 at 08:17:03PM -0500, Malahat Qureshi wrote: Is there any way to boot of from zfs disk work around ?? Yes, see http://blogs.sun.com/roller/page/tabriz?entry=are_you_ready_to_rumble --mat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
On Tue, Aug 08, 2006 at 06:11:09PM +0200, Robert Milkowski wrote: filebench/singlestreamread v440 1. UFS, noatime, HW RAID5 6 disks, S10U2 70MB/s 2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1) 87MB/s 3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2 130MB/s 4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44 133MB/s FYI, Streaming read performance is improved considerably by Mark's prefetch fixes which are in build 45. (However, as mentioned you will soon run into the bandwidth of a single fiber channel connection.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAID10
On Tue, Aug 08, 2006 at 09:54:16AM -0700, Robert Milkowski wrote: Hi. snv_44, v440 filebench/varmail results for ZFS RAID10 with 6 disks and 32 disks. What is suprising is that the results for both cases are almost the same! 6 disks: IO Summary: 566997 ops 9373.6 ops/s, (1442/1442 r/w) 45.7mb/s, 299us cpu/op, 5.1ms latency IO Summary: 542398 ops 8971.4 ops/s, (1380/1380 r/w) 43.9mb/s, 300us cpu/op, 5.4ms latency 32 disks: IO Summary: 572429 ops 9469.7 ops/s, (1457/1457 r/w) 46.2mb/s, 301us cpu/op, 5.1ms latency IO Summary: 560491 ops 9270.6 ops/s, (1426/1427 r/w) 45.4mb/s, 300us cpu/op, 5.2ms latency Using iostat I can see that with 6 disks in a pool I get about 100-200 IO/s per disk in a pool, and with 32 disk pool I get only 30-70 IO/s per disk in a pool. Each CPU is used at about 25% in SYS (there're 4 CPUs). Something is wrong here. It's possible that you are CPU limited. I'm guessing that your test uses only one thread, so that may be the limiting factor. We can get a quick idea of where that CPU is being spent if you can run 'lockstat -kgIW sleep 60' while your test is running, and send us the first 100 lines of output. It would be nice to see the output of 'iostat -xnpc 3' while the test is running, too. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS RAID10
On Tue, Aug 08, 2006 at 10:42:41AM -0700, Robert Milkowski wrote: filebench in varmail by default creates 16 threads - I configrm it with prstat, 16 threrads are created and running. Ah, OK. Looking at these results, it doesn't seem to be CPU bound, and the disks are not fully utilized either. However, because the test is doing so much synchronous writes (eg. by calling fsync()), we are continually writing out the intent log. Unfortunately, we are only able to issue a small number of concurrent i/os while doing the intent log writes. All the threads must wait for the intent log blocks to be written before they can enqueue more data. Therefore, we are essentially doing: many threads call fsync(). one of them will flush the intent log, issuing a few writes to the disks all of the threads wait for the writes to complete repeat. This test fundamentally requires waiting for lots of syncronous writes. Assuming no other activity on the system, the performance of syncronous writes does not scale with the number of drives, it scales with the drive's write latency. If you were to alter the test to not require everything to be done synchronously, then you would see much different behavior. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: 'canmount' option
On Thu, Aug 10, 2006 at 10:23:20AM -0700, Eric Schrock wrote: A new option will be added, 'canmount', which specifies whether the given filesystem can be mounted with 'zfs mount'. This is a boolean property, and is not inherited. Cool, looks good. Do you plan to implement this using the generic (inheritable) property infrastructure (eg. dsl_prop_set/get()), and just ignore the setting if it is inherited? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: 'canmount' option
On Thu, Aug 10, 2006 at 10:44:46AM -0700, Eric Schrock wrote: Right now I'm using the generic property mechanism, but have a special case in dsl_prop_get_all() to ignore searching parents for this particular property. I'm not thrilled about it, but I only see two other options: 1. Do not use the generic infrastructure. This requires much more invasive changes that I'd rather avoid. 2. From the kernel's perspective have it be inheritable, but then fake up the non-inherited state in libzfs. i.e. if the source is not ZFS_SRC_LOCAL, then pretend like it isn't set at all. If the current hack is too offensive, moving it into libzfs seems like a reasonable option. Yeah, I guess I was suggesting (2), but having a check in dsl_prop code might be better. It would probably be better to base it off some value stored in the zfs_prop_t though, rather than hard-coding canmount into dsl_prop.c. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Difficult to recursive-move ZFS filesystems to another server
On Fri, Aug 11, 2006 at 10:02:41AM -0700, Brad Plecs wrote: There doesn't appear to be a way to move zfspool/www and its decendants en masse to a new machine with those quotas intact. I have to script the recreation of all of the descendant filesystems by hand. Yep, you need 6421959 want zfs send to preserve properties ('zfs send -p') 6421958 want recursive zfs send ('zfs send -r') --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] in-kernel gzip compression
On Thu, Aug 17, 2006 at 02:53:09PM +0200, Robert Milkowski wrote: Hello zfs-discuss, Is someone actually working on it? Or any other algorithms? Any dates? Not that I know of. Any volunteers? :-) (Actually, I think that a RLE compression algorithm for metadata is a higher priority, but if someone from the community wants to step up, we won't turn your code away!) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] in-kernel gzip compression
On Thu, Aug 17, 2006 at 10:28:10AM -0700, Adam Leventhal wrote: On Thu, Aug 17, 2006 at 10:00:32AM -0700, Matthew Ahrens wrote: (Actually, I think that a RLE compression algorithm for metadata is a higher priority, but if someone from the community wants to step up, we won't turn your code away!) Is RLE likely to be more efficient for metadata? No, it it not likely to achieve a higher compression ratio. However, it should use significantly less CPU time. We've seen some circumstances where the CPU usage caused by compressing metadata can be not as trivial as we'd like. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem with zfs receive -i
On Sat, Aug 19, 2006 at 07:21:52PM -0700, Frank Cusack wrote: On August 19, 2006 7:06:06 PM -0700 Matthew Ahrens [EMAIL PROTECTED] wrote: My guess is that the filesystem is not mounted. It should be remounted after the 'zfs recv', but perhaps that is not happening correctly. You can see if it's mounted by running 'df' or 'zfs list -o name,mounted'. You are right, it's not mounted. Did the 'zfs recv' print any error messages? nope. Are you able to reproduce this behavior? easily. Hmm, I think there must be something special about your filesystems or configuration; I'm not able to reproduce it. One possible cause for trouble is if you are doing the 'zfs receive' into a filesystem which has descendent filesystems (eg, you are doing 'zfs recv pool/[EMAIL PROTECTED]' and pool/fs/child exists). This isn't handled correctly now, but you should get an error message in that case. (This will be fixed by some changes Noel is going to putback next week.) Could you send me the output of 'truss zfs recv ...', and 'zfs list' and 'zfs get -r all pool' on both the source and destination systems? ah ok. Note that if I do zfs send; zfs send -i on the local side, then do zfs list; zfs mount -a on the remote side, I still show space used in the @7.1 snapshot, even though I didn't touch anything. I guess mounting accesses the mount point and updates the atime. Hmm, maybe. I'm not sure if that's exactly what's happening, because mounting and unmounting a filesystem doesn't seem to update the atime for me. Does the @7.1 snapshot show used space before you do the 'zfs mount -a'? On the local side, how come after I take the 7.1 snapshot and then 'ls', the 7.1 snapshot doesn't start using up space? Shouldn't my ls of the mountpoint update the atime also? I believe what's happening here is that although we update the in-core atime, we sometimes defer pushing it to disk. You can force the atime to be pushed to disk by unmounting the filesystem. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Niagara and ZFS compression?
On Sun, Aug 20, 2006 at 08:38:03PM -0700, Luke Lonergan wrote: Matthew, On 8/20/06 6:20 PM, Matthew Ahrens [EMAIL PROTECTED] wrote: This was not the design, we're working on fixing this bug so that many threads will be used to do the compression. Is this also true of decompression? I believe that decompression already runs in many threads. If you see differently, let us know. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance compared to UFS VxFS
On Tue, Aug 22, 2006 at 06:15:08AM -0700, Tony Galway wrote: A question (well lets make it 3 really) ? Is vdbench a useful tool when testing file system performance of a ZFS file system? Secondly - is ZFS write performance really much worse than UFS or VxFS? and Third - what is a good benchmarking tool to test ZFS vs UFS vs VxFS? ... sd=ZFS,lun=/pool/TESTFILE,size=10g,threads=8 wd=ETL,sd=ZFS,rdpct=0, seekpct=80 rd=ETL,wd=ETL,iorate=max,elapsed=1800,interval=5,forxfersize=(1k,4k,8k,32k) ZFS write performance should be much better than UFS or VxFS. What exactly is the write workload? It sounds like it is doing effectively random writes of various (1k,4k,8k,32k) record sizes. As these record sizes are all smaller than ZFS's default block size (128k), they will all require ZFS to read in the 128k block. Whereas UFS (on x86) uses a 4k block size by default so the 4k, 8k, and 32k record size writes will not require any reads, only the 1k records will require UFS to read the block in from disk. When doing record-structured access (eg. databases), it is recommended that you do 'zfs set recordsize=XXX' to set ZFS's block size to match your application's record size. In this case perhaps you should set it to 4k to match UFS. I am seeing large periods of time where is no reported activity, and if I am looking at zfs iostat I do see consistent writing however How are you measuring this reported activity? If your application is trying to write faster than the storage can keep up with, then it will have to be throttled. So if you are measuring this at the application or syscall level, then this is the expected behavior and does not indicate a performance problem in and of itself. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression
On Tue, Aug 22, 2006 at 08:43:32AM -0700, roland wrote: can someone tell, how effective is zfs compression and space-efficiency (regarding small files) ? since compression works at the block level, i assume compression may not come into effect as some may expect. (maybe i`m wrong here) It's true that since we are compressing a block at a time, there are some efficiencies of whole-large-file compression that will be lost. However, since ZFS uses 128k blocks on large files, the difference should be neglegable. For smaller files, ZFS uses a single block that exactly fits the file (compressed or not) (rounded up to the nearest sector size (512 bytes)). So I believe that ZFS's compression infrastructure permits good efficiency. However, at this point, we have only implemented one compression algorithm, which is much faster than, but does not compress as much as, gzip. We plan to implement a broader range of compression algorithms in the future. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with zfs snapshot replication from version2 to version3 pool.
Shane, I wasn't able to reproduce this failure on my system. Could you try running Eric's D script below and send us the output while running 'zfs list'? thanks, --matt On Fri, Aug 18, 2006 at 09:47:45AM -0700, Eric Schrock wrote: Can you send the output of this D script while running 'zfs list'? #!/sbin/dtrace -s zfs_ioc_snapshot_list_next:entry { trace(stringof(args[0]-zc_name)); } zfs_ioc_snapshot_list_next:return { trace(arg1); } - Eric On Fri, Aug 18, 2006 at 09:27:36AM -0700, Shane Milton wrote: I did a little bit of digging, and didn't turn up any known issues. Any insite would be appreciated. Basically I replicated a zfs snapshot from a version2 storage pool into a version3 pool and it seems to have corrupted the version3 pool. At the time of the error both pools were running on the same system (amd64 build44) The command used was something similiar to the following. zfs send [EMAIL PROTECTED] | zfs recv [EMAIL PROTECTED] zfs list, zfs list-r version3pool_name, zpool destroy version3pool_name all end with a core dump. After a little digging with mdb and truss, It seems to be dying around the function ZFS_IOC_SNAPSHOT_LIST_NEXT. I'm away from the system at the moment, but do have some of the core files and truss output for those interested. # truss zfs list execve(/sbin/zfs, 0x08047E90, 0x08047E9C) argc = 2 resolvepath(/usr/lib/ld.so.1, /lib/ld.so.1, 1023) = 12 resolvepath(/sbin/zfs, /sbin/zfs, 1023) = 9 sysconfig(_CONFIG_PAGESIZE) = 4096 xstat(2, /sbin/zfs, 0x08047C48) = 0 open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT xstat(2, /lib/libzfs.so.1, 0x08047448)= 0 resolvepath(/lib/libzfs.so.1, /lib/libzfs.so.1, 1023) = 16 open(/lib/libzfs.so.1, O_RDONLY) = 3 .. ... ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08045FBC) = 0 ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08046DFC) = 0 ioctl(3, ZFS_IOC_OBJSET_STATS, 0x080450BC) = 0 ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08045EFC) Err#3 ESRCH ioctl(3, ZFS_IOC_SNAPSHOT_LIST_NEXT, 0x08045EFC) Err#22 EINVAL fstat64(2, 0x08044EE0) = 0 internal error: write(2, i n t e r n a l e r r.., 16) = 16 Invalid argumentwrite(2, I n v a l i d a r g u.., 16) = 16 write(2, \n, 1) = 1 sigaction(SIGABRT, 0x, 0x08045E30) = 0 sigaction(SIGABRT, 0x08045D70, 0x08045DF0) = 0 schedctl() = 0xFEBEC000 lwp_sigmask(SIG_SETMASK, 0x, 0x) = 0xFFBFFEFF [0x] lwp_kill(1, SIGABRT)= 0 Received signal #6, SIGABRT [default] siginfo: SIGABRT pid=1444 uid=0 code=-1 Thanks -Shane This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import: snv_33 to S10 6/06
On Wed, Aug 23, 2006 at 09:57:04AM -0400, James Foronda wrote: Hi, [EMAIL PROTECTED] cat /etc/release Solaris Nevada snv_33 X86 Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 06 February 2006 I have zfs running well on this box. Now, I want to upgrade to Solaris 10 6/06 release. Question: Will the 6/06 release recognize the zfs created by snv_33? I seem to recall something about being at a certain release level for 6/06 to be able to import without problems.. I searched the archives but I can't find where I read that anymore. Yes, new releases of Solaris can seamlessly access any ZFS pools created with Solaris Nevada or 10 (but not pools from before ZFS was integrated into Solaris, in October 2005). However, once you upgrade to build 35 or later (including S10 6/06), do not downgrade back to build 34 or earlier, per the following message: Summary: If you use ZFS, do not downgrade from build 35 or later to build 34 or earlier. This putback (into Solaris Nevada build 35) introduced a backwards- compatable change to the ZFS on-disk format. Old pools will be seamlessly accessed by the new code; you do not need to do anything special. However, do *not* downgrade from build 35 or later to build 34 or earlier. If you do so, some of your data may be inaccessible with the old code, and attemts to access this data will result in an assertion failure in zap.c. We have fixed the version-checking code so that if a similar change needs to be made in the future, the old code will fail gracefully with an informative error message. After upgrading, you should consider running 'zpool upgrade' to enable the latest features of ZFS, including ditto blocks, hot spares, and double-parity RAID-Z. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import: snv_33 to S10 6/06
On Thu, Aug 24, 2006 at 08:12:34AM +1000, Boyd Adamson wrote: Isn't the whole point of the zpool upgrade process to allow users to decide when they want to remove the fall back to old version option? In other words shouldn't any change that eliminates going back to an old rev require an explicit zpool upgrade? Yes, that is exactly the case. Unfortunately, builds prior to 35 had some latent bugs which made implementation of 'zpool upgrade' nontrivial. Thus we issued this one-time do not downgrade message and promptly implemented 'zpool upgrade'. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] space accounting with RAID-Z
I just realized that I forgot to send this message to zfs-discuss back in May when I fixed this bug. Sorry for the delay. The putback of the following bug fix to Solaris Nevada build 42 and Solaris 10 update 3 build 3 (and coinciding with the change to ZFS on-disk version 3) changes the behavior of space accounting when using pools with raid-z: 6288488 du reports misleading size on RAID-Z The old behavior is that on raidz vdevs, the space used and available includes the space used to store the data redundantly (ie. the parity blocks). On mirror vdevs, and all other products' RAID-4/5 implementations, it does not, leading to confustion. Customers are accustomed to the redundant space not being reported, so this change makes zfs do that for raid-z devices as well. The new behavior applies to: (a) newly created pools (with version 3 or later) (b) old (version 1 or 2) pools which, when 'zpool upgrade'-ed, did not have any raid-z vdevs (but have since 'zpool add'-ed a raid-z vdev) Note that the space accounting behavior will never change on old raid-z pools. If the new behavior is desired, these pools must be backed up, destroyed, and re-'zpool create'-ed. The 'zpool list' output is unchanged (ie. it still includes the space used for parity information). This is bug 6308817 discrepancy between zfs and zpool space accounting. The reported space used may be slightly larger than the parity-free size because the amount of space used to store parity with RAID-Z varies somewhat with blocksize (eg. even small blocks need at least 1 sector of parity). On most workloads[*], the overwhelming majority of space is stored in 128k blocks, so this effect is typically not very pronounced. --matt [*] One workload where this effect can be noticable is when the 'recordsize' property has be decreased, eg. for a database or zvol. However, in this situation the rounding error space can be completely eliminated by using an appropriate number of disks in the raid-z group, according to the following table: exact optimal num. disks recordsize raidz1 raidz2 8k+ 3, 5 or 9 6, 10 or 18 4k3 or 5 6 or 10 2k3 6 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Need Help: didn't create the pool as radiz but stripes
On Thu, Aug 24, 2006 at 10:12:12AM -0600, Arlina Goce-Capiral wrote: It does appear that the disk is fill up by 140G. So this confirms what I was saying, that they are only able to write ndisks-1 worth of data (in this case, ~68GB * (3-1) == ~136GB. So there is no unexpected behavior with respect to the size of their raid-z pool, just the known (and now fixed) bug. I think I now know what happen. I created a raidz pool and I did not write any data to it before I just pulled out a disk. So I believe the zfs filesystem did not initialize yet. So this is why my zfs filesystem was unusable. Can you confirm this? No, that should not be the case. As soon as the 'zfs' or 'zpool' command completes, everything will be on disk for the requested action. But when I created a zfs filesystem and wrote data to it, it could now lose a disk and just be degraded. I tested this part by removing the disk partition in format. Well, it sounds like you are testing two different things: first you tried physically pulling out a disk, then you tried re-partitioning a disk. It sounds like there was a problem when you pulled out the disk. If you can describe the problem further (Did the machine panic? What was the panic message?) then perhaps we can diagnose it. I will try this same test to re-duplicate my issue, but can you confirm for me if my zfs filesystem as a raidz requires me to write data to it first before it's really ready? No, that should not be the case. Any ideas when the Solaris 10 update 3 (11/06) be released? I'm not sure, but November or December sounds about right. And of course, if they want the fix sooner they can always use Solaris Express or OpenSolaris! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unaccounted for daily growth in ZFS disk space usage
On Thu, Aug 24, 2006 at 07:07:45AM -0700, Joe Little wrote: We finally flipped the switch on one of our ZFS-based servers, with approximately 1TB of 2.8TB (3 stripes of 950MB or so, each of which is a RAID5 volume on the adaptec card). We have snapshots every 4 hours for the first few days. If you add up the snapshot references it appears somewhat high versus daily use (mostly mail boxes, spam, etc changing), but say an aggregate of no more than 400+MB a day. However, zfs list shows our daily pool as a whole, and per day we are growing by .01TB, or more specifically 80GB a day. That's a far cry different than the 400MB we can account for. Is it possible that metadata/ditto blocks, or the like is trully growing that rapidly. By our calculations, we will triple our disk space (sitting still) in 6 months and use up the remaining 1.7TB. Of course, this is only with 2-3 days of churn, but its an alarming rate where before on the NetApp we didn't see anything close to this rate. How are you calculating this 400MB/day figure? Keep in mind that space used by each snapshot is the amount of space unique to that snapshot. Adding up the space used by all your snapshots is *not* the amount of space that they are all taking up cumulatively. For leaf filesystems (those with no descendents), you can calculate the space used by all snapshots as (fs's used - fs's referenced). How many filesystems do you have? Can you send me the output of 'zfs list' and 'zfs get -r all pool'? How much space did you expect to be using, and what data is that based on? Are you sure you aren't writing 80GB/day to your pool? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and very large directories
On Thu, Aug 24, 2006 at 01:15:51PM -0500, Nicolas Williams wrote: I just tried creating 150,000 directories in a ZFS roto directory. It was speedy. Listing individual directories (lookup) is fast. Glad to hear that it's working well for you! Listing the large directory isn't, but that turns out to be either terminal I/O or collation in a UTF-8 locale (which is what I use; a simple DTrace script showed that to be my problem): % ptime ls ... real9.850 user6.263 - ouch, UTF-8 hurts sys 0.245 Yep, beware of using 'ls' on large directories! See also: 6299769 'ls' memory usage is excessive 6299767 'ls -f' should not buffer output --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] unaccounted for daily growth in ZFS disk space usage
On Thu, Aug 24, 2006 at 02:21:33PM -0700, Joe Little wrote: well, by deleting my 4-hourlies I reclaimed most of the data. To answer some of the questions, its about 15 filesystems (decendents included). I'm aware of the space used by snapshots overlapping. I was looking at the total space (zpool iostat reports) and seeing the diff per day. The 400MB/day was be inspection and by looking at our nominal growth on a netapp. It would appear that if one days many snapshots, there is an initial quick growth in disk usage, but once those snapshot meet their retention level (say 12), the growth would appear to match our typical 400MB/day. Time will prove this one way or other. By simply getting rid of hourly snapshots and collapsing to dailies for two days worth, I reverted to only ~1-2GB total growth, which is much more in line with expectations. OK, so sounds like there is no problem here, right? You were taking snapshots every 4 hours, which took up no more space than was needed, but more than you would like (and more than daily snapshots). Using daily snapshots the space usage is in line with daily snapshots on NetApp. For various reasons, I can't post the zfs list type results as yet. I'll need to get the ok for that first.. Sorry.. It sounds like there is no problem here so no need to post the output. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + rsync, backup on steroids.
James Dickens wrote: Why not make a snapshots on a production and then send incremental backups over net? Especially with a lot of files it should be MUCH faster than rsync. because its a ZFS limited solution, if the source is not ZFS it won't work, and i'm not sure how much faster incrementals would be than rsysnc since rsync only shares checksums untill it finds a block that has changed. 'zfs send' is *incredibly* faster than rsync. rsync needs to traverse all the metadata, so it is fundamentally O(all metadata). It needs to read every directory and stat every file, to figure out what's been changed. Then it needs to read all of every changed file to figure out what parts of it have been changed. In contrast, 'zfs send' essentially only needs to read the changed data, so it is O(changed data). We can do this by leveraging our knowledge of the zfs internal structure, eg. block birth times. That said, there is still a bunch of low-hanging performance fruit in 'zfs send', which I'll be working on over the next few months. And of course if you need a cross-filesystem tool then 'zfs send' is not for you. But give it a try if you can, and let us know how it works for you! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + rsync, backup on steroids.
Dick Davies wrote: On 30/08/06, Matthew Ahrens [EMAIL PROTECTED] wrote: 'zfs send' is *incredibly* faster than rsync. That's interesting. We had considered it as a replacement for a certain task (publishing a master docroot to multiple webservers) but a quick test with ~500Mb of data showed the zfs send/recv to be about 5x slower than rsync for the initial copy. You're saying subsequent copies (zfs send -i?) should be faster? Yes. The architectural benefits of 'zfs send' over rsync only apply to sending incremental changes. When sending a full backup, both schemes have to traverse all the metadata and send all the data, so the *should* be about the same speed. However, as I mentioned, there's still some low-hanging performance issues with 'zfs send', although I'm surprised that it was 5x slower than rsync! I'd like to look into that issue some more... What type of files were you sending? Eg. approximately what size files, how many files, how many files/directory? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with expanding LUNs
Theo Bongers wrote: Please can anyone tell me how to handle with a LUN that is expanded (on a RAID array or SAN storage)? and grow the filesystem without data-loss? How does ZFS looks at the volume. In other words how can I grow the filesystem after LUN expansion. Do I need to format/type/autoconfigre/label on the specific device? I believe that if you have given ZFS the whole disk, then it will automatically detect that the LUN has grown when it opens the device. You can cause this to happen by rebooting the machine, or running 'zpool export poolname; zpool import poolname'. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS + rsync, backup on steroids.
Roch wrote: Matthew Ahrens writes: Robert Milkowski wrote: IIRC unmounting ZFS file system won't flush its caches - you've got to export entire pool. That's correct. And I did ensure that the data was not cached before each of my tests. Matt ? It seems to me that (at least in the past) unmount would actually cause the data to not be accessible (read would issue an I/O) even if potentially the associated memory with previous cached data was not quite reaped back to the OS. Looks like you're right, we do (mostly) evict the data when a filesystem is unmounted. The exception is if some of its cached data is being shared with another filesystem (eg, via a clone fs), then that data will not be evicted. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migrating data across boxes
John Beck wrote: % zfs snapshot -r [EMAIL PROTECTED] % zfs send space/[EMAIL PROTECTED] | ssh newbox zfs recv -d space % zfs send space/[EMAIL PROTECTED] | ssh newbox zfs recv -d space ... % zfs set mountpoint=/export/home space % zfs set mountpoint=/usr/local space/local % zfs set sharenfs=on space/jbeck space/local I'm working on some enhancements to zfs send/recv that will simplify this even further, especially in cases where you have many filesystems, snapshots, or changed properties. In particular, you'll be able to simply do: # zfs snapshot -r [EMAIL PROTECTED] # zfs send -r -b [EMAIL PROTECTED] | ssh newbox zfs recv -p -d newpool The send -b flag means to send from the beginning. This will send a full stream of the oldest snapshot, and incrementals up to the named snapshot (eg, from @a to @b, from @b to @c, ... @j to @today). This way your new pool will have all of the snapshots from your old pool. The send -r flag means to do this for all the filesystem's descendants as well (in this case, space/jbeck and space/local). The recv -p flag means to preserve locally set properties (in this case, the mountpoint and sharenfs settings). For more information, see RFEs 6421959 and 6421958, and watch for a forthcoming formal interface proposal. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs clones
Marlanne DeLaSource wrote: As I understand it, the snapshot of a set is used as a reference by the clone. So the clone is initially a set of pointers to the snapshot. That's why it is so fast to create. How can I separate it from the snapshot ? (so that df -k or zfs list will display for a 48G drive pool/fs1 4G 40G pool/clone 4G 40G instead of pool/fs1 4G 44G pool/clone 4G 44G ) I hope I am clear enough :/ There is no way to separate a clone from its origin snapshot. I think the numbers you're posting are: FS REFDAVAIL pool/fs14G 40G pool/clone 4G 40G So you want it to say that less space is available than really is? Perhaps what you want is to set a reservation on the clone for its initial size, so that you will be guaranteed to have enough space to overwrite its initial contents with new contents of the same size? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Need Help: Getting error zfs:bad checksum (read on unknown off...)
Arlina Goce-Capiral wrote: Customer's main concern right now is to make the system bootable but it seems couldn't do that since the bad disks is part of the zfs filesystems. Is there a way to disable or clear out the bad zfs filesystem so system can be booted? Yes, see this FAQ: http://opensolaris.org/os/community/zfs/faq/#zfspanic quote: What can I do if ZFS panics on every boot? ZFS is designed to survive arbitrary hardware failures through the use of redundancy (mirroring or RAID-Z). Unfortunately, certain failures in non-replicated configurations can cause ZFS to panic when trying to load the pool. This is a bug, and will be fixed in the near future (along with several other nifty features like background scrubbing and the ability to see a list of corrupted files). In the meantime, if you find yourself in the situation where you cannot boot due to a corrupt pool, do the followng: 1. boot using '-m milestone=none' 2. # mount -o remount / 3. # rm /etc/zfs/zpool.cache 4. # reboot This will remove all knowledge of pools from your system. You will have to re-create your pool and restore from backup. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Proposal: multiple copies of user data
Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Your comments are appreciated! --matt A. INTRODUCTION ZFS stores multiple copies of all metadata. This is accomplished by storing up to three DVAs (Disk Virtual Addresses) in each block pointer. This feature is known as Ditto Blocks. When possible, the copies are stored on different disks. See bug 6410698 ZFS metadata needs to be more highly replicated (ditto blocks) for details on ditto blocks. This case will extend this feature to allow system administrators to store multiple copies of user data as well, on a per-filesystem basis. These copies are in addition to any redundancy provided at the pool level (mirroring, raid-z, etc). B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). The pool must be at least on-disk version 2 to use this feature (see 'zfs upgrade'). By default (copies=1), only two copies of most filesystem metadata are stored. However, if we are storing multiple copies of user data, then 3 copies (the maximum) of filesystem metadata will be stored. This feature is similar to using mirroring, but differs in several important ways: * Different filesystems in the same pool can have different numbers of copies. * The storage configuration is not constrained as it is with mirroring (eg. you can have multiple copies even on a single disk). * Mirroring offers slightly better performance, because only one DVA needs to be allocated. * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). * In a pool with single-parity raid-z a filesystem with copies=2 will be stored with 2 copies, each copy protected by its own parity block. The filesystem can tolerate any 3 disks failing without data loss (assuming that there are at least ncopies=2 raid-z groups). C. MANPAGE CHANGES *** zfs.man4Tue Jun 13 10:15:38 2006 --- zfs.man5Mon Sep 11 16:34:37 2006 *** *** 708,714 --- 708,725 they are inherited. + copies=1 | 2 | 3 +Controls the number of copies of data stored for this dataset. +These copies are in addition to any redundancy provided by the +pool (eg. mirroring or raid-z). The copies will be stored on +different disks if possible. + +Changing this property only affects newly-written data. +Therefore, it is recommended that this property be set at +filesystem creation time, using the '-o copies=' option. + + Temporary Mountpoint Properties When a file system is mounted, either through mount(1M) for legacy mounts or the zfs mount command for normal file D. REFERENCES ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
James Dickens wrote: On 9/11/06, Matthew Ahrens [EMAIL PROTECTED] wrote: B. DESCRIPTION A new property will be added, 'copies', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the 'copies' property be set at filesystem-creation time (eg. 'zfs create -o copies=2 pool/fs'). would the user be held acountable for the space used by the extra copies? Doh! Sorry I forgot to address that. I'll amend the proposal and manpage to include this information... Yes, the space used by the extra copies will be accounted for, eg. in stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota. so if a user has a 1GB quota and stores one 512MB file with two copies activated, all his space will be used? Yes, and as mentioned this will be reflected in all the space accounting tools. what happens if the same user stores a file that is 756MB on the filesystem with multiple copies enabled an a 1GB quota, does the save fail? Yes, they will get ENOSPC and see that their filesystem is full. How would the user tell that his filesystem is full since all the tools he is used to report he is using only 1/2 the space? Any tool will report that in fact all space is being used. Is there a way for the sysdmin to get rid of the excess copies should disk space needs require it? No, not without rewriting them. (This is the same behavior we have today with the 'compression' and 'checksum' properties. It's a long-term goal of ours to be able to go back and change these things after the fact (scrub them in, so to say), but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.) If I start out 2 copies and later change it to on 1 copy, do the files created before keep there 2 copies? Yep, the property only affects newly-written data. what happens if root needs to store a copy of an important file and there is no space but there is space if extra copies are reclaimed? They will get ENOSPC. Will this be configurable behavior? No. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Mike Gerdts wrote: Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. This is a long-term goal of ours, but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.) . If so, this feature should subscribe to any generic framework provided by such an effort. Yep, absolutely. * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk. I'm basing this slightly better call on a model of random, complete-disk failures. I know that this is only an approximation. With many mirrors, most (but not all) 2-disk failures can be tolerated. With copies=2, almost no 2-top-level-vdev failures will be tolerated, because it's likely that *some* block will have both its copies on those 2 disks. With mirrors, you can arrange to mirror across cabinets, not within them, which you can't do with copies. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: All of these examples seem to assume that there six disks. Not really. There could be any number of mirrors or raid-z groups (although I note, you need at least 'copies' groups to survive the max whole-disk failures). * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). This one assumes best case scenario with 6 disks. Suppose you had 4 x 72 GB and 2 x 36 GB disks. You could end up with multiple copies on the 72 GB disks. Yes, all these examples assume that our putting the copies on different disks when possible actually worked out. It will almost certainly work out unless you have a small number of different-sized devices, or are running with very little free space. If you need hard guarantees, you need to use actual mirroring. Any statement about physical location on the disk? It would seem as though locating two copies sequentially on the disk would not provide nearly the amount of protection as having them fairly distant from each other. Yep, if the copies can't be stored on different disks, they will be stored spread-out on the same disk if possible (I think we aim for one on each quarter of the disk). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
James Dickens wrote: though I think this is a cool feature, I think i needs more work. I think there sould be an option to make extra copies expendible. So the extra copies are a request, if the space is availible make them, if not complete the write, and log the event. Are you asking for the extra copies that have already been written to be dynamically freed up when we are running low on space? That could be useful, but it isn't the problem I'm trying to solve with the 'copies' property (not to mention it would be extremely difficult to implement). It the user really requires guaranteed extra copies, then use mirrored or raided disks. Right, if you want everything to have extra redundancy, that use case is handled just fine today by mirrors or RAIDZ. The case where 'copies' is useful is when you want some data to be stored with more redundancy than others, without the burden of setting up different pools. It seems just to be a nightmare for the administrator, you start with 3 copies and then change to 2 copies, you will have phantom copies that are only known to exist to the OS, it won't show in any reports, zfs list doesn't have an option to show which files have multiple clones and which dont. There is no way to destroy multiple clones without rewriting every file on the disk. (I'm assuming you mean copies, not clones.) So would you prefer that the property be restricted to only being set at filesystem creation time, and not changed later? That way the number of copies of all files in the filesystem is always the same. It seems like the issue of knowing how many copies there are would be much worse in the system you're asking for where the extra copies are freed up as needed... --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and free space
Robert Milkowski wrote: Hello Mark, Monday, September 11, 2006, 4:25:40 PM, you wrote: MM Jeremy Teo wrote: Hello, how are writes distributed as the free space within a pool reaches a very small percentage? I understand that when free space is available, ZFS will batch writes and then issue them in sequential order, maximising write bandwidth. When free space reaches a minimum, what happens? Thanks! :) MM Just what you would expect to happen: MM As contiguous write space becomes unavailable, writes will be come MM scattered and performance will degrade. More importantly: at this MM point ZFS will begin to heavily write-throttle applications in order MM to ensure that there is sufficient space on disk for the writes to MM complete. This means that there will be less writes to batch up MM in each transaction group for contiguous IO anyway. MM As with any file system, performance will tend to degrade at the MM limits. ZFS keeps a small overhead reserve (much like other file MM systems) to help mitigate this, but you will definitely see an MM impact. I hope it won't be a problem if space is getting low i a file system with quota set however in a pool the file system is in there's plenty of space, right? If you are running close to your quota, there will be a little bit of performance degradation, but not to the same degree as when running low on free space in the pool. The reason performance degrades when you're near your quota is that we aren't exactly sure how much space will be used until we actually get around to writing it out (due to compression, snapshots, etc). So we have to write things out in smaller batches (ie. flush out transaction groups more frequently than is optimal). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
Dick Davies wrote: For the sake of argument, let's assume: 1. disk is expensive 2. someone is keeping valuable files on a non-redundant zpool 3. they can't scrape enough vdevs to make a redundant zpool (remembering you can build vdevs out of *flat files*) Given those assumptions, I think that the proposed feature is the perfect solution. Simply put those files in a filesystem that has copies1. Also note that using files to back vdevs is not a recommended solution. If the user wants to make sure the file is 'safer' than others, he can just make multiple copies. Either to a USB disk/flashdrive, cdrw, dvd, ftp server, whatever. It seems to me that asking the user to solve this problem by manually making copies of all his files puts all the burden on the user/administrator and is a poor solution. For one, they have to remember to do it pretty often. For two, when they do experience some data loss, they have to manually reconstruct the files! They could have one file which has part of it missing from copy A and part of it missing from copy B. I'd hate to have to reconstruct that manually from two different files, but the proposed solution would do this transparently. The redundancy you're talking about is what you'd get from 'cp /foo/bar.jpg /foo/bar.jpg.ok', except it's hidden from the user and causing headaches for anyone trying to comprehend, port or extend the codebase in the future. Whether it's hard to understand is debatable, but this feature integrates very smoothly with the existing infrastructure and wouldn't cause any trouble when extending or porting ZFS. I'm afraid I honestly think this greatly complicates the conceptual model (not to mention the technical implementation) of ZFS, and I haven't seen a convincing use case. Just for the record, these changes are pretty trivial to implement; less than 50 lines of code changed. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: Matthew Ahrens wrote: The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Can you expand? I can think of some examples where using multiple pools - even on the same host - is quite useful given the current feature set of the product. Or are you only discussing the specific case where a host would want more reliability for a certain set of data then an other? If that's the case I'm still confused as to what failure cases would still allow you to retrieve your data if there are more then one copy in the fs or pool.but I'll gladly take some enlightenment. :) (My apologies for the length of this response, I'll try to address most of the issues brought up recently...) When I wrote this proposal, I was only seriously thinking about the case where you want different amounts of redundancy for different data. Perhaps because I failed to make this clear, discussion has concentrated on laptop reliability issues. It is true that there would be some benefit to using multiple copies on a single-disk (eg. laptop) pool, but of course it would not protect against the most common failure mode (whole disk failure). One case where this feature would be useful is if you have a pool with no redundancy (ie. no mirroring or raid-z), because most of the data in the pool is not very important. However, the pool may have a bunch of disks in it (say, four). The administrator/user may realize (perhaps later on) that some of their data really *is* important and they would like some protection against losing it if a disk fails. They may not have the option of adding more disks to mirror all of their data (cost or physical space constraints may apply here). Their problem is solved by creating a new filesystem with copies=2 and putting the important data there. Now, if a disk fails, then the data in the copies=2 filesystem will not be lost. Approximately 1/4 of the data in other filesystems will be lost. (There is a small chance that some tiny fraction of the data in the copies=2 filesystem will still be lost if we were forced to put both copies on the disk that failed.) Another plausible use case would be where you have some level of redundancy, say you have a Thumper (X4500) with its 48 disks configured into 9 5-wide single-parity raid-z groups (with 3 spares). If a single disk fails, there will be no data loss. However, if two disks within the same raid-z group fail, data will be lost. In this scenario, imagine that this data loss probability is acceptable for most of the data stored here, but there is some extremely important data for which this is unacceptable. Rather than reconfiguring the entire pool for higher redundancy (say, double-parity raid-z) and less usable storage, you can simply create a filesystem with copies=2 within the raid-z storage pool. Data within that filesystem will not be lost even if any three disks fail. I believe that these use cases, while not being extremely common, do occur. The extremely low amount of engineering effort required to implement the feature (modulo the space accounting issues) seems justified. The fact that this feature does not solve all problems (eg, it is not intended to be a replacement for mirroring) is not a downside; not all features need to be used in all situations :-) The real problem with this proposal is the confusion surrounding disk space accounting with copies1. While the same issues are present when using compression, people are understandably less upset when files take up less space than expected. Given the current lack of interest in this feature, the effort required to address the space accounting issue does not seem justified at this time. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots and backing store
Nicolas Dorfsman wrote: Hi, There's something really bizarre in ZFS snaphot specs : Uses no separate backing store. . Hum...if I want to mutualize one physical volume somewhere in my SAN as THE snaphots backing-store...it becomes impossible to do ! Really bad. Is there any chance to have a backing-store-file option in a future release ? In the same idea, it would be great to have some sort of propertie to add a disk/LUN/physical_space to a pool, only reserved to backing-store. At now, the only thing I see to disallow users to use my backing-store space for their usage is to put quota. If you want to copy your filesystems (or snapshots) to other disks, you can use 'zfs send' to send them to a different pool (which may even be on a different machine!). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Snapshots and backing store
Nicolas Dorfsman wrote: We need to think ZFS as ZFS, and not as a new filesystem ! I mean, the whole concept is different. Agreed. So. What could be the best architecture ? What is the problem? With UFS, I used to have separate metadevices/LUNs for each application. With ZFS, I thought it would be nice to use a separate pool for each application. Ick. It would be much better to have one pool, and a separate filesystem for each application. But, it means multiply snapshot backing-store OR dynamically remove/add this space/LUN to pool where we need to do backups. I don't understand this statement. What problem are you trying to solve? If you want to do backups, simply take a snapshot, then point your backup program at it. If you want faster incremental backups, use 'zfs send -i' to generate the file to backup. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature
Bady, Brant RBCM:EX wrote: Actually to clarify - what I want to do is to be able to read the associated checksums ZFS creates for a file and then store them in an external system e.g. an oracle database most likely Rather than storing the checksum externally, you could simply let ZFS verify the integrity of the data. Whenever you want to check it, just run 'zpool scrub'. Of course, if you don't trust ZFS to do that for you, you probably wouldn't trust it to tell you the checksum either! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: zfs clones
Jan Hendrik Mangold wrote: I didn't ask the original question, but I have a scenario where I want to use clone as well and encounter a (designed?) behaviour I am trying to understand. I create a filesystem A with ZFS and modify it to a point where I create a snapshot [EMAIL PROTECTED] Then I clone that snapshot to create a new filesystem B. I seem to have two filesystem entities I can make independant modifications and snapshots with/on/from. The problem I am running into is that when modifying A and wanting to rollback to the snapshot [EMAIL PROTECTED] I can't do that as long as the clone B is mounted. Is this a case where I would benefit from the ability to sperate the clone? Or is this something not possible with ZFS? Hmm, actually this is unexpected; you shouldn't have to unmount the clone to do the rollback on the origin filesystem. I think that our command-line tool is simply being a bit overzealous. I've filed bug 6472202 to track this issue; it should be pretty straightforward to fix. Thanks for bringing this to our attention! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs clones
Mike Gerdts wrote: A couple scenarios from environments that I work in, using legacy file systems and volume managers: 1) Various test copies need to be on different spindles to remove any perceived or real performance impact imposed by one or the other. Arguably by having the IO activity spread across all the spindles there would be fewer bottlenecks. However, if you are trying to simulate the behavior of X production spindles, doing so with 1.3 X or 2 X spindles is not a proper comparison. Hence being wasteful and getting suboptimal performance may be desirable. If you don't understand that logic, you haven't worked in a big enough company or studied Dilbert enough. :) Here it makes sense to be using X spindles. However, using a clone filesystem will perform the same as a non-clone filesystem. So if you have enough space on those X spindles for the clone, I don't think there's any need for additional separation. Of course, this may not eliminate imagined performance difference (eg, your Dilbert reference :-), in which case you can simply use 'zfs send | zfs recv' to send the snapshot to a suitably-isolated pool/machine. 2) One of the copies of the data needs to be portable to another system while the original stays put. This could be done to refresh non-production instances from production, to perform backups in such a way that it doesn't put load on the production spindles, networks, etc. This is a case where you should be using multiple pools (possibly on the same host), and using 'zfs send | zfs recv' between them. In some cases, you may be able to attach the storage to the destination machine and use the network to move the data, eg. 'zfs send | ssh dest zfs recv'. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Fastest way to send 100gb ( with ZFS send )
Anantha N. Srirama wrote: You're most certainly are hitting the SSH limitation. Note that SSH/SCP sessions are single threaded and won't utilize all of the system resources even if they are available. You may want to try 'ssh -c blowfish' to use the (faster) blowfish encryption algorithm rather than default of triple-DES. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to make an extended LUN size known to ZFS and Solaris
Michael Phua - PTS wrote: Hi, Our customer has an Sun Fire X4100 with Solaris 10 using ZFS and a HW RAID array (STK D280). He has extended a LUN on the storage array and want to make this new size known to ZFS and Solaris. Does anyone know if this can be done and how it can be done. Unfortunately, there's no good way to do this at the moment. When you give ZFS the whole disk, we put a EFI label on the disk and make one big slice for our use. However, when the LUN grows, that slice stays the same size. ZFS needs to write a new EFI label describing the new size, before it can use the new space. I've filed bug 6475340 to track this issue. As a workaround, it *should* be possible to manually relabel the device with format(1m), but unfortunately bug 4967547 (a problem with format) prevents this from working correctly. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What's going to make it into 11/06?
Darren Dunham wrote: What about ZFS root?. And compatibility with Live Upgrade?. Any timetable estimation?. ZFS root has been previously announced as targeted for update 4. ZFS root support will most likely not be available in Solaris 10 until update 5. (And of course this is subject to change...) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: directory tree removal issue with zfs on Blade 1500/PC rack server IDE disk
Stefan Urbat wrote: By the way, I have to wait a few hours to umount and check mountpoint permissions, because an automated build is currently running on that zfs --- the performance of [EMAIL PROTECTED] is indeed rather poor (much worse than ufs), but this is another, already documented and bug entry honoured issue. Really? Are you allowing ZFS to use the entire disk (and thus turn on the disk's write cache)? Can you describe your workload and give numbers on both ZFS and UFS? What bug was filed? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Unbootable system recovery
Ewen Chan wrote: However, in order for me to lift the unit, I needed to pull the drives out so that it would actually be moveable, and in doing so, I think that the drive-cable-port allocation/assignment has changed. If that is the case, then ZFS would automatically figure out the new mapping. (Of course, there could be an undiscovered bug in that code.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A versioning FS
[EMAIL PROTECTED] wrote: On Fri, Oct 06, 2006 at 01:14:23AM -0600, Chad Leigh -- Shire.Net LLC wrote: But I would dearly like to have a versioning capability. Me too. Example (real life scenario): there is a samba server for about 200 concurrent connected users. They keep mainly doc/xls files on the server. From time to time they (somehow) currupt their files (they share the files so it is possible) so they are recovered from backup. Having versioning they could be said that if their main file is corrupted they can open previous version and keep working. ZFS snapshots is not solution in this case because we would have to create snapshots for 400 filesystems (yes, each user has its filesystem and I said that there are 200 concurrent connections but there much more accounts on the server) each hour or so. I completely disagree. In this scenario (and almost all others), use of regular snapshots will solve the problem. 'zfs snapshot -r' is extremely fast, and I'm working on some new features that will make using snapshots for this even easier and better-performing. If you disagree, please tell us *why* you think snapshots don't solve the problem. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A versioning FS
Jeremy Teo wrote: A couple of use cases I was considering off hand: 1. Oops i truncated my file 2. Oops i saved over my file 3. Oops an app corrupted my file. 4. Oops i rm -rf the wrong directory. All of which can be solved by periodic snapshots, but versioning gives us immediacy. So is immediacy worth it to you folks? I rather not embark on writing and finishing code on something no one wants besides me. In my opinion, the marginal benefit of per-write(2) versions over snapshots (which can be per-transaction, ie. every ~5 seconds) does not outweigh the complexity of implementation and use/administration. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't recv incremental snapshot
Frank Cusack wrote: [EMAIL PROTECTED]:~]# zfs send -i export/zone/www/[EMAIL PROTECTED] export/zone/www/[EMAIL PROTECTED] | ssh cookies zfs recv export/zone/www/html cannot receive: destination has been modified since most recent snapshot -- use 'zfs rollback' to discard changes I was going to try deleting all snaps and start over with a new snap but I thought someone might be interested in figuring out what's going on here. That should not be necessary! I assume that you already followed the suggestion of doing 'zfs rollback', and you got the same message after trying the incremental recv again. If not, try that first. There are a couple of things that could cause this. One is that some process is inadvertently modifying the destination (eg. by reading something, causing the atime to be updated). You can get around this by making the destination fs readonly=on. Another possibility is that you're hitting 6343779 ZPL's delete queue causes 'zfs restore' to fail. In either case, you can fix the problem by using zfs recv -F which will do the rollback for you and make sure nothing happens between the rollback and the recv. You need to be running build 48 or later to use 'zfs recv -F'. If you can't run build 48 or later, then you can workaround the problem by not mounting the filesystem in between the 'rollback' and the 'recv': cookies# zfs set mountpoint=none export/zone/www/html cookies# zfs rollback export/zone/www/[EMAIL PROTECTED] milk# zfs send -i @4 export/zone/www/[EMAIL PROTECTED] | ssh cookies zfs recv export/zone/www/html Let me know if one of those options works for you. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't recv incremental snapshot
Frank Cusack wrote: If you can't run build 48 or later, then you can workaround the problem by not mounting the filesystem in between the 'rollback' and the 'recv': cookies# zfs set mountpoint=none export/zone/www/html cookies# zfs rollback export/zone/www/[EMAIL PROTECTED] milk# zfs send -i @4 export/zone/www/[EMAIL PROTECTED] | ssh cookies zfs recv export/zone/www/html Let me know if one of those options works for you. Setting mountpoint=none works, but once I set the mountpoint option back it fails again. That is, I successfully send the incremental, reset the mountpoint option, rollback and send and it fails. I don't follow... could you list the exact sequence of commands you used and their output? I think you're saying that you were able to successfully receive the @[EMAIL PROTECTED] incremental, but when you tried the @[EMAIL PROTECTED] incremental without doing mountpoint=none, the recv failed. So you're saying that you need mountpoint=none for any incremental recv's, not just @[EMAIL PROTECTED] So I guess there is a filesystem access somewhere somehow immediately after the rollback. I can't run b48 (any idea if -F will be in 11/06?). I don't think so. Look for it in Solaris 10 update 4. However, I really do this via a script which does a rollback then immediately does the send. This script always fails. It sounds like the mountpoint=none trick works for you, so can't you just incorporate it into your script? Eg: while (want to send snap) { zfs set mountpoint=none destfs zfs rollback [EMAIL PROTECTED] zfs send -i @bla [EMAIL PROTECTED] | ssh desthost zfs recv bla zfs inherit mountpoint destfs sleep ... } readonly=on doesn't help. That is, cookies# zfs set readonly=on export/zone/www/html cookies# zfs rollback export/zone/www/[EMAIL PROTECTED] milk# zfs send ... ... destination has been modified ... This implies that you are hitting 6343779 (or some other bug) which is causing your fs to be modified, rather than some spurious process. But I would expect that to be rare, so it would be surprising if you see this happening with many different snapshots. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't recv incremental snapshot
Frank Cusack wrote: No, I just tried the @[EMAIL PROTECTED] incremental again. I didn't think to try another incremental. So I was basically doing the mountpoint=none trick, they trying @[EMAIL PROTECTED] again without doing mountpoint=none. Again, seeing the exact sequence of commands you ran would make it quicker for me to diagnose this. I think you're saying that you ran: zfs set mountpoint=none destfs zfs rollback [EMAIL PROTECTED] zfs send -i @4 [EMAIL PROTECTED] | zfs recv ... - success zfs inherit mountpoint destfs zfs rollback -r [EMAIL PROTECTED] zfs send -i @4 [EMAIL PROTECTED] | zfs recv ... - failure This would be consistent with hitting bug 6343779. It sounds like the mountpoint=none trick works for you, so can't you just incorporate it into your script? Eg: Sure. I was just trying to identify the problem correctly, in case this isn't just another instance of an already-known problem. mountpoint=none is really suboptimal for me though, it means i cannot have services running on the receiving host. I was hoping readonly=on would do the trick. Really? I find it hard to believe that mountpoint=none causes any more problems than 'zfs recv' by itself, since 'zfs recv' of an incremental stream always unmounts the destination fs while the recv is taking place. It's all existing snapshots on that one filesystem. If I take a new snapshot (@6) and send it, it works. Which seems weird to me. It seems to be something to do with the sending host, not the receiving host. From the information you've provided, my best guess is that the problem is associated with your @4 snapshot, and you are hitting 6343779. Here is the bug description: Even when not accessing a filesystem, it can become dirty due to the zpl's delete queue. This means that even if you are just 'zfs restore'-ing incremental backups into the filesystem, it may fail because the filesystem has been modified. One possible solution would be to make filesystems created by 'zfs restore' be readonly by default, and have the zpl not process the delete queue if it is mounted readonly. *** (#1 of 2): 2005-10-31 03:31:02 PST [EMAIL PROTECTED] Note, currently even if you manually set the filesystem to be readonly, the ZPL will still process the delete queue, making it particularly difficult to ensure there are no changes since a most recent snapshot which has entries in the delete queue. The only workaround I could find is to not mount the filesystem. *** (#2 of 2): 2005-10-31 03:34:56 PST [EMAIL PROTECTED] --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't recv incremental snapshot
Frank Cusack wrote: Really? I find it hard to believe that mountpoint=none causes any more problems than 'zfs recv' by itself, since 'zfs recv' of an incremental stream always unmounts the destination fs while the recv is taking place. You're right. I forgot I was having problems with this anyway. You'd probably be interested in RFE 6425096 want online (read-only) 'zfs recv'. Unfortunately this isn't a priority at the moment. It's all existing snapshots on that one filesystem. If I take a new snapshot (@6) and send it, it works. Which seems weird to me. It seems to be something to do with the sending host, not the receiving host. From the information you've provided, my best guess is that the problem is associated with your @4 snapshot, and you are hitting 6343779. Well, all existing snapshots (@0, @1 ... @4). I will add changing of the mountpoint property to my script. That's a bit surprising, but I'm glad we have a workaround for you. 'zfs recv -F' will make this a bit smoother once you have it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list No known data errors
ttoulliu2002 wrote: Hi: I have zpool created # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT ktspool34,5G 33,5K 34,5G 0% ONLINE - However, zpool status shows no known data error. May I know what is the problem # zpool status pool: ktspool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM ktspool ONLINE 0 0 0 c0t1d0s6 ONLINE 0 0 0 errors: No known data errors Please do not crosspost to both zfs-discuss and zfs-code. zfs-code is a subset of zfs-discuss, so just post to zfs-discuss. To answer your question, there does not appear to be any problem. Why do you think there is a problem? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: directory tree removal issue with zfs on Blade 1500/PC rack server IDE disk
Stefan Urbat wrote: What bug was filed? 6421427 is nfs related, but another forum member thought, that it is in fact a general IDE performance bottleneck behind, and was only made visible in this case. There is a report, that on an also with simple IDE equipped Blade 150 the same issue with low performance is visible: http://www.opensolaris.org/jive/thread.jspa?messageID=57201 Ah yes, good old 6421427. The fix for that should be putback into opensolaris any day now. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to carve up 8 disks
Brian Hechinger wrote: Ok, previous threads have lead me to believe that I want to make raidz vdevs [0] either 3, 5 or 9 disks in size [1]. Let's say I have 8 disks. Do I want to create a zfs pool with a 5-disk vdev and a 3-disk vdev? Are there performance issues with mixing differently sized raidz vdevs in a pool? If there *is* a performance hit to mix like that, would it be greater or lesser than building an 8-disk vdev? Unless you are running a database (or other record-structured application), or have specific performance data for your workload that supports your choice, I wouldn't worry about using the power-of-two-plus-parity size stripes. I'd choose between (in order of decreasing available io/s): 4x 2-way mirrors (most io/s and most read bandwidth) 2x 4-way raidz1 1x 8-way raidz1 (most write bandwidth) 1x 8-way raidz2 (most redundant) [0] - Just for clarity, what are the sub-pools in a pool, the actual raidz/mirror/etc containers called. What is the correct term to refer to them? I don't want any extra confusion here. ;) We would usually just call them vdevs (or to be more specific, top-level vdevs). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Where is the ZFS configuration data stored?
Steven Goldberg wrote: Thanks Matt. So is the config/meta info for the pool that is stored within the pool kept in a file? Is the file user readable or binary? It is not user-readable. See the on-disk format document, linked here: http://www.opensolaris.org/os/community/zfs/docs/ --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? It would be really great to automatically select the proper recordsize for each file! How do you suggest doing so? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs and zones
Roshan Perera wrote: Hi Jeff Robert, Thanks for the reply. Your interpretation is correct and the answer spot on. This is going to be at a VIP clients QA/production environment and first introduction to 10, zones and zfs. Anything unsupported is not allowed. Hence I may have to wait for the fix. Do you know roughly when the fixes will be available. So that I can give the cusrtomer some time related info. Thanks again. Roshan Using ZFS for a zones root is currently planned to be supported in solaris 10 update 5, but we are working on moving it up to update 4. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper and ZFS
Robert Milkowski wrote: Hello Richard, Friday, October 13, 2006, 8:05:18 AM, you wrote: REP Do you want data availability, data retention, space, or performance? data availability, space, performance However we're talking about quite a lot of small IOs (r+w). Then you should seriously consider using mirrors. The real question was what do you think about creating each raid group only from disks from different controllers so controller failure won't affect data availability. On thumper, where the controllers (and cables, etc) are integrated into the system board, controller failure is extremely unlikely. These controllers are much more reliable than your traditional SCSI card in a PCI slot. In fact, most controller failures are due to SCSI bus negotiation problems (confused devices, bad cables, etc), which simply don't exist in the point-to-point (ie. SCSI, SAS) world. So I wouldn't worry very much about spreading across controllers for the sake of controller failure. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Usability issue : improve means of finding ZFS-physdevice(s) mapping
Robert Milkowski wrote: Hello Noel, Friday, October 13, 2006, 11:22:06 PM, you wrote: ND I don't understand why you can't use 'zpool status'? That will show ND the pools and the physical devices in each and is also a pretty basic ND command. Examples are given in the sysadmin docs and manpages for ND ZFS on the opensolaris ZFS community page. Showing physical devs in df output with ZFS is not right and I do not imagine how one would show in df output for a pool with dozen disks. But an option to zpool command to display config in such a way so it's easy (almost copypaste) to recreate such config would be useful. Something like metastat -p. Agreed, see 6276640 zpool config. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? Here is one relatively straightforward way you could implement this. You can't (currently) change the recordsize once there are multiple blocks in the file. This shouldn't be too bad because by the time they've written 128k, you should have enough info to make the choice. In fact, that might make a decent algorithm: * Record the first write size (in the ZPL's znode) * If subsequent writes differ from that size, reset write size to zero * When a write comes in past 128k, see if the write size is still nonzero; if so, then read in the 128k, decrease the blocksize to the write size, fill in the 128k again, and finally do the new write. Obviously you will have to test this algorithm and make sure that it actually detects the recordsize on various databases. They may like to initialize their files with large writes, which would break this. If you have to change the recordsize once the file is big, you will have to rewrite everything[*], which would be time consuming. --matt [*] Or if you're willing to hack up the DMU and SPA, you'll just have to re-read everything to compute the new checksums and re-write all the indirect blocks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Configuring a 3510 for ZFS
Torrey McMahon wrote: Richard Elling - PAE wrote: Anantha N. Srirama wrote: I'm glad you asked this question. We are currently expecting 3511 storage sub-systems for our servers. We were wondering about their configuration as well. This ZFS thing throws a wrench in the old line think ;-) Seriously, we now have to put on a new hat to figure out the best way to leverage both the storage sub-system as well as ZFS. [for the archives] There is *nothing wrong* with treating ZFS like UFS when configuring with LUNs hosted on RAID arrays. It is true that you will miss some of the self-healing features of ZFS, but at least you will know when the RAID array has munged your data -- a feature missing on UFS and most other file systems. Of you just offer ZFS multiple LUNs from the RAID array. The issue is putting ZFS on a single LUN be it a disk in a JBOD or a LUN offered from a HW RAID array. If someone goes wrong and the LUN becomes inaccessible then ... blamo! You're toasted. If ZFS detects a data inconsistency then it can't look to an other block for a mirrored copy, ala ZFS mirror, or to a parity block, ala RAIDZ. Right, I think Richard's point is that even if you just give ZFS a single LUN, ZFS is still more reliable than other filesystems (eg, due to its checksums to prevent silent data corruption and multiple copies of metadata to lessen the hurt of small amounts of data loss). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots impact on performance
Robert Milkowski wrote: If it happens again I'll try to get some more specific data - however it depends on when it happens as during peak hours I'll probably just destroy a snapshot to get it working. If it happens again, it would be great if you could gather some data before you destroy the snapshot so we have some chance of figuring out what's going on here. 'iostat -xnpc 1' will tell us if it's CPU or disk bound. 'lockstat -kgiw sleep 10' will tell us what functions are using CPU. 'echo ::walk thread|::findstack | mdb -k' will tell us where threads are stuck. Actually, if you could gather each of those both while you're observing the problem, and then after the problem goes away, that would be helpful. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Jeremy Teo wrote: Heya Anton, On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote: No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. (Actually ZFS goes up to 128k not 256k (yet!)) Ah. I knew i was missing something. What COW giveth, COW taketh away... Yes, although actually most non-COW filesystems have this same problem, because they don't write partial blocks either, even though technically they could. (And FYI, checksumming would take away the ability to write partial blocks too.) 1) Set recordsize manually 2) Allow the blocksize of a file be changed even if there are multiple blocks in the file. Or, as has been suggested, add an API for apps to tell us the recordsize before they populate the file. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ENOSPC : No space on file deletion
Erblichs wrote: Now the stupid question.. If the snapshot is identical to the FS, I can't remove files from the FS because of the snapshot and removing files from the snapshot only removes a reference to the file and leaves the memory. So, how do I do a atomic file removes on both the original and the snapshot(s). Yes, I am assuming that I have backed up the file offline. Can I request a possible RFE to be able to force a file remove from the original FS and if found elsewhere remove that location too IFF a ENOSPC would fail the original rm? No, you can not remove files from snapshots. Snapshots can not be changed. If you are out of space because of snapshots, you can always 'zfs destroy' the snapshot :-) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored Raidz
Richard Elling - PAE wrote: Anthony Miller wrote: Hi, I've search the forums and not found any answer to the following. I have 2 JBOD arrays each with 4 disks. I want to create create a raidz on one array and have it mirrored to the other array. Today, the top level raid sets are assembled using dynamic striping. There is no option to assemble the sets with mirroring. Perhaps the ZFS team can enlighten us on their intentions in this area? Our thinking is that if you want more redundancy than RAID-Z, you should use RAID-Z with double parity, which provides more reliability and more usable storage than a mirror of RAID-Zs would. (Also, expressing mirror of RAID-Zs from the CLI would be a bit messy; you'd have to introduce parentheses in vdev descriptions or something.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing number of disks in a RAID-Z?
Robert Milkowski wrote: Hello Jeremy, Monday, October 23, 2006, 5:04:09 PM, you wrote: JT Hello, Shrinking the vdevs requires moving data. Once you move data, you've got to either invalidate the snapshots or update them. I think that will be one of the more difficult parts. JT Updating snapshots would be non-trivial, but doable. Perhaps some sort JT of reverse mapping or brute force search to relate snapshots to JT blocks. IMHO ability to shrink/grow pools even if restricted so no snapshots and clones can be present in a pool during shrinking/growing would still be a great feature. FYI, we're working on being able to shrink pools with no restrictions. Unfortunately I don't have an ETA for you on this, though. And as I'm sure you know, you can always grow pools :-) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zone with lofs zfs - why legacy
Jens Elkner wrote: Yes, I guessed that, but hopefully not that much ... Thinking about it, it would suggest to me (if I need abs. max. perf): the best thing to do is, to create a pool inside the zone and to use zfs on it ? Using a ZFS filesystem within a zone will go just as fast as in the global zone, so there's no need to create multiple pools. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing number of disks in a RAID-Z?
Erik Trimble wrote: Matthew Ahrens wrote: Erik Trimble wrote: The ability to expand (and, to a less extent, shrink) a RAIDZ or RAIDZ2 device is actually one of the more critical missing features from ZFS, IMHO. It is very common for folks to add additional shelf or shelves into an existing array setup, and if you have created a pool which uses RAIDZ across the shelves (a good idea), then you want to add the new shelves into the existing RAIDZ setup. Out of curiosity, what software (filesystem and/or volume manager) and configuration are you using today to achieve this? --matt I can't speak for VxVM, since I can't remember if it has the capability, but most hardware RAID controllers and SAN controllers have had this ability for ages (which, combines with VxVM or other FS's that can grow/shrink a FS when the underlying partition size changes). See: IBM's ServeRAID controllers, HP's MSA-series array heads, etc. Right, but those are volume managers or hardware devices that export LUNs. They can shrink the LUN by simply throwing away the end of it. ZFS's zvols can do this too. Shrinking the *filesystem* that sits on top of that LUN is a much more difficult problem! (But, as I've mentioned, it's one we're going to solve.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Snapshots impact on performance
Robert Milkowski wrote: Hi. On nfs clients which are mounting file system f3-1/d611 I can see 3-5s periods of 100% busy (iostat) and almost no IOs issued to nfs server, on nfs server at the same time disk activity is almost 0 (both iostat and zpool iostat). However CPU activity increases in SYS during that periods. Different time period when disk activity is small: # lockstat -kgIw sleep 10 | less Did you happen to get 'lockstat -kgIW' output while the problem was occurring? (note the capital W) I'm not sure how to interpret the -w output... (and sorry I gave you the wrong flags before). Now during another period when disk activity is low and nfs clients see problem: # dtrace -n fbt:::entry'{self-vt=vtimestamp;}' -n fbt:::return'/self-vt/[EMAIL PROTECTED](vtimestamp-self-vt);self-vt=0;}' -n tick-5s'{printa(@);exit(0);}' [...] page_next_scan_large 23648600 generic_idle_cpu 69234100 disp_getwork 139261800 avl_walk 669424900 Hmm, that's a possibility, but the method you're using to gather this information (tracing *every* function entry and exit) is a bit heavy-handed, and it may be distorting the results. Heh, I'm sure I have seen avl_walk consuming lot of CPU before... So wait for another such period and (6-7seconds): # dtrace -n fbt::avl_walk:entry'[EMAIL PROTECTED]()]=count();}' [...] zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e4 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_alloc_blk+0x70 zfs`zil_lwb_write_start+0x8c zfs`zil_lwb_commit+0x1ac zfs`zil_commit+0x1b0 zfs`zfs_fsync+0xa8 genunix`fop_fsync+0x14 nfssrv`rfs3_create+0x700 nfssrv`common_dispatch+0x444 rpcmod`svc_getreq+0x154 rpcmod`svc_run+0x198 nfs`nfssys+0x1c4 unix`syscall_trap32+0xcc 1415957 Hmm, assuming that avl_walk() is actually consuming most of our CPU (which the lockstat -kgIW will confirm), this seems to indicate that we're taking a long time trying to find free chunks of space. This may happen if you have lots of small fragments of free space, but no chunks large enough to hold the block we're trying to allocate. We try to avoid this situation by trying to allocate from the metaslabs with the most free space, but it's possible that there's an error in this algorithm. So lets destroy oldest snapshot: # zfs destroy f3-1/[EMAIL PROTECTED] [it took about 4 minutes!] After snapshot was destroyed problem is completly gone. FYI, destroying the snapshot probably helped simply by (a) returning some big chunks of space to the pool and/or (b) perturbing the system enough so that we try different metaslabs which aren't so fragmented. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS hangs systems during copy
Juergen Keil wrote: Sounds familiar. Yes it is a small system a Sun blade 100 with 128MB of memory. Oh, 128MB... Btw, does anyone know if there are any minimum hardware (physical memory) requirements for using ZFS? It seems as if ZFS wan't tested that much on machines with 256MB (or less) memory... The minimum hardware requirement for Solaris 10 (including ZFS) is 256MB, and we did test with that :-) On small memory systems, make sure that you are running with kmem_flags=0 (this is the default on non-debug builds, but debug builds default to kmem_flags=f and you will have to manually change it in /etc/system). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: copying a large file..
Jeremy Teo wrote: This is the same problem described in 6343653 : want to quickly copy a file from a snapshot. Actually it's a somewhat different problem. Copying a file from a snapshot is a lot simpler than copying a file from a different filesystem. With snapshots, things are a lot more constrained, and we already have the infrastructure for a filesystem referencing the same blocks as its snapshots. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS thinks my 7-disk pool has imaginary disks
Rince wrote: Hi all, I recently created a RAID-Z1 pool out of a set of 7 SCSI disks, using the following command: # zpool create magicant raidz c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 c5t6d0 It worked fine, but I was slightly confused by the size yield (99 GB vs the 116 GB I had on my other RAID-Z1 pool of same-sized disks). This is probably because your old pool was hitting 6288488 du reports misleading size on RAID-Z Pools created with more recent bits won't hit this. (note, 99/116 == 6/7) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive into zone?
Jeff Victor wrote: If I add a ZFS dataset to a zone, and then want to zfs send from another computer into a file system that the zone has created in that data set, can I zfs send to the zone, or can I send to that zone's global zone, or will either of those work? I believe that the 'zfs send' can be done from either the global or local zone just fine. You can certainly do it from the local zone. FYI, if you are doing a 'zfs recv' into a filesystem that's been designated to a zone, you should do the 'zfs recv' inside the zone. (I think it's possible to do the 'zfs recv' in the global zone, but I think you'll have to first make sure that it isn't mounted in the local zone. This is because the global zone doesn't know how to go into the local zone and unmount it.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Size of raidz
Vahid Moghaddasi wrote: I created a raidz from three 70GB disks and got a total of 200GB out of it. It't that supposed to give 140GB? You are hitting 6288488 du reports misleading size on RAID-Z which affects pools created before build 42 or s10u3. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] linux versus sol10
Robert Milkowski wrote: PvdZ This could be related to Linux trading reliability for speed by doing PvdZ async metadata updates. PvdZ If your system crashes before your metadata is flushed to disk your PvdZ filesystem might be hosed and a restore PvdZ from backups may be needed. you can achieve something similar with fastfs on ufs file systems and setting zil_disable to 1 on ZFS. No, zil_disable does not trade off consistency for performance the way 'fastfs' on ufs or async metadata updates on EXT do! Setting zil_disable causes ZFS to not push synchronous operations (eg, fsync(), O_DSYNC, NFS ops) to disk immediately, but it does not compromise filesystem integrity in any way. Unlike these other filesystems fast modes, ZFS (even with zil_disable=1) will not corrupt itself and send you to backup tapes. To state it another way, setting 'zil_disable=1' on ZFS will at worst cause some operations which requested synchronous semantics to not actually be on disk in the event of a crash, whereas other filesystems can corrupt themselves and lose all your data. All that said, 'zil_disable' is a completely unsupported hack, and subject to change at any time. It will probably eventually be replaced by 6280630 zil synchronicity. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Production ZFS Server Death (06/06)
Elizabeth Schwartz wrote: On 11/28/06, *David Dyer-Bennet* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Looks to me like another example of ZFS noticing and reporting an error that would go quietly by on any other filesystem. And if you're concerned with the integrity of the data, why not use some ZFS redundancy? (I'm guessing you're applying the redundancy further downstream; but, as this situation demonstrates, separating it too far from the checksum verification makes it less useful.) Well, this error meant that two files on the file system were inaccessible, and one user was completely unable to use IMAP, so I don't know about unnoticeable. David said, [the error] would go quietly by on any other filesystem. The point is that ZFS detected and reported the fact that your hardware corrupted the data. A different filesystem would have simply given your application the incorrect data. How would I use more redundancy? By creating a zpool with some redundancy, eg. 'zpool create poolname mirror disk1 disk2'. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS related kernel panic
Jason J. W. Williams wrote: Hi all, Having experienced this, it would be nice if there was an option to offline the filesystem instead of kernel panicking on a per-zpool basis. If its a system-critical partition like a database I'd prefer it to kernel-panick and thereby trigger a fail-over of the application. However, if its a zpool hosting some fileshares I'd prefer it to stay online. Putting that level of control in would alleviate a lot of the complaints it seems to me...or at least give less of a leg to stand on. ;-) Agreed, and we are working on this. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool mirror
Gino Ruopolo wrote: Hi All, we have some ZFS pools on production with more than 100s fs and more than 1000s snapshots on them. Now we do backups with zfs send/receive with some scripting but I'm searching for a way to mirror each zpool to an other one for backup purposes (so including all snapshots!). Is that possible? Not right now (without a bunch of shell-scripting). I'm working on being able to send a whole tree of filesystems their snapshots. Would that do what you want? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Sol10u3 -- is du bug fixed?
Jeb Campbell wrote: After upgrade you did actually re-create your raid-z pool, right? No, but I did zpool upgrade -a. Hmm, I guess I'll try re-writing the data first. I know you have to do that if you change compression options. Ok -- rewriting the data doesn't work ... I'll create a new temp pool and see what that does ... then I'll investigate options for recreating my big pool ... Unfortunately, this bug is only fixed when you create the pool on the new bits. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problems during 'destroy' (and bizzare Zone problem as well)
Anantha N. Srirama wrote: - Why is the destroy phase taking so long? Destroying clones will be much faster with build 53 or later (or the unreleased s10u4 or later) -- see bug 6484044. - What can explain the unduly long snapshot/clone times - Why didn't the Zone startup? - More surprisingly why did the Zone startup after an hour? Perhaps there was so much activity on the system that we couldn't push out transaction groups in the usual 5 seconds. 'zfs snapshot' and 'zfs clone' take at least 1 transaction group to complete, so this could explain it. We've seen this problem as well and are working on a fix... --mat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [security-discuss] Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto
Bill Sommerfeld wrote: On Tue, 2006-12-19 at 16:19 -0800, Matthew Ahrens wrote: Darren J Moffat wrote: I believe that ZFS should provide a method of bleaching a disk or part of it that works without crypto having ever been involved. I see two use cases here: I agree with your two, but I think I see a third use case in Darren's example: bleaching disks as they are removed from a pool. That sounds plausible too. (And you could implement it as 'zfs destroy -r pool; zpool bleach pool' We may need a second dimension controlling *how* to bleach.. You mean whether we do single overwrite with zeros, muliple overwrites with some crazy government-mandated patterns, etc, right? That's what I meant by the value of the property can specify what type of bleach to use (perhaps taking the metaphor a bit too far) for example, 'zfs set bleach=how fs'. Like other properties, we would provide bleach=on which would choose a reasonable default. We'd need something similar with 'zpool bleach' (eg 'zpool bleach [-o how] pool'). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The size of a storage pool
Nathalie Poulet (IPSL) wrote: Hello, After an export and an importation, the size of the pool remains unchanged. As there were no data on this partition, I destroyed and recreate the pool. The size was indeed taken into account. The correct size is indicated by the order zpool list. The order df - k shows a size higher than the real size. The order zfs list shows a lower size. Why? As Tomas pointed out, zfs list and df -k show the same size. zpool list shows slightly more, because it does its accounting differently, taking into account only actual blocks allocated, whereas the others show usable space, taking into account the small amount of space we reserve for allocation efficiency (as well as quotas or reservations, if you have them). The fact that 'zpool list' shows the raw values is bug 6308817 discrepancy between zfs and zpool space accounting. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
Jason J. W. Williams wrote: INFORMATION: If a member of this striped zpool becomes unavailable or develops corruption, Solaris will kernel panic and reboot to protect your data. This is a bug, not a feature. We are currently working on fixing it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss