[zfs-discuss] oddity of slow zfs destroy
I ran into something odd today: zfs destroy -r random/filesystem is mindbogglingly slow. But seems to me, it shouldnt be. It's slow, because the filesystem has two snapshots on it. Presumably, it's busy rolling back the snapshots. but I've already declared by my command line, that I DONT CARE about the contents of the filesystem! Why doesnt zfs simply do: 1. unmount filesystem, if possible (it was possible) (1.5 possibly note intent to delete somewhere in the pool records) 2. zero out/free the in-kernel-memory in one go 3. update the pool, hey I deleted the filesystem, all these blocks are now clear Having this kind of operation take more than even 10 seconds, seems like a huge bug to me. yet it can take many minutes. An order of magnitude off. yuck. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] checking/fixing busy locks for zfs send/receive
It was suggested to me by Ian Collins, that doing zfs sends and receives, can render a filesystem busy. if there isnt a process visible doing this via ps, I'm wondering how one might check if a zfs filesystem or snapshot is rendered busy in this way, interfering with an unmount or destroy? I'm also wondering if this sort of thing can mean interference between some combination of multiple send/receives at the same time, on the same filesystem? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] checking/fixing busy locks for zfs send/receive
On Fri, Mar 16, 2012 at 3:06 PM, Brandon High bh...@freaks.com wrote: On Fri, Mar 16, 2012 at 2:35 PM, Philip Brown p...@bolthole.com wrote: if there isnt a process visible doing this via ps, I'm wondering how one might check if a zfs filesystem or snapshot is rendered busy in this way, interfering with an unmount or destroy? I'm also wondering if this sort of thing can mean interference between some combination of multiple send/receives at the same time, on the same filesystem? Look at 'zfs hold', 'zfs holds', and 'zfs release'. Sends and receives will place holds on snapshots to prevent them from being changed. yup, know about holds. wasnt those. The reason for my question is, I recently ran into a situation where there was a single orphaned zfs filesystem, no snapshots (therefore no holds), no subfilesystems, no clones... and as far as I'm aware, no send or receive active for it. There were a bunch before that time, but they had all completed, I believe. so I'm trying to figure out if there was any kind of left-over lock, and how I might see that. Is there some zdb magic? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zrep initial release (replication with failover)
I'm happy to announce the first release of zrep (v0.1) http://www.bolthole.com/solaris/zrep/ This is a self-contained single executable tool, to implement synchronization *and* failover of an active/passive zfs filesystem pair. No configuration files needed: configuration is stored in the zfs filesystem properties. Setting up replication, is a simple 2 step process (presuming you already have root ssh trust set up) 1. zrep init pool/myfs remotehost remotepool/remotefs (This will create, and sync, the remote filesystem) 2. zrep sync pool/myfs (or if you prefer, zrep sync all) Do this manually, or crontab it,or will automatically switch roles, making the src, the destination, and vice versa. You can then in theory set up zrep sync -q SOME_SEC all as a cronjob on both sides, and then forget about it. (although you should note that it currently is only single-threaded) Failover is equally simple: zrep failover pool/myfs zrep uses an internal locking mechanism to avoid problems with overlapping operations on a filesystem. zrep automatically handles serialization of snapshots. It uses a 6 digit hex serial number, of the form @zrep_## It can thus handle running once a minute, every minute, for 11650 days. Or, over 30 years By default it only keeps the last 5 snapshots, but that's tunable via a property. Simple usage summary: zrep (init|-i) ZFS/fs remotehost remoteZFSpool/fs zrep (sync|-S) ZFS/fs zrep (sync|-S) all zrep (status|-s) [ZFS/fs] zrep (list|-l) [-v] [ZFS/fs] zrep (expire|-e) [-L] (ZFS/fs ...)|(all)|() zrep (changeconfig|-C) ZFS/fs remotehost remoteZFSpool/fs zrep failover [-L] ZFS/fs zrep takeover [-L] ZFS/fs zrep clear ZFS/fs -- REMOVE ZREP CONFIG AND SNAPS FROM FILESYSTEM ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RFC for new zfs replication tool
Please note: this is a cross posting of sorts, from a post I made; http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/a8bd4aab3918b7a0/528dacb05c970748 It was suggested that I mention it here. so I am doing so. For convenience, here is mostly a duplicate of what I posted, with a couple of minor updates. Please note: I'm not asking for technical help on how to do it. I'm soliciting feature requests now, so I can incorporate appropriate ones into the initial user interface design. This will be a single-executable program WIP: design doc for zrep, a zfs based replication program. This goes one step beyond other replication utils I've seen, in that it is explicitly targetting the concept of production failover. This is meant to be enterprise product quality, rather than merely a sysadmin's tool. # Design goals: # 1. Easy to configure # 2. Easy to use # 3. As robust as possible # 3.1 will not be harmful to run every minute, even when WAN is down. # (Will need safety limits on # of snapshots and filesystem space free?) # 4. Well documented # Limitations(mostly for ease-of-use reasons): # Uses short hostname, not FQDN, in snapshot names. automatically truncates. # Only one copy destination per filesystem-remotehost combination allowed # Stores configuration in filesystem properties of snapshots. # Need to figure out some sort of locking, for during sync and changes. ## Possibly via filesystem properties?? or other zfs commands Usage: zrep -i/init ZFSfs remotehost destfs == create initial snapshot. should do lots of sanitychecks. both local and remote. SHOULD it actually do first sync as well? Should it allow hand-created snapshot,? If so, specify snap as ZFSfs arg. Extra options SHOULD IT SET READ-ONLY ON REMOTE SIDE??! Should it DEFAULT to read-only? (probably?) (Should it CREATE fs in pool? or just leave that to sync zrep -S/sync ZFSfs remote destfs # copy/sync after initial snapshot created zrep -S/sync all #special case, copies all zfs fs's that have been # initialized. zrep -C/changedest ZFSfs remotehost destfs #changes configs for given ZFS zrep -l/list (ZFSfs ...)#list existing configured filesystems, and their config # Should also somehow list INCOMING zrep synced stuff? # or use separate option for that? Possibly -L zrep -s/status (ZFSfs) ? zrep clear ZFSfs #clear all configured replication for that fs. zrep clear ZFSfs remotehost #clear configs for just that remotehost zrep failover ZFSfs@snapname # Changes sync direction to non-master # Can be run from EITHER side? or should make it context-sensitive? Initial concept of failover Ensures first of all, that that snapshot exists on both sides. (should it allow hand-created snapshots?) Then configures snapshot on non-master side, with proper naming/properties. Renames snapshot pair to reflect new direction. REMOVES other snapshots for old outoging direction. At completion of this operation ,there will be only 1 zrep-recognized snapshot on either side, that will serve as the initial point of synch. ### # zrep fs properties # # zrep:dest-fswhere this get zsend to # # zrep:lock ? no, use zfs hold instead? # # ### # snapshot format: # # fs@zrep_host1_host2_#seq# # fs@zrep_host1_host2_#seq#_sent # a snapshot will be one or the other of the above. # Once a snapshot has been successfully copied, it should be auto-renamed, # so you can know without seeing the other side, whether something has been # synced. # After initialization, when normal operations has started, there should # always be at least TWO snapshots: # the latest full, and the most recently sent incremental. # There can also be some number of just in case incrementals # ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] reliable, enterprise worthy JBODs?
So, another hardware question :) ZFS has been touted as taking maximal advantage of disk hardware, to the point where it can be used efficiently and cost-effectively on JBODs, rather than having to throw more expensive RAID arrays at it. Only trouble is.. JBODs seem to have disappeared :( Sun/Oracle has discontinued its j4000 line, with no replacement that I can see. IBM seems to have some nice looking hardware in the form of its EXP3500 expansion trays... but they only support it connected to an IBM (SAS) controller... which is only supported when plugged into IBM server hardware :( Any other suggestions for (large-)enterprise-grade, supported JBOD hardware for ZFS these days? Either fibre or SAS would be okay. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How well does zfs mirror handle temporary disk offlines?
Sorry if this is well known.. I tried a bunch of googles, but didnt get anywhere useful. Closest I came, was http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028090.html but that doesnt answer my question, below, reguarding zfs mirror recovery. Details of our needs follow. We normally are very into redundancy. Pretty much all our SAN storage is dual ported, along with all our production hosts. Two completely redundant paths to storage. Two independant SANs. However, now, we are encountering a need for tier 3 storage, aka not that important, we're going to go cheap on it ;-) That being said, we'd still like to make it as reliable and robust as possible. So I was wondering just how robust it would be to do ZFS mirroring, across 2 sans. My specific question is, how easily does ZFS handle *temporary* SAN disconnects, to one side of the mirror? What if the outage is only 60 seconds? 3 minutes? 10 minutes? an hour? If we have 2x1TB drives, in a simple zfs mirror if one side goes temporarily off line, will zfs attempt to resync **1 TB** when it comes back? Or does it have enough intelligence to say, oh hey I know this disk..and I know [these bits] are still good, so I just need to resync [that bit] ? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote: ZFS's ability to handle short-term interruptions depend heavily on the underlying device driver. If the device driver reports the device as dead/missing/etc at any point, then ZFS is going to require a zpool replace action before it re-accepts the device. If the underlying driver simply stalls, then it's more graceful (and no user interaction is required). As far as what the resync does: ZFS does smart resilvering, in that it compares what the good side of the mirror has against what the bad side has, and only copies the differences over to sync them up. Hmm. Well, we're talking fibre, so we're very concerned with the recovery mode when the fibre drivers have marked it as failed. (except it hasnt really failed, we've just had a switch drop out) I THINK what you are saying, is that we could, in this situation, do: zpool replace (old drive) (new drive) and then your smart recovery, should do the limited resilvering only. Even for potentially long outages. Is that what you are saying? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Liveupgrade'd to U8 and now can't boot previous U6 BE :(
Quote: cindys 3. Boot failure from a previous BE if either #1 or #2 failure occurs. #1 or #2 were not relevant in my case. Just found I could not boot into old u7 be. I am happy with workaround as shinsui points out, so this is purely for your information. Quote: renil82 U7 did not encounter such problems. my problem occurred from lu 07 to 08. again only for information purposes as workaround is sufficient. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Liveupgrade'd to U8 and now can't boot previous U6 BE :(
same problem here on sun x2100 amd64 i started with a core installation of u7 with the only patches applied as outlined in live upgrade doco 206844 ( http://sunsolve.sun.com/search/document.do?assetkey=1-61-206844-1 ). also as stated in doco: pkgrm SUNWlucfg SUNWluu SUNWlur and then from 10/9 dvd pkgadd -d SUNWlucfg SUNWlur SUNWluu more info in attached zfsinfo.txt -- This message posted from opensolaris.org Last login: Fri Oct 16 14:47:14 2009 from 192.168.1.64 Sun Microsystems Inc. SunOS 5.10 Generic January 2005 [phi...@unknown] [3:16pm] [~] zpool status pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s0 ONLINE 0 0 0 c0t1d0s0 ONLINE 0 0 0 errors: No known data errors [phi...@unknown] [3:17pm] [~] # lufslist -n s10x_u7wos_08 boot environment name: s10x_u7wos_08 Filesystem fstypedevice size Mounted on Mount Options --- --- -- /dev/zvol/dsk/rpool/swap swap536870912 - - rpool/ROOT/s10x_u7wos_08 zfs 522009600 / - rpool zfs 155414159360 /rpool - rpool/exportzfs 152577344512 /export - rpool/export/home zfs 152577325056 /export/home- [phi...@unknown] [3:17pm] [~] # luactivate s10x_u7wos_08 System has findroot enabled GRUB Generating boot-sign, partition and slice information for PBE sol-10-u8-x86 Setting failsafe console to ttya. Generating boot-sign for ABE s10x_u7wos_08 Generating partition and slice information for ABE s10x_u7wos_08 Copied boot menu from top level dataset. Generating multiboot menu entries for PBE. Generating multiboot menu entries for ABE. Disabling splashimage No more bootadm entries. Deletion of bootadm entries is complete. GRUB menu default setting is unaffected Done eliding bootadm entries. ** The target boot environment has been activated. It will be used when you reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You MUST USE either the init or the shutdown command when you reboot. If you do not use either init or shutdown, the system will not boot using the target BE. ** In case of a failure while booting to the target BE, the following process needs to be followed to fallback to the currently working boot environment: 1. Boot from Solaris failsafe or boot in single user mode from the Solaris Install CD or Network. 2. Mount the Parent boot environment root slice to some directory (like /mnt). You can use the following command to mount: mount -Fzfs /dev/dsk/c0t0d0s0 /mnt 3. Run luactivate utility with out any arguments from the Parent boot environment root slice, as shown below: /mnt/sbin/luactivate 4. luactivate, activates the previous working boot environment and indicates the result. 5. Exit Single User mode and reboot the machine. ** Modifying boot archive service Propagating findroot GRUB for menu conversion. File /etc/lu/installgrub.findroot propagation successful File /etc/lu/stage1.findroot propagation successful File /etc/lu/stage2.findroot propagation successful File /etc/lu/GRUB_capability propagation successful Deleting stale GRUB loader from all BEs. File /etc/lu/installgrub.latest deletion successful File /etc/lu/stage1.latest deletion successful File /etc/lu/stage2.latest deletion successful Activation of boot environment s10x_u7wos_08 successful. [phi...@unknown] [3:17pm] [~] # lufslist -n s10x_u7wos_08 boot environment name: s10x_u7wos_08 This boot environment will be active on next system boot. Filesystem fstypedevice size Mounted on Mount Options --- --- -- /dev/zvol/dsk/rpool/swap swap536870912 - - rpool/ROOT/s10x_u7wos_08 zfs 522009600 / - rpool zfs 155414215168 /rpool - rpool/exportzfs 152577344512 /export - rpool/export/home zfs 152577325056 /export/home- [phi...@unknown] [3:18pm] [~] # lustatus Boot Environment Is Active ActiveCanCopy Name
Re: [zfs-discuss] questions on zfs send,receive,backups
If I'm interpreting correctly, you're talking about a couple of features, neither of which is in ZFS yet, ... 1. The ability to restore individual files from a snapshot, in the same way an entire snapshot is restored - simply using the blocks that are already stored. 2. The ability to store (and restore from) snapshots on external media. Those sound useful. particularly the ability to restore a single file, even if it was only from a full send instead of a snapshot. But I dont think that's what I'm asking for :-) Lemme try again. Lets say that you have a mega-source tree, in one huge zfs filesystem. (lets say, the entire ON distribution or something :-) Lets say that you had a full zfs send done, Nov 1st. then, between then, and today, there were assorted things done to the source tree. Major things. Things that people suddenly realized were bad. But they werent sure exactly how/why. They just knew things worked nov 1st, but are broken now. Pretend there's no such thing as tags, etc. So: they want to get things up and running, maybe even only in read-only mode, from the nov 1st full send. But they also want to take a look at the changes. And they want to do it in a very space-efficient manner. It would be REALLY REALLY NICE, to be able to take a full send of /zfs/srctree, and restore it to /zfs/[EMAIL PROTECTED], or something like that. Given that [making up numbers] out of 1 million src files, only 1000 have changed, it would be really nice, to have those 999,000 files that have NOT changed, not be doubly allocated in both /zfs/srctree and /zfs/[EMAIL PROTECTED] They will be actually hardlinked/snapshot-duped/whatever the terminology is. I guess you might refer to what I'm talking about, as taking a synthetic snapshot. Kinda like veritas backup, etc. can synthesize full dumps, from a sequence of full+ incrementals, and then write out a real full dump, onto a single tape, as if a full dump happened on the date of a particular incremental. Except that in what I 'm talking about for zfs, it would be synthesizing a zfs snapshot of a filesystem, that was made for the full zsend (even though the original snapshot has since been deleted) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
Ok, I think I understand. You're going to be told that ZFS send isn't a backup (and for these purposes I definately agree), ... Hmph. well, even for 'replication' type purposes, what I'm talking about is quite useful. Picture two remote systems, which happen to have mostly identical data. Perhaps they were manually synced at one time with tar, or something. Now the company wants to bring them both into full sync... but first analyze the small differences that may be present. In that scenario, it would then be very useful, to be able to do the following: hostA# zfs snapshot /zfs/[EMAIL PROTECTED] hostA# zfs send /zfs/[EMAIL PROTECTED] | ssh hostB zfs receive /zfs/[EMAIL PROTECTED] hostB# diff -r /zfs/prod /zfs/prod/.zfs/snapshots/A /tmp/prod.diffs One could otherwise find files that are different, with rsync -avn. But doing it with zfs in this way, adds value, by allowing you to locally compare old and new files on the same machine, without having to do some ghastly manual copy of each different file, to a new place, and doing the compare there. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
relling wrote: This question makes no sense to me. Perhaps you can rephrase? To take a really obnoxious case: lets say I have a 1 gigabyte filesystem. It has 1.5 gigabytes of physical disk allocated to it (so it is 66% full). It has 10x100meg files in it. Something bad happens, and I need to do a restore. The most recent zsend data, has all 10 files in it. 9 of them have not been touched since the zsend was done. Now, since zfs has data integrity checks, yadda yadda yadda, it should be able to determine relatively easily, The file on the zfs send, is the exact same file on disk. So, when I do a zfs receive, it would be really nice, if there were some way for zfs to figure out, lets say, recieve to a snapshot of the filesystem; then take advantage of the fact that it is a snapshot, to NOT write on disk, the 9 unaltered files that are in the snapshot; just allocate for the altered one. it would be really nice for zfs to have the smarts to do this, WITHOUT having to potentially throw a laaarge amount of extra hard disk space for snapshots. I want the snapshot space to be allocated on TAPE, not hard disk, if you see what I mean. If one 100meg file gets replace every 2 days, I wouldnt want to use snapshots on the filesystem, if there was a disk space limitation. (I know there are solutions such as samfs for this, but. I'm looking for a zfs solution, if possible, please?) and help with the other parts of my original email, would still be appreciated :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
So, when I do a zfs receive, it would be really nice, if there were some way for zfs to figure out, lets say, recieve to a snapshot of the filesystem; then take advantage of the fact that it is a snapshot, to NOT write on disk, the 9 unaltered files that are in the snapshot; just allocate for the altered one. To follow up on my own question a bit :-) I would presume that the mandate of incrementals MUST have a common snapshot, with the target restoral zfs filesystem, are basically just a shortcut that somehow guarantees files are identical, without having to do actual calculation. What about some kind of rsync-like capability, though? To have zfs receive, have the capability to judge sameness by, Well, the timestamp and filesizes are identical: treat them as identical! without a common snapshot. And, for the truely paranoid, having a binary compare option, where it says, hmm.. timestamp and filesizes are the same... they MIGHT be identical... lemme read from disk, and compare what I'm reading from the zfs send stream. If I find a difference, then write as a new file. Otherwise, just create [hardlink/whatever] in the destination receive snapshot, since they really are the same! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions on zfs send,receive,backups
Ah, there is a cognitive disconnect... more below. The cognitive disconnect is that snapshots are blocks, not files. Therefore, the snapshot may contain only changed portions of files and blocks from a single file may be spread across many different snapshots. I was referring to restoring TO a snapshot. However, I didnt mandate that the incomming stream WAS a snapshot :-} Your point about snapshots being blocks, not files, is well taken. However, the limitation that receive of a full send can only be done to an automatically created new filesystem, is overly burdensome. Wouldnt it be more useful, if it had the capability to restore to a newly created snapshot of an existing zfs filesystem, rsync style? Thanks for the ADM reference. I'll check that out. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Roch wrote: And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [Fwd: Re: [zfs-discuss] Re: disk write cache, redux]
Dana H. Myers wrote: Phil Brown wrote: hmm. well I hope sun will fix this bug, and add in the long-missing write_cache control for regular ata drives too. Actually, I believe such ata drives by default enable the write cache. some do, some dont. reguardless, the toggle functionality belongs in the ata driver as well as the scsi driver. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: disk write cache, redux
Roch wrote: Check here: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/vdev_disk.c#157 distilled version: vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift) /*...*/ /* * If we own the whole disk, try to enable disk write caching. * We ignore errors because it's OK if we can't do it. */ Which to me implies, when a disk pool is mounted/created, enable write cache. (and presumably leave it on indefinately) The intersting thing is, dtrace with fbt::ldi_ioctl:entry { printf(ldi_ioctl called with %x\n,args[1]); } says that some kind of ldi_ioctl IS called, when I create a test zpool with these sata disks. specific ioctls called would seem to be: x422 x425 x42a and I believe DKIOCSETWCE is x425. HOWEVER... checking with format -e on those disks, says that write cache is NOT ENABLED after this happens. And interestingly, if I augment the dtrace with fbt::sata_set_cache_mode:entry, fbt::sata_init_write_cache_mode:entry { printf(%s called\n,probefunc); } the sata-specific set-cache routines, are NOT getting called. according to dtrace, anyways ? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: disk write cache, redux
I previously wrote about my scepticism on the claims that zfs selectively enables and disables write cache, to improve throughput over the usual solaris defaults prior to this point. I posted my observations that this did not seem to be happening in any meaningful way, for my zfs, on build nv33. I was told, oh you just need the more modern drivers. Well, I'm now running S10u2, with SUNWzfsr 11.10.0,REV=2006.05.18.01.46 I dont see much of a difference. By default, iostat shows the disks grinding along at 10MB/sec during the transfer. However, if I manually enable write_cache on the drives (SATA drives, FWIW), the drive throughput zips up to 30MB/sec during the transfer. Test case: # zpool status philpool pool: philpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM philpoolONLINE 0 0 0 c5t1d0ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 # dd if=/dev/zero of=/philpool/testfile bs=256k count=1 # [run iostat] The wall clock time for the i/o to quiesce, is as espected. Without write cache manually enabled, it takes 3 times as long to finish, as with it enabled. (1:30, vs 30sec) [Approximately a 2 gig file is generated. A side note of interest to me is that in both cases, the dd returns to the user relatively quickly, but the write goes on for quite a long time in the background.. without apparently reserving 2 gigabytes of extra kernel memory according to swap -s ] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New Feature Idea: ZFS Views ?
Nicolas Williams wrote: ... Also, why shouldn't lofs grow similar support? aha! This to me sounds much much better. Put all the funky potentially disasterous code, in lofs, not in zfs please :-) plus that way any filesystem will potentially get the benefit of views. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] disk write cache, redux
hi folks... I've just been exposed to zfs directly, since I'm trying it out on a certain 48-drive box with 4 cpus :-) I read in the archives, the recent hard drive write cache thread. in which someone at sun made the claim that zfs takes advantage of the disk write cache, selectively enabling it and disabling it. However, that does not seem to be at all true, on the system I am testing on. (or if it doesnt, it isnt doing it in any kind of effective way) SunOS test-t[xx](ahem) 5.11 snv_33 i86pc i386 i86pc On the following RAIDZ pool: # zpool status rzpool pool: rzpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 Write performance for large files appears to top out at around 15-20MB/sec, according to zpool iostat However, when I manually enable write cache on all the drives involved... performance for the pathalogical case of dd if=/dev/zero of=/rzpool/testfile bs=128k jumps to be 40-60MB/sec (with an initial spike to 80MB/sec. i was very disappointed to see that was not sustained ;-) ] This kind of performance differential also shows up with real load; doing a tar| tar copy of large video files over NFS to the filesystem. As a comparison, a single disk's dd write performance is around 6MB/sec no cache, and 30MB/sec with write cache enabled. So the 40-50MB/sec result is kind of disappointing, with a **10** disk pool. Comments? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss