Re: [zfs-discuss] clones bound too tightly to its origin
This should work just fine with latest bits (Nevada 77 and later) via: http://bugs.opensolaris.org/view_bug.do?bug_id=6425096 Its backport is currently targeted for an early build of s10u6. eric On Jan 8, 2008, at 7:13 AM, Andreas Koppenhoefer wrote: > [I apologise for reposting this... but no one replied to my post > from Dec, 4th.] > > Hallo all, > > while experimenting with "zfs send" and "zfs receive" mixed with > cloning on receiver side I found the following... > > On server A there is a zpool with snapshots created on regular > basis via cron. > Server B get's updated by a zfs-send-ssh-zfs-receive command pipe. > Both servers are running Solaris 10 update 4 (08/07). > > Sometimes I want to do some testing on server B without corrupting > data on server A. For doing so I create a clone of the filesystem. > Up to here everything is ok. As long as the mounte clone filesystem > is NOT busy, any further zfs-send-ssh-zfs-receive will work > properly, updating my pool on B. > > But there are some long running test jobs on server B which keeps > clone's filesystem busy or just a single login shell with its cwd > within clone's filesystem, which makes the filesystem busy from > umount's point of view. > > Meanwhile another zfs-send-ssh-zfs-receive command gets launched to > copy new snapshot from A to B. If the receiving pool of a zfs- > receive-command has busy clones, the receive command will fail. > For some unknown reason the receive command tries to umount my > cloned filesystem and fails with "Device busy". > > The question is: why? > > Since the clone is (or should be) independent of its origin, "zfs > receive" should not umount cloned data of older snapshots. > > If you want to reproduce this - below (and attached) you find a > simple test script. The script will bump out at last zfs-receive > command. If you comment out the line "cd /mnt/copy", script will > run as expected. > > Disclaimer: Before running my script, make sure you do not have > zpools named "copy" or "origin". Use this script only on a test > machine! Use it at your own risk. > Here is the script: > > #!/usr/bin/bash > # cleanup before test > cd / > set -ex > for pool in origin copy; do > zpool destroy $pool || : > rm -f /var/tmp/zpool.$pool > [ -d /mnt/$pool ] && rmdir /mnt/$pool > > mkfile -nv 64m /var/tmp/zpool.$pool > zpool create -m none $pool /var/tmp/zpool.$pool > > done > > zfs create -o mountpoint=/mnt/origin origin/test > > update () { > # create/update a log file > date >>/mnt/origin/log > } > > snapnum=0 > make_snap () { > snapnum=$(($snapnum+1)) > zfs snapshot origin/[EMAIL PROTECTED] > } > > update > make_snap > update > make_snap > update > make_snap > update > > zfs send origin/[EMAIL PROTECTED] | zfs receive -v -d copy > zfs clone copy/[EMAIL PROTECTED] copy/clone > zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive > -v -d copy > zfs set mountpoint=/mnt/copy copy/clone > zfs list -r origin copy > cd /mnt/copy # make filesystem busy > zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive > -v -d copy > ls -l /mnt/{origin,copy}/log > exit > > - > Cleanup with > > zpool destroy copy; zpool destroy origin; rm /var/tmp/zpool.* > > after running tests. > > - Andreas > > > This message posted from opensolaris.org clone.sh>___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS on OS X port now on macosforge
Hey everyone, This is just a quick announcement to say that the ZFS on OS X port is now posted for your viewing fun at: http://zfs.macosforge.org/ The page is also linked off of the ZFS Open Solaris page under "ZFS Ports": http://opensolaris.org/os/community/zfs/porting/ This page holds the status for the ZFS on OSX port and includes a small FAQ, some known bugs, announcements, and will include more as time goes on. It also holds the latest source code and binaries that you can download to your hearts content. So if you have a Mac, are running Leopard, and are feeling bleeding edge please try it out. Comments, questions, suggestions and feedback are all very welcome. I also want to point out this is BETA. We're still working on getting some features going, as well as fleshing out issues with Finder, Disk Util, iTunes, and other parts of the system. So when I say bleeding, I'm not kidding :) However I'm excited to say that I'm happily running ZFS as my home directory on my MacBook Pro which is what I work off of every day, and am running weekly snapshots which I 'zfs send' to my external drive. Oh happy day. thanks! Noel Dellofano ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
> But is seems that when we're talking about full block > writes (such as > sequential file writes) ZFS could do a bit better. > > And as long as there is bandwidth left to the disk > and the controllers, it > is difficult to argue that the work is redundant. If > it's free in that > sense, it doesn't matter whether it is redundant. > But if it turns out NOT > o have been redundant you save a lot. > I think this is why an adaptive algorithm makes sense ... in situations where frequent, progressive small writes are engaged by an application, the amount of redundant disk access can be significant, and longer consolidation times may make sense ... larger writes (>= the FS block size) would benefit less from longer consolidation times, and shorter thresholds could provide more usable bandwidth to get a sense of the issue here, I've done some write testing to previously written files in a ZFS file system, and the choice of write element size shows some big swings in actual vs data-driven bandwidth when I launch a set of threads each of which writes 4KB buffers sequentially to its own file, I observe that for 60GB of application writes, the disks see 230+GB of IO (reads and writes): data-driven BW =~41MB/Sec (my 60GB in ~1500 Sec) actual BW =~157 MB/Sec (the 230+GB in ~1500 Sec) if I do the same writes with 128KB buffers (block size of my pool), the same 60GBs of writes only generate 95GB of disk IO (reads and writes) data-driven BW =~85MB/Sec (my 60GB in ~700 Sec) actual BW =~134.6MB/Sec (the 95+GB in ~700 Sec) in the first case, longer consolidation times would have lead to less total IO and better data-driven BW, while in the second case shorter consolidation times would have worked better as far as redundant writes possibly occupying free bandwidth (and thus costing nothing), I think you also have to consider the related costs of additional block scavenging, and less available free space at any specific instant, possibly limiting the sequentiality of the next write ... of course there's also the additional device stress in any case, I agree with you that ZFS could do a better job in this area, but it's not as simple as just looking for large or small IOs ... sequential vs random access patterns also play a big role (as you point out) I expect (hope) the adaptive algorithms will mature over time, eventually providing better behavior over a broader set of operating conditions ... Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] clones bound too tightly to its origin
[I apologise for reposting this... but no one replied to my post from Dec, 4th.] Hallo all, while experimenting with "zfs send" and "zfs receive" mixed with cloning on receiver side I found the following... On server A there is a zpool with snapshots created on regular basis via cron. Server B get's updated by a zfs-send-ssh-zfs-receive command pipe. Both servers are running Solaris 10 update 4 (08/07). Sometimes I want to do some testing on server B without corrupting data on server A. For doing so I create a clone of the filesystem. Up to here everything is ok. As long as the mounte clone filesystem is NOT busy, any further zfs-send-ssh-zfs-receive will work properly, updating my pool on B. But there are some long running test jobs on server B which keeps clone's filesystem busy or just a single login shell with its cwd within clone's filesystem, which makes the filesystem busy from umount's point of view. Meanwhile another zfs-send-ssh-zfs-receive command gets launched to copy new snapshot from A to B. If the receiving pool of a zfs-receive-command has busy clones, the receive command will fail. For some unknown reason the receive command tries to umount my cloned filesystem and fails with "Device busy". The question is: why? Since the clone is (or should be) independent of its origin, "zfs receive" should not umount cloned data of older snapshots. If you want to reproduce this - below (and attached) you find a simple test script. The script will bump out at last zfs-receive command. If you comment out the line "cd /mnt/copy", script will run as expected. Disclaimer: Before running my script, make sure you do not have zpools named "copy" or "origin". Use this script only on a test machine! Use it at your own risk. Here is the script: #!/usr/bin/bash # cleanup before test cd / set -ex for pool in origin copy; do zpool destroy $pool || : rm -f /var/tmp/zpool.$pool [ -d /mnt/$pool ] && rmdir /mnt/$pool mkfile -nv 64m /var/tmp/zpool.$pool zpool create -m none $pool /var/tmp/zpool.$pool done zfs create -o mountpoint=/mnt/origin origin/test update () { # create/update a log file date >>/mnt/origin/log } snapnum=0 make_snap () { snapnum=$(($snapnum+1)) zfs snapshot origin/[EMAIL PROTECTED] } update make_snap update make_snap update make_snap update zfs send origin/[EMAIL PROTECTED] | zfs receive -v -d copy zfs clone copy/[EMAIL PROTECTED] copy/clone zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive -v -d copy zfs set mountpoint=/mnt/copy copy/clone zfs list -r origin copy cd /mnt/copy# make filesystem busy zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive -v -d copy ls -l /mnt/{origin,copy}/log exit - Cleanup with zpool destroy copy; zpool destroy origin; rm /var/tmp/zpool.* after running tests. - Andreas This message posted from opensolaris.org test-zfs-clone.sh Description: Bourne shell script ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
>consolidating these writes in host cache eliminates some redundant disk >writing, resulting in more productive bandwidth ... providing some ability to >tune the consolidation time window and/or the accumulated cache size may >seem like a reasonable thing to do, but I think that it's typically a moving >target, and depending on an adaptive, built-in algorithm to dynamically set >these marks (as ZFS claims it does) seems like a better choice But is seems that when we're talking about full block writes (such as sequential file writes) ZFS could do a bit better. And as long as there is bandwidth left to the disk and the controllers, it is difficult to argue that the work is redundant. If it's free in that sense, it doesn't matter whether it is redundant. But if it turns out NOT to have been redundant you save a lot. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
> I have a question that is related to this topic: Why > is there only a (tunable) 5 second threshold and not > also an additional threshold for the buffer size > (e.g. 50MB)? > > Sometimes I see my system writing huge amounts of > data to a zfs, but the disks staying idle for 5 > seconds, although the memory consumption is already > quite big and it really would make sense (from my > uneducated point of view as an observer) to start > writing all the data to disks. I think this leads to > the pumping effect that has been previously mentioned > in one of the forums here. > > Can anybody comment on this? > > TIA, > Thomas because ZFS always writes to a new location on the disk, premature writing can often result in redundant work ... a single host write to a ZFS object results in the need to rewrite all of the changed data and meta-data leading to that object if a subsequent follow-up write to the same object occurs quickly, this entire path, once again, has to be recreated, even though only a small portion of it is actually different from the previous version if both versions were written to disk, the result would be to physically write potentially large amounts of nearly duplicate information over and over again, resulting in logically vacant bandwidth consolidating these writes in host cache eliminates some redundant disk writing, resulting in more productive bandwidth ... providing some ability to tune the consolidation time window and/or the accumulated cache size may seem like a reasonable thing to do, but I think that it's typically a moving target, and depending on an adaptive, built-in algorithm to dynamically set these marks (as ZFS claims it does) seems like a better choice ...Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
> > the ZIL is always there in host memory, even when no > synchronous writes > are being done, since the POSIX fsync() call could be > made on an open > write channel at any time, requiring all to-date > writes on that channel > to be committed to persistent store before it returns > to the application > ... it's cheaper to write the ZIL at this point than > to force the entire 5 sec > buffer out prematurely > I have a question that is related to this topic: Why is there only a (tunable) 5 second threshold and not also an additional threshold for the buffer size (e.g. 50MB)? Sometimes I see my system writing huge amounts of data to a zfs, but the disks staying idle for 5 seconds, although the memory consumption is already quite big and it really would make sense (from my uneducated point of view as an observer) to start writing all the data to disks. I think this leads to the pumping effect that has been previously mentioned in one of the forums here. Can anybody comment on this? TIA, Thomas This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does block allocation for small writes work over iSCSI?
Although it looks like possible, but very complex architecture. If you can wait, please explore pNFS: http://opensolaris.org/os/project/nfsv41/ What is pNFS? * The pNFS protocol allows us to separate a NFS file system's data and metadata paths. With a separate data path we are free to lay file data out in interesting ways like striping it across multiple different file servers. For more information, see the NFSv4.1 specification. Gilberto Mautner wrote: Hello list, I'm thinking about this topology: NFS ClientzFS Host <---iSCSI---> zFS Node 1, 2, 3 etc. The idea here is to create a scalable NFS server by plugging in more nodes as more space is needed, striping data across them. A question is: we know from the docs that zFS optimizes random write speed by consolidating what would be many random writes into a single sequential operation. I imagine that for zFS be able to do that it has to have some knowledge about the hard disk geography. Now, if this geography is being abstracted by iSCSI, is that optimization still valid? Thanks Gilberto ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss