Re: [zfs-discuss] clones bound too tightly to its origin

2008-01-08 Thread eric kustarz
This should work just fine with latest bits (Nevada 77 and later) via:
http://bugs.opensolaris.org/view_bug.do?bug_id=6425096

Its backport is currently targeted for an early build of s10u6.

eric

On Jan 8, 2008, at 7:13 AM, Andreas Koppenhoefer wrote:

> [I apologise for reposting this... but no one replied to my post  
> from Dec, 4th.]
>
> Hallo all,
>
> while experimenting with "zfs send" and "zfs receive" mixed with  
> cloning on receiver side I found the following...
>
> On server A there is a zpool with snapshots created on regular  
> basis via cron.
> Server B get's updated by a zfs-send-ssh-zfs-receive command pipe.  
> Both servers are running Solaris 10 update 4 (08/07).
>
> Sometimes I want to do some testing on server B without corrupting  
> data on server A. For doing so I create a clone of the filesystem.  
> Up to here everything is ok. As long as the mounte clone filesystem  
> is NOT busy, any further zfs-send-ssh-zfs-receive will work  
> properly, updating my pool on B.
>
> But there are some long running test jobs on server B which keeps  
> clone's filesystem busy or just a single login shell with its cwd  
> within clone's filesystem, which makes the filesystem busy from  
> umount's point of view.
>
> Meanwhile another zfs-send-ssh-zfs-receive command gets launched to  
> copy new snapshot from A to B. If the receiving pool of a zfs- 
> receive-command has busy clones, the receive command will fail.
> For some unknown reason the receive command tries to umount my  
> cloned filesystem and fails with "Device busy".
>
> The question is: why?
>
> Since the clone is (or should be) independent of its origin, "zfs  
> receive" should not umount cloned data of older snapshots.
>
> If you want to reproduce this - below (and attached) you find a  
> simple test script. The script will bump out at last zfs-receive  
> command. If you comment out the line "cd /mnt/copy", script will  
> run as expected.
>
> Disclaimer: Before running my script, make sure you do not have  
> zpools named "copy" or "origin". Use this script only on a test  
> machine! Use it at your own risk.
> Here is the script:
> 
> #!/usr/bin/bash
> # cleanup before test
> cd /
> set -ex
> for pool in origin copy; do
> zpool destroy $pool || :
> rm -f /var/tmp/zpool.$pool
> [ -d /mnt/$pool ] && rmdir /mnt/$pool
>
> mkfile -nv 64m /var/tmp/zpool.$pool
> zpool create -m none $pool /var/tmp/zpool.$pool
>
> done
>
> zfs create -o mountpoint=/mnt/origin origin/test
>
> update () {
> # create/update a log file
> date >>/mnt/origin/log
> }
>
> snapnum=0
> make_snap () {
> snapnum=$(($snapnum+1))
> zfs snapshot origin/[EMAIL PROTECTED]
> }
>
> update
> make_snap
> update
> make_snap
> update
> make_snap
> update
>
> zfs send origin/[EMAIL PROTECTED] | zfs receive -v -d copy
> zfs clone copy/[EMAIL PROTECTED] copy/clone
> zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive 
> -v -d copy
> zfs set mountpoint=/mnt/copy copy/clone
> zfs list -r origin copy
> cd /mnt/copy  # make filesystem busy
> zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive 
> -v -d copy
> ls -l /mnt/{origin,copy}/log
> exit
> 
> -
> Cleanup with
> 
> zpool destroy copy; zpool destroy origin; rm /var/tmp/zpool.*
> 
> after running tests.
>
> - Andreas
>
>
> This message posted from opensolaris.org clone.sh>___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS on OS X port now on macosforge

2008-01-08 Thread Noël Dellofano
Hey everyone,

This is just a quick announcement to say that the ZFS on OS X port is  
now posted for your viewing fun at:

http://zfs.macosforge.org/

The page is also linked off of the ZFS Open Solaris page under "ZFS  
Ports":
http://opensolaris.org/os/community/zfs/porting/

This page holds the status for the ZFS on OSX port and includes a  
small FAQ, some known bugs, announcements, and will include more as  
time goes on.  It also holds the latest source code and binaries that  
you can download to your hearts content.  So if you have a Mac, are  
running Leopard, and are feeling bleeding edge please try it out.   
Comments, questions, suggestions and feedback are all very welcome.
I also want to point out this is BETA.  We're still working on getting  
some features going, as well as fleshing out issues with Finder, Disk  
Util, iTunes, and other parts of the system.  So when I say bleeding,  
I'm not kidding :)  However I'm excited to say that I'm happily  
running ZFS as my home directory on my MacBook Pro which is what I  
work off of every day, and am running weekly snapshots which I 'zfs  
send' to my external drive.  Oh happy day.

thanks!
Noel Dellofano
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
> But is seems that when we're talking about full block
> writes (such as 
> sequential file writes) ZFS could do a bit better.
> 
> And as long as there is bandwidth left to the disk
> and the controllers, it 
> is difficult to argue that the work is redundant.  If
> it's free in that
> sense, it doesn't matter whether it is redundant.
>  But if it turns out NOT
> o have been redundant you save a lot.
> 

I think this is why an adaptive algorithm makes sense ... in situations where
frequent, progressive small writes are engaged by an application, the amount
of redundant disk access can be significant, and longer consolidation times
may make sense ... larger writes (>= the FS block size) would benefit less 
from longer consolidation times, and shorter thresholds could provide more
usable bandwidth

to get a sense of the issue here, I've done some write testing to previously
written files in a ZFS file system, and the choice of write element size
shows some big swings in actual vs data-driven bandwidth

when I launch a set of threads each of which writes 4KB buffers 
sequentially to its own file, I observe that for 60GB of application 
writes, the disks see 230+GB of IO (reads and writes): 
data-driven BW =~41MB/Sec (my 60GB in ~1500 Sec)
actual BW =~157 MB/Sec (the 230+GB in ~1500 Sec)

if I do the same writes with 128KB buffers (block size of my pool),
the same 60GBs of writes only generate 95GB of disk IO (reads and writes)
data-driven BW =~85MB/Sec (my 60GB in ~700 Sec)
actual BW =~134.6MB/Sec (the 95+GB in ~700 Sec)

in the first case, longer consolidation times would have lead to less total IO
and better data-driven BW, while in the second case shorter consolidation
times would have worked better

as far as redundant writes possibly occupying free bandwidth (and thus
costing nothing), I think you also have to consider the related costs of
additional block scavenging, and less available free space at any specific 
instant, possibly limiting the sequentiality of the next write ... of
course there's also the additional device stress

in any case, I agree with you that ZFS could do a better job in this area,
but it's not as simple as just looking for large or small IOs ...
sequential vs random access patterns also play a big role (as you point out)

I expect  (hope) the adaptive algorithms will mature over time, eventually
providing better behavior over a broader set of operating conditions
... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] clones bound too tightly to its origin

2008-01-08 Thread Andreas Koppenhoefer
[I apologise for reposting this... but no one replied to my post from Dec, 4th.]

Hallo all,

while experimenting with "zfs send" and "zfs receive" mixed with cloning on 
receiver side I found the following...

On server A there is a zpool with snapshots created on regular basis via cron.
Server B get's updated by a zfs-send-ssh-zfs-receive command pipe. Both servers 
are running Solaris 10 update 4 (08/07).

Sometimes I want to do some testing on server B without corrupting data on 
server A. For doing so I create a clone of the filesystem. Up to here 
everything is ok. As long as the mounte clone filesystem is NOT busy, any 
further zfs-send-ssh-zfs-receive will work properly, updating my pool on B.

But there are some long running test jobs on server B which keeps clone's 
filesystem busy or just a single login shell with its cwd within clone's 
filesystem, which makes the filesystem busy from umount's point of view.

Meanwhile another zfs-send-ssh-zfs-receive command gets launched to copy new 
snapshot from A to B. If the receiving pool of a zfs-receive-command has busy 
clones, the receive command will fail.
For some unknown reason the receive command tries to umount my cloned 
filesystem and fails with "Device busy".

The question is: why?

Since the clone is (or should be) independent of its origin, "zfs receive" 
should not umount cloned data of older snapshots.

If you want to reproduce this - below (and attached) you find a simple test 
script. The script will bump out at last zfs-receive command. If you comment 
out the line "cd /mnt/copy", script will run as expected.

Disclaimer: Before running my script, make sure you do not have zpools named 
"copy" or "origin". Use this script only on a test machine! Use it at your own 
risk.
Here is the script:

#!/usr/bin/bash
# cleanup before test
cd /
set -ex
for pool in origin copy; do
zpool destroy $pool || :
rm -f /var/tmp/zpool.$pool
[ -d /mnt/$pool ] && rmdir /mnt/$pool

mkfile -nv 64m /var/tmp/zpool.$pool
zpool create -m none $pool /var/tmp/zpool.$pool

done

zfs create -o mountpoint=/mnt/origin origin/test

update () {
# create/update a log file
date >>/mnt/origin/log
}

snapnum=0
make_snap () {
snapnum=$(($snapnum+1))
zfs snapshot origin/[EMAIL PROTECTED]
}

update
make_snap
update
make_snap
update
make_snap
update

zfs send origin/[EMAIL PROTECTED] | zfs receive -v -d copy
zfs clone copy/[EMAIL PROTECTED] copy/clone
zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive -v 
-d copy
zfs set mountpoint=/mnt/copy copy/clone
zfs list -r origin copy
cd /mnt/copy# make filesystem busy
zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive -v 
-d copy
ls -l /mnt/{origin,copy}/log
exit

-
Cleanup with

zpool destroy copy; zpool destroy origin; rm /var/tmp/zpool.*

after running tests.

- Andreas
 
 
This message posted from opensolaris.org

test-zfs-clone.sh
Description: Bourne shell script
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Casper . Dik


>consolidating these writes in host cache eliminates some redundant disk
>writing, resulting in more productive bandwidth ... providing some ability to
>tune the consolidation time window and/or the accumulated cache size may
>seem like a reasonable thing to do, but I think that it's typically a moving
>target, and depending on an adaptive, built-in algorithm to dynamically set
>these marks (as ZFS claims it does) seems like a better choice


But is seems that when we're talking about full block writes (such as 
sequential file writes) ZFS could do a bit better.

And as long as there is bandwidth left to the disk and the controllers, it 
is difficult to argue that the work is redundant.  If it's free in that
sense, it doesn't matter whether it is redundant.  But if it turns out NOT
to have been redundant you save a lot.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
> I have a question that is related to this topic: Why
> is there only a (tunable) 5 second threshold and not
> also an additional threshold for the buffer size
> (e.g. 50MB)?
> 
> Sometimes I see my system writing huge amounts of
> data to a zfs, but the disks staying idle for 5
> seconds, although the memory consumption is already
> quite big and it really would make sense (from my
> uneducated point of view as an observer) to start
> writing all the data to disks. I think this leads to
> the pumping effect that has been previously mentioned
> in one of the forums here.
> 
> Can anybody comment on this?
> 
> TIA,
> Thomas

because ZFS always writes to a new location on the disk, premature writing
can often result in redundant work ... a single host write to a ZFS object
results in the need to rewrite all of the changed data and meta-data leading
to that object

if a subsequent follow-up write to the same object occurs quickly,
this entire path, once again, has to be recreated, even though only a small 
portion of it is actually different from the previous version

if both versions were written to disk, the result would be to physically write 
potentially large amounts of nearly duplicate information over and over
again, resulting in logically vacant bandwidth

consolidating these writes in host cache eliminates some redundant disk
writing, resulting in more productive bandwidth ... providing some ability to
tune the consolidation time window and/or the accumulated cache size may
seem like a reasonable thing to do, but I think that it's typically a moving
target, and depending on an adaptive, built-in algorithm to dynamically set
these marks (as ZFS claims it does) seems like a better choice

...Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Thomas Maier-Komor
> 
> the ZIL is always there in host memory, even when no
> synchronous writes
> are being done, since the POSIX fsync() call could be
> made on an open 
> write channel at any time, requiring all to-date
> writes on that channel
> to be committed to persistent store before it returns
> to the application
> ... it's cheaper to write the ZIL at this point than
> to force the entire 5 sec
> buffer out prematurely
> 

I have a question that is related to this topic: Why is there only a (tunable) 
5 second threshold and not also an additional threshold for the buffer size 
(e.g. 50MB)?

Sometimes I see my system writing huge amounts of data to a zfs, but the disks 
staying idle for 5 seconds, although the memory consumption is already quite 
big and it really would make sense (from my uneducated point of view as an 
observer) to start writing all the data to disks. I think this leads to the 
pumping effect that has been previously mentioned in one of the forums here.

Can anybody comment on this?

TIA,
Thomas
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does block allocation for small writes work over iSCSI?

2008-01-08 Thread Andre Wenas

Although it looks like possible, but very complex architecture.

If you can wait, please explore pNFS: 
http://opensolaris.org/os/project/nfsv41/


What is pNFS?

   * The pNFS protocol allows us to separate a NFS file system's data
 and metadata paths. With a separate data path we are free to lay
 file data out in interesting ways like striping it across multiple
 different file servers. For more information, see the NFSv4.1
 specification.



Gilberto Mautner wrote:

Hello list,
 
 
I'm thinking about this topology:
 
NFS Client  zFS Host <---iSCSI---> zFS Node 1, 2, 3 etc.
 
The idea here is to create a scalable NFS server by plugging in more 
nodes as more space is needed, striping data across them.
 
A question is: we know from the docs that zFS optimizes random write 
speed by consolidating what would be many random writes into a single 
sequential operation.
 
I imagine that for zFS be able to do that it has to have some 
knowledge about the hard disk geography. Now, if this geography is 
being abstracted by iSCSI, is that optimization still valid?
 
 
Thanks
 
Gilberto
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss