[zfs-discuss] oddity of slow zfs destroy

2012-06-25 Thread Philip Brown

I ran into something odd today:

zfs destroy -r  random/filesystem

is mindbogglingly slow. But seems to me, it shouldnt be.
It's slow, because the filesystem has two snapshots on it. Presumably, it's 
busy rolling back the snapshots.
but I've already declared by my command line, that I DONT CARE about the 
contents of the filesystem!

Why doesnt zfs simply do:

1. unmount filesystem, if possible (it was possible)
(1.5 possibly note intent to delete somewhere in the pool records)
2. zero out/free the in-kernel-memory in one go
3. update the pool, hey I deleted the filesystem, all these blocks are now 
clear



Having this kind of operation take more than even 10 seconds, seems like a 
huge bug to me. yet it can take many minutes. An order of magnitude off. yuck.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] checking/fixing busy locks for zfs send/receive

2012-03-16 Thread Philip Brown
It was suggested to me by Ian Collins, that doing zfs sends and
receives, can render a filesystem busy.

if there isnt a process visible doing this via ps, I'm wondering how
one might check if a zfs filesystem or snapshot is rendered busy in
this way, interfering with an unmount or destroy?

I'm also wondering if this sort of thing can mean interference between
some combination of multiple send/receives at the same time, on the
same filesystem?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checking/fixing busy locks for zfs send/receive

2012-03-16 Thread Philip Brown
On Fri, Mar 16, 2012 at 3:06 PM, Brandon High bh...@freaks.com wrote:
 On Fri, Mar 16, 2012 at 2:35 PM, Philip Brown p...@bolthole.com wrote:
 if there isnt a process visible doing this via ps, I'm wondering how
 one might check if a zfs filesystem or snapshot is rendered busy in
 this way, interfering with an unmount or destroy?

 I'm also wondering if this sort of thing can mean interference between
 some combination of multiple send/receives at the same time, on the
 same filesystem?

 Look at 'zfs hold', 'zfs holds', and 'zfs release'. Sends and receives
 will place holds on snapshots to prevent them from being changed.


yup, know about holds. wasnt those.
The reason for my question is, I recently ran into a situation where
there was a single orphaned zfs filesystem, no snapshots (therefore no
holds), no subfilesystems, no clones... and as far as I'm aware, no
send or receive active for it.
There were a bunch before that time, but they had all completed, I believe.

so I'm trying to figure out if there was any kind of left-over lock,
and how I might see that.
Is there some zdb magic?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zrep initial release (replication with failover)

2012-03-12 Thread Philip Brown
I'm happy to announce the first release of zrep (v0.1)

http://www.bolthole.com/solaris/zrep/

This is a self-contained single executable tool, to implement
synchronization *and* failover of an active/passive zfs filesystem
pair.
No configuration files needed: configuration is stored in the zfs
filesystem properties.

Setting up replication, is a simple 2 step process
(presuming you already have root ssh trust set up)


  1. zrep init pool/myfs remotehost remotepool/remotefs
  (This will create, and sync, the remote filesystem)

  2. zrep sync pool/myfs
  (or if you prefer, zrep sync all)
   Do this manually, or crontab it,or

will automatically switch roles, making the src, the destination, and
vice versa.

You can then in theory set up zrep sync -q SOME_SEC all
as a cronjob on both sides, and then forget about it.
(although you should note that it currently is only single-threaded)

Failover is equally simple:

  zrep failover pool/myfs


zrep uses an internal locking mechanism to avoid problems with
overlapping operations on a filesystem.

zrep automatically handles serialization of snapshots. It uses a 6
digit hex serial number, of the form
  @zrep_##
It can thus handle running once a minute, every minute, for 11650
days. Or, over 30 years

By default it only keeps the last 5 snapshots, but that's tunable via
a property.



Simple usage summary:
zrep (init|-i) ZFS/fs remotehost remoteZFSpool/fs
zrep (sync|-S) ZFS/fs
zrep (sync|-S) all
zrep (status|-s) [ZFS/fs]
zrep (list|-l) [-v] [ZFS/fs]
zrep (expire|-e) [-L] (ZFS/fs ...)|(all)|()
zrep (changeconfig|-C) ZFS/fs remotehost remoteZFSpool/fs
zrep failover [-L] ZFS/fs
zrep takeover [-L] ZFS/fs
zrep clear ZFS/fs  -- REMOVE ZREP CONFIG AND SNAPS FROM FILESYSTEM
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RFC for new zfs replication tool

2012-02-22 Thread Philip Brown
Please note: this is a cross posting of sorts, from a post I made;
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/a8bd4aab3918b7a0/528dacb05c970748
It was suggested that I mention it here. so I am doing so.

For convenience, here is mostly a duplicate of what I posted, with a
couple of minor updates.

Please note: I'm not asking for technical help on how to do it. I'm
soliciting feature requests now, so I can incorporate appropriate ones
into the initial user interface design.
This will be a single-executable program

WIP: design doc for zrep, a zfs based replication program.
This goes one step beyond other replication utils I've seen, in that
it is explicitly targetting the concept of production failover.
This is meant to be enterprise product quality, rather than merely
a sysadmin's tool.

# Design goals:
# 1. Easy to configure
# 2. Easy to use
# 3. As robust as possible
#   3.1 will not be harmful to run every minute, even when WAN is down.
#   (Will need safety limits on # of snapshots and filesystem space free?)
# 4. Well documented

# Limitations(mostly for ease-of-use reasons):
#  Uses short hostname, not FQDN, in snapshot names. automatically truncates.
#  Only one copy destination per filesystem-remotehost combination allowed
#  Stores configuration in filesystem properties of snapshots.

# Need to figure out some sort of locking, for during sync and changes.
## Possibly via filesystem properties?? or other zfs commands

Usage:

zrep -i/init ZFSfs remotehost destfs   == create initial snapshot.
 should do lots of sanitychecks. both local and remote.
 SHOULD it actually do first sync as well? 
Should it allow hand-created snapshot,?
If so, specify snap as ZFSfs arg.
   Extra options
SHOULD IT SET READ-ONLY ON REMOTE SIDE??!
Should it DEFAULT to read-only? (probably?)
(Should it CREATE fs in pool? or just leave that to sync


zrep -S/sync ZFSfs remote destfs   # copy/sync after initial snapshot created
zrep -S/sync all   #special case, copies all zfs fs's that have been
   # initialized.

zrep -C/changedest ZFSfs remotehost destfs #changes configs for given ZFS
zrep -l/list (ZFSfs ...)#list existing configured filesystems, and their config
 # Should also somehow list INCOMING zrep synced stuff?
 # or use separate option for that? Possibly -L

zrep -s/status (ZFSfs) ?

zrep clear ZFSfs   #clear all configured replication for that fs.
zrep clear ZFSfs remotehost #clear configs for just that remotehost

zrep failover ZFSfs@snapname  # Changes sync direction to non-master
  # Can be run from EITHER side? or should make it context-sensitive?

  Initial concept of failover
  Ensures first of all, that that snapshot exists on both sides.
  (should it allow hand-created snapshots?)
  Then configures snapshot on non-master side, with proper naming/properties.
  Renames snapshot pair to reflect new direction.
  REMOVES other snapshots for old outoging direction.
  At completion of this operation ,there will be only 1 zrep-recognized
  snapshot on either side, that will serve as the initial point of synch.

###
#  zrep fs properties
#
# zrep:dest-fswhere this get zsend to
#
# zrep:lock ? no, use zfs hold instead?
#
#
###
# snapshot format:
#
# fs@zrep_host1_host2_#seq#
# fs@zrep_host1_host2_#seq#_sent
#  a snapshot will be one or the other of the above.
# Once a snapshot has been successfully copied, it should be auto-renamed,
# so you can know without seeing the other side, whether something has been
# synced.
# After initialization, when normal operations has started, there should
#  always be at least TWO snapshots:
#  the latest full, and the most recently sent incremental.
#  There can also be some number of just in case incrementals
#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] reliable, enterprise worthy JBODs?

2011-01-25 Thread Philip Brown
So, another hardware question :)

ZFS has been touted as taking maximal advantage of disk hardware, to the point 
where it can be used efficiently and cost-effectively on JBODs, rather than 
having to throw more expensive RAID arrays at it.

Only trouble is.. JBODs seem to have disappeared :(
Sun/Oracle has discontinued its j4000 line, with no replacement that I can see.

IBM seems to have some nice looking hardware in the form of its EXP3500 
expansion trays... but they only support it connected to an IBM (SAS) 
controller... which is only supported when plugged into IBM server hardware :(

Any other suggestions for (large-)enterprise-grade, supported JBOD hardware for 
ZFS these days?
Either fibre or SAS would be okay.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How well does zfs mirror handle temporary disk offlines?

2011-01-18 Thread Philip Brown
Sorry if this is well known.. I tried a bunch of googles, but didnt get 
anywhere useful. Closest I came, was 
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028090.html  but 
that doesnt answer my question, below, reguarding zfs mirror recovery.
Details of our needs follow.


We normally are very into redundancy. Pretty much all our SAN storage is dual 
ported, along with all our production hosts. Two completely redundant paths to 
storage. Two independant SANs.

However, now, we are encountering a need for tier 3 storage, aka not that 
important, we're going to go cheap on it ;-)
That being said, we'd still like to make it as reliable and robust as possible. 
So I was wondering just how robust it would be to do ZFS mirroring, across 2 
sans.

My specific question is, how easily does ZFS handle *temporary* SAN 
disconnects, to one side of the mirror?
What if the outage is only 60 seconds?
3 minutes?
10 minutes?
an hour?

If we have 2x1TB drives, in a simple zfs mirror if one side goes 
temporarily off line, will zfs attempt to resync **1 TB** when it comes back? 
Or does it have enough intelligence to say, oh hey I know this disk..and I 
know [these bits] are still good, so I just need to resync [that bit] ?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?

2011-01-18 Thread Philip Brown
 On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon
 wrote:

 ZFS's ability to handle short-term interruptions
 depend heavily on the
 underlying device driver.
 
 If the device driver reports the device as
 dead/missing/etc at any
 point, then ZFS is going to require a zpool replace
 action before it
 re-accepts the device.  If the underlying driver
 simply stalls, then
 it's more graceful (and no user interaction is
 required).
 
 As far as what the resync does:  ZFS does smart
 resilvering, in that
 it compares what the good side of the mirror has
 against what the
 bad side has, and only copies the differences over
 to sync them up.


Hmm. Well, we're talking fibre, so we're very concerned with the recovery  mode 
when the fibre drivers have marked it as failed. (except it hasnt really 
failed, we've just had a switch drop out)

I THINK what you are saying, is that we could, in this situation, do:

zpool replace (old drive) (new drive)

and then your smart recovery, should do the limited resilvering only. Even 
for potentially long outages.

Is that what you are saying?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Liveupgrade'd to U8 and now can't boot previous U6 BE :(

2009-10-20 Thread Philip Brown
Quote: cindys 
3. Boot failure from a previous BE if either #1 or #2 failure occurs.

#1 or #2 were not relevant in my case.  Just found I could not boot into old u7 
be. I am happy with workaround as shinsui points out, so this is purely for 
your information.

Quote: renil82
U7 did not encounter such problems.

my problem occurred from lu 07 to 08. 
again only for information purposes as workaround is sufficient.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Liveupgrade'd to U8 and now can't boot previous U6 BE :(

2009-10-17 Thread Philip Brown
same problem here on sun x2100 amd64

i started with a core installation of u7 with the only patches applied as 
outlined in live upgrade doco 206844 ( 
http://sunsolve.sun.com/search/document.do?assetkey=1-61-206844-1 ).

also as stated in doco: 
pkgrm SUNWlucfg SUNWluu SUNWlur
and then from 10/9 dvd
pkgadd -d  SUNWlucfg SUNWlur SUNWluu

more info in attached zfsinfo.txt
-- 
This message posted from opensolaris.org

Last login: Fri Oct 16 14:47:14 2009 from 192.168.1.64
Sun Microsystems Inc.   SunOS 5.10  Generic January 2005
[phi...@unknown] [3:16pm] [~]  zpool status
  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.   

action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the 

pool will no longer be accessible on older software versions.   

 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c0t0d0s0  ONLINE   0 0 0
c0t1d0s0  ONLINE   0 0 0

errors: No known data errors
[phi...@unknown] [3:17pm] [~] # lufslist -n s10x_u7wos_08
   boot environment name: s10x_u7wos_08

Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/zvol/dsk/rpool/swap swap536870912 -   -
rpool/ROOT/s10x_u7wos_08 zfs 522009600 /   -
rpool   zfs  155414159360 /rpool  -
rpool/exportzfs  152577344512 /export -
rpool/export/home   zfs  152577325056 /export/home-
[phi...@unknown] [3:17pm] [~] # luactivate s10x_u7wos_08
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE sol-10-u8-x86

Setting failsafe console to ttya.
Generating boot-sign for ABE s10x_u7wos_08
Generating partition and slice information for ABE s10x_u7wos_08
Copied boot menu from top level dataset.
Generating multiboot menu entries for PBE.
Generating multiboot menu entries for ABE.
Disabling splashimage
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

**

The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.

**

In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

 mount -Fzfs /dev/dsk/c0t0d0s0 /mnt

3. Run luactivate utility with out any arguments from the Parent boot 
environment root slice, as shown below:

 /mnt/sbin/luactivate

4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.

**

Modifying boot archive service
Propagating findroot GRUB for menu conversion.
File /etc/lu/installgrub.findroot propagation successful
File /etc/lu/stage1.findroot propagation successful
File /etc/lu/stage2.findroot propagation successful
File /etc/lu/GRUB_capability propagation successful
Deleting stale GRUB loader from all BEs.
File /etc/lu/installgrub.latest deletion successful
File /etc/lu/stage1.latest deletion successful
File /etc/lu/stage2.latest deletion successful
Activation of boot environment s10x_u7wos_08 successful.
[phi...@unknown] [3:17pm] [~] # lufslist -n s10x_u7wos_08
   boot environment name: s10x_u7wos_08
   This boot environment will be active on next system boot.

Filesystem  fstypedevice size Mounted on  Mount Options
---   --- --
/dev/zvol/dsk/rpool/swap swap536870912 -   -
rpool/ROOT/s10x_u7wos_08 zfs 522009600 /   -
rpool   zfs  155414215168 /rpool  -
rpool/exportzfs  152577344512 /export -
rpool/export/home   zfs  152577325056 /export/home-

[phi...@unknown] [3:18pm] [~] # lustatus
Boot Environment   Is   Active ActiveCanCopy  
Name

Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Philip Brown
   If
 I'm interpreting correctly, you're talking about a
 couple of features, neither of which is in ZFS yet,
...
 1.  The ability to restore individual files from a
 snapshot, in the same way an entire snapshot is
 restored - simply using the blocks that are already
 stored.
 
 2.  The ability to store (and restore from) snapshots
 on external media.

Those sound useful. particularly the ability to restore a single file, even if 
it was only from a full send instead of a snapshot.  But I dont think that's 
what I'm asking for :-)



Lemme try again.

Lets say that you have a mega-source tree, in one huge zfs filesystem.
(lets say, the entire ON distribution or something :-)
Lets say that you had a full zfs send done, Nov 1st.
then, between then, and today, there were assorted things done to the source 
tree. Major things. 
Things that people suddenly realized were bad. But they werent sure exactly 
how/why. They just knew things worked nov 1st, but are broken now. Pretend 
there's no such thing as tags, etc.
So: they want to get things up and running, maybe even only in read-only mode, 
from the nov 1st full send. 
But they also want to take a look at the changes.  And they want to do it in a 
very space-efficient manner.

It would be REALLY REALLY NICE, to be able to take a full send of /zfs/srctree, 
and restore it to /zfs/[EMAIL PROTECTED], or something like that.
Given that [making up numbers] out of 1 million src files, only 1000 have 
changed, it would be really nice, to have those 999,000 files that have NOT 
changed, not be doubly allocated in both /zfs/srctree and /zfs/[EMAIL 
PROTECTED] They will be actually hardlinked/snapshot-duped/whatever the 
terminology is.

I guess you might refer to what I'm talking about, as taking a synthetic 
snapshot. Kinda like veritas backup, etc. can synthesize full dumps, from a 
sequence of full+ incrementals, and then write out a real full dump, onto a 
single tape, as if a full dump happened on the date of a particular 
incremental.

Except that in what I 'm talking about for zfs, it would be synthesizing a zfs 
snapshot of a filesystem, that was made for the full zsend (even though the 
original snapshot has since been deleted)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Philip Brown
 Ok, I think I understand.  You're going to be told
 that ZFS send isn't a backup (and for these purposes
 I definately agree),  ...

Hmph. well, even for 'replication' type purposes, what I'm talking about is 
quite useful.
Picture two remote systems, which happen to have mostly identical data. 
Perhaps they were manually synced at one time with tar, or something.
Now the company wants to bring them both into full sync... but first analyze 
the small differences that may be present.

In that scenario, it would then be very useful, to be able to do the following:

hostA# zfs snapshot /zfs/[EMAIL PROTECTED]
hostA# zfs send /zfs/[EMAIL PROTECTED] | ssh hostB zfs receive /zfs/[EMAIL 
PROTECTED]

hostB# diff -r /zfs/prod /zfs/prod/.zfs/snapshots/A /tmp/prod.diffs


One could otherwise find files that are different, with rsync -avn. But doing 
it with zfs in this way, adds value, by allowing you to locally compare old 
and new files on the same machine, without having to do some ghastly manual 
copy of each different file, to a new place, and doing the compare there.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-10-31 Thread Philip Brown
 relling wrote:
 This question makes no sense to me.  Perhaps you can
 rephrase?
 

To take a really obnoxious case:
lets say I have a 1 gigabyte filesystem. It has 1.5 gigabytes of physical disk 
allocated to it (so it is 66% full).
It has 10x100meg files in it.

Something bad happens, and I need to do a restore. 
The most recent zsend data, has all 10 files in it. 9 of them have not been 
touched since the zsend was done.

Now, since zfs has data integrity checks, yadda yadda yadda, it should be able 
to determine relatively easily, The file on the zfs send, is the exact same 
file on disk. 
So, when I do a zfs receive, it would be really nice, if there were some way 
for zfs to figure out, lets say, recieve to a snapshot of the filesystem; then 
take advantage of the fact that it is a snapshot, to NOT write on disk, the 9 
unaltered files that are in the snapshot; just allocate for the altered one.

it would be really nice for zfs to have the smarts to do this, WITHOUT having 
to potentially throw a laaarge amount of extra hard disk space for snapshots. I 
want the snapshot space to be allocated on TAPE, not hard disk, if you see 
what I mean.
If one 100meg file gets replace every 2 days, I wouldnt want to use snapshots 
on the filesystem, if there was a disk space limitation.

(I know there are solutions such as samfs for this, but. I'm looking for a zfs 
solution, if possible, please?)

and help with the other parts of my original email, would still be appreciated 
:)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-10-31 Thread Philip Brown
 So, when I do a zfs receive, it would be really
 nice, if there were some way for zfs to figure out,
 lets say, recieve to a snapshot of the filesystem;
 then take advantage of the fact that it is a
 snapshot, to NOT write on disk, the 9 unaltered files
 that are in the snapshot; just allocate for the
 altered one.


To follow up on my own question a bit :-)
I would presume that the mandate of incrementals MUST have a common snapshot, 
with the target restoral zfs filesystem, are basically just a shortcut that 
somehow guarantees files are identical, without having to do actual calculation.

What about some kind of rsync-like capability, though? To have zfs receive, 
have the capability to judge sameness by,
Well, the timestamp and filesizes are identical: treat them as identical! 
without a common snapshot. And, for the truely paranoid, having a binary 
compare option, where it says, hmm.. timestamp and filesizes are the same... 
they MIGHT be identical... lemme read from disk, and compare what I'm reading 
from the zfs send stream. If I find a difference, then write as a new file. 
Otherwise, just create [hardlink/whatever] in the destination receive snapshot, 
since they really are the same!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-10-31 Thread Philip Brown
 Ah, there is a cognitive disconnect... more below.
 
 The cognitive disconnect is that snapshots are
 blocks, not files.
 Therefore, the snapshot may contain only changed
 portions of
 files and blocks from a single file may be spread
 across many
 different snapshots. 


I was referring to restoring TO a snapshot. However, I didnt mandate that the 
incomming stream WAS a snapshot :-}

Your point about snapshots being blocks, not files, is well taken. However, the 
limitation that receive of a full send can only be done to an automatically 
created new filesystem, is overly burdensome. 
Wouldnt it be more useful, if it had the capability to restore to a newly 
created snapshot of an existing zfs filesystem, rsync style?

Thanks for the ADM reference. I'll check that out.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Philip Brown

Roch wrote:

And, ifthe load can accomodate   a
reorder, to  get top per-spindle read-streaming performance,
a cp(1) of the file should do wonders on the layout.



but there may not be filesystem space for double the data.
Sounds like there is a need for a zfs-defragement-file utility perhaps?

Or if you want to be politically cagey about naming choice, perhaps,

zfs-seq-read-optimize-file ?  :-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [Fwd: Re: [zfs-discuss] Re: disk write cache, redux]

2006-06-16 Thread Philip Brown

Dana H. Myers wrote:

Phil Brown wrote:


hmm. well I hope sun will fix this bug, and add in the long-missing
write_cache control for regular ata drives too.



Actually, I believe such ata drives by default enable the write cache.



some do, some dont. reguardless, the toggle functionality belongs in the ata 
driver as well as the scsi driver.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: disk write cache, redux

2006-06-15 Thread Philip Brown

Roch wrote:

Check here:

 
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/vdev_disk.c#157



distilled version:
  vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)
  /*...*/
  /*
   * If we own the whole disk, try to enable disk write caching.
   * We ignore errors because it's OK if we can't do it.
   */


Which to me implies, when a disk pool is mounted/created, enable write cache.

(and presumably leave it on indefinately)

The intersting thing is, dtrace with

fbt::ldi_ioctl:entry { printf(ldi_ioctl called with %x\n,args[1]); }


says that some kind of ldi_ioctl IS called, when I create a test zpool with 
these sata disks.

specific ioctls called would seem to be:
x422
x425
x42a

and I believe  DKIOCSETWCE is x425.


HOWEVER... checking with format -e on those disks, says that write cache is 
NOT ENABLED after this happens.


And interestingly, if I augment the dtrace with
fbt::sata_set_cache_mode:entry,
fbt::sata_init_write_cache_mode:entry
{
printf(%s called\n,probefunc);
}



the sata-specific set-cache routines, are NOT getting called. according to 
dtrace, anyways ?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: disk write cache, redux

2006-06-14 Thread Philip Brown
I previously wrote about my scepticism on the claims that zfs selectively 
enables and disables write cache, to improve throughput over the usual 
solaris defaults prior to this point.


I posted my observations that this did not seem to be happening in any 
meaningful way, for my zfs, on build nv33.


I was told, oh you just need the more modern drivers.

Well, I'm now running S10u2, with
SUNWzfsr  11.10.0,REV=2006.05.18.01.46

I dont see much of a difference.
By default, iostat shows the disks grinding along at 10MB/sec during the 
transfer.
However, if I manually enable write_cache on the drives (SATA drives, FWIW), 
the drive throughput zips up to 30MB/sec during the transfer.



Test case:

# zpool status philpool
  pool: philpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
philpoolONLINE   0 0 0
  c5t1d0ONLINE   0 0 0
  c5t4d0ONLINE   0 0 0
  c5t5d0ONLINE   0 0 0

# dd if=/dev/zero of=/philpool/testfile bs=256k count=1

# [run iostat]

The wall clock time for the i/o to quiesce, is as espected. Without write 
cache manually enabled, it takes 3 times as long to finish, as with it 
enabled.  (1:30, vs 30sec)


[Approximately a 2 gig file is generated. A side note of interest to me is 
that in both cases, the dd returns to the user relatively quickly, but the 
write goes on for quite a long time in the background.. without apparently 
reserving 2 gigabytes of extra kernel memory according to swap -s ]





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New Feature Idea: ZFS Views ?

2006-06-07 Thread Philip Brown

Nicolas Williams wrote:

...
Also, why shouldn't lofs grow similar support?



aha!
This to me sounds much much better. Put all the funky potentially 
disasterous code, in lofs, not in zfs please :-) plus that way any 
filesystem will potentially get the benefit of views.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] disk write cache, redux

2006-06-02 Thread Philip Brown

hi folks...
I've just been exposed to zfs directly, since I'm trying it out on
a certain 48-drive box with 4 cpus :-)

I read in the archives, the recent  hard drive write cache 
thread. in which someone at sun made the claim that zfs takes advantage of 
the disk write cache, selectively enabling it and disabling it.


However, that does not seem to be at all true, on the system I am testing 
on. (or if it doesnt, it isnt doing it in any kind of effective way)



SunOS test-t[xx](ahem) 5.11 snv_33 i86pc i386 i86pc


On the following RAIDZ pool:

# zpool status rzpool
  pool: rzpool
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
rzpool   ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c0t4d0   ONLINE   0 0 0
c0t5d0   ONLINE   0 0 0
c1t4d0   ONLINE   0 0 0
c1t5d0   ONLINE   0 0 0
c5t4d0   ONLINE   0 0 0
c5t5d0   ONLINE   0 0 0
c9t4d0   ONLINE   0 0 0
c9t5d0   ONLINE   0 0 0
c10t4d0  ONLINE   0 0 0
c10t5d0  ONLINE   0 0 0


Write performance for large files appears to top out at around 15-20MB/sec, 
according to zpool iostat



However, when I manually enable write cache on all the drives involved... 
performance for the pathalogical case of


dd if=/dev/zero of=/rzpool/testfile bs=128k


jumps to be 40-60MB/sec (with an initial spike to 80MB/sec. i was very 
disappointed to see that was not sustained ;-) ]


This kind of performance differential also shows up with real load;
doing a tar| tar copy of large video files over NFS to the filesystem.


As a comparison, a single disk's dd write performance is around 6MB/sec no 
cache, and 30MB/sec with write cache enabled.


So the 40-50MB/sec result is kind of disappointing, with a **10** disk pool.



Comments?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss