Re: [zfs-discuss] This is the scrub that never ends...

2009-09-10 Thread Jonathan Edwards


On Sep 9, 2009, at 9:29 PM, Bill Sommerfeld wrote:



On Wed, 2009-09-09 at 21:30 +, Will Murnane wrote:

Some hours later, here I am again:
scrub: scrub in progress for 18h24m, 100.00% done, 0h0m to go
Any suggestions?


Let it run for another day.

A pool on a build server I manage takes about 75-100 hours to scrub,  
but
typically starts reporting 100.00% done, 0h0m to go at about the  
50-60

hour point.

I suspect the combination of frequent time-based snapshots and a  
pretty

active set of users causes the progress estimate to be off..



out of curiousity - do you have a lot of small files in the filesystem?

zdb -s pool might be interesting to observe too

---
.je

(oh, and thanks for the subject line .. now i've had this song stuck  
in my head for a couple days :P)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Books on File Systems and File System Programming

2009-08-15 Thread Jonathan Edwards


On Aug 14, 2009, at 11:14 AM, Peter Schow wrote:

On Thu, Aug 13, 2009 at 05:02:46PM -0600, Louis-Fr?d?ric Feuillette  
wrote:

I saw this question on another mailing list, and I too would like to
know. And I have a couple questions of my own.

== Paraphrased from other list ==
Does anyone have any recommendations for books on File Systems and/or
File Systems Programming?
== end ==


Going back ten years, but still a good tutorial:

  Practical File System Design with the Be File System
  by Dominic Giampaolo

  http://www.nobius.org/~dbg/practical-file-system-design.pdf


I think he's still at apple now working on spotlight .. his fs-kit is  
good study too:

http://www.nobius.org/~dbg/fs-kit-0.4.tgz

for understanding the vnode/vfs interface - you might want to take a  
look at:

- Solaris Internals (2nd edition) - chapter 14
- Zadok's FiST paper:
http://www.fsl.cs.sunysb.edu/docs/zadok-thesis-proposal/

UFS:
- Solaris Internals (2nd edition) - chapter 15
HFS+:
- Amit Singh's Mac OS X Internals chapter 11 (see http://osxbook.com/)

then opensolaris src of course for:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/
http://opensolaris.org/os/community/zfs/source/
http://opensolaris.org/os/project/samqfs/sourcecode/
http://opensolaris.org/os/project/ext3/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Jonathan Edwards


On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote:

This brings me to the absurd conclusion that the system must be  
rebooted immediately prior to each use.


see Phil's later email .. an export/import of the pool or a remount of  
the filesystem should clear the page cache - with mmap'd files you're  
essentially both them both in the page cache and also in the ARC ..  
then invalidations in the page cache are going to have effects on  
dirty data in the cache



/etc/system tunables are currently:

set zfs:zfs_arc_max = 0x28000
set zfs:zfs_write_limit_override = 0xea60
set zfs:zfs_vdev_max_pending = 5



if you're on x86 - i'd also increase maxphys to 128K .. we still have  
a 56KB default value in there which is still a bad thing (IMO)


---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot mount '/tank/home': directory is not empty

2009-06-10 Thread Jonathan Edwards
i've seen a problem where periodically a 'zfs mount -a' and sometimes  
a 'zpool import pool' can create what appears to be a race condition  
on nested mounts .. that is .. let's say that i have:


FS  mountpoint
pool/export
pool/fs1/export/home
pool/fs2/export/home/bob
pool/fs3/export/home/bob/stuff

if pool is imported (or a mount -a is done) and somehow pool/fs3  
mounts first - then it will create /export/home and /export/home/bob  
and pool/fs1 and pool/fs2 will fail to mount .. this seems to be  
happening on more recent builds, but not predictably - so i'm still  
trying to track down what's going on


On Jun 10, 2009, at 1:01 PM, Richard Elling wrote:


Something is bothering me about this thread.  It seems to me that
if the system provides an error message such as cannot mount
'/tank/home': directory is not empty then the first plan of action
should be to look and see what is there, no?
The issue of overlaying mounts has existed for about 30 years and
invariably one discovers that events which lead to different data in
overlapping directories is the result of some sort of procedural  
issue.


Perhaps once again, ZFS is a singing canary?
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Jonathan Edwards


On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote:


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does  
maintain filesystem consistency through coordination between the  
ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately  
for SNDR, ZFS caches a lot of an applications filesystem data in  
the ZIL, therefore the data is in memory, not written to disk, so  
SNDR does not know this data exists. ZIL flushes to disk can be  
seconds behind the actual application writes completing, and if  
SNDR is running asynchronously, these replicated writes to the SNDR  
secondary can be additional seconds behind the actual application  
writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no  
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?

In either case, creating a snapshot should get both flushed to disk,  
I think?
(If you don't actually need a snapshot, simply destroy it  
immediately afterwards.)


not sure if there's another way to trigger a full flush or lockfs, but  
to make sure you do have all transactions that may not have been  
flushed from the ARC you could just unmount the filesystem or export  
the zpool .. with the latter, then you wouldn't have to worry about  
the -f on the import


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-22 Thread Jonathan Edwards
not quite .. it's 16KB at the front and 8MB back of the disk (16384  
sectors) for the Solaris EFI - so you need to zero out both of these

of course since these drives are 1TB you i find it's easier to format  
to SMI (vtoc) .. with format -e (choose SMI, label, save, validate -  
then choose EFI)

but to Casper's point - you might want to make sure that fdisk is  
using the whole disk .. you should probably reinitialize the fdisk  
sectors either with the fdisk command or run fdisk from format (delete  
the partition, create a new partition using 100% of the disk, blah,  
blah) ..

finally - glancing at the format output - there appears to be a mix of  
labels on these disks as you've got a mix c#d# entries and c#t#d#  
entries so i might suspect fdisk might not be consistent across the  
various disks here .. also noticed that you dumped the vtoc for c3d0  
and c4d0, but you're replacing c2d1 (of unknown size/layout) with c1d1  
(never dumped in your emails) .. so while this has been an animated  
(slightly trollish) discussion on right-sizing (odd - I've typically  
only seen that term as an ONTAPism) with some short-stroking digs ..  
it's a little unclear what the c1d1s0 slice looks like here or what  
the cylinder count is - i agree it should be the same - but it would  
be nice to see from my armchair here

On Jan 22, 2009, at 3:32 AM, Dale Sears wrote:

 Would this work?  (to get rid of an EFI label).

   dd if=/dev/zero of=/dev/dsk/thedisk bs=1024k count=1

 Then use

   format

 format might complain that the disk is not labeled.  You
 can then label the disk.

 Dale



 Antonius wrote:
 can you recommend a walk-through for this process, or a bit more of  
 a description? I'm not quite sure how I'd use that utility to  
 repair the EFI label
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Largest (in number of files) ZFS instance tested

2008-07-11 Thread Jonathan Edwards

On Jul 11, 2008, at 4:59 PM, Bob Friesenhahn wrote:


 Has anyone tested a ZFS file system with at least 100 million +  
 files?
 What were the performance characteristics?

 I think that there are more issues with file fragmentation over a long
 period of time than the sheer number of files.

actually it's a similar problem .. with a maximum blocksize of 128KB  
and the COW nature of the filesytem you get indirect block pointers  
pretty quickly on a large ZFS filesystem as the size of your tree  
grows .. in this case a large constantly modified file (eg: /u01/data/ 
*.dbf) is going to behave over time like a lot of random access to  
files spread across the filesystem .. the only real difference is that  
you won't walk it every time someone does a getdirent() or an lstat64()

so ultimately the question could be framed as what's the maximum  
manageable tree size you can get to with ZFS while keeping in mind  
that there's no real re-layout tool (by design) .. the number i'm  
working with until i hear otherwise is probably about 20M, but in the  
relativistic sense - it *really* does depend on how balanced your tree  
is and what your churn rate is .. we know on QFS we can go up to 100M,  
but i trust the tree layout a little better there, can separate the  
metadata out if i need to and have planned on it, and know that we've  
got some tools to relayout the metadata or dump/restore for a tape  
backed archive

jonathan

(oh and btw - i believe this question is a query for field data ..  
architect != crash test dummy .. but some days it does feel like it)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?

2008-04-09 Thread Jonathan Edwards

On Apr 9, 2008, at 11:46 AM, Bob Friesenhahn wrote:
 On Wed, 9 Apr 2008, Ross wrote:

 Well the first problem is that USB cables are directional, and you
 don't have the port you need on any standard motherboard.  That

 Thanks for that info.  I did not know that.

 Adding iSCSI support to ZFS is relatively easy since Solaris already
 supported TCP/IP and iSCSI.  Adding USB support is much more
 difficult and isn't likely to happen since afaik the hardware to do
 it just doens't exist.

 I don't believe that Firewire is directional but presumably the
 Firewire support in Solaris only expects to support certain types of
 devices.  My workstation has Firewire but most systems won't have it.

 It seemed really cool to be able to put your laptop next to your
 Solaris workstation and just plug it in via USB or Firewire so it can
 be used as a removable storage device.  Or Solaris could be used on
 appropriate hardware to create a more reliable portable storage
 device.  Apparently this is not to be and it will be necessary to deal
 with iSCSI instead.

 I have never used iSCSI so I don't know how difficult it is to use as
 temporary removable storage under Windows or OS-X.

i'm not so sure what you're really after, but i'm guessing one of two  
things:

1) a global filesystem?  if so - ZFS will never be globally accessible  
from 2 hosts at the same time without an interposer layer such as NFS  
or Lustre .. zvols could be exported to multiple hosts via iSCSI or FC- 
target but that's only 1/2 the story ..
2) an easy way to export volumes?  agree - there should be some sort  
of semantics that would a signal filesystem is removable and trap on  
USB events when the media is unplugged .. of course you'll have  
problems with uncommitted transactions that would have to roll back on  
the next plug, or somehow be query-able

iSCSI will get you a block/character device level sharing from a zvol  
(pseudo device) or the equivalent of a blob filestore .. you'd have to  
format it with a filesystem, but that filesystem could be a global one  
(eg: QFS) and you could multi-host natively that way.

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Jonathan Edwards

On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote:
 On Thu, 20 Mar 2008, Mario Goebbels wrote:

 Similarly, read block size does not make a
 significant difference to the sequential read speed.

 Last time I did a simple bench using dd, supplying the record size as
 blocksize to it instead of no blocksize parameter bumped the mirror  
 pool
 speed from 90MB/s to 130MB/s.

 Indeed.  However, as an interesting twist to things, in my own
 benchmark runs I see two behaviors.  When the file size is smaller
 than the amount of RAM the ARC can reasonably grow to, the write block
 size does make a clear difference.  When the file size is larger than
 RAM, the write block size no longer makes much difference and
 sometimes larger block sizes actually go slower.

in that case .. try fixing the ARC size .. the dynamic resizing on the  
ARC can be less than optimal IMHO

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Jonathan Edwards

On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote:
 On Thu, 20 Mar 2008, Jonathan Edwards wrote:

 in that case .. try fixing the ARC size .. the dynamic resizing on  
 the ARC
 can be less than optimal IMHO

 Is a 16GB ARC size not considered to be enough? ;-)

 I was only describing the behavior that I observed.  It seems to me
 that when large files are written very quickly, that when the file
 becomes bigger than the ARC, that what is contained in the ARC is
 mostly stale and does not help much any more.  If the file is smaller
 than the ARC, then there is likely to be more useful caching.

sure i got that - it's not the size of the arc in this case since  
caching is going to be a lost cause.. but explicitly setting a  
zfs_arc_max should result in fewer calls to arc_shrink() when you hit  
memory pressure between the application's page buffer competing with  
the arc

in other words, as soon as the arc is 50% full of dirty pages (8GB)  
it'll start evicting pages .. you can't avoid that .. but what you can  
avoid is the additional weight of constantly growing and shrinking the  
cache as it tries to keep up with your constantly changing blocks in a  
large file

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs backups to tape

2008-03-16 Thread Jonathan Edwards

On Mar 14, 2008, at 3:28 PM, Bill Shannon wrote:
 What's the best way to backup a zfs filesystem to tape, where the size
 of the filesystem is larger than what can fit on a single tape?
 ufsdump handles this quite nicely.  Is there a similar backup program
 for zfs?  Or a general tape management program that can take data from
 a stream and split it across tapes reliably with appropriate headers
 to ease tape management and restore?

for now you could send snapshots to files and a file hierarchy on a  
SAM-QFS archive .. then you've got all the feature functionality there  
to be able to proactively back up the snapshots and possibly segment  
them if they're big enough (non-shared-qfs - might make sense if  
you've got multiple drives you want to take advantage of) .. I believe  
the goal is to provide this sort of functionality through a DMAPI HSM  
with ADM at some point in the near future:
http://opensolaris.org/os/project/adm/

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic ZFS disk accesses

2008-03-01 Thread Jonathan Edwards

On Mar 1, 2008, at 3:41 AM, Bill Shannon wrote:
 Running just plain iosnoop shows accesses to lots of files, but none
 on my zfs disk.  Using iosnoop -d c1t1d0 or iosnoop -m /export/ 
 home/shannon
 shows nothing at all.  I tried /usr/demo/dtrace/iosnoop.d too, still  
 nothing.

hi Bill

this came up sometime last year .. io:::start won't work since ZFS  
doesn't call bdev_strategy() directly .. you'll want to use something  
more like zfs_read:entry, zfs_write:entry and zfs_putpage or  
zfs_getpage for mmap'd ZFS files

here's one i hacked from our discussion back then to track some  
timings on files:

  cat zfs_iotime.d

#!/usr/sbin/dtrace -s

# pragma D option quiet

zfs_write:entry,
zfs_read:entry,
zfs_putpage:entry,
zfs_getpage:entry
{
self-ts = timestamp;
self-filepath = args[0]-v_path;
}

zfs_write:return,
zfs_read:return,
zfs_putpage:return,
zfs_getpage:return
/self-ts  self-filepath/
{
printf(%s on %s took %d nsecs\n, probefunc,
stringof(self-filepath), timestamp - self-ts);
self-ts = 0;
self-filepath = 0;
}

---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-27 Thread Jonathan Edwards

On Feb 27, 2008, at 8:36 AM, Uwe Dippel wrote:
 As much as ZFS is revolutionary, it is far away from being the  
 'ultimate file system', if it doesn't know how to handle event- 
 driven snapshots (I don't like the word), backups, versioning. As  
 long as a high-level system utility needs to be invoked by a  
 scheduler for these features (CDP), and - this is relevant - *ZFS  
 does not support these functionalities essentially different from  
 FAT or UFS*, the days of ZFS are counted. Sooner or later, and I bet  
 it is sooner, someone will design a file system (hardware, software,  
 Cairo) to which the tasks of retiring files, as well as creating  
 versions of modified files, can be passed down, together with the  
 file handlles.

meh .. don't believe all the marketing hype you hear - it's good at  
what it's good at, and is a constant WIP for many of the other  
features that people would like to hear .. but the one ring to rule  
them all - not quite yet ..

as for the CDP issue - i believe the event driving would really have  
to happen below ZFS at the vnode or znode layer .. keep in mind that  
with the ZPL we're still dealing with 30+ year old structures and  
methods (which is fine btw) in the VFS/Vnode layers .. a couple of  
areas i would look at (that i haven't seen mentioned in this  
discussion) might be:

- fop_vnevent .. or the equivalent (if we have one yet) for a znode
- filesystem - door interface for event handling
- auditing

if you look at what some of the other vendors (eg: apple/timemachine)  
are doing - it's essentially a tally of file change events that get  
dumped into a database and rolled up at some point .. if you plan on  
taking more immediate action on the file changes then i believe that  
you'll run into latency (race) issues for synchronous semantics

anyhow - just a thought from another who is constantly learning (being  
corrected, learning some more, more correction, etc ..)

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-29 Thread Jonathan Edwards

On Dec 29, 2007, at 2:33 AM, Jonathan Loran wrote:

 Hey, here's an idea:  We snapshot the file as it exists at the time of
 the mv in the old file system until all referring file handles are
 closed, then destroy the single file snap.  I know, not easy to
 implement, but that is the correct behavior, I believe.

 All this said, I would love to have this feature introduced.  Moving
 large file stores between zfs file systems would be so handy!  From my
 own sloppiness, I've suffered dearly from the the lack of it.

since in the current implementation a mv between filesystems would  
have to assign new st_ino values (fsids in NFS should also be  
different), all you should need to do is assign new block pointers in  
the new side of the filesystem .. that would also be handy for cp as  
well

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread Jonathan Edwards

On Dec 5, 2007, at 17:50, can you guess? wrote:

 my personal-professional data are important (this is
 my valuation, and it's an assumption you can't
 dispute).

 Nor was I attempting to:  I was trying to get you to evaluate ZFS's  
 incremental risk reduction *quantitatively* (and if you actually  
 did so you'd likely be surprised at how little difference it makes  
 - at least if you're at all rational about assessing it).

ok .. i'll bite since there's no ignore feature on the list yet:

what are you terming as ZFS' incremental risk reduction? .. (seems  
like a leading statement toward a particular assumption) .. are you  
just trying to say that without multiple copies of data in multiple  
physical locations you're not really accomplishing a more complete  
risk reduction

yes i have read this thread, as well as many of your other posts  
around usenet and such .. in general i find your tone to be somewhat  
demeaning (slightly rude too - but - eh, who's counting?  i'm none to  
judge) - now, you do know that we are currently in an era of  
collaboration instead of deconstruction right? .. so i'd love to see  
the improvements on the many shortcomings you're pointing to and  
passionate about written up, proposed, and freely implemented :)

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread Jonathan Edwards
apologies in advance for prolonging this thread .. i had considered  
taking this completely offline, but thought of a few people at least  
who might find this discussion somewhat interesting .. at the least i  
haven't seen any mention of Merkle trees yet as the nerd in me yearns  
for

On Dec 5, 2007, at 19:42, bill todd - aka can you guess? wrote:

 what are you terming as ZFS' incremental risk reduction? ..  
 (seems like a leading statement toward a particular assumption)

 Primarily its checksumming features, since other open source  
 solutions support simple disk scrubbing (which given its ability to  
 catch most deteriorating disk sectors before they become unreadable  
 probably has a greater effect on reliability than checksums in any  
 environment where the hardware hasn't been slapped together so  
 sloppily that connections are flaky).

ah .. okay - at first reading incremental risk reduction seems to  
imply an incomplete approach to risk .. putting various creators and  
marketing organizations pride issues aside for a moment, as a  
complete risk reduction - nor should it billed as such.  However i do  
believe that an interesting use of the merkle tree with a sha256 hash  
is somewhat of an improvement over conventional volume based data  
scrubbing techniques since there can be a unique integration between  
the hash tree for the filesystem block layout and a hierarchical data  
validation method.  In addition to the finding unknown areas with the  
scrub, you're also doing relatively inexpensive data validation  
checks on every read.

 Aside from the problems that scrubbing handles (and you need  
 scrubbing even if you have checksums, because scrubbing is what  
 helps you *avoid* data loss rather than just discover it after it's  
 too late to do anything about it), and aside from problems deriving  
 from sloppy assembly (which tend to become obvious fairly quickly,  
 though it's certainly possible for some to be more subtle),  
 checksums primarily catch things like bugs in storage firmware and  
 otherwise undetected disk read errors (which occur orders of  
 magnitude less frequently than uncorrectable read errors).

sure - we've seen many transport errors, as well as firmware  
implementation errors .. in fact with many arrays we've seen data  
corruption issues with the scrub (particularly if the checksum is  
singly stored along with the data block) -  just like spam you really  
want to eliminate false positives that could indicate corruption  
where there isn't any.  if you take some time to read the on disk  
format for ZFS you'll see that there's a tradeoff that's done in  
favor of storing more checksums in many different areas instead of  
making more room for direct block pointers.

 Robert Milkowski cited some sobering evidence that mid-range arrays  
 may have non-negligible firmware problems that ZFS could often  
 catch, but a) those are hardly 'consumer' products (to address that  
 sub-thread, which I think is what applies in Stefano's case) and b)  
 ZFS's claimed attraction for higher-end (corporate) use is its  
 ability to *eliminate* the need for such products (hence its  
 ability to catch their bugs would not apply - though I can  
 understand why people who needed to use them anyway might like to  
 have ZFS's integrity checks along for the ride, especially when  
 using less-than-fully-mature firmware).

actually on this list we've seen a number of consumer level products  
including sata controllers, and raid cards (which are also becoming  
more commonplace in the consumer realm) that can be confirmed to  
throw data errors.  Code maturity issues aside, there aren't very  
many array vendors that are open-sourcing their array firmware - and  
if you consider zfs as a feature-set that could function as a multi- 
purpose storage array (systems are cheap) - i find it refreshing that  
everything that's being done under the covers is really out in the open.

 And otherwise undetected disk errors occur with negligible  
 frequency compared with software errors that can silently trash  
 your data in ZFS cache or in application buffers (especially in PC  
 environments:  enterprise software at least tends to be more stable  
 and more carefully controlled - not to mention their typical use of  
 ECC RAM).

 So depending upon ZFS's checksums to protect your data in most PC  
 environments is sort of like leaving on a vacation and locking and  
 bolting the back door of your house while leaving the front door  
 wide open:  yes, a burglar is less likely to enter by the back  
 door, but thinking that the extra bolt there made you much safer is  
 likely foolish.

granted - it's not an all-in-one solution, but by combining the  
merkle tree approach with the sha256 checksum along with periodic  
data scrubbing - it's a darn good approach .. particularly since it  
also tends to cost a lot less than what you might have to pay  
elsewhere for something you 

Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread Jonathan Edwards

On Dec 6, 2007, at 00:03, Anton B. Rang wrote:

 what are you terming as ZFS' incremental risk reduction?

 I'm not Bill, but I'll try to explain.

 Compare a system using ZFS to one using another file system -- say,  
 UFS, XFS, or ext3.

 Consider which situations may lead to data loss in each case, and  
 the probability of each such situation.

 The difference between those two sets is the 'incremental risk  
 reduction' provided by ZFS.

ah .. thanks Anton - so the next step would be to calculate the  
probability of occurrence, the impact to operation, and the return to  
service for each anticipated risk in a given environment in order to  
determine the size of the increment that constitutes the risk  
reduction that ZFS is providing.  Without this there's just a lot of  
hot air blowing around in here ..

snip

excellent summary of risks - perhaps we should also consider the  
availability and transparency of the code to potentially mitigate  
future problems .. that's currently where i'm starting to see  
tremendous value in open and free raid controller solutions to help  
drive down the cost of implementation for this sort of data  
protection instead of paying through the nose for a closed hardware  
based solutions (which is still a great margin in licensing for  
dedicated storage vendors)

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-12 Thread Jonathan Edwards

On Nov 10, 2007, at 23:16, Carson Gaspar wrote:

 Mattias Pantzare wrote:

 As the fsid is created when the file system is created it will be the
 same when you mount it on a different NFS server. Why change it?

 Or are you trying to match two different file systems? Then you also
 have to match all inode-numbers on your files. That is not  
 possible at
 all.

 It is, if you do block replication between the servers (drbd on Linux,
 or the Sun product whose name I'm blanking on at the moment).

AVS (or Availability Suite) ..

http://www.opensolaris.org/os/project/avs/

Jim Dunham does a nice demo here for block replication on zfs (see  
sidebar)

 What isn't clear is if zfs send/recv retains inode numbers... if it
 doesn't that's a really sad thing, as we won't be able to use ZFS to
 replace NetApp snapmirrors.

zfs send/recv comes out of the DSL which i believe will generate a  
unique fsid_guid .. for mirroring you'd really want to use AVS.

btw - you can also look at the Cluster SUNWnfs agent in the ohac  
community:
http://opensolaris.org/os/community/ha-clusters/ohac/downloads/

hth
---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Count objects/inodes

2007-11-10 Thread Jonathan Edwards
Hey Bill:

what's an object here? or do we have a mapping between objects and  
block pointers?

for example a zdb -bb might show:
th37 # zdb -bb rz-7

Traversing all blocks to verify nothing leaked ...

 No leaks (block sum matches space maps exactly)

 bp count:  47
 bp logical:518656avg:  11035
 bp physical:64512avg:   1372 
compression:   8.04
 bp allocated:  249856avg:   5316 
compression:   2.08
 SPA allocated: 249856   used:  0.00%

but do we maintain any sort of mapping between the object  
instantiation and how many block pointers an object or file might  
consume on disk?

---
.je

On Nov 9, 2007, at 15:18, Bill Moore wrote:

 You can just do something like this:

 # zfs list tank/home/billm
 NAMEUSED  AVAIL  REFER  MOUNTPOINT
 tank/home/billm83.9G  5.56T  74.1G  /export/home/billm
 # zdb tank/home/billm
 Dataset tank/home/billm [ZPL], ID 83, cr_txg 541, 74.1G, 111066  
 objects

 Let me know if that causes any trouble.


 --Bill

 On Fri, Nov 09, 2007 at 12:14:07PM -0700, Jason J. W. Williams wrote:
 Hi Guys,

 Someone asked me how to count the number of inodes/objects in a ZFS
 filesystem and I wasn't exactly sure. zdb -dv filesystem seems
 like a likely candidate but I wanted to find out for sure. As to why
 you'd want to know this, I don't know their reasoning but I assume it
 has to do with the maximum number of files a ZFS filesystem can
 support (2^48 no?). Thank you in advance for your help.

 Best Regards,
 Jason
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] df command in ZFS?

2007-10-18 Thread Jonathan Edwards

On Oct 18, 2007, at 11:57, Richard Elling wrote:

 David Runyon wrote:
 I was presenting to a customer at the EBC yesterday, and one of the
 people at the meeting said using df in ZFS really drives him crazy  
 (no,
 that's all the detail I have).  Any ideas/suggestions?

 Filter it.  This is UNIX after all...

err - no .. i can understand that when I put my old SA helmet on ..  
if you look at the avail capacity number below we've really got an  
overprovisioned number if you're not doing quotas - this kind of  
thing can drive you batty particularly when you're used to looking at  
df to quickly see how much space you've got left on the system ..  
it's like asking how many seats are available on this plane, and they  
tell you the amount of available seats on the airline

[EMAIL PROTECTED] # df -h
Filesystem size   used  avail capacity  Mounted on
/dev/dsk/c5t0d0s0  454G12G   437G 3%/
/devices 0K 0K 0K 0%/devices
ctfs 0K 0K 0K 0%/system/contract
proc 0K 0K 0K 0%/proc
mnttab   0K 0K 0K 0%/etc/mnttab
swap   8.4G   876K   8.4G 1%/etc/svc/volatile
objfs0K 0K 0K 0%/system/object
/usr/lib/libc/libc_hwcap2.so.1
454G12G   437G 3%/lib/libc.so.1
fd   0K 0K 0K 0%/dev/fd
swap   8.4G40K   8.4G 1%/tmp
swap   8.4G24K   8.4G 1%/var/run
/dev/dsk/c5t0d0s5  3.9G   1.8G   2.1G46%/var/crash2
log-pool   457G   120M   447G 1%/log-pool
thumper-pool/n01_oraadmin1
 16T   1.4G13T 1%/n01/oraadmin1
thumper-pool/n01_oraarch1
 16T   159M13T 1%/n01/oraarch1
thumper-pool/n01_oradata1
 16T98G13T 1%/n01/oradata1
thumper-pool/tst08a_ctl1
 16T17M13T 1%/s01/controlfile1
thumper-pool/tst08a_ctl2
 16T17M13T 1%/s01/controlfile2
thumper-pool/tst08a_ctl3
 16T17M13T 1%/s01/controlfile3
thumper-pool/tst32a_data
 16T   135G13T 1%/s01/oradata1/tst32
thumper-pool16T   1.1T13T 8%/thumper-pool
thumper-pool/home   16T45K13T 1%/thumper-pool/home
thumper-pool/home/db2inst1
 16T   163G13T 2%/thumper-pool/ 
home/db2inst1
thumper-pool/home/kurt
 16T   223K13T 1%/thumper-pool/ 
home/kurt
thumper-pool/home/mahadev
 16T40K13T 1%/thumper-pool/ 
home/mahadev
thumper-pool/mrd-data
 16T75G13T 1%/thumper-pool/ 
mrd-data
thumper-pool/software
 16T   6.3G13T 1%/thumper-pool/ 
software
thumper-pool/u0116T   5.2G13T 1%/u01
thumper-pool/tst08a_data
 16T   761G13T 6%/s01/oradata1/tst08
log-pool/swim   50G24K50G 1%/log-pool/swim
log-pool/butterfinger
457G24K   457G 1%/log-pool/ 
butterfinger


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] df command in ZFS?

2007-10-18 Thread Jonathan Edwards

On Oct 18, 2007, at 13:26, Richard Elling wrote:


 Yes. It is true that ZFS redefines the meaning of available space.   
 But
 most people like compression, snapshots, clones, and the pooling  
 concept.
 It may just be that you want zfs list instead, df is old-school :-)

exactly - i'm not complaining .. just understanding the confusion

I don't think anticipate deprecating df in favor of zfs list, but  
df_zfs or additonal flags to df might be helpful .. perhaps a pool  
option, and some sort of easy visual to say that the avail number  
you're looking at is shared .. perhaps something like this (sorted  
output would be nice too by default):

# df -F zfs -xh
Filesystem size   used  resv  avail capacity  Mounted on
...
log-pool  (457G)  120M  ---  (447G)1% /log-pool
log-pool/butterfinger
   (457G)   24K  10G  (457G)1% /log-pool/ 
butterfinger
log-pool/swim  [50G]   24K  ---   [50G]1% /log-pool/swim
thumper-pool   (16T)  1.1T  ---   (13T)8% /thumper-pool
thumper-pool/home  (16T)   46K  ---   (13T)1% /thumper- 
pool/home

essentially just some way to tell at a glance that the capacity is  
either (shared) or a [quota]


 OTOH, df does have a notion of file system specific options.  It  
 might be
 useful to have a df_zfs option which would effectively show the zfs  
 list-like
 data.

yeah - i'm thinking it might be helpful to see reserved capacity here  
by default, or at least have a switch for it instead of having to  
alias zfs list -o  
name,used,reservation,available,refer,mountpoint .. i'm always  
thrown at first glance by that one:

NAMEUSED  RESERV  AVAIL  REFER  MOUNTPOINT
log-pool   10.1Gnone   447G   120M  /log-pool
log-pool/butterfinger  24.5K 10G   457G  24.5K  /log-pool/ 
butterfinger
log-pool/swim  24.5Knone  50.0G  24.5K  /log-pool/swim
thumper-pool   2.63Tnone  12.9T  1.11T  /thumper-pool
thumper-pool/home   163Gnone  12.9T  45.7K  /thumper-pool/home

 BTW, airlines also overprovision seats, which is why you might  
 sometimes
 get bumped.  Hotels do this as well.

my point as well - meaning you're never sure if you're going to get a  
seat especially if there's a rush .. sorry looking back it's kind of  
a bad analogy

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun 6120 array again

2007-10-01 Thread Jonathan Edwards
SCSI based, but solid and cheap enclosures if you don't care about  
support:
http://search.ebay.com/search/search.dll?satitle=Sun+D1000

On Oct 1, 2007, at 12:15, Andy Lubel wrote:

 I gave up.

 The 6120 I just ended up not doing zfs.  And for our 6130 since we  
 don't
 have santricity or the sscs command to set it, I just decided to  
 export each
 disk and create an array with zfs (and a RAMSAN zil), which made  
 performance
 acceptable for us.

 I wish there was a firmware that just made these things dumb jbods!

 -Andy


 On 9/28/07 7:37 PM, Marion Hakanson [EMAIL PROTECTED] wrote:

 Greetings,

 Last April, in this discussion...
 http://www.opensolaris.org/jive/thread.jspa?messageID=143517

 ...we never found out how (or if) the Sun 6120 (T4) array can be  
 configured
 to ignore cache flush (sync-cache) requests from hosts.  We're  
 about to
 reconfigure a 6120 here for use with ZFS (S10U4), and the evil  
 tuneable
 zfs_nocacheflush is not going to serve us well (there is a ZFS  
 pool on
 slices of internal SAS drives, along with UFS boot/OS slices).

 Any pointers would be appreciated.

 Thanks and regards,

 Marion


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 -- 


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS array NVRAM cache?

2007-09-26 Thread Jonathan Edwards

On Sep 25, 2007, at 19:57, Bryan Cantrill wrote:


 On Tue, Sep 25, 2007 at 04:47:48PM -0700, Vincent Fox wrote:
 It seems like ZIL is a separate issue.

 It is very much the issue:  the seperate log device work was done  
 exactly
 to make better use of this kind of non-volatile memory.  To use  
 this, setup
 one LUN that has all of the NVRAM on the array dedicated to it, and  
 then
 use that device as a separate log device.  Works like a champ...


on the 3310/3510 you can't really do this in the same way that you  
can't create a zfs filesystem or zvol and disable the ARC for this ..  
i mean we can dance around the issue and create a really big log  
device on a 3310/3510 and use JBOD for the data, but i don't think  
that's the point - the bottom line is that there's 2 competing cache  
strategies that aren't very complimentary.

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS array NVRAM cache?

2007-09-26 Thread Jonathan Edwards

On Sep 26, 2007, at 14:10, Torrey McMahon wrote:

 You probably don't have to create a LUN the size of the NVRAM  
 either. As
 long as its dedicated to one LUN then it should be pretty quick. The
 3510 cache, last I checked, doesn't do any per LUN segmentation or
 sizing. Its a simple front end for any LUN that is using cache.

yep - the policy gets set on the controller for everything served by  
it .. you could put the ZIL LUN on one controller and change the  
other controller from write back to write through, but then you  
essentially waste a controller just for the log device and controller  
failover would be a mess .. we might as well just redo the fcode for  
these arrays to be a minimized optimized zfs build, but then again -  
i don't know what does to our OEM relationships for the controllers  
or if it's even worth it in the long run .. seems like it might be  
easier to just roll our own or release a spec for the hardware  
vendors to implement.

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The ZFS-Man.

2007-09-21 Thread Jonathan Edwards

On Sep 21, 2007, at 14:57, eric kustarz wrote:

 Hi.

 I gave a talk about ZFS during EuroBSDCon 2007, and because it won  
 the
 the best talk award and some find it funny, here it is:

  http://youtube.com/watch?v=o3TGM0T1CvE

 a bit better version is here:

  http://people.freebsd.org/~pjd/misc/zfs/zfs-man.swf

 Looks like Jeff has been working out :)

my first thought too:
http://blogs.sun.com/bonwick/resource/images/bonwick.portrait.jpg

funny - i always pictured this as UFS-man though:
http://www.benbakerphoto.com/business/47573_8C-after.jpg

but what's going on with the sheep there?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/WAFL lawsuit

2007-09-06 Thread Jonathan Edwards

On Sep 6, 2007, at 14:48, Nicolas Williams wrote:

 Exactly the articles point -- rulings have consequences outside of  
 the
 original case.  The intent may have been to store logs for web server
 access (logical and prudent request) but the ruling states that  
 RAM albeit
 working memory is no different then other storage and must be kept  
 for
 discovery.  This is generalized because (as I understand) the  
 defense was
 arguing  logs are not turned on -- they do not exist and that  
 was met
 with of course the running program has this information in RAM  
 and you are
 disposing of it ad nauseam.  The only saving grace for the ruling  
 is that
 it is not a higher court.

 Allowing for technical illiteracy in judges I think the obvious
 interpretation is that discoverable data should be retained and that
 but it exists only in RAM is not a defense, and rightly so.

hang on .. let me take it out and give it to you ..

I'm thinking this seems to get into v-chip territory, or otherwise  
providing a means for agencies to track information that might have  
passed through a system .. err, for the safety of our children and  
such :P
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Samba with ZFS ACL

2007-09-04 Thread Jonathan Edwards

On Sep 4, 2007, at 12:09, MC wrote:

 For everyone else:

 http://blogs.sun.com/timthomas/entry/ 
 samba_and_swat_in_solaris#comments

 It looks like nevada 70b will be the next Solaris Express  
 Developer Edition (SXDE) which should also drop shortly and should  
 also have the ZFS ACL fix, but to find the full source integration  
 you have to look in snv_72

 I wonder what is missing from 70b that is included in the full  
 source integration :)

that was my comment - 70b was a respin of snv_70 with some extra  
stuff added - meaning that the zfsacl.so.0 is released in binary form  
in the SXDE (70b) in /usr/sfw/lib/vfs, but if you want to browse the  
source consolidation for sfw you should really look here:
http://dlc.sun.com/osol/sfw/downloads/20070822/
instead of here:
http://dlc.sun.com/osol/sfw/downloads/20070724/

in S10u4 you'll need a patch that hasn't been released yet ..  
(according to Jiri some of this has to do with prioritization on  
samba.org's  releases as the zfsacl code got pushed to 3.0.26 which  
is becoming the 3.2 branch complete with the GPLv3)

to implement, you'll need the following in the smb.conf [public]  
section:

vfs objects = zfsacl
nfs4: mode = special


and for other issues around samba and the zfs_acl patch you should  
really watch jurasek's blog:
http://blogs.sun.com/jurasek/

jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raid is very slow???

2007-07-07 Thread Jonathan Edwards


On Jul 7, 2007, at 06:14, Orvar Korvar wrote:


When I copy that file from ZFS to /dev/null I get this output:
real0m0.025s
user0m0.002s
sys 0m0.007s
which can't be correct. Is it wrong of me to use time cp fil fil2  
when measuring disk performance?


well you're reading and writing to the same disk so that's going to  
affect performance, particularly as you're seeking to different areas  
of the disk both for the files and for the uberblock updates .. in  
the above case it looks like the file is already cached (buffer cache  
being what is probably consuming most of your memory here) - so  
you're just looking at a memory to memory transfer here .. if you  
want to see a simple write performance test many people use dd like so:


# timex dd if=/dev/zero of=file bs=128k count=8192

which will give you a measure of an efficient 1GB file write of  
zeros .. or use a better opensource tool like iozone to get a better  
fix on single thread vs multi-thread, read/write mix, and block size  
differences for your given filesystem and storage layout


jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: shareiscsi is cool, but what about sharefc or sharescsi?

2007-06-01 Thread Jonathan Edwards


On Jun 1, 2007, at 18:37, Richard L. Hamilton wrote:


Can one use a spare SCSI or FC controller as if it were a target?


we'd need an FC or SCSI target mode driver in Solaris .. let's just  
say we

used to have one, and leave it mysteriously there.  smart idea though!

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-15 Thread Jonathan Edwards


On May 15, 2007, at 13:13, Jürgen Keil wrote:


Would you mind also doing:

ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=1

to see the raw performance of underlying hardware.


This dd command is reading from the block device,
which might cache dataand probably splits requests
into maxphys pieces (which happens to be 56K on an
x86 box).


to increase this to say 8MB, add the following to /etc/system:

set maxphys=0x80

and you'll probably want to increase sd_max_xfer_size as
well (should be 256K on x86/x64) .. add the following to
/kernel/drv/sd.conf:

sd_max_xfer_size=0x80;

then reboot to get the kernel and sd tunings to take.

---
.je

btw - the defaults on sparc:
maxphys = 128K
ssd_max_xfer_size = maxphys
sd_max_xfer_size = maxphys


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with adding existing EFI disks to a zpool

2007-05-05 Thread Jonathan Edwards


On May 5, 2007, at 09:34, Mario Goebbels wrote:

I spend yesterday all day evading my data of one of the Windows  
disks, so that I can add it to the pool. Using mount-ntfs, it's a  
pain due to its slowness. But once I finished, I thought Cool,  
let's do it. So I added the disk using the zero slice notation  
(c0d0s0), as suggested for performance reasons. I checked the pool  
status and noticed however that the pool size didn't raise.


After a short panic (myself, not the kernel), I remembered that I  
partitioned this disk as EFI disk in Windows (mostly just because).  
c0d0s0 was the emergency, boot or whatever partition automatically  
created according to the recommended EFI partitioning scheme. So it  
added the minimal space of that partition to the pool. The real  
whole disk partition was c0d0s1. Since there's no device removal in  
ZFS yet, I had to replace slice 0 with slice 1 since destroying the  
pool was out of the question.


Two things now:
a) ZFS would have added EFI labels anyway. Will ZFS figure things  
out for itself, or did I lose write cache control because I didn't  
explicitely specify s0 though this is an EFI disk already?


yes if add the whole device to the pool .. that is use c0t0d0 instead  
of c0t0d0s0 .. in this case, ZFS creates a large partition on s0  
starting at sector 34 and encompassing the entire disk.  If you need  
to check the write_cache use format -e, cache, write_cache, display.


b) I don't remember it mentioned anywhere in the documentation. If  
a) is indeed an issue, it should be mentioned that you have to  
unlabel EFI disks before adding.


Removing an EFI label is a little trickier .. you can replace the EFI  
label with an SMI label if it's below 1TB (format -e then l) and then  
dd if=/dev/zero of=/dev/dsk/c0t0d0s2 bs=512 count=1 to remove the  
SMI label .. or you could also attempt to access the entire disk  
(c0t0d0) with dd and zero out the first 17KB and the last 8MB, but  
you'd have to get the 8MB offset from the VTOC.  You know you've got  
an empty label if you get stderr entries at the top of the format  
output, or syslog messages around corrupt label - bad magic number


Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 6410 expansion shelf

2007-03-27 Thread Jonathan Edwards
right on for optimizing throughput on solaris .. a couple of notes  
though (also mentioned in the QFS manuals):


- on x86/x64 you're just going to have an sd.conf so just increase  
the max_xfer_size for all with a line at the bottom like:

sd_max_xfer_size=0x80;
(note: if you look at the source the ssd driver is built from the sd  
source .. it got collapsed back down to sd in S10 x86)


- ssd_max_throttle or sd_max_throttle is typically a point of  
contention that has had many years of history with storage vendors ..  
this will limit the maximum queue depth across the board for all sd  
or ssd devices (read all disks) .. if you're using the native  
Leadville stack, there is a dynamic throttle that should adjust per  
target, so you really shouldn't have to set this unless you're seeing  
command timeouts either on the port or on the host.  By tuning this  
down you can affect performance on the root drives as well as  
external storage making solaris appear slower than it may or may not be.


- ZFS has a maximum block size of 128KB - so i don't think that  
tuning up maxphys and the max transfer sizes to 8MB isn't going to  
make that much difference here .. if you want larger block transfers  
(possibly matching to a full stripe width) you'd have to either go  
with QFS or raw - (but note that with larger block transfers you can  
get into higher cache latency response times depending on the storage  
controller .. and that's a whole other discussion)



On Mar 27, 2007, at 08:24, Rayson Ho wrote:


BTW, did anyone try this??

http://blogs.sun.com/ValdisFilks/entry/improving_i_o_throughput_for

Rayson



On 3/27/07, Wee Yeh Tan [EMAIL PROTECTED] wrote:

As promised.  I got my 6140 SATA delivered yesterday and I hooked it
up to a T2000 on S10u3.  The T2000 saw the disks straight away and is
working for the last 1 hour.  I'll be running some benchmarks on  
it.

 I'll probably have a week with it until our vendor comes around and
steals it from me.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Perforce on ZFS

2007-02-20 Thread Jonathan Edwards

Roch

what's the minimum allocation size for a file in zfs?  I get 1024B by  
my calculation (1 x 512B block allocation (minimum) + 1 x 512B inode/ 
znode allocation) since we never pack file data in the inode/znode.   
Is this a problem?  Only if you're trying to pack a lot files small  
byte files in a limited amount of space, or if you're concerned about  
trying to access many small files quickly.


VxFS has a 96B immediate area for file, symlink, or directory data;  
NTFS can store small files in the MFT record; NetApp WAFL can also  
store small files in the 4KB inode (16 Block pointers = 128B?) .. if  
you look at some of the more recent OSD papers and some of the Lustre/ 
BlueArc work you'll see that this topic comes into play for  
performance in pre-fetching file data and locality issues for  
optimizing heavy access of many small files.


---
.je

On Feb 20, 2007, at 05:12, Roch - PAE wrote:



Sorry to insist  but I am not  aware of a small file problem
with  ZFS (which doesn't mean there   isn't one, nor that we
agree on definition of 'problem'). So  if anyone has data on
this topic, I'm interested.

Also note, ZFS does a lot more than VxFS.

-r

Claude Teissedre writes:

Hello Roch,

Thanks for your reply. According to Iozone and Filebench
(http://blogs.sun.com/dom/), ZFS is less performant than VxFS for  
smalll

files and more performant for large files. In you blog, I don't see
specific infos related to small files -but it's a very interesting  
blog.


Any help from CC: people related to Perforce benchmark (not in
techtracker) is welcome.

Thanks,
Clausde

Roch - PAE a écrit :

Salut Claude.
For this kind of query, try zfs-discuss@opensolaris.org;
Looks like a common workload to me.
I know of no small file problem with ZFS.
You might want to state your metric of success ?

-r

Claude Teissedre writes:

Hello,

I am looking for any benchmark of Perforce on ZFS.
My need here is specifically for Perforce, a source manager. At  
my ISV, it handles 250 users simustaneously (15 instances on  
average)
and 16 Millions (small) files. That's an area not covered in the  
benchmaks I have seen.


Thanks, Claude









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Perforce on ZFS

2007-02-20 Thread Jonathan Edwards


On Feb 20, 2007, at 15:05, Krister Johansen wrote:


what's the minimum allocation size for a file in zfs?  I get 1024B by
my calculation (1 x 512B block allocation (minimum) + 1 x 512B inode/
znode allocation) since we never pack file data in the inode/znode.
Is this a problem?  Only if you're trying to pack a lot files small
byte files in a limited amount of space, or if you're concerned about
trying to access many small files quickly.


This is configurable on a per-dataset basis.  The look in zfs(1m) for
recordsize.


the minimum is still 512B .. (try creating a bunch of 10B files -  
they show

up as ZFS plain files each with a 512B data block in zdb)


VxFS has a 96B immediate area for file, symlink, or directory data;
NTFS can store small files in the MFT record; NetApp WAFL can also
store small files in the 4KB inode (16 Block pointers = 128B?) .. if
you look at some of the more recent OSD papers and some of the  
Lustre/

BlueArc work you'll see that this topic comes into play for
performance in pre-fetching file data and locality issues for
optimizing heavy access of many small files.


ZFS has something similar.  It's called a bonus buffer.


i see .. but currently we're only storing symbolic links there since  
given the
bufsize of 320B - the znode_phys struct of 264B, we've only got 56B  
left for
data in the 512B dnode_phys struct .. i'm thinking we might want to  
trade
off some of the uint64_t meta attributes with something smaller and  
maybe
eat into the pad to get a bigger data buffer .. of course that will  
also affect

the reporting end of things, but should be easily fixable.

just my 2p
---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] se3510 and ZFS

2007-02-06 Thread Jonathan Edwards


On Feb 6, 2007, at 06:55, Robert Milkowski wrote:


Hello zfs-discuss,

  It looks like when zfs issues write cache flush commands se3510
  actually honors it. I do not have right now spare se3510 to be 100%
  sure but comparing nfs/zfs server with se3510 to another nfs/ufs
  server with se3510 with Periodic Cache Flush Time set to disable
  or so longer time I can see that cache utilization on nfs/ufs stays
  about 48% while on nfs/zfs it's hardly reaches 20% and every few
  seconds goes down to 0 (I guess every txg_time).

  nfs/zfs also has worse performance than nfs/ufs.

  Does anybody know how to tell se3510 not to honor write cache flush
  commands?


I don't think you can .. DKIOCFLUSHWRITECACHE *should* tell the array
to flush the cache.  Gauging from the amount of calls that zfs makes to
this vs ufs (fsck, lockfs, mount?) - i think you'll see the  
performance diff,

particularly when you hit an NFS COMMIT.  (If you don't use vdevs you
may see another difference in zfs as the only place you'll hit is on  
the zil)


btw - you may already know, but you'll also fall to write-through on  
the cache
if your battery charge drops and we also recommend setting to write- 
through
when you only have a single controller since a power event could  
result in

data loss.  Of course there's a big performance difference between
write-back and write-through cache

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] se3510 and ZFS

2007-02-06 Thread Jonathan Edwards


On Feb 6, 2007, at 11:46, Robert Milkowski wrote:

  Does anybody know how to tell se3510 not to honor write cache  
flush

  commands?


JE I don't think you can .. DKIOCFLUSHWRITECACHE *should* tell the  
array
JE to flush the cache.  Gauging from the amount of calls that zfs  
makes to

JE this vs ufs (fsck, lockfs, mount?)


correction .. UFS uses _FIOFFS which is a file ioctl not a device  
ioctl which makes
sense given the difference in models .. hence UFS doesn't care if the  
device write
cache is turned on or off as it only makes dkio calls for geometry,  
info and such.


you can poke through the code to see what other dkio ioctls are being  
made by z ..
i believe it's due to the design of a closer tie between the  
underlying devices and
the file system that there's a big difference.  The DKIOCFLUSH PSARC  
is here:

http://www.opensolaris.org/os/community/arc/caselog/2004/652/spec/

however I'm not sure if the 3510 maintains a difference between the  
entire array cache
and the cache for a single LUN/device .. we'd have to dig up one of  
the firmware
engineers for a more definitive answer.  Point well taken on shared  
storage if we're

flushing an array cache here :)

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which label a ZFS/ZPOOL device has ? VTOC or EFI ?

2007-02-04 Thread Jonathan Edwards


On Feb 3, 2007, at 02:31, dudekula mastan wrote:

After creating the ZFS file system on a VTOC labeled disk, I am  
seeing the following warning messages.


Feb  3 07:47:00 scoobyb Corrupt label; wrong magic number
Feb  3 07:47:00 scoobyb scsi: [ID 107833 kern.warning] WARNING: / 
scsi_vhci/[EMAIL PROTECTED] (ssd156):


Any idea on this ?


This generally means that this device doesn't have a label - and this  
particular device would be the multipathed device identified by the  
GUID 600508b400102eb70001204b or the old BSD style driver  
enumeration ssd156 .. (take a look at http://access1.sun.com/ 
codesamples/disknames.html to see an example on how to use libdevinfo  
to convert this to the SVR4 c#t#d# style name)


Now with ZFS if you don't specify a slice, you're essentially asking  
ZFS to use and autolabel the entire disk which will put an EFI style  
label on since the older sun style VTOC labels have an upper limit of  
1TB per disk (EFI should work up to 2^64 LBAs.)  The older sun VTOC  
labels typically use slice 2 as a backup to show the entire disk and  
will store the label in the first 512B, whereas the EFI labels will  
use 34 sectors at the start of the disk to store the label, and will  
also reserve a portion at the tail end of the disk for a backup label.


With the older sun style VTOC labels, if you ever overwrite the first  
the first 512B on cylinder 0 of the disk (eg: dd if=/dev/zero of=/dev/ 
rdsk/c1t1d0s2 where s2 is the typical backup label starting at  
cylinder 0) you'll overwrite the label, whereas with the EFI label  
you have to overwrite both protected sections of the disk.


So to reiterate what Robert and Tomas have already gone into .. if  
you plan on using the entire disk and want the vdev benefits (the  
ability to import/export pools, write caching, etc) you should  
probably not specify a slice and allow ZFS to autolabel the disk as  
it sees fit.


hth

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Project Proposal: Availability Suite

2007-02-02 Thread Jonathan Edwards


On Feb 2, 2007, at 15:35, Nicolas Williams wrote:


Unlike traditional journalling replication, a continuous ZFS send/recv
scheme could deal with resource constraints by taking a snapshot and
throttling replication until resources become available again.
Replication throttling would mean losing some transaction history, but
since we don't expose that right now, nothing would be lost.

Scoreboarding (what SNDR does) should perform better in general,  
but in
the case of COW filesystems and databases ISTM that it should be a  
wash

unless it's properly integrated with the COW system, and that's what
makes me think scoreboarding and journalling approach each other at  
the

limit when integrated with ZFS.


hmm .. a COW scoreboard .. visions of Clustra with the notion of  
each node
is an atomic failure unit spring to mind .. of course in this light,  
there's not
much of a difference between just replication and global  
synchronization ..


very interesting ..

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS or UFS - what to do?

2007-01-29 Thread Jonathan Edwards

On Jan 26, 2007, at 09:16, Jeffery Malloch wrote:


Hi Folks,

I am currently in the midst of setting up a completely new file  
server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM)  
connected to an Engenio 6994 product (I work for LSI Logic so  
Engenio is a no brainer).  I have configured a couple of zpools  
from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB.  I  
then created sub zfs systems below that and set quotas and  
sharenfs'd them so that it appears that these file systems are  
dynamically shrinkable and growable.


ah - the 6994 is the controller we use in the 6140/6540 if i'm not  
mistaken .. i guess this thread will go down in a flaming JBOD vs  
RAID controller religious war again .. oops, too late :P


yes - the dynamic LUN expansion bits in ZFS is quite nice and handy  
for managing dynamic growth of a pool or file system.  so going back  
to Jeffery's original questions:




1.  How stable is ZFS?  The Engenio box is completely configured  
for RAID5 with hot spares and write cache (8GB) has battery backup  
so I'm not too concerned from a hardware side.  I'm looking for an  
idea of how stable ZFS itself is in terms of corruptability, uptime  
and OS stability.


I think the stability issue has already been answered pretty well ..

8GB battery backed cache is nice .. performance wise you might find  
some odd interactions with the ZFS adaptive cache integration and the  
way in which the intent log operates (O_DSYNC writes can potentially  
impose a lot of in flight commands for relatively little work) -  
there's a max blocksize of 128KB (also maxphys), so you might want to  
experiment with tuning back the stripe width .. i seem to recall the  
the 6994 controller seemed to perform best with 256KB or 512KB stripe  
width .. so there may be additional tuning on the read-ahead or write- 
behind algorithms.


2.  Recommended config.  Above, I have a fairly simple setup.  In  
many of the examples the granularity is home directory level and  
when you have many many users that could get to be a bit of a  
nightmare administratively.  I am really only looking for high  
level dynamic size adjustability and am not interested in its built  
in RAID features.  But given that, any real world recommendations?


Not being interested in the RAID functionality as Roch points out  
eliminates the self-healing functionality and reconstruction bits in  
ZFS .. but you still get other nice benefits like dynamic LUN expansion


As i see it, since we seem to have excess CPU and bus capacity on  
newer systems (most applications haven't quite caught up to impose  
enough of a load yet) .. we're back to the mid '90s where host based  
volume management and caching makes sense and is being proposed  
again.  Being proactive, we might want to consider putting an  
embedded Solaris/ZFS on a RAID controller to see if we've really got  
something novel in the caching and RAID algorithms for when the  
application load really does catch up and impose more of a load on  
the host.  Additionally - we're seeing that there's a big benefit in  
moving the filesystem closer to the storage array since most users  
care more about their consistency of their data (upper level) than  
the reliability of the disk subsystem or RAID controller.   
Implementing a RAID controller that's more intimately aware of the  
upper data levels seems like the next logical evolutionary step.


3.  Caveats?  Anything I'm missing that isn't in the docs that  
could turn into a BIG gotchya?


I would say be careful of the ease at which you can destroy file  
systems and pools .. while convenient - there's typically no warning  
if you or an administrator does a zfs or zpool destroy .. so i could  
see that turning into an issue.  Also if a LUN goes offline, you may  
not see this right away and you would have the potential to corrupt  
your pool or panic your system.  Hence the self-healing and scrub  
options to detect and repair failure a little bit faster.  People on  
this forum have been finding RAID controller inconsistencies .. hence  
the religious JBOD vs RAID ctlr disruptive paradigm shift


4.  Since all data access is via NFS we are concerned that 32 bit  
systems (Mainly Linux and Windows via Samba) will not be able to  
access all the data areas of a 2TB+ zpool even if the zfs quota on  
a particular share is less then that.  Can anyone comment?


Doing 2TB+ shouldn't be a problem for the NFS or Samba mounted  
filesystem regardless if the host is 32bit or not.  The only place  
where you can run into a problem is if the size of an individual file  
crosses 2 or 4TB on a 32bit system.  I know we've implemented file  
systems (QFS in this case) that were samba shared to 32bit windows  
hosts in excess of 40-100TB without any major issues.  I'm sure  
there's similar cases with ZFS and thumper .. i just don't have that  
data.


a little late to the discussion, but hth
---
.je

Re: [zfs-discuss] Re: ZFS or UFS - what to do?

2007-01-29 Thread Jonathan Edwards


On Jan 29, 2007, at 14:17, Jeffery Malloch wrote:


Hi Guys,

SO...

From what I can tell from this thread ZFS if VERY fussy about  
managing writes,reads and failures.  It wants to be bit perfect.   
So if you use the hardware that comes with a given solution (in my  
case an Engenio 6994) to manage failures you risk a) bad writes  
that don't get picked up due to corruption from write cache to  
disk b) failures due to data changes that ZFS is unaware of that  
the hardware imposes when it tries to fix itself.


So now I have a $70K+ lump that's useless for what it was designed  
for.  I should have spent $20K on a JBOD.  But since I didn't do  
that, it sounds like a traditional model works best (ie. UFS et al)  
for the type of hardware I have.  No sense paying for something and  
not using it.  And by using ZFS just as a method for ease of file  
system growth and management I risk much more corruption.


The other thing I haven't heard is why NOT to use ZFS.  Or people  
who don't like it for some reason or another.


Comments?


I put together this chart a while back .. i should probably update it  
for RAID6 and RAIDZ2


#   ZFS ARRAY HWCAPACITYCOMMENTS
--  --- 
1   R0  R1  N/2 hw mirror - no zfs healing
2   R0  R5  N-1 hw R5 - no zfs healing
3   R1  2 x R0  N/2 flexible, redundant, good perf
4   R1  2 x R5  (N/2)-1 flexible, more redundant,  
decent perf
5   R1  1 x R5  (N-1)/2 parity and mirror on same  
drives (XXX)

6   RZ  R0  N-1 standard RAID-Z no mirroring
7   RZ  R1 (tray)   (N/2)-1 RAIDZ+1
8   RZ  R1 (drives) (N/2)-1 RAID1+Z (highest redundancy)
9   RZ  3 x R5  N-4 triple parity calculations (XXX)
10  RZ  1 x R5  N-2 double parity calculations (XXX)

(note: I included the cases where you have multiple arrays with a  
single lun per vdisk (say) and where you only have a single array  
split into multiple LUNs.)


The way I see it, you're better off picking either controller parity  
or zfs parity .. there's no sense in computing parity multiple times  
unless you have cycles to spare and don't mind the performance hit ..  
so the questions you should really answer before you choose the  
hardware is what level of redundancy to capacity balance do you want?  
and whether or not you want to compute RAID in ZFS host memory or out  
on a dedicated blackbox controller?  I would say something about  
double caching too, but I think that's moot since you'll always cache  
in the ARC if you use ZFS the way it's currently written.


Other feasible filesystem options for Solaris - UFS, QFS, or vxfs  
with SVM or VxVM for volume mgmt if you're so inclined .. all depends  
on your budget and application.  There's currently tradeoffs in each  
one, and contrary to some opinions, the death of any of these has  
been grossly exaggerated.


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Thumper Origins Q

2007-01-25 Thread Jonathan Edwards


On Jan 25, 2007, at 10:16, Torrey McMahon wrote:


Albert Chin wrote:

On Wed, Jan 24, 2007 at 10:19:29AM -0800, Frank Cusack wrote:

On January 24, 2007 10:04:04 AM -0800 Bryan Cantrill  
[EMAIL PROTECTED] wrote:



On Wed, Jan 24, 2007 at 09:46:11AM -0800, Moazam Raja wrote:


Well, he did say fairly cheap. the ST 3511 is about $18.5k. That's
about the same price for the low-end NetApp FAS250 unit.


Note that the 3511 is being replaced with the 6140:

Which is MUCH nicer but also much pricier.  Also, no non-RAID  
option.




So there's no way to treat a 6140 as JBOD? If you wanted to use a  
6140

with ZFS, and really wanted JBOD, your only choice would be a RAID 0
config on the 6140?


Why would you want to treat a 6140 like a JBOD? (See the previous  
threads about JBOD vs HW RAID...)


I was trying to see if we sold the CSM2 trays without the controller,  
but I don't think that's commonly asked for .. reminds me of the old  
D1000 days - i seem to recall putting in more of those as the A1000  
controllers weren't the greatest and people tended to opt for s/w  
mirrors instead.  Then as the system application load went higher and  
the data became more critical the push was towards offloading this  
onto better storage controllers .. so since it seems like we now have  
more processing and bus speed on the system that applications aren't  
taking advantage of yet, it looks like the pendulum might be swinging  
back towards host-based RAID again.


not a verdict .. just a thought
---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Thumper Origins Q

2007-01-25 Thread Jonathan Edwards


On Jan 25, 2007, at 14:34, Bill Sommerfeld wrote:


On Thu, 2007-01-25 at 10:16 -0500, Torrey McMahon wrote:

So there's no way to treat a 6140 as JBOD? If you wanted to use a  
6140

with ZFS, and really wanted JBOD, your only choice would be a RAID 0
config on the 6140?


Why would you want to treat a 6140 like a JBOD? (See the previous
threads about JBOD vs HW RAID...)


Let's turn this around.  Assume I want a FC JBOD.  What should I get?


perhaps something coming real soon .. (stall)

---
.je

btw - I've also said you could do a FC target in a thumper a la  
FalconStor .. but
i'm not sure if they've got that going on S10, and their target  
multipathing
was less than stellar .. we did have a target mode driver at one  
point, but i

think that project got scrapped a while back.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Thumper Origins Q

2007-01-25 Thread Jonathan Edwards


On Jan 25, 2007, at 17:30, Albert Chin wrote:


On Thu, Jan 25, 2007 at 02:24:47PM -0600, Al Hopper wrote:

On Thu, 25 Jan 2007, Bill Sommerfeld wrote:


On Thu, 2007-01-25 at 10:16 -0500, Torrey McMahon wrote:

So there's no way to treat a 6140 as JBOD? If you wanted to use  
a 6140
with ZFS, and really wanted JBOD, your only choice would be a  
RAID 0

config on the 6140?


Why would you want to treat a 6140 like a JBOD? (See the previous
threads about JBOD vs HW RAID...)


Let's turn this around.  Assume I want a FC JBOD.  What should I  
get?


Many companies make FC expansion boxes to go along with their FC  
based
hardware RAID arrays.  Often, the expansion chassis is identical  
to the
RAID equipped chassis - same power supplies, same physical chassis  
and
disk drive carriers - the only difference is that the slots used  
to house
the (dual) RAID H/W controllers have been blanked off.  These  
expansion
chassis are designed to be daisy chained back to the box with  
the H/W

RAID.  So you simply use one of the expansion chassis and attach it
directly to a system equipped with an FC HBA and ... you've got an FC
JBOD.  Nearly all of them will support two FC connections to allow  
dual
redundant connections to the FC RAID H/W.  So if you equip your  
ZFS host
with either a dual-port FC HBA or two single-port FC HBAs - you  
have a

pretty good redundant FC JBOD solution.

An example of such an expansion box is the DS4000 EXP100 from  
IBM.  It's
also possible to purchase a 3510FC box from Sun with no RAID  
controllers -
but their nearest equivalent of an empty box comes with 6  
(overpriced)

disk drives pre-installed. :(

Perhaps you could use your vast influence at Sun to persuade them  
to sell
an empty 3510FC box?  Or an empty box bundled with a single or  
dual-port

FC card (Qlogic based please).  Well - there's no harm in making the
suggestion ... right?


Well, when you buy disk for the Sun 5320 NAS Appliance, you get a
Controller Unit shelf and, if you expand storage, an Expansion Unit
shelf that connects to the Controller Unit. Maybe the Expansion Unit
shelf is a JBOD 6140?


that's the CSM200 - the IOMs in that should just take a 2Gb or 4Gb SFP
(copper or fibre) and the tray should run switched loop so you can mix
FC and SATA as it connects back to the 6140 or 6540 controller head.

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Thumper Origins Q

2007-01-24 Thread Jonathan Edwards


On Jan 24, 2007, at 09:25, Peter Eriksson wrote:

too much of our future roadmap, suffice it to say that one should  
expect
much, much more from Sun in this vein: innovative software and  
innovative
hardware working together to deliver world-beating systems with  
undeniable

economics.


Yes please. Now give me a fairly cheap (but still quality) FC- 
attached JBOD utilizing SATA/SAS disks and I'll be really happy! :-)


Could you outline why FC attached instead of network attached (iSCSI  
say) makes more sense to you?  It might help to illustrate the demand  
for an FC target I'm hearing instead of just a network target ..


.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread Jonathan Edwards


On Jan 24, 2007, at 06:54, Roch - PAE wrote:


[EMAIL PROTECTED] writes:
Note also that for most applications, the size of their IO  
operations

would often not match the current page size of the buffer, causing
additional performance and scalability issues.


Thanks for mentioning this, I forgot about it.

Since ZFS's default block size is configured to be larger than a  
page,

the application would have to issue page-aligned block-sized I/Os.
Anyone adjusting the block size would presumably be responsible for
ensuring that the new size is a multiple of the page size.  (If they
would want Direct I/O to work...)

I believe UFS also has a similar requirement, but I've been wrong
before.



I believe the UFS requirement is that the I/O be sector
aligned for DIO to be attempted. And Anton did mention that
one of the benefit of DIO is the ability to direct-read a
subpage block. Without UFS/DIO the OS is required to read and
cache the full page and the extra amount of I/O may lead to
data channel saturation (I don't see latency as an issue in
here, right ?).


In QFS there are mount options to do automatic type switching
depending on whether or not the IO is sector aligned or not.  You
essentially set a trigger to switch to DIO if you receive a tunable
number of well aligned IO requests.  This helps tremendously in
certain streaming workloads (particularly write) to reduce overhead.


This is where I said that such a feature would translate
for ZFS into the ability to read parts of a filesystem block
which would only make sense if checksums are disabled.


would it be possible to do checksums a posteri? .. i suspect that
the checksum portion of the transaction may not be atomic though
and this leads us back towards the older notion of a DIF.


And for RAID-Z that could mean avoiding I/Os to each disks but
one in a group, so that's a nice benefit.

So  for the  performance  minded customer that can't  afford
mirroring, is not  much a fan  of data integrity, that needs
to do subblock reads to an  uncacheable workload, then I can
see a feature popping up. And this feature is independant on
whether   or not the data  is  DMA'ed straight into the user
buffer.


certain streaming write workloads that are time dependent can
fall into this category .. if i'm doing a DMA read directly from a
device's buffer that i'd like to stream - i probably want to avoid
some of the caching layers of indirection that will probably impose
more overhead.

The idea behind allowing an application to advise the filesystem
of how it plans on doing it's IO (or the state of it's own cache or
buffers or stream requirements) is to prevent the one cache fits
all sort of approach that we currently seem to have in the ARC.


The  other  feature,  is to  avoid a   bcopy by  DMAing full
filesystem block reads straight into user buffer (and verify
checksum after). The I/O is high latency, bcopy adds a small
amount. The kernel memory can  be freed/reuse straight after
the user read  completes. This is  where I ask, how much CPU
is lost to the bcopy in workloads that benefit from DIO ?


But isn't the cost more than just the bcopy?  Isn't there additional
overhead in the TLB/PTE from the page invalidation that needs
to occur when you do actually go to write the page out or flush
the page?


At this point, there are lots of projects  that will lead to
performance improvements.  The DIO benefits seems like small
change in the context of ZFS.

The quickest return on  investement  I see for  the  directio
hint would be to tell ZFS to not grow the ARC when servicing
such requests.


How about the notion of multiple ARCs that could be referenced
or fine tuned for various types of IO workload profiles to provide a
more granular approach?  Wouldn't this also keep the page tables
smaller and hopefully more contiguous for atomic operations? Not
sure what this would break ..

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thumper Origins Q

2007-01-24 Thread Jonathan Edwards


On Jan 24, 2007, at 12:41, Bryan Cantrill wrote:




well, Thumper is actually a reference to Bambi


You'd have to ask Fowler, but certainly when he coined it, Bambi  
was the
last thing on anyone's mind.  I believe Fowler's intention was one  
that

thumps (or, in the unique parlance of a certain Commander-in-Chief,
one that gives a thumpin').


You can take your pick of things that thump here:
http://en.wikipedia.org/wiki/Thumper

given the other name is the X4500 .. it does seem like it should be a  
weapon


---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread Jonathan Edwards

Roch

I've been chewing on this for a little while and had some thoughts

On Jan 15, 2007, at 12:02, Roch - PAE wrote:



Jonathan Edwards writes:


On Jan 5, 2007, at 11:10, Anton B. Rang wrote:


DIRECT IO is a set of performance optimisations to circumvent
shortcomings of a given filesystem.


Direct I/O as generally understood (i.e. not UFS-specific) is an
optimization which allows data to be transferred directly between
user data buffers and disk, without a memory-to-memory copy.

This isn't related to a particular file system.



true .. directio(3) is generally used in the context of *any* given
filesystem to advise it that an application buffer to system buffer
copy may get in the way or add additional overhead (particularly if
the filesystem buffer is doing additional copies.)  You can also look
at it as a way of reducing more layers of indirection particularly if
I want the application overhead to be higher than the subsystem
overhead.  Programmatically .. less is more.


Direct IO makes good sense when the target disk sectors are
set a priori. But in the context of ZFS, would you rather
have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that
was possible).


sure, but in a well designed filesystem this is essentially the
same as efficient buffer cache utilization .. coalescing IO
operations to commit on a more efficient and larger disk
allocation unit.  However, paged IO (and in particular ZFS
paged IO) is probably a little more than simply a bcopy()
in comparison to Direct IO (at least in the QFS context)


As for read, I  can see that when  the load is cached in the
disk array and we're running  100% CPU, the extra copy might
be noticeable. Is this the   situation that longs for DIO  ?
What % of a system is spent in the copy  ? What is the added
latency that comes from the copy ? Is DIO the best way to
reduce the CPU cost of ZFS ?


To achieve maximum IO rates (in particular if you have a flexible
blocksize and know the optimal stripe width for the best raw disk
or array logical volume performance) you're going to do much
better if you don't have to pass through buffered IO strategies
with the added latencies and kernel space dependencies.

Consider the case where you're copying or replicating from one
disk device to another in a one-time shot.  There's tremendous
advantage in bypassing the buffer and reading and writing full
stripe passes.  The additional buffer copy is also going to add
latency and affect your run queue, particularly if you're working
on a shared system as the buffer cache might get affected by
memory pressure, kernel interrupts, or other applications.

Another common case could be line speed network data capture
if the frame size is already well aligned for the storage device.
Being able to attach one device to another with minimal kernel
intervention should be seen as an advantage for a wide range
of applications that need to stream data from device A to device
B and already know more than you might about both devices.


The  current Nevada  code base  has  quite nice  performance
characteristics  (and  certainly   quirks); there are   many
further efficiency gains to be reaped from ZFS. I just don't
see DIO on top of  that list for now.   Or at least  someone
needs to  spell out what  is ZFS/DIO and  how much better it
is expected to be (back of the envelope calculation accepted).


the real benefit is measured more in terms of memory consumption
for a given application and the type of balance between application
memory space and filesystem memory space.  when the filesystem
imposes more pressure on the application due to it's mapping you're
really measuring the impact of doing an application buffer read and
copy for each write.  In other words you're imposing more of a limit
on how the application should behave with respect to it's notion of
the storage device.

DIO should not been seen as a catchall for the notion of more
efficiency will be gotten by bypassing the filesystem buffers but
rather as please don't buffer this since you might push back on
me and I don't know if I can handle a push back advice


Reading RAID-Z  subblocks on filesystems that  have checksum
disabled might be interesting.   That would avoid  some disk
seeks.To served  the  subblocks directly   or  not is  a
separate matter; it's  a small deal  compared to the feature
itself.  How about disabling the  DB  checksum (it can't fix
the block anyway) and do mirroring ?


Basically speaking - there needs to be some sort of strategy for
bypassing the ARC or even parts of the ARC for applications that
may need to advise the filesystem of either:
1) the delicate nature of imposing additional buffering for their
data flow
2) already well optimized applications that need more adaptive
cache in the application instead of the underlying filesystem or
volume manager

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org

Re: [zfs-discuss] Re: ZFS direct IO

2007-01-05 Thread Jonathan Edwards


On Jan 5, 2007, at 11:10, Anton B. Rang wrote:

DIRECT IO is a set of performance optimisations to circumvent  
shortcomings of a given filesystem.


Direct I/O as generally understood (i.e. not UFS-specific) is an  
optimization which allows data to be transferred directly between  
user data buffers and disk, without a memory-to-memory copy.


This isn't related to a particular file system.



true .. directio(3) is generally used in the context of *any* given  
filesystem to advise it that an application buffer to system buffer  
copy may get in the way or add additional overhead (particularly if  
the filesystem buffer is doing additional copies.)  You can also look  
at it as a way of reducing more layers of indirection particularly if  
I want the application overhead to be higher than the subsystem  
overhead.  Programmatically .. less is more.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re[2]: ZFS in a SAN environment

2006-12-20 Thread Jonathan Edwards


On Dec 20, 2006, at 00:37, Anton B. Rang wrote:

INFORMATION: If a member of this striped zpool becomes  
unavailable or
develops corruption, Solaris will kernel panic and reboot to  
protect your data.


OK, I'm puzzled.

Am I the only one on this list who believes that a kernel panic,  
instead of EIO, represents a bug?


I agree as well - did you file a bug on this yet?

Inducing kernel panics (like we also do on certain sun cluster  
failure types) to prevent corruption can often lead to more  
corruption elsewhere, and usually ripples to throw admins, managers,  
and users in a panic as well - typically resulting in  more corrupted  
opinions and perceptions of reliability and usability.  :)


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-20 Thread Jonathan Edwards


On Dec 20, 2006, at 04:41, Darren J Moffat wrote:


Bill Sommerfeld wrote:

There also may be a reason to do this when confidentiality isn't
required: as a sparse provisioning hack..
If you were to build a zfs pool out of compressed zvols backed by
another pool, then it would be very convenient if you could run in a
mode where freed blocks were overwritten by zeros when they were  
freed,
because this would permit the underlying compressed zvol to free  
*its*

blocks.


A very interesting observation.  Particularly given that I have  
just created such a configuration - with iSCSI in the middle.


over ipsec?  wow - how many layers is that before you start talking  
to the real (non-psuedo) block storage device?


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in a SAN environment

2006-12-19 Thread Jonathan Edwards


On Dec 18, 2006, at 17:52, Richard Elling wrote:

In general, the closer to the user you can make policy decisions,  
the better
decisions you can make.  The fact that we've had 10 years of RAID  
arrays
acting like dumb block devices doesn't mean that will continue for  
the next
10 years :-)  In the interim, we will see more and more  
intelligence move

closer to the user.


I thought this is what the T10 OSD spec was set up to address.  We've  
already

got device manufacturers beginning to design and code to the spec.

---
.je

(ps .. actually it's closer to 20+ years of RAID and dumb block  
devices ..)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in a SAN environment

2006-12-19 Thread Jonathan Edwards

On Dec 19, 2006, at 07:17, Roch - PAE wrote:



Shouldn't there be a big warning when configuring a pool
with no redundancy and/or should that not require a -f flag ?


why?  what if the redundancy is below the pool .. should we
warn that ZFS isn't directly involved in redundancy decisions?

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-19 Thread Jonathan Edwards


On Dec 19, 2006, at 08:59, Darren J Moffat wrote:


Darren Reed wrote:

If/when ZFS supports this then it would be nice to also be able
to have Solaris bleach swap on ZFS when it shuts down or reboots.
Although it may be that this option needs to be put into how we
manage swap space and not specifically zomething for ZFS.
Doing this to swap space has been a kernel option on another very
widely spread operating system for at least 2 major OS releases...


Which ones ?  I know that MacOS X and OpenBSD both support  
encrypted swap which for swap IMO is a better way to solve this  
problem.


You can get that today with OpenSolaris by using the stuff in the  
loficc project.   You will also get encrypted swap when we have ZFS  
crypto and you swap on a ZVOL that is encrypted.


Note though that that isn't quite the same way as OpenBSD solves  
the encrypted swap problem, and I'm not familiar with the technical  
details of what Apple did in MacOS X.


there's an encryption option in the dynamic_pager to write out  
encrypted paging files (/var/vm/swapfile*) .. it gets turned on with  
an environment variable that gets set at boot (what happens when you  
choose secure virtual memory.)  Before this was implemented there was  
a workaround using an encrypted dmg that held the swap files .. but  
that was an incomplete solution.


Bleaching is a time consuming task, not something I'd want to do at  
system boot/halt.


particularly if we choose to do a 35 pass Gutmann algorithm .. :)

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-19 Thread Jonathan Edwards


On Dec 18, 2006, at 11:54, Darren J Moffat wrote:


[EMAIL PROTECTED] wrote:
Rather than bleaching which doesn't always remove all stains, why  
can't
we use a word like erasing (which is hitherto unused for  
filesystem use

in Solaris, AFAIK)


and this method doesn't remove all stains from the disk anyway it  
just reduces them so they can't be easily seen ;-)


and if you add the right amount of ammonia is should remove  
everything .. (ahh - fun with trichloramine)


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in a SAN environment

2006-12-19 Thread Jonathan Edwards


On Dec 19, 2006, at 10:15, Torrey McMahon wrote:


Darren J Moffat wrote:

Jonathan Edwards wrote:

On Dec 19, 2006, at 07:17, Roch - PAE wrote:



Shouldn't there be a big warning when configuring a pool
with no redundancy and/or should that not require a -f flag ?


why?  what if the redundancy is below the pool .. should we
warn that ZFS isn't directly involved in redundancy decisions?


Yes because if ZFS doesn't know about it then ZFS can't use it to  
do corrections when the checksums (which always work) detect  
problems.





We do not have the intelligent end-to-end management to make these  
judgments. Trying to make one layer of the stack {stronger,  
smarter, faster, bigger,} while ignoring the others doesn't help.  
Trying to make educated guesses as to what the user intends doesn't  
help either.


Hi! It looks like you're writing a block
 Would you like help?
- Get help writing the block
- Just write the block without help
- (Don't show me this tip again)

somehow I think we all know on some level that letting a system  
attempt to guess your intent will get pretty annoying after a while ..

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in a SAN environment

2006-12-18 Thread Jonathan Edwards


On Dec 18, 2006, at 16:13, Torrey McMahon wrote:


Al Hopper wrote:

On Sun, 17 Dec 2006, Ricardo Correia wrote:



On Friday 15 December 2006 20:02, Dave Burleson wrote:


Does anyone have a document that describes ZFS in a pure
SAN environment?  What will and will not work?

 From some of the information I have been gathering
it doesn't appear that ZFS was intended to operate
in a SAN environment.


This might answer your question:
http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid



The section entitled Does ZFS work with SAN-attached devices?  
does not

make it clear the (some would say) dire effects of not having pool
redundancy.  I think that FAQ should clearly spell out the  
downside; i.e.,

where ZFS will say (Sorry Charlie) pool is corrupt.

A FAQ should always emphasize the real-world downsides to poor  
decisions

made by the reader.   Not delivering bad news does the reader a
dis-service IMHO.



I'd say that it's clearly described in the FAQ.  If you push to  
hard people will infer that SANs are broken if you use ZFS on top  
of them or vice versa. The only bit that looks a little  
questionable to my eyes is ...


   Overall, ZFS functions as designed with SAN-attached devices,  
but if

   you expose simpler devices to ZFS, you can better leverage all
   available features.

What are simpler devices?  (I could take a guess ... )


stone tablets in a room full of monkeys with chisels?

The bottom line is ZFS wants to ultimately function as the controller  
cache
and eventually eliminate the blind data algorithms that they  
incorporate ..
the problem is that we can't really say that explicitly since we  
sell, and much
of the enterprise operates with enterprise class arrays and  
integrated data

cache.  The trick is in balancing who does what since you've really got
duplicate Virtualization, RAID, and caching options open to you.

.je


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Vanity ZVOL paths?

2006-12-09 Thread Jonathan Edwards


On Dec 8, 2006, at 05:20, Jignesh K. Shah wrote:



Hello ZFS Experts

I have two ZFS pools zpool1 and zpool2

I am trying to create bunch of zvols such that their paths are  
similar except for consisent number scheme without reference to the  
zpools that actually belong. (This will allow me to have common  
references in my setup scripts)




If I create
zfs create -V 100g zpool1/tablespace1
zfs create -V 100g zpool2/tablespace2
zfs create -V 100g zpool1/tablespace3
zfs create -V 100g zpool2/tablespace4

Then I get
/dev/zvol/rdsk/zpool1/tablespace1
/dev/zvol/rdsk/zpool1/tablespace2
/dev/zvol/rdsk/zpool1/tablespace3
/dev/zvol/rdsk/zpool2/tablespace4

As you  notice I have two series zpool and tablespace.. I am trying  
to eliminate 1 series. So I tried



zfs create zpool1/dbdata1
zfs create zpool2/dbdata2
zfs create zpool1/dbdata3
zfs create zpool2/dbdata4

And changed their mount point as follows
zfs set mountpoint=/tablespace1 zpool1/dbdata1
zfs set mountpoint=/tablespace2 zpool1/dbdata2
zfs set mountpoint=/tablespace3 zpool2/dbdata3
zfs set mountpoint=/tablespace4 zpool2/dbdata4


And then created a common zvol name for all pools:

zfs create -V 100g zpool1/dbdata1/data
zfs create -V 100g zpool2/dbdata2/data
zfs create -V 100g zpool1/dbdata3/data
zfs create -V 100g zpool2/dbdata4/data

I was expecting I will get
/dev/zvol/rdsk/tablespace1/data
/dev/zvol/rdsk/tablespace2/data
/dev/zvol/rdsk/tablespace3/data
/dev/zvol/rdsk/tablespace4/data

Instead I got

/dev/zvol/rdsk/zpool1/dbdata1/data
/dev/zvol/rdsk/zpool2/dbdata2/data
/dev/zvol/rdsk/zpool1/dbdata3/data
/dev/zvol/rdsk/zpool2/dbdata4/data


Any idea how do I get my abstracted zvol paths like I can do with  
my mountpoints in regular ZFS.


setting the mountpoint isn't going to affect the volume name .. for  
vanity zvol paths you'll have to use symlinks .. try:


mkdir /dev/zvol/rdsk/tablespace1 /dev/zvol/dsk/tablespace1
ln -s /dev/zvol/rdsk/zpool1/dbdata1/data /dev/zvol/rdsk/tablespace1/data
ln -s /dev/zvol/dsk/zpool1/dbdata1/data /dev/zvol/rdsk/tablespace1/data
... etc ...

or better yet, simply link to the underlying /devices entry and you  
don't even have to keep it in the /dev/zvol tree since everything in  
the /dev tree is a symlink anyhow ..


.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: system wont boot after zfs

2006-11-30 Thread Jonathan Edwards

Dave

which BIOS manufacturers and revisions?   that seems to be more of  
the problem
as choices are typically limited across vendors .. and I take it  
you're running 6/06 u2


Jonathan

On Nov 30, 2006, at 12:46, David Elefante wrote:


Just as background:

I attempted this process on the following:

1.  Jetway amd socket 734 (vintage 2005)
2.  Asus amd socket 939 (vintage 2005)
3. Gigabyte amd socket am2 (vintage 2006)

All with the same problem.  I disabled the onboard nvidia nforce  
410/430
raid bios in the bios in all cases.  Now whether it actually does  
not look
for a signature, I do not know. I'm attempting to make this box  
into an
iSCSI target for my ESX environments.  I can put W3K and SanMelody  
on there,

but it is not as interesting and I am attempting to help the Solaris
community.

I am simply making the business case that over three major vendors  
boards

and the absolute latest (gigabyte), the effect was the same.

As a workaround I can make slice 0 1 cyl and slice 1 1-x, and the  
zpool on
the rest of the disk and be fine with that.  So on a PC with zpool  
create
there should be a warning for pc users that most likely if they use  
the
entire disk, the resultant EFI label is likely to cause lack of  
bootability.


I attempted to hotplug the sata drives after booting, and Nevada 51  
came up
with scratch space errors and did not recognize the drive.  In any  
case I'm

not hotplugging my drives every time.

The given fact is that PC vendors are not readily adopting EFI bios  
at this
time, the millions of PC's out there are vulnerable to this.  And  
if x86
Solaris is to be really viable, this community needs to be  
addressed.  Now I
was at Sun 1/4 of my entire life and I know the politics, but the  
PC area is
different.  If you tell the customer to go to the mobo vendor to  
fix the
bios, they will have to find some guy in a bunker in Taiwan.  Not  
likely.

Now I'm at VMware actively working on consolidating companies into x86
platforms.  The simple fact that the holy war between AMD and Intel  
has
created processors that a cheap enough and fast enough to cause  
disruption
in the enterprise space.  My new dual core AMD processor is  
incredibly fast

and the entire box cost me $500 to assemble.

The latest Solaris 10 documentation (thx Richard) has use the  
entire disk
all over it.  I don't see any warning in here about EFI labels, in  
fact

these statements discourage putting ZFS in a slice.:


ZFS applies an EFI label when you create a storage pool with whole  
disks.
Disks can be labeled with a traditional Solaris VTOC label when you  
create a

storage pool with a disk slice.

Slices should only be used under the following conditions:

*

  The device name is nonstandard.
*

  A single disk is shared between ZFS and another file system,  
such as

UFS.
*

  A disk is used as a swap or a dump device.

Disks can be specified by using either the full path, such as
/dev/dsk/c1t0d0, or a shorthand name that consists of the device  
name within
the /dev/dsk directory, such as c1t0d0. For example, the following  
are valid

disk names:

*

  c1t0d0
*

  /dev/dsk/c1t0d0
*

  c0t0d6s2
*

  /dev/foo/disk

ZFS works best when given whole physical disks. Although constructing
logical devices using a volume manager, such as Solaris Volume Manager
(SVM), Veritas Volume Manager (VxVM), or a hardware volume manager  
(LUNs or
hardware RAID) is possible, these configurations are not  
recommended. While
ZFS functions properly on such devices, less-than-optimal  
performance might

be the result.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On  
Behalf Of

[EMAIL PROTECTED]
Sent: Wednesday, November 29, 2006 1:24 PM
To: Jonathan Edwards
Cc: David Elefante; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Re: system wont boot after zfs



I suspect a lack of an MBR could cause some BIOS implementations to
barf ..


Why?

Zeroed disks don't have that issue either.

What appears to be happening is more that raid controllers attempt
to interpret the data in the EFI label as the proprietary
hardware raid labels.  At least, it seems to be a problem
with internal RAIDs only.

In my experience, removing the disks from the boot sequence was
not enough; you need to disable the disks in the BIOS.

The SCSI disks with EFI labels in the same system caused no
issues at all; but the disks connected to the on-board RAID
did have issues.

So what you need to do is:

- remove the controllers from the probe sequence
- disable the disks

Casper

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date:  
11/30/2006

3:41 PM


--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date:  
11/30/2006

3:41 PM

Re: [zfs-discuss] Re: ZFS ACLs and Samba

2006-10-25 Thread Jonathan Edwards


On Oct 25, 2006, at 15:38, Roger Ripley wrote:

IBM has contributed code for NFSv4 ACLs under AIX's JFS; hopefully  
Sun will not tarry in following their lead for ZFS.


http://lists.samba.org/archive/samba-cvs/2006-September/070855.html


I thought this was still in draft:
http://ietf.org/internet-drafts/draft-ietf-nfsv4-acl-mapping-05.txt

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored Raidz

2006-10-24 Thread Jonathan Edwards


On Oct 24, 2006, at 04:19, Roch wrote:



Michel Kintz writes:

Matthew Ahrens a écrit :


Richard Elling - PAE wrote:


Anthony Miller wrote:


Hi,

I've search the forums and not found any answer to the following.

I have 2 JBOD arrays each with 4 disks.

I want to create create a raidz on one array and have it  
mirrored to

the other array.



Today, the top level raid sets are assembled using dynamic  
striping.

There
is no option to assemble the sets with mirroring.  Perhaps the ZFS
team can
enlighten us on their intentions in this area?



Our thinking is that if you want more redundancy than RAID-Z, you
should use RAID-Z with double parity, which provides more  
reliability

and more usable storage than a mirror of RAID-Zs would.

(Also, expressing mirror of RAID-Zs from the CLI would be a bit
messy; you'd have to introduce parentheses in vdev descriptions or
something.)


It is not always a matter of more redundancy.
In my customer's case, they have storage in 2 different rooms of  
their
datacenter and want to mirror from one storage unit in one room to  
the

other.
So having in this case a combination of RAID-Z + Mirror makes  
sense in

my mind   or ?

Michel.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


you may let the storage export RAID-5 luns and let ZFS
mirror those. Would that work ?

-r


they're JBOD arrays, so unless you're proposing the use of another
volume manager i don't think that would work.  as for the maximum
redundancy in configurations, i think that Frank hit it with the  
mirroring

of each drive component across the arrays and doing a simple stripe

I just think it would be good to add the flexibility in zpool to:
1) raidz a set of mirrors
2) mirror a couple of raidz
in certain environments you care more about multiple drive or array
failures than anything else.

Today you can do this with zvols, but I'm a little worried about how  
this

would perform given the nested layering you have to introduce .. eg:
# zpool create a1pool raidz c0t0d0 c0t1d0 c0t2d0 ..
# zpool create a2pool raidz c1t0d0 c1t1d0 c1t2d0 ..
# zfs create -V size a1pool/vol
# zfs create -V size a2pool/vol
# zpool create mzdata mirror /dev/zvol/dsk/a1pool/vol /dev/zvol/dsk/ 
a2pool/vol


.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Mirrored Raidz

2006-10-24 Thread Jonathan Edwards

there's 2 approaches:

1) RAID 1+Z where you mirror the individual drives across trays and  
then RAID-Z the whole thing

2) RAID Z+1 where you RAIDZ each tray and then mirror them

I would argue that you can lose the most drives in configuration 1  
and stay alive:


With a simple mirrored stripe you lose if you lose 1 drives in each  
tray.

With configuration 2 this takes it 2 drives in each tray.
With configuration 1 you have to lose both sides of a 2 mirrored sets  
to fail.


so it's not a space or performance model .. simply an availability  
model with failing disk


Jonathan

On Oct 24, 2006, at 12:46, Richard Elling - PAE wrote:


Pedantic question, what would this gain us other than better data
retention?
Space and (especially?) performance would be worse with RAID-Z+1
than 2-way mirrors.
 -- richard

Frank Cusack wrote:
On October 24, 2006 9:19:07 AM -0700 Anton B. Rang  
[EMAIL PROTECTED] wrote:
Our thinking is that if you want more redundancy than RAID-Z,  
you should
use RAID-Z with double parity, which provides more reliability  
and more

usable storage than a mirror of RAID-Zs would.


This is only true if the drives have either independent or identical
failure modes, I think.  Consider two boxes, each containing ten  
drives.
Creating RAID-Z within each box protects against single-drive  
failures.

Mirroring the boxes together protects against single-box failures.

But mirroring also protects against single-drive failures.
-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [osol-discuss] Cloning a disk w/ ZFS in it

2006-10-22 Thread Jonathan Edwards
you don't really need to do the prtvtoc and fmthard with the old Sun  
labels if you start at cylinder 0 since you're doing a bit - bit  
copy with dd .. but, keep in mind:


- The Sun VTOC is the first 512B and s2 *typically* should start at  
cylinder 0 (unless it's been redefined .. check!)
- The EFI label though, reserves the first 17KB (34 blocks) and for a  
dd to work, you need to either:
1) dd without the slice (eg: dd if=/dev/rdsk/c0t0d0 of=/dev/rdsk/ 
c1t0d0 bs=128K)

or
2) prtvtoc / fmthard (eg: prtvtoc /dev/rdsk/c0t0d0s0  /tmp/ 
vtoc.out ; fmthard -s /tmp/vtoc.out /dev/rdsk/c1t0d0s0)


.je

On Oct 22, 2006, at 12:45, Krzys wrote:

yeah disks need to be identical but why do you need to do prtvtoc  
and fmthard to duplicate the disk label (before the dd), I thought  
that dd would take care of all of that... whenever I used dd I used  
it on slice 2 and I never had to do prtvtoc and fmthard... Juts  
make sure disks are identical and that is the key.


Regards,

Chris

On Fri, 20 Oct 2006, Richard Elling - PAE wrote:


minor adjustments below...

Darren J Moffat wrote:

Asif Iqbal wrote:

Hi
I have a X2100 with two 74G disks. I build the OS on the first disk
with slice0 root 10G ufs, slice1 2.5G swap, slice6 25MB ufs and  
slice7

62G zfs. What is the fastest way to clone it to the second disk. I
have to build 10 of those in 2 days. Once I build the disks I slam
them to the other X2100s and ship it out.

if clone really means make completely identical then do this:
boot of cd or network.
dd if=/dev/dsk/sourcedisk  of=/dev/dsk/destdisk
Where sourcedisk and destdisk are both localally attached.


I use prtvtoc and fmthard to duplicate the disk label (before the dd)
Note: the actual disk geometry may change between vendors or disk
firmware revs.  You will first need to verify that the geometries are
similar, especially the total number of blocks.

For dd, I'd use a larger block size than the default.  Something  
like:

dd bs=1024k if=/dev/dsk/sourcedisk  of=/dev/dsk/destdisk

The copy should go at media speed, approximately 50-70 MBytes/s for
the X2100 disks.
-- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


!DSPAM:122,45390d6810494021468!


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A versioning FS

2006-10-09 Thread Jonathan Edwards


On Oct 8, 2006, at 23:54, Nicolas Williams wrote:


On Sun, Oct 08, 2006 at 11:16:21PM -0400, Jonathan Edwards wrote:

On Oct 8, 2006, at 22:46, Nicolas Williams wrote:

You're arguing for treating FV as extended/named attributes :)


kind of - but one of the problems with EAs is the increase/bloat in
the inode/dnode structures and corresponding incompatibilities with
other applications or tools.


This in a thread where folks [understandably] claim that storage is
cheap and abundant.  And I agree that it is.

Plus, I think you may be jumping to conclusions about the bloat of
extended attributes:


  Another approach might be to put it all
into the block storage rather than trying to stuff it into the
metadata on top.  If we look at the zfs on-disk structure instead and
simply extend the existing block pointer mappings to handle the diffs
along with a header block to handle the version numbers - this might
be an easier way out rather than trying to redefine or extend the
dnode structure.   Of course you'd still need a single attribute to
flag reading the version block header and corresponding diff blocks,
but this could go anywhere - even a magic acl perhaps .. i would
argue that the overall goal should be aimed toward the reduction of
complexity in the metadata nodes rather than attempting to extend
them and increase the seek/parse time.


Wait a minute -- the extended attribute idea is about *interfaces*,  
not
internal implementation.  I certainly did not argue that a file  
version

should be copied into an EA.


true, but I just find that the EA discussion is just as loaded as the FV
discussion that too often focuses on improvements in the metadata
space rather than the block data space.  I'm not talking about the file
version data .. rather the bplist for the file version data and possibly
causing this to live in the block data space instead of the dnode
DMU.  This way the FV will be completely accessible within the
filesystem block data structure instead of being abstracted back out
of the dnode DMU.  I would hold that the version data space
consumption should also be readily apparent on the filesystem level
and that versioned access should not impede the regular file
lookup or attribute caching.  It's a slight deviation from the typical
EA approach, but an important distinction to make to keep the
metadata structures relatively lean.

Let's keep interface and implementation details separate.  Most of  
this

thread has been about interfaces precisely because that's what users
will interact with; users won't care one bit about how it's all
implemented under the hood.


I'm not so sure you can separate the two without creating a hack.  I
would also argue that users (particularly the ones creating the
interfaces) will care about the implementation details since those
are the real underlying issues they'll be wrestling with.

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A versioning FS

2006-10-08 Thread Jonathan Edwards


On Oct 8, 2006, at 21:40, Wee Yeh Tan wrote:


On 10/7/06, Ben Gollmer [EMAIL PROTECTED] wrote:

On Oct 6, 2006, at 6:15 PM, Nicolas Williams wrote:
 What I'm saying is that I'd like to be able to keep multiple
 versions of
 my files without echo * or ls showing them to me by default.

Hmm, what about file.txt - ._file.txt.1, ._file.txt.2, etc? If you
don't like the _ you could use @ or some other character.


You missed Nicolas's point.

It does not matter which delimiter you use.  I still want my for i in
*; do ... to work as per now.

We want to differentiate files that are created intentionally from
those that are just versions.  If files starts showing up on their
own, a lot of my scripts will break.  Still, an FV-aware
shell/program/API can accept an environment setting that may quiesce
the version output. E.g. export show-version=off/on.



if we're talking implementation - i think it would make more sense to
store the block version differences in the base dnode itself rather than
creating new dnode structures to handle the different versions.  You'd
then structure different tools or flags to handle the versions (copy  
them
to a new file/dnode, etc) - standard or existing tools don't need to  
know

about the underlying versions.

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: A versioning FS

2006-10-06 Thread Jonathan Edwards
On Oct 6, 2006, at 23:42, Anton B. Rang wrote:I don't agree that version control systems solve the same problem as file versioning. I don't want to check *every change* that I make into version control -- it makes the history unwieldy. At the same time, if I make a change that turns out to work really poorly, I'd like to revert to the previous code -- not necessary the code which is checked in. (I suspect there may be some versioning systems which allow intermediate versions to be deleted, and I just haven't used them, but this still seems complex compared to only checking in known-good code.) The use cases are somewhat different here.  I would venture to say that a *personal* file versioning system needs to be thought of differently from a *group* co-ordination formal version control system.  Of course there is a fair amount of overlap in both use cases particularly when you consider a global namespace and concurrent access problems as you can see in the cedar or plan9 systems (fossil/venti):http://portal.acm.org/citation.cfm?doid=42392.42398http://cm.bell-labs.com/plan9/And if we were to also consider dynamic linking and versioning for depracated functions, there's another whole level of parallel backwards compatibility interface problems that are become much easier to approach.While this is an FV discussion, I do believe that we need some sort of clearer distinction between FV, VC, DR, CDP, and Snapshotting structured around the usability cases and close/sync vs a forced version mark/branch .. there's too much confusion in this space often with conflicting goals misapplied to often solve similar problems..je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: Recommendation ZFS on StorEdge 3320

2006-09-05 Thread Jonathan Edwards
On Sep 5, 2006, at 06:45, Robert Milkowski wrote:Hello Wee,Tuesday, September 5, 2006, 10:58:32 AM, you wrote:WYT On 9/5/06, Torrey McMahon [EMAIL PROTECTED] wrote: This is simply not true. ZFS would protect against the same type oferrors seen on an individual drive as it would on a pool made of HW raidLUN(s). It might be overkill to layer ZFS on top of a LUN that isalready protected in some way by the devices internal RAID code but itdoes not "make your data susceptible to HW errors caused by the storagesubsystem's RAID algorithm, and slow down the I/O". WYT  Roch's recommendation to leave at least 1 layer of redundancy to ZFSWYT allows the extension of ZFS's own redundancy features for some truelyWYT remarkable data reliability.WYT Perhaps, the question should be how one could mix them to get the bestWYT of both worlds instead of going to either extreme.Depends on your data but sometime it could be useful to create HW RAIDand then do just striping on ZFS side between at least two LUNs. Thatway you do not get data protection but fs/pool protection with dittoblock. Of course each LUN is HW RAID made of different physical disks.i remember working up a chart on this list about 2 months ago:Here's 10 options I can think of to summarize combinations of zfs with hw redundancy:#   ZFS     ARRAY HW        CAPACITY    COMMENTS--  ---                 1   R0      R1              N/2         hw mirror - no zfs healing (XXX)2   R0      R5              N-1         hw R5 - no zfs healing (XXX)3   R1      2 x R0          N/2         flexible, redundant, good perf4   R1      2 x R5          (N/2)-1     flexible, more redundant, decent perf5   R1      1 x R5          (N-1)/2     parity and mirror on same drives (XXX)6   RZ      R0              N-1         standard RAIDZ - no array RAID (XXX)7   RZ      R1 (tray)       (N/2)-1     RAIDZ+18   RZ      R1 (drives)     (N/2)-1     RAID1+Z (highest redundancy)9   RZ      2 x R5          N-3         triple parity calculations (XXX)10  RZ      1 x R5          N-2         double parity calculations (XXX)If you've invested in a RAID controller on an array, you might as well take advantage of it, otherwise you could probably get an old D1000 chassis somewhere and just run RAIDZ on JBOD.  If you're more concerned about redundancy than space, with the SUN/STK 3000 series dual controller arrays I would either create at least 2 x RAID5 luns balanced across controllers and zfs mirror, or create at least 4 x RAID1 luns balanced across controllers and use RAIDZ.  RAID0 isn't going to make that much sense since you've got a 128KB txg commit on zfs which isn't going to be enough to do a full stripe in most cases..je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3510 JBOD ZFS vs 3510 HW RAID

2006-08-02 Thread Jonathan Edwards


On Aug 1, 2006, at 22:23, Luke Lonergan wrote:


Torrey,

On 8/1/06 10:30 AM, Torrey McMahon [EMAIL PROTECTED] wrote:


http://www.sun.com/storagetek/disk_systems/workgroup/3510/index.xml

Look at the specs page.


I did.

This is 8 trays, each with 14 disks and two active Fibre channel
attachments.

That means that 14 disks, each with a platter rate of 80MB/s will  
be driven
over a 400MB/s pair of Fibre Channel connections, a slowdown of  
almost 3 to

1.

This is probably the most expensive, least efficient way to get disk
bandwidth available to customers.

WRT the discussion about blow the doors, etc., how about we see some
bonnie++ numbers to back it up.



actually .. there's SPC-2 vdbench numbers out at:
http://www.storageperformance.org/results

see the full disclosure report here:
http://www.storageperformance.org/results/b5_Sun_SPC2_full- 
disclosure_r1.pdf


of course that's a 36GB 15K FC system with 2 expansion trays, 4HBAs  
and 3 yrs maintenance in the quote that was spec'd at $72K list (or  
$56/GB) .. (i'll use list numbers for comparison since they're the  
easiest )


if you've got a copy of the vdbench tool you might want to try the  
profiles in the appendix on a thumper - I believe the bonnie/bonnie++  
numbers tend to skew more on single threaded low blocksize memory  
transfer issues.


now to bring the thread full circle to the original question of price/ 
performance and increasing the scope to include the X4500 .. for  
single attached low cost systems, thumper is *very* compelling  
particularly when you factor in the density .. for example using list  
prices from http://store.sun.com/


X4500 (thumper) w/ 48 x 250GB SATA drives = $32995 = $2.68/GB
X4500 (thumper) w/ 48 x 500GB SATA drives = $69995 = $2.84/GB
SE3511 (dual controller) w/ 12 x 500GB SATA drives = $36995 = $6.17/GB
SE3510 (dual controller) w/ 12 x 300GB FC drives = $48995 = $13.61/GB

So a 250GB SATA drive configured thumper (server attached with 16GB  
of cache .. err .. RAM) is 5x less in cost/GB than a 300GB FC drive  
configured 3510 (dual controllers w/ 2 x 1GB typically mirrored  
cache) and a 500GB SATA drive configured thumper (server attached) is  
2.3x less in cost/GB than a 500GB SATA drive configured 3511 (again  
dual controllers w/ 2 x 1GB typically mirrored cache)


For a single attached system - you're right - 400MB/s is your  
effective throttle (controller speeds actually) on the 3510 and your  
realistic throughput on the 3511 is probably going to be less than  
1/2 that number if we factor in the back pressure we'll get on the  
cache against the back loop  .. your bonnie ++ block transfer numbers  
on a 36 drive thumper were showing about 424MB/s on 100% write and  
about 1435MB/s on 100% read .. it'd be good to see the vdbench  
numbers as well (but i've have a hard time getting my hands on one  
since most appear to be out at customer sites)


Now with thumper - you are SPoF'd on the motherboard and operating  
system - so you're not really getting the availability aspect from  
dual controllers .. but given the value - you could easily buy 2 and  
still come out ahead .. you'd have to work out some sort of timely  
replication of transactions between the 2 units and deal with failure  
cases with something like a cluster framework.  Then for multi- 
initiator cross system access - we're back to either some sort of NFS  
or CIFS layer or we could always explore target mode drivers and  
virtualization .. so once again - there could be a compelling  
argument coming in that arena as well.  Now, if you already have a  
big shared FC infrastructure - throwing dense servers in the middle  
of it all may not make the most sense yet - but on the flip side, we  
could be seeing a shrinking market for single attach low cost arrays.


Lastly (for this discussion anyhow) there's the reliability and  
quality issues with SATA vs FC drives (bearings, platter materials,  
tolerances, head skew, etc) .. couple that with the fact that dense  
systems aren't so great when they fail .. so I guess we're right back  
to choosing the right systems for the right purposes (ZFS does some  
great things around failure detection and workaround) .. but i think  
we've beat that point to death ..


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Best Practices for StorEdge 3510 Array and ZFS

2006-08-02 Thread Jonathan Edwards


On Aug 2, 2006, at 17:03, prasad wrote:


Torrey McMahon [EMAIL PROTECTED] wrote:

Are any other hosts using the array? Do you plan on carving LUNs  
out of

the RAID5 LD and assigning them to other hosts?


There are no other hosts using the array. We need all the available  
space (2.45TB) on just one host. One option was to create 2 LUN's  
and use raidz.


raidz on RAID5 isn't very efficient and you'll want at least 3 lun's  
to do it .. you're calculating double parity and tying up too much of  
your drive bandwidth.


if you're going to some variation of RAID5 the best throughput you'll  
see is to *either* pick the HW RAID characteristics *or* ZFS raidz ..  
but not both .. if you want a *lot* of redundancy you could create a  
bunch of RAID10 volumes and then do a raidz on the zpool - but you're  
really going to lose a lot of capacity that way.


What you really want to do is make efficient use of the array cache  
*and* the copy on write zfs cache so you're doing mostly memory to  
memory transfers.  so that leaves us with 2 options (each with slight  
variations)


option 1 - raidz:
I would use all the disks in the 3510 to make either 4 x 3 disk or 6  
x 2 disk R0 volumes and balance them across the controllers (assuming  
you have 2) .. then create your raidz zpool out of all the disks ..  
the disadvantage (or advantage depending on how you look at it) here  
is that you're not using the parity engine in the 3510 and you can't  
really hot spare  from the array.. the advantage though is the  
software based error correction you'll be able to do.


option 2 - RAID5
either use the volume you already have or make 2 R5 volumes if you  
have 2 controllers to balance the LUNs .. it won't matter if they're  
the same size or not, and you should only really need 1 global hot  
spare .. then create a standard zpool with these .. the disadvantage  
is that you won't get the lovely raidz features .. but the possible  
advantage is that you've offloaded the parity calculation and  
workload from the host


Keep in mind that zfs was originally designed with JBOD in mind ..  
there's still ongoing discussions on how hw RAID fits into the  
picture with the new and lovely sw raidz and whether or not socks  
will be worn when testing one vs the other ..


---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs. Apple XRaid

2006-08-01 Thread Jonathan Edwards


On Aug 1, 2006, at 03:43, [EMAIL PROTECTED] wrote:




So what does this exercise leave me thinking? Is Linux 2.4.x really
screwed up in NFS-land? This Solaris NFS replaces a Linux-based NFS
server that the clients (linux and IRIX) liked just fine.



Yes; the Linux NFS server and client work together just fine but  
generally

only because the Linux NFS server replies that writes are done before
they are committed to disk (async operation).

The Linux NFS client is not optimized for server which do not do this
and it appears to write little before waiting for the commit replies.


Well .. linux clients with linux servers tend to be slightly better  
behaved since
the server essentially fudges on the commit and the async cluster  
count is
generally higher (it won't switch on every operation like Solaris  
will by

default)

Additionally there's a VM issue in the page-writeback code that seems to
affect write performance and RPC socket performance when there's a high
dirty page count.  Essentially as pages are flushed there's a higher  
number

of NFS commit operations which will tend to slow down the Solaris NFS
server (and probably the txgs or zil as well with the increase in  
synchronous
behaviour.)  On the linux 2.6 VM - the number of commits has been  
seen to
rise dramatically when the dirty page count is between 40-90% of the  
overall

system memory .. by tuning the dirtypage_ratio back down to 10% there's
typically less time spent in page-writeback and the overall async  
throughput
should rise .. this wasn't really addressed until 2.6.15 or 2.6.16 so  
you might
also get better results on a later kernel.  Watching performance  
between a
linux client and a linux server - the linux server seems to buffer  
the NFS commit
operations .. of course the clients will also buffer as much as they  
can - so you

can end up with some unbelievable performance numbers both on the
filesystem layers (before you do a sync) and on the NFS client layers  
as well

(until you unmount/remount.)


Overall, I find that the Linux VM suffers from many of the same sorts  
of large
memory performance problems that Solaris used to face before priority  
paging
in 2.6 and subsequent page coloring schemes.  Based on my  
unscientific mac
powerbook performance observations - i suspect that there could be  
similar
issues with various iterations of the BSD or Darwin kernels - but I  
haven't taken

the initiative to really study any of this.

So to wrap up:

When doing linux client / solaris server NFS .. I'll typically tune  
the client for
32KB async tcp transfers (you have to dig into the kernel source to  
increase this
and it's not really worth it) tune the VM to reduce time spent in the  
kludgy
page-writeback (typically a sysctl setting for the dirty page ratio  
or some such),
and then increase the nfs:nfs3_async_clusters and  
nfs:nfs4_async_clusters to
something higher than 1 .. say 32 x 32KB transfers to get you to  
1MB .. you can
also increase the numbers of threads and the read ahead on the server  
to eek

out some more performance

I'd also look at tuning the volblocksize and recordsize as well as  
the stripe width
on your array to 32K or reasonable multiples .. but I'm not sure how  
much of the
issue is in misaligned I/O blocksizes between the various elements vs  
mandatory

pauses or improper behaviour incurred from miscommunication ..

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3510 JBOD ZFS vs 3510 HW RAID

2006-08-01 Thread Jonathan Edwards


On Aug 1, 2006, at 14:18, Torrey McMahon wrote:


(I hate when I hit the Send button when trying to change windows)

Eric Schrock wrote:

On Tue, Aug 01, 2006 at 01:31:22PM -0400, Torrey McMahon wrote:

The correct comparison is done when all the factors are taken  
into account. Making blanket statements like, ZFS  JBODs are  
always ideal or ZFS on top of a raid controller is a bad idea  
or SATA drives are good enough without taking into account the  
amount of data, access patterns, numbers of hosts, price,  
performance, data retention policies, audit requirements ... is  
where I take issue.





Then how are blanket statements like:

That said a 3510 with a raid controller is going to blow the
door, drive brackets, and skin off a JBOD in raw performance.

Not offensive as well?





Who said anything about offensive? I just said I take issue such  
statements in the general sense of trying to compare boxes to boxes  
or when making blanket statements such as X always works better on  
Y.


The specific question was around a 3510JBOD having better  
performance then a 3510 with a raid controller. Thats where I said  
the raid controller performance was going to be better.


just to be clear .. we're talking about a 3510 JBOD with ZFS (i guess  
you could run pass through on the controller or just fail the  
batteries on the cache) vs a 3510 with the raid controller turned  
on .. I'd tend to agree with Torrey, mainly since well designed RAID  
controllers will generally do a better job with their own back-end on  
aligning I/O for efficient full-stripe commits .. without battery  
backed memory on the host, CoW is still going to need synchronous I/O  
somewhere for guaranteed writes - and there's a fraction of your gain.


Don't get me wrong .. CoW is key for a lot of the cool features and  
amazing functionality in ZFS and I like it .. it's just not generally  
considered a high performance I/O technique for many cases when we're  
talking about committing bits to spinning rust.  And while it may be  
great for asynchronous behaviour, unless we want to reveal some  
amazing discovery that reverses years of I/O development - it seems  
to me that when we fall to synchronous behaviour the invalidation of  
the filesystem's page cache will always play a factor in the overall  
reduction of throughput.  OK .. I can see that we can eliminate the  
read/modify/write penalty and write hole problem at the storage  
layer .. but so does battery backed array cache with the real  
limiting factor ultimately being the latency between the cache  
through the back-end loops to the spinning disk.  (I would argue that  
low cache latency and under-saturated drive channels matter more than  
the sheer amount of coherent cache)


Speaking in high generalities, the problem almost always works it's  
way down to how well an array solution balances properly aligned I/O  
with the response time between cache across the back-end loops to the  
spindles and any inherent latency there or in between.  OK .. I can  
see that ZFS is a nice arbitrator and is working it's way into some  
of the drive mechanics, but there is still some reliance on the  
driver stack for determining the proper transport saturation and back- 
off.  And great - we're making more inroads with transaction groups  
and an intent log that's wonderful .. and we've done a lot of cool  
things along the way .. maybe by the time we're done we can move the  
code to a minimized Solaris build on dedicated hardware .. and build  
an array solution (with a built in filesystem) .. that's big .. and  
round .. and rolls fast .. and then we can call it .. (thump thump  
thump) .. the zwheel :)


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs vs. vxfs

2006-07-31 Thread Jonathan Edwards


On Jul 30, 2006, at 23:44, Malahat Qureshi wrote:
Is any one have a comparison between zfs vs. vxfs, I'm working on a  
presentation for my management on this ---


That can be a tough question to answer depending on what you're  
looking for .. you could take the feature comparison approach like  
you'll find on wikipedia and i think has already been mentioned here:

http://en.wikipedia.org/wiki/File_system_comparison

agreed it's only a small subset, and generally feature comparisons  
get heavily used in marketing campaigns for some sort of mudslinging  
or feature bashing.  Of course there's always something that doesn't  
really get addressed when you take a spreadsheet or bullet point  
approach.   Or you could take the microbenchmark approach with  
something like Richard's filebench project:

http://opensolaris.org/os/community/performance/filebench/

IMO the latter is more of a step in the right direction but the  
problem sets may be very different depending on your applications -  
it can be a tough decision to determine which numbers matter the most  
when you have to make tradeoffs .. your best approach is typically to  
try and decide some form of CTQs for your applications or  
organizations that take into account the relevant factors  
(administration, volume management, storage platforms, performance,  
recovery, operating systems, etc) and match up features and  
performance considerations concurrently.


I think you'll find that ZFS is an amazing fit for most applications,  
but in cases where you may think you need directio or non-buffered  
sorts of behaviour .. you could be at a slight disadvantage.  Of  
course Sun also offer QFS as another high performance alternative ..  
but like the old mantra we've all heard too many times now ..  
(everyone together) .. It all depends on what you're trying to do ..


---
.je
(* disappears back into the mist *)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS questions (hybrid HDs)

2006-07-28 Thread Jonathan Edwards


On Jun 21, 2006, at 11:05, Anton B. Rang wrote:



My guess from reading between the lines of the Samsung/Microsoft  
press release is that there is a mechanism for the operating system  
to pin particular blocks into the cache (e.g. to speed boot) and  
the rest of the cache is used for write buffering. (Using it as a  
read cache doesn't buy much compared to using the normal drive  
cache RAM for that, and might also contribute to wear, which is why  
read caching appears to be under OS control rather than automatic.)


Actually, Microsoft has been posting a bit about this for the  
upcoming Vista release .. WinHEC '06 had a few interesting papers and  
it looks like Microsoft is going to be introducing SuperFetch,  
ReadyBoost, and ReadyDrive .. mentioned here:


http://www.microsoft.com/whdc/system/sysperf/accelerator.mspx

The ReadyDrive paper seems to outline their strategy on the industry  
Hybrid Drive push and the recent t13.org adoption of the ATA-ACS8  
command set:


http://www.microsoft.com/whdc/device/storage/hybrid.mspx

It also looks like they're aiming at some sort of driver level  
PriorityIO scheme which should play nicely into lower level tiered  
hardware in an attempt for more intelligent read/write caching:


http://www.microsoft.com/whdc/driver/priorityio.mspx

---
.je


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and Storage

2006-06-28 Thread Jonathan Edwards
On Jun 28, 2006, at 12:32, Erik Trimble wrote:The main reason I don't see ZFS mirror / HW RAID5 as useful is this:  ZFS mirror/ RAID5:      capacity =  (N / 2) -1                                     speed   N / 2 -1                                     minimum # disks to lose before loss of data:  4                                     maximum # disks to lose before loss of data:  (N / 2) + 2shouldn't that be capacity = ((N -1) / 2) ?loss of a single disk would cause a rebuild on the R5 stripe which could affect performance on that side of the mirror.  Generally speaking good RAID controllers will dedicate processors and channels to calculate the parity and write it out so you're not impacted from the host access PoV.  There is a similar sort of CoW behaviour that can happen between the array cache and the drives, but in the ideal case you're dealing with this in dedicated hw instead of shared hw.  ZFS mirror / HW Stripe   capacity =  (N / 2)                                     speed =  N / 2                                     minimum # disks to lose before loss of data:  2                                     maximum # disks to lose before loss of data:  (N / 2) + 1  Given a reasonable number of hot-spares, I simply can't see the (very) marginal increase in safety give by using HW RAID5 as out balancing the considerable speed hit using RAID5 takes.  I think you're comparing this to software R5 or at least badly implemented array code and divining that there is a considerable speed hit when using R5.  In practice this is not always the case provided that the response time and interaction between the array cache and drives is sufficient for the incoming stream.  By moving your operation to software you're now introducing more layers between the CPU, L1/L2 cache, memory bus, and system bus before you get to the interconnect and further latencies on the storage port and underlying device (virtualized or not.)  Ideally it would be nice to see ZFS style improvements in array firmware, but given the state of embedded Solaris and the predominance of 32bit controllers - I think we're going to have some issues.  We'd also need to have some sort of client mechanism to interact with the array if we're talking about moving the filesystem layer out there .. just a thoughtJon E
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: disk write cache, redux

2006-06-15 Thread Jonathan Edwards


On Jun 15, 2006, at 06:23, Roch Bourbonnais - Performance Engineering  
wrote:



Naively I'd think a write_cache  should not help throughput
test since the cache should fill  up after which you should still be
throttled by the physical drain rate. You clearly show that
it helps; Anyone knows why/how a cache helps throughput ?


7200 RPM disks are typically IOP bound - so the write cache (which
can be up to 16MB on some drives) should be able to buffer enough
IO to deliver more efficiently on each IOP and also reduce head seek.
Not sure which vendors implement write through when the cache fills,
or how detailed the drive cache algos on SATA can go ..

Take a look at PSARC 2004/652:
http://www.opensolaris.org/os/community/arc/caselog/2004/652/

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss