Re: [zfs-discuss] This is the scrub that never ends...
On Sep 9, 2009, at 9:29 PM, Bill Sommerfeld wrote: On Wed, 2009-09-09 at 21:30 +, Will Murnane wrote: Some hours later, here I am again: scrub: scrub in progress for 18h24m, 100.00% done, 0h0m to go Any suggestions? Let it run for another day. A pool on a build server I manage takes about 75-100 hours to scrub, but typically starts reporting 100.00% done, 0h0m to go at about the 50-60 hour point. I suspect the combination of frequent time-based snapshots and a pretty active set of users causes the progress estimate to be off.. out of curiousity - do you have a lot of small files in the filesystem? zdb -s pool might be interesting to observe too --- .je (oh, and thanks for the subject line .. now i've had this song stuck in my head for a couple days :P) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Books on File Systems and File System Programming
On Aug 14, 2009, at 11:14 AM, Peter Schow wrote: On Thu, Aug 13, 2009 at 05:02:46PM -0600, Louis-Fr?d?ric Feuillette wrote: I saw this question on another mailing list, and I too would like to know. And I have a couple questions of my own. == Paraphrased from other list == Does anyone have any recommendations for books on File Systems and/or File Systems Programming? == end == Going back ten years, but still a good tutorial: Practical File System Design with the Be File System by Dominic Giampaolo http://www.nobius.org/~dbg/practical-file-system-design.pdf I think he's still at apple now working on spotlight .. his fs-kit is good study too: http://www.nobius.org/~dbg/fs-kit-0.4.tgz for understanding the vnode/vfs interface - you might want to take a look at: - Solaris Internals (2nd edition) - chapter 14 - Zadok's FiST paper: http://www.fsl.cs.sunysb.edu/docs/zadok-thesis-proposal/ UFS: - Solaris Internals (2nd edition) - chapter 15 HFS+: - Amit Singh's Mac OS X Internals chapter 11 (see http://osxbook.com/) then opensolaris src of course for: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/ http://opensolaris.org/os/community/zfs/source/ http://opensolaris.org/os/project/samqfs/sourcecode/ http://opensolaris.org/os/project/ext3/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote: This brings me to the absurd conclusion that the system must be rebooted immediately prior to each use. see Phil's later email .. an export/import of the pool or a remount of the filesystem should clear the page cache - with mmap'd files you're essentially both them both in the page cache and also in the ARC .. then invalidations in the page cache are going to have effects on dirty data in the cache /etc/system tunables are currently: set zfs:zfs_arc_max = 0x28000 set zfs:zfs_write_limit_override = 0xea60 set zfs:zfs_vdev_max_pending = 5 if you're on x86 - i'd also increase maxphys to 128K .. we still have a 56KB default value in there which is still a bad thing (IMO) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot mount '/tank/home': directory is not empty
i've seen a problem where periodically a 'zfs mount -a' and sometimes a 'zpool import pool' can create what appears to be a race condition on nested mounts .. that is .. let's say that i have: FS mountpoint pool/export pool/fs1/export/home pool/fs2/export/home/bob pool/fs3/export/home/bob/stuff if pool is imported (or a mount -a is done) and somehow pool/fs3 mounts first - then it will create /export/home and /export/home/bob and pool/fs1 and pool/fs2 will fail to mount .. this seems to be happening on more recent builds, but not predictably - so i'm still trying to track down what's going on On Jun 10, 2009, at 1:01 PM, Richard Elling wrote: Something is bothering me about this thread. It seems to me that if the system provides an error message such as cannot mount '/tank/home': directory is not empty then the first plan of action should be to look and see what is there, no? The issue of overlaying mounts has existed for about 30 years and invariably one discovers that events which lead to different data in overlapping directories is the result of some sort of procedural issue. Perhaps once again, ZFS is a singing canary? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.
On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote: Jim Dunham wrote: ZFS the filesystem is always on disk consistent, and ZFS does maintain filesystem consistency through coordination between the ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, therefore the data is in memory, not written to disk, so SNDR does not know this data exists. ZIL flushes to disk can be seconds behind the actual application writes completing, and if SNDR is running asynchronously, these replicated writes to the SNDR secondary can be additional seconds behind the actual application writes. Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 'supported' way to get ZFS to empty the ZIL to disk on demand. I'm wondering if you really meant ZIL here, or ARC? In either case, creating a snapshot should get both flushed to disk, I think? (If you don't actually need a snapshot, simply destroy it immediately afterwards.) not sure if there's another way to trigger a full flush or lockfs, but to make sure you do have all transactions that may not have been flushed from the ARC you could just unmount the filesystem or export the zpool .. with the latter, then you wouldn't have to worry about the -f on the import --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
not quite .. it's 16KB at the front and 8MB back of the disk (16384 sectors) for the Solaris EFI - so you need to zero out both of these of course since these drives are 1TB you i find it's easier to format to SMI (vtoc) .. with format -e (choose SMI, label, save, validate - then choose EFI) but to Casper's point - you might want to make sure that fdisk is using the whole disk .. you should probably reinitialize the fdisk sectors either with the fdisk command or run fdisk from format (delete the partition, create a new partition using 100% of the disk, blah, blah) .. finally - glancing at the format output - there appears to be a mix of labels on these disks as you've got a mix c#d# entries and c#t#d# entries so i might suspect fdisk might not be consistent across the various disks here .. also noticed that you dumped the vtoc for c3d0 and c4d0, but you're replacing c2d1 (of unknown size/layout) with c1d1 (never dumped in your emails) .. so while this has been an animated (slightly trollish) discussion on right-sizing (odd - I've typically only seen that term as an ONTAPism) with some short-stroking digs .. it's a little unclear what the c1d1s0 slice looks like here or what the cylinder count is - i agree it should be the same - but it would be nice to see from my armchair here On Jan 22, 2009, at 3:32 AM, Dale Sears wrote: Would this work? (to get rid of an EFI label). dd if=/dev/zero of=/dev/dsk/thedisk bs=1024k count=1 Then use format format might complain that the disk is not labeled. You can then label the disk. Dale Antonius wrote: can you recommend a walk-through for this process, or a bit more of a description? I'm not quite sure how I'd use that utility to repair the EFI label ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Largest (in number of files) ZFS instance tested
On Jul 11, 2008, at 4:59 PM, Bob Friesenhahn wrote: Has anyone tested a ZFS file system with at least 100 million + files? What were the performance characteristics? I think that there are more issues with file fragmentation over a long period of time than the sheer number of files. actually it's a similar problem .. with a maximum blocksize of 128KB and the COW nature of the filesytem you get indirect block pointers pretty quickly on a large ZFS filesystem as the size of your tree grows .. in this case a large constantly modified file (eg: /u01/data/ *.dbf) is going to behave over time like a lot of random access to files spread across the filesystem .. the only real difference is that you won't walk it every time someone does a getdirent() or an lstat64() so ultimately the question could be framed as what's the maximum manageable tree size you can get to with ZFS while keeping in mind that there's no real re-layout tool (by design) .. the number i'm working with until i hear otherwise is probably about 20M, but in the relativistic sense - it *really* does depend on how balanced your tree is and what your churn rate is .. we know on QFS we can go up to 100M, but i trust the tree layout a little better there, can separate the metadata out if i need to and have planned on it, and know that we've got some tools to relayout the metadata or dump/restore for a tape backed archive jonathan (oh and btw - i believe this question is a query for field data .. architect != crash test dummy .. but some days it does feel like it) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?
On Apr 9, 2008, at 11:46 AM, Bob Friesenhahn wrote: On Wed, 9 Apr 2008, Ross wrote: Well the first problem is that USB cables are directional, and you don't have the port you need on any standard motherboard. That Thanks for that info. I did not know that. Adding iSCSI support to ZFS is relatively easy since Solaris already supported TCP/IP and iSCSI. Adding USB support is much more difficult and isn't likely to happen since afaik the hardware to do it just doens't exist. I don't believe that Firewire is directional but presumably the Firewire support in Solaris only expects to support certain types of devices. My workstation has Firewire but most systems won't have it. It seemed really cool to be able to put your laptop next to your Solaris workstation and just plug it in via USB or Firewire so it can be used as a removable storage device. Or Solaris could be used on appropriate hardware to create a more reliable portable storage device. Apparently this is not to be and it will be necessary to deal with iSCSI instead. I have never used iSCSI so I don't know how difficult it is to use as temporary removable storage under Windows or OS-X. i'm not so sure what you're really after, but i'm guessing one of two things: 1) a global filesystem? if so - ZFS will never be globally accessible from 2 hosts at the same time without an interposer layer such as NFS or Lustre .. zvols could be exported to multiple hosts via iSCSI or FC- target but that's only 1/2 the story .. 2) an easy way to export volumes? agree - there should be some sort of semantics that would a signal filesystem is removable and trap on USB events when the media is unplugged .. of course you'll have problems with uncommitted transactions that would have to roll back on the next plug, or somehow be query-able iSCSI will get you a block/character device level sharing from a zvol (pseudo device) or the equivalent of a blob filestore .. you'd have to format it with a filesystem, but that filesystem could be a global one (eg: QFS) and you could multi-host natively that way. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote: On Thu, 20 Mar 2008, Mario Goebbels wrote: Similarly, read block size does not make a significant difference to the sequential read speed. Last time I did a simple bench using dd, supplying the record size as blocksize to it instead of no blocksize parameter bumped the mirror pool speed from 90MB/s to 130MB/s. Indeed. However, as an interesting twist to things, in my own benchmark runs I see two behaviors. When the file size is smaller than the amount of RAM the ARC can reasonably grow to, the write block size does make a clear difference. When the file size is larger than RAM, the write block size no longer makes much difference and sometimes larger block sizes actually go slower. in that case .. try fixing the ARC size .. the dynamic resizing on the ARC can be less than optimal IMHO --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote: On Thu, 20 Mar 2008, Jonathan Edwards wrote: in that case .. try fixing the ARC size .. the dynamic resizing on the ARC can be less than optimal IMHO Is a 16GB ARC size not considered to be enough? ;-) I was only describing the behavior that I observed. It seems to me that when large files are written very quickly, that when the file becomes bigger than the ARC, that what is contained in the ARC is mostly stale and does not help much any more. If the file is smaller than the ARC, then there is likely to be more useful caching. sure i got that - it's not the size of the arc in this case since caching is going to be a lost cause.. but explicitly setting a zfs_arc_max should result in fewer calls to arc_shrink() when you hit memory pressure between the application's page buffer competing with the arc in other words, as soon as the arc is 50% full of dirty pages (8GB) it'll start evicting pages .. you can't avoid that .. but what you can avoid is the additional weight of constantly growing and shrinking the cache as it tries to keep up with your constantly changing blocks in a large file --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs backups to tape
On Mar 14, 2008, at 3:28 PM, Bill Shannon wrote: What's the best way to backup a zfs filesystem to tape, where the size of the filesystem is larger than what can fit on a single tape? ufsdump handles this quite nicely. Is there a similar backup program for zfs? Or a general tape management program that can take data from a stream and split it across tapes reliably with appropriate headers to ease tape management and restore? for now you could send snapshots to files and a file hierarchy on a SAM-QFS archive .. then you've got all the feature functionality there to be able to proactively back up the snapshots and possibly segment them if they're big enough (non-shared-qfs - might make sense if you've got multiple drives you want to take advantage of) .. I believe the goal is to provide this sort of functionality through a DMAPI HSM with ADM at some point in the near future: http://opensolaris.org/os/project/adm/ --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic ZFS disk accesses
On Mar 1, 2008, at 3:41 AM, Bill Shannon wrote: Running just plain iosnoop shows accesses to lots of files, but none on my zfs disk. Using iosnoop -d c1t1d0 or iosnoop -m /export/ home/shannon shows nothing at all. I tried /usr/demo/dtrace/iosnoop.d too, still nothing. hi Bill this came up sometime last year .. io:::start won't work since ZFS doesn't call bdev_strategy() directly .. you'll want to use something more like zfs_read:entry, zfs_write:entry and zfs_putpage or zfs_getpage for mmap'd ZFS files here's one i hacked from our discussion back then to track some timings on files: cat zfs_iotime.d #!/usr/sbin/dtrace -s # pragma D option quiet zfs_write:entry, zfs_read:entry, zfs_putpage:entry, zfs_getpage:entry { self-ts = timestamp; self-filepath = args[0]-v_path; } zfs_write:return, zfs_read:return, zfs_putpage:return, zfs_getpage:return /self-ts self-filepath/ { printf(%s on %s took %d nsecs\n, probefunc, stringof(self-filepath), timestamp - self-ts); self-ts = 0; self-filepath = 0; } --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
On Feb 27, 2008, at 8:36 AM, Uwe Dippel wrote: As much as ZFS is revolutionary, it is far away from being the 'ultimate file system', if it doesn't know how to handle event- driven snapshots (I don't like the word), backups, versioning. As long as a high-level system utility needs to be invoked by a scheduler for these features (CDP), and - this is relevant - *ZFS does not support these functionalities essentially different from FAT or UFS*, the days of ZFS are counted. Sooner or later, and I bet it is sooner, someone will design a file system (hardware, software, Cairo) to which the tasks of retiring files, as well as creating versions of modified files, can be passed down, together with the file handlles. meh .. don't believe all the marketing hype you hear - it's good at what it's good at, and is a constant WIP for many of the other features that people would like to hear .. but the one ring to rule them all - not quite yet .. as for the CDP issue - i believe the event driving would really have to happen below ZFS at the vnode or znode layer .. keep in mind that with the ZPL we're still dealing with 30+ year old structures and methods (which is fine btw) in the VFS/Vnode layers .. a couple of areas i would look at (that i haven't seen mentioned in this discussion) might be: - fop_vnevent .. or the equivalent (if we have one yet) for a znode - filesystem - door interface for event handling - auditing if you look at what some of the other vendors (eg: apple/timemachine) are doing - it's essentially a tally of file change events that get dumped into a database and rolled up at some point .. if you plan on taking more immediate action on the file changes then i believe that you'll run into latency (race) issues for synchronous semantics anyhow - just a thought from another who is constantly learning (being corrected, learning some more, more correction, etc ..) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Dec 29, 2007, at 2:33 AM, Jonathan Loran wrote: Hey, here's an idea: We snapshot the file as it exists at the time of the mv in the old file system until all referring file handles are closed, then destroy the single file snap. I know, not easy to implement, but that is the correct behavior, I believe. All this said, I would love to have this feature introduced. Moving large file stores between zfs file systems would be so handy! From my own sloppiness, I've suffered dearly from the the lack of it. since in the current implementation a mv between filesystems would have to assign new st_ino values (fsids in NFS should also be different), all you should need to do is assign new block pointers in the new side of the filesystem .. that would also be handy for cp as well --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On Dec 5, 2007, at 17:50, can you guess? wrote: my personal-professional data are important (this is my valuation, and it's an assumption you can't dispute). Nor was I attempting to: I was trying to get you to evaluate ZFS's incremental risk reduction *quantitatively* (and if you actually did so you'd likely be surprised at how little difference it makes - at least if you're at all rational about assessing it). ok .. i'll bite since there's no ignore feature on the list yet: what are you terming as ZFS' incremental risk reduction? .. (seems like a leading statement toward a particular assumption) .. are you just trying to say that without multiple copies of data in multiple physical locations you're not really accomplishing a more complete risk reduction yes i have read this thread, as well as many of your other posts around usenet and such .. in general i find your tone to be somewhat demeaning (slightly rude too - but - eh, who's counting? i'm none to judge) - now, you do know that we are currently in an era of collaboration instead of deconstruction right? .. so i'd love to see the improvements on the many shortcomings you're pointing to and passionate about written up, proposed, and freely implemented :) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
apologies in advance for prolonging this thread .. i had considered taking this completely offline, but thought of a few people at least who might find this discussion somewhat interesting .. at the least i haven't seen any mention of Merkle trees yet as the nerd in me yearns for On Dec 5, 2007, at 19:42, bill todd - aka can you guess? wrote: what are you terming as ZFS' incremental risk reduction? .. (seems like a leading statement toward a particular assumption) Primarily its checksumming features, since other open source solutions support simple disk scrubbing (which given its ability to catch most deteriorating disk sectors before they become unreadable probably has a greater effect on reliability than checksums in any environment where the hardware hasn't been slapped together so sloppily that connections are flaky). ah .. okay - at first reading incremental risk reduction seems to imply an incomplete approach to risk .. putting various creators and marketing organizations pride issues aside for a moment, as a complete risk reduction - nor should it billed as such. However i do believe that an interesting use of the merkle tree with a sha256 hash is somewhat of an improvement over conventional volume based data scrubbing techniques since there can be a unique integration between the hash tree for the filesystem block layout and a hierarchical data validation method. In addition to the finding unknown areas with the scrub, you're also doing relatively inexpensive data validation checks on every read. Aside from the problems that scrubbing handles (and you need scrubbing even if you have checksums, because scrubbing is what helps you *avoid* data loss rather than just discover it after it's too late to do anything about it), and aside from problems deriving from sloppy assembly (which tend to become obvious fairly quickly, though it's certainly possible for some to be more subtle), checksums primarily catch things like bugs in storage firmware and otherwise undetected disk read errors (which occur orders of magnitude less frequently than uncorrectable read errors). sure - we've seen many transport errors, as well as firmware implementation errors .. in fact with many arrays we've seen data corruption issues with the scrub (particularly if the checksum is singly stored along with the data block) - just like spam you really want to eliminate false positives that could indicate corruption where there isn't any. if you take some time to read the on disk format for ZFS you'll see that there's a tradeoff that's done in favor of storing more checksums in many different areas instead of making more room for direct block pointers. Robert Milkowski cited some sobering evidence that mid-range arrays may have non-negligible firmware problems that ZFS could often catch, but a) those are hardly 'consumer' products (to address that sub-thread, which I think is what applies in Stefano's case) and b) ZFS's claimed attraction for higher-end (corporate) use is its ability to *eliminate* the need for such products (hence its ability to catch their bugs would not apply - though I can understand why people who needed to use them anyway might like to have ZFS's integrity checks along for the ride, especially when using less-than-fully-mature firmware). actually on this list we've seen a number of consumer level products including sata controllers, and raid cards (which are also becoming more commonplace in the consumer realm) that can be confirmed to throw data errors. Code maturity issues aside, there aren't very many array vendors that are open-sourcing their array firmware - and if you consider zfs as a feature-set that could function as a multi- purpose storage array (systems are cheap) - i find it refreshing that everything that's being done under the covers is really out in the open. And otherwise undetected disk errors occur with negligible frequency compared with software errors that can silently trash your data in ZFS cache or in application buffers (especially in PC environments: enterprise software at least tends to be more stable and more carefully controlled - not to mention their typical use of ECC RAM). So depending upon ZFS's checksums to protect your data in most PC environments is sort of like leaving on a vacation and locking and bolting the back door of your house while leaving the front door wide open: yes, a burglar is less likely to enter by the back door, but thinking that the extra bolt there made you much safer is likely foolish. granted - it's not an all-in-one solution, but by combining the merkle tree approach with the sha256 checksum along with periodic data scrubbing - it's a darn good approach .. particularly since it also tends to cost a lot less than what you might have to pay elsewhere for something you
Re: [zfs-discuss] Yager on ZFS
On Dec 6, 2007, at 00:03, Anton B. Rang wrote: what are you terming as ZFS' incremental risk reduction? I'm not Bill, but I'll try to explain. Compare a system using ZFS to one using another file system -- say, UFS, XFS, or ext3. Consider which situations may lead to data loss in each case, and the probability of each such situation. The difference between those two sets is the 'incremental risk reduction' provided by ZFS. ah .. thanks Anton - so the next step would be to calculate the probability of occurrence, the impact to operation, and the return to service for each anticipated risk in a given environment in order to determine the size of the increment that constitutes the risk reduction that ZFS is providing. Without this there's just a lot of hot air blowing around in here .. snip excellent summary of risks - perhaps we should also consider the availability and transparency of the code to potentially mitigate future problems .. that's currently where i'm starting to see tremendous value in open and free raid controller solutions to help drive down the cost of implementation for this sort of data protection instead of paying through the nose for a closed hardware based solutions (which is still a great margin in licensing for dedicated storage vendors) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover
On Nov 10, 2007, at 23:16, Carson Gaspar wrote: Mattias Pantzare wrote: As the fsid is created when the file system is created it will be the same when you mount it on a different NFS server. Why change it? Or are you trying to match two different file systems? Then you also have to match all inode-numbers on your files. That is not possible at all. It is, if you do block replication between the servers (drbd on Linux, or the Sun product whose name I'm blanking on at the moment). AVS (or Availability Suite) .. http://www.opensolaris.org/os/project/avs/ Jim Dunham does a nice demo here for block replication on zfs (see sidebar) What isn't clear is if zfs send/recv retains inode numbers... if it doesn't that's a really sad thing, as we won't be able to use ZFS to replace NetApp snapmirrors. zfs send/recv comes out of the DSL which i believe will generate a unique fsid_guid .. for mirroring you'd really want to use AVS. btw - you can also look at the Cluster SUNWnfs agent in the ohac community: http://opensolaris.org/os/community/ha-clusters/ohac/downloads/ hth --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Count objects/inodes
Hey Bill: what's an object here? or do we have a mapping between objects and block pointers? for example a zdb -bb might show: th37 # zdb -bb rz-7 Traversing all blocks to verify nothing leaked ... No leaks (block sum matches space maps exactly) bp count: 47 bp logical:518656avg: 11035 bp physical:64512avg: 1372 compression: 8.04 bp allocated: 249856avg: 5316 compression: 2.08 SPA allocated: 249856 used: 0.00% but do we maintain any sort of mapping between the object instantiation and how many block pointers an object or file might consume on disk? --- .je On Nov 9, 2007, at 15:18, Bill Moore wrote: You can just do something like this: # zfs list tank/home/billm NAMEUSED AVAIL REFER MOUNTPOINT tank/home/billm83.9G 5.56T 74.1G /export/home/billm # zdb tank/home/billm Dataset tank/home/billm [ZPL], ID 83, cr_txg 541, 74.1G, 111066 objects Let me know if that causes any trouble. --Bill On Fri, Nov 09, 2007 at 12:14:07PM -0700, Jason J. W. Williams wrote: Hi Guys, Someone asked me how to count the number of inodes/objects in a ZFS filesystem and I wasn't exactly sure. zdb -dv filesystem seems like a likely candidate but I wanted to find out for sure. As to why you'd want to know this, I don't know their reasoning but I assume it has to do with the maximum number of files a ZFS filesystem can support (2^48 no?). Thank you in advance for your help. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] df command in ZFS?
On Oct 18, 2007, at 11:57, Richard Elling wrote: David Runyon wrote: I was presenting to a customer at the EBC yesterday, and one of the people at the meeting said using df in ZFS really drives him crazy (no, that's all the detail I have). Any ideas/suggestions? Filter it. This is UNIX after all... err - no .. i can understand that when I put my old SA helmet on .. if you look at the avail capacity number below we've really got an overprovisioned number if you're not doing quotas - this kind of thing can drive you batty particularly when you're used to looking at df to quickly see how much space you've got left on the system .. it's like asking how many seats are available on this plane, and they tell you the amount of available seats on the airline [EMAIL PROTECTED] # df -h Filesystem size used avail capacity Mounted on /dev/dsk/c5t0d0s0 454G12G 437G 3%/ /devices 0K 0K 0K 0%/devices ctfs 0K 0K 0K 0%/system/contract proc 0K 0K 0K 0%/proc mnttab 0K 0K 0K 0%/etc/mnttab swap 8.4G 876K 8.4G 1%/etc/svc/volatile objfs0K 0K 0K 0%/system/object /usr/lib/libc/libc_hwcap2.so.1 454G12G 437G 3%/lib/libc.so.1 fd 0K 0K 0K 0%/dev/fd swap 8.4G40K 8.4G 1%/tmp swap 8.4G24K 8.4G 1%/var/run /dev/dsk/c5t0d0s5 3.9G 1.8G 2.1G46%/var/crash2 log-pool 457G 120M 447G 1%/log-pool thumper-pool/n01_oraadmin1 16T 1.4G13T 1%/n01/oraadmin1 thumper-pool/n01_oraarch1 16T 159M13T 1%/n01/oraarch1 thumper-pool/n01_oradata1 16T98G13T 1%/n01/oradata1 thumper-pool/tst08a_ctl1 16T17M13T 1%/s01/controlfile1 thumper-pool/tst08a_ctl2 16T17M13T 1%/s01/controlfile2 thumper-pool/tst08a_ctl3 16T17M13T 1%/s01/controlfile3 thumper-pool/tst32a_data 16T 135G13T 1%/s01/oradata1/tst32 thumper-pool16T 1.1T13T 8%/thumper-pool thumper-pool/home 16T45K13T 1%/thumper-pool/home thumper-pool/home/db2inst1 16T 163G13T 2%/thumper-pool/ home/db2inst1 thumper-pool/home/kurt 16T 223K13T 1%/thumper-pool/ home/kurt thumper-pool/home/mahadev 16T40K13T 1%/thumper-pool/ home/mahadev thumper-pool/mrd-data 16T75G13T 1%/thumper-pool/ mrd-data thumper-pool/software 16T 6.3G13T 1%/thumper-pool/ software thumper-pool/u0116T 5.2G13T 1%/u01 thumper-pool/tst08a_data 16T 761G13T 6%/s01/oradata1/tst08 log-pool/swim 50G24K50G 1%/log-pool/swim log-pool/butterfinger 457G24K 457G 1%/log-pool/ butterfinger ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] df command in ZFS?
On Oct 18, 2007, at 13:26, Richard Elling wrote: Yes. It is true that ZFS redefines the meaning of available space. But most people like compression, snapshots, clones, and the pooling concept. It may just be that you want zfs list instead, df is old-school :-) exactly - i'm not complaining .. just understanding the confusion I don't think anticipate deprecating df in favor of zfs list, but df_zfs or additonal flags to df might be helpful .. perhaps a pool option, and some sort of easy visual to say that the avail number you're looking at is shared .. perhaps something like this (sorted output would be nice too by default): # df -F zfs -xh Filesystem size used resv avail capacity Mounted on ... log-pool (457G) 120M --- (447G)1% /log-pool log-pool/butterfinger (457G) 24K 10G (457G)1% /log-pool/ butterfinger log-pool/swim [50G] 24K --- [50G]1% /log-pool/swim thumper-pool (16T) 1.1T --- (13T)8% /thumper-pool thumper-pool/home (16T) 46K --- (13T)1% /thumper- pool/home essentially just some way to tell at a glance that the capacity is either (shared) or a [quota] OTOH, df does have a notion of file system specific options. It might be useful to have a df_zfs option which would effectively show the zfs list-like data. yeah - i'm thinking it might be helpful to see reserved capacity here by default, or at least have a switch for it instead of having to alias zfs list -o name,used,reservation,available,refer,mountpoint .. i'm always thrown at first glance by that one: NAMEUSED RESERV AVAIL REFER MOUNTPOINT log-pool 10.1Gnone 447G 120M /log-pool log-pool/butterfinger 24.5K 10G 457G 24.5K /log-pool/ butterfinger log-pool/swim 24.5Knone 50.0G 24.5K /log-pool/swim thumper-pool 2.63Tnone 12.9T 1.11T /thumper-pool thumper-pool/home 163Gnone 12.9T 45.7K /thumper-pool/home BTW, airlines also overprovision seats, which is why you might sometimes get bumped. Hotels do this as well. my point as well - meaning you're never sure if you're going to get a seat especially if there's a rush .. sorry looking back it's kind of a bad analogy --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun 6120 array again
SCSI based, but solid and cheap enclosures if you don't care about support: http://search.ebay.com/search/search.dll?satitle=Sun+D1000 On Oct 1, 2007, at 12:15, Andy Lubel wrote: I gave up. The 6120 I just ended up not doing zfs. And for our 6130 since we don't have santricity or the sscs command to set it, I just decided to export each disk and create an array with zfs (and a RAMSAN zil), which made performance acceptable for us. I wish there was a firmware that just made these things dumb jbods! -Andy On 9/28/07 7:37 PM, Marion Hakanson [EMAIL PROTECTED] wrote: Greetings, Last April, in this discussion... http://www.opensolaris.org/jive/thread.jspa?messageID=143517 ...we never found out how (or if) the Sun 6120 (T4) array can be configured to ignore cache flush (sync-cache) requests from hosts. We're about to reconfigure a 6120 here for use with ZFS (S10U4), and the evil tuneable zfs_nocacheflush is not going to serve us well (there is a ZFS pool on slices of internal SAS drives, along with UFS boot/OS slices). Any pointers would be appreciated. Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array NVRAM cache?
On Sep 25, 2007, at 19:57, Bryan Cantrill wrote: On Tue, Sep 25, 2007 at 04:47:48PM -0700, Vincent Fox wrote: It seems like ZIL is a separate issue. It is very much the issue: the seperate log device work was done exactly to make better use of this kind of non-volatile memory. To use this, setup one LUN that has all of the NVRAM on the array dedicated to it, and then use that device as a separate log device. Works like a champ... on the 3310/3510 you can't really do this in the same way that you can't create a zfs filesystem or zvol and disable the ARC for this .. i mean we can dance around the issue and create a really big log device on a 3310/3510 and use JBOD for the data, but i don't think that's the point - the bottom line is that there's 2 competing cache strategies that aren't very complimentary. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array NVRAM cache?
On Sep 26, 2007, at 14:10, Torrey McMahon wrote: You probably don't have to create a LUN the size of the NVRAM either. As long as its dedicated to one LUN then it should be pretty quick. The 3510 cache, last I checked, doesn't do any per LUN segmentation or sizing. Its a simple front end for any LUN that is using cache. yep - the policy gets set on the controller for everything served by it .. you could put the ZIL LUN on one controller and change the other controller from write back to write through, but then you essentially waste a controller just for the log device and controller failover would be a mess .. we might as well just redo the fcode for these arrays to be a minimized optimized zfs build, but then again - i don't know what does to our OEM relationships for the controllers or if it's even worth it in the long run .. seems like it might be easier to just roll our own or release a spec for the hardware vendors to implement. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The ZFS-Man.
On Sep 21, 2007, at 14:57, eric kustarz wrote: Hi. I gave a talk about ZFS during EuroBSDCon 2007, and because it won the the best talk award and some find it funny, here it is: http://youtube.com/watch?v=o3TGM0T1CvE a bit better version is here: http://people.freebsd.org/~pjd/misc/zfs/zfs-man.swf Looks like Jeff has been working out :) my first thought too: http://blogs.sun.com/bonwick/resource/images/bonwick.portrait.jpg funny - i always pictured this as UFS-man though: http://www.benbakerphoto.com/business/47573_8C-after.jpg but what's going on with the sheep there? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/WAFL lawsuit
On Sep 6, 2007, at 14:48, Nicolas Williams wrote: Exactly the articles point -- rulings have consequences outside of the original case. The intent may have been to store logs for web server access (logical and prudent request) but the ruling states that RAM albeit working memory is no different then other storage and must be kept for discovery. This is generalized because (as I understand) the defense was arguing logs are not turned on -- they do not exist and that was met with of course the running program has this information in RAM and you are disposing of it ad nauseam. The only saving grace for the ruling is that it is not a higher court. Allowing for technical illiteracy in judges I think the obvious interpretation is that discoverable data should be retained and that but it exists only in RAM is not a defense, and rightly so. hang on .. let me take it out and give it to you .. I'm thinking this seems to get into v-chip territory, or otherwise providing a means for agencies to track information that might have passed through a system .. err, for the safety of our children and such :P ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Samba with ZFS ACL
On Sep 4, 2007, at 12:09, MC wrote: For everyone else: http://blogs.sun.com/timthomas/entry/ samba_and_swat_in_solaris#comments It looks like nevada 70b will be the next Solaris Express Developer Edition (SXDE) which should also drop shortly and should also have the ZFS ACL fix, but to find the full source integration you have to look in snv_72 I wonder what is missing from 70b that is included in the full source integration :) that was my comment - 70b was a respin of snv_70 with some extra stuff added - meaning that the zfsacl.so.0 is released in binary form in the SXDE (70b) in /usr/sfw/lib/vfs, but if you want to browse the source consolidation for sfw you should really look here: http://dlc.sun.com/osol/sfw/downloads/20070822/ instead of here: http://dlc.sun.com/osol/sfw/downloads/20070724/ in S10u4 you'll need a patch that hasn't been released yet .. (according to Jiri some of this has to do with prioritization on samba.org's releases as the zfsacl code got pushed to 3.0.26 which is becoming the 3.2 branch complete with the GPLv3) to implement, you'll need the following in the smb.conf [public] section: vfs objects = zfsacl nfs4: mode = special and for other issues around samba and the zfs_acl patch you should really watch jurasek's blog: http://blogs.sun.com/jurasek/ jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raid is very slow???
On Jul 7, 2007, at 06:14, Orvar Korvar wrote: When I copy that file from ZFS to /dev/null I get this output: real0m0.025s user0m0.002s sys 0m0.007s which can't be correct. Is it wrong of me to use time cp fil fil2 when measuring disk performance? well you're reading and writing to the same disk so that's going to affect performance, particularly as you're seeking to different areas of the disk both for the files and for the uberblock updates .. in the above case it looks like the file is already cached (buffer cache being what is probably consuming most of your memory here) - so you're just looking at a memory to memory transfer here .. if you want to see a simple write performance test many people use dd like so: # timex dd if=/dev/zero of=file bs=128k count=8192 which will give you a measure of an efficient 1GB file write of zeros .. or use a better opensource tool like iozone to get a better fix on single thread vs multi-thread, read/write mix, and block size differences for your given filesystem and storage layout jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: shareiscsi is cool, but what about sharefc or sharescsi?
On Jun 1, 2007, at 18:37, Richard L. Hamilton wrote: Can one use a spare SCSI or FC controller as if it were a target? we'd need an FC or SCSI target mode driver in Solaris .. let's just say we used to have one, and leave it mysteriously there. smart idea though! --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?
On May 15, 2007, at 13:13, Jürgen Keil wrote: Would you mind also doing: ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=1 to see the raw performance of underlying hardware. This dd command is reading from the block device, which might cache dataand probably splits requests into maxphys pieces (which happens to be 56K on an x86 box). to increase this to say 8MB, add the following to /etc/system: set maxphys=0x80 and you'll probably want to increase sd_max_xfer_size as well (should be 256K on x86/x64) .. add the following to /kernel/drv/sd.conf: sd_max_xfer_size=0x80; then reboot to get the kernel and sd tunings to take. --- .je btw - the defaults on sparc: maxphys = 128K ssd_max_xfer_size = maxphys sd_max_xfer_size = maxphys ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with adding existing EFI disks to a zpool
On May 5, 2007, at 09:34, Mario Goebbels wrote: I spend yesterday all day evading my data of one of the Windows disks, so that I can add it to the pool. Using mount-ntfs, it's a pain due to its slowness. But once I finished, I thought Cool, let's do it. So I added the disk using the zero slice notation (c0d0s0), as suggested for performance reasons. I checked the pool status and noticed however that the pool size didn't raise. After a short panic (myself, not the kernel), I remembered that I partitioned this disk as EFI disk in Windows (mostly just because). c0d0s0 was the emergency, boot or whatever partition automatically created according to the recommended EFI partitioning scheme. So it added the minimal space of that partition to the pool. The real whole disk partition was c0d0s1. Since there's no device removal in ZFS yet, I had to replace slice 0 with slice 1 since destroying the pool was out of the question. Two things now: a) ZFS would have added EFI labels anyway. Will ZFS figure things out for itself, or did I lose write cache control because I didn't explicitely specify s0 though this is an EFI disk already? yes if add the whole device to the pool .. that is use c0t0d0 instead of c0t0d0s0 .. in this case, ZFS creates a large partition on s0 starting at sector 34 and encompassing the entire disk. If you need to check the write_cache use format -e, cache, write_cache, display. b) I don't remember it mentioned anywhere in the documentation. If a) is indeed an issue, it should be mentioned that you have to unlabel EFI disks before adding. Removing an EFI label is a little trickier .. you can replace the EFI label with an SMI label if it's below 1TB (format -e then l) and then dd if=/dev/zero of=/dev/dsk/c0t0d0s2 bs=512 count=1 to remove the SMI label .. or you could also attempt to access the entire disk (c0t0d0) with dd and zero out the first 17KB and the last 8MB, but you'd have to get the 8MB offset from the VTOC. You know you've got an empty label if you get stderr entries at the top of the format output, or syslog messages around corrupt label - bad magic number Jonathan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 6410 expansion shelf
right on for optimizing throughput on solaris .. a couple of notes though (also mentioned in the QFS manuals): - on x86/x64 you're just going to have an sd.conf so just increase the max_xfer_size for all with a line at the bottom like: sd_max_xfer_size=0x80; (note: if you look at the source the ssd driver is built from the sd source .. it got collapsed back down to sd in S10 x86) - ssd_max_throttle or sd_max_throttle is typically a point of contention that has had many years of history with storage vendors .. this will limit the maximum queue depth across the board for all sd or ssd devices (read all disks) .. if you're using the native Leadville stack, there is a dynamic throttle that should adjust per target, so you really shouldn't have to set this unless you're seeing command timeouts either on the port or on the host. By tuning this down you can affect performance on the root drives as well as external storage making solaris appear slower than it may or may not be. - ZFS has a maximum block size of 128KB - so i don't think that tuning up maxphys and the max transfer sizes to 8MB isn't going to make that much difference here .. if you want larger block transfers (possibly matching to a full stripe width) you'd have to either go with QFS or raw - (but note that with larger block transfers you can get into higher cache latency response times depending on the storage controller .. and that's a whole other discussion) On Mar 27, 2007, at 08:24, Rayson Ho wrote: BTW, did anyone try this?? http://blogs.sun.com/ValdisFilks/entry/improving_i_o_throughput_for Rayson On 3/27/07, Wee Yeh Tan [EMAIL PROTECTED] wrote: As promised. I got my 6140 SATA delivered yesterday and I hooked it up to a T2000 on S10u3. The T2000 saw the disks straight away and is working for the last 1 hour. I'll be running some benchmarks on it. I'll probably have a week with it until our vendor comes around and steals it from me. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Perforce on ZFS
Roch what's the minimum allocation size for a file in zfs? I get 1024B by my calculation (1 x 512B block allocation (minimum) + 1 x 512B inode/ znode allocation) since we never pack file data in the inode/znode. Is this a problem? Only if you're trying to pack a lot files small byte files in a limited amount of space, or if you're concerned about trying to access many small files quickly. VxFS has a 96B immediate area for file, symlink, or directory data; NTFS can store small files in the MFT record; NetApp WAFL can also store small files in the 4KB inode (16 Block pointers = 128B?) .. if you look at some of the more recent OSD papers and some of the Lustre/ BlueArc work you'll see that this topic comes into play for performance in pre-fetching file data and locality issues for optimizing heavy access of many small files. --- .je On Feb 20, 2007, at 05:12, Roch - PAE wrote: Sorry to insist but I am not aware of a small file problem with ZFS (which doesn't mean there isn't one, nor that we agree on definition of 'problem'). So if anyone has data on this topic, I'm interested. Also note, ZFS does a lot more than VxFS. -r Claude Teissedre writes: Hello Roch, Thanks for your reply. According to Iozone and Filebench (http://blogs.sun.com/dom/), ZFS is less performant than VxFS for smalll files and more performant for large files. In you blog, I don't see specific infos related to small files -but it's a very interesting blog. Any help from CC: people related to Perforce benchmark (not in techtracker) is welcome. Thanks, Clausde Roch - PAE a écrit : Salut Claude. For this kind of query, try zfs-discuss@opensolaris.org; Looks like a common workload to me. I know of no small file problem with ZFS. You might want to state your metric of success ? -r Claude Teissedre writes: Hello, I am looking for any benchmark of Perforce on ZFS. My need here is specifically for Perforce, a source manager. At my ISV, it handles 250 users simustaneously (15 instances on average) and 16 Millions (small) files. That's an area not covered in the benchmaks I have seen. Thanks, Claude ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Perforce on ZFS
On Feb 20, 2007, at 15:05, Krister Johansen wrote: what's the minimum allocation size for a file in zfs? I get 1024B by my calculation (1 x 512B block allocation (minimum) + 1 x 512B inode/ znode allocation) since we never pack file data in the inode/znode. Is this a problem? Only if you're trying to pack a lot files small byte files in a limited amount of space, or if you're concerned about trying to access many small files quickly. This is configurable on a per-dataset basis. The look in zfs(1m) for recordsize. the minimum is still 512B .. (try creating a bunch of 10B files - they show up as ZFS plain files each with a 512B data block in zdb) VxFS has a 96B immediate area for file, symlink, or directory data; NTFS can store small files in the MFT record; NetApp WAFL can also store small files in the 4KB inode (16 Block pointers = 128B?) .. if you look at some of the more recent OSD papers and some of the Lustre/ BlueArc work you'll see that this topic comes into play for performance in pre-fetching file data and locality issues for optimizing heavy access of many small files. ZFS has something similar. It's called a bonus buffer. i see .. but currently we're only storing symbolic links there since given the bufsize of 320B - the znode_phys struct of 264B, we've only got 56B left for data in the 512B dnode_phys struct .. i'm thinking we might want to trade off some of the uint64_t meta attributes with something smaller and maybe eat into the pad to get a bigger data buffer .. of course that will also affect the reporting end of things, but should be easily fixable. just my 2p --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] se3510 and ZFS
On Feb 6, 2007, at 06:55, Robert Milkowski wrote: Hello zfs-discuss, It looks like when zfs issues write cache flush commands se3510 actually honors it. I do not have right now spare se3510 to be 100% sure but comparing nfs/zfs server with se3510 to another nfs/ufs server with se3510 with Periodic Cache Flush Time set to disable or so longer time I can see that cache utilization on nfs/ufs stays about 48% while on nfs/zfs it's hardly reaches 20% and every few seconds goes down to 0 (I guess every txg_time). nfs/zfs also has worse performance than nfs/ufs. Does anybody know how to tell se3510 not to honor write cache flush commands? I don't think you can .. DKIOCFLUSHWRITECACHE *should* tell the array to flush the cache. Gauging from the amount of calls that zfs makes to this vs ufs (fsck, lockfs, mount?) - i think you'll see the performance diff, particularly when you hit an NFS COMMIT. (If you don't use vdevs you may see another difference in zfs as the only place you'll hit is on the zil) btw - you may already know, but you'll also fall to write-through on the cache if your battery charge drops and we also recommend setting to write- through when you only have a single controller since a power event could result in data loss. Of course there's a big performance difference between write-back and write-through cache --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] se3510 and ZFS
On Feb 6, 2007, at 11:46, Robert Milkowski wrote: Does anybody know how to tell se3510 not to honor write cache flush commands? JE I don't think you can .. DKIOCFLUSHWRITECACHE *should* tell the array JE to flush the cache. Gauging from the amount of calls that zfs makes to JE this vs ufs (fsck, lockfs, mount?) correction .. UFS uses _FIOFFS which is a file ioctl not a device ioctl which makes sense given the difference in models .. hence UFS doesn't care if the device write cache is turned on or off as it only makes dkio calls for geometry, info and such. you can poke through the code to see what other dkio ioctls are being made by z .. i believe it's due to the design of a closer tie between the underlying devices and the file system that there's a big difference. The DKIOCFLUSH PSARC is here: http://www.opensolaris.org/os/community/arc/caselog/2004/652/spec/ however I'm not sure if the 3510 maintains a difference between the entire array cache and the cache for a single LUN/device .. we'd have to dig up one of the firmware engineers for a more definitive answer. Point well taken on shared storage if we're flushing an array cache here :) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which label a ZFS/ZPOOL device has ? VTOC or EFI ?
On Feb 3, 2007, at 02:31, dudekula mastan wrote: After creating the ZFS file system on a VTOC labeled disk, I am seeing the following warning messages. Feb 3 07:47:00 scoobyb Corrupt label; wrong magic number Feb 3 07:47:00 scoobyb scsi: [ID 107833 kern.warning] WARNING: / scsi_vhci/[EMAIL PROTECTED] (ssd156): Any idea on this ? This generally means that this device doesn't have a label - and this particular device would be the multipathed device identified by the GUID 600508b400102eb70001204b or the old BSD style driver enumeration ssd156 .. (take a look at http://access1.sun.com/ codesamples/disknames.html to see an example on how to use libdevinfo to convert this to the SVR4 c#t#d# style name) Now with ZFS if you don't specify a slice, you're essentially asking ZFS to use and autolabel the entire disk which will put an EFI style label on since the older sun style VTOC labels have an upper limit of 1TB per disk (EFI should work up to 2^64 LBAs.) The older sun VTOC labels typically use slice 2 as a backup to show the entire disk and will store the label in the first 512B, whereas the EFI labels will use 34 sectors at the start of the disk to store the label, and will also reserve a portion at the tail end of the disk for a backup label. With the older sun style VTOC labels, if you ever overwrite the first the first 512B on cylinder 0 of the disk (eg: dd if=/dev/zero of=/dev/ rdsk/c1t1d0s2 where s2 is the typical backup label starting at cylinder 0) you'll overwrite the label, whereas with the EFI label you have to overwrite both protected sections of the disk. So to reiterate what Robert and Tomas have already gone into .. if you plan on using the entire disk and want the vdev benefits (the ability to import/export pools, write caching, etc) you should probably not specify a slice and allow ZFS to autolabel the disk as it sees fit. hth .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
On Feb 2, 2007, at 15:35, Nicolas Williams wrote: Unlike traditional journalling replication, a continuous ZFS send/recv scheme could deal with resource constraints by taking a snapshot and throttling replication until resources become available again. Replication throttling would mean losing some transaction history, but since we don't expose that right now, nothing would be lost. Scoreboarding (what SNDR does) should perform better in general, but in the case of COW filesystems and databases ISTM that it should be a wash unless it's properly integrated with the COW system, and that's what makes me think scoreboarding and journalling approach each other at the limit when integrated with ZFS. hmm .. a COW scoreboard .. visions of Clustra with the notion of each node is an atomic failure unit spring to mind .. of course in this light, there's not much of a difference between just replication and global synchronization .. very interesting .. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS or UFS - what to do?
On Jan 26, 2007, at 09:16, Jeffery Malloch wrote: Hi Folks, I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs'd them so that it appears that these file systems are dynamically shrinkable and growable. ah - the 6994 is the controller we use in the 6140/6540 if i'm not mistaken .. i guess this thread will go down in a flaming JBOD vs RAID controller religious war again .. oops, too late :P yes - the dynamic LUN expansion bits in ZFS is quite nice and handy for managing dynamic growth of a pool or file system. so going back to Jeffery's original questions: 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot spares and write cache (8GB) has battery backup so I'm not too concerned from a hardware side. I'm looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. I think the stability issue has already been answered pretty well .. 8GB battery backed cache is nice .. performance wise you might find some odd interactions with the ZFS adaptive cache integration and the way in which the intent log operates (O_DSYNC writes can potentially impose a lot of in flight commands for relatively little work) - there's a max blocksize of 128KB (also maxphys), so you might want to experiment with tuning back the stripe width .. i seem to recall the the 6994 controller seemed to perform best with 256KB or 512KB stripe width .. so there may be additional tuning on the read-ahead or write- behind algorithms. 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. I am really only looking for high level dynamic size adjustability and am not interested in its built in RAID features. But given that, any real world recommendations? Not being interested in the RAID functionality as Roch points out eliminates the self-healing functionality and reconstruction bits in ZFS .. but you still get other nice benefits like dynamic LUN expansion As i see it, since we seem to have excess CPU and bus capacity on newer systems (most applications haven't quite caught up to impose enough of a load yet) .. we're back to the mid '90s where host based volume management and caching makes sense and is being proposed again. Being proactive, we might want to consider putting an embedded Solaris/ZFS on a RAID controller to see if we've really got something novel in the caching and RAID algorithms for when the application load really does catch up and impose more of a load on the host. Additionally - we're seeing that there's a big benefit in moving the filesystem closer to the storage array since most users care more about their consistency of their data (upper level) than the reliability of the disk subsystem or RAID controller. Implementing a RAID controller that's more intimately aware of the upper data levels seems like the next logical evolutionary step. 3. Caveats? Anything I'm missing that isn't in the docs that could turn into a BIG gotchya? I would say be careful of the ease at which you can destroy file systems and pools .. while convenient - there's typically no warning if you or an administrator does a zfs or zpool destroy .. so i could see that turning into an issue. Also if a LUN goes offline, you may not see this right away and you would have the potential to corrupt your pool or panic your system. Hence the self-healing and scrub options to detect and repair failure a little bit faster. People on this forum have been finding RAID controller inconsistencies .. hence the religious JBOD vs RAID ctlr disruptive paradigm shift 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? Doing 2TB+ shouldn't be a problem for the NFS or Samba mounted filesystem regardless if the host is 32bit or not. The only place where you can run into a problem is if the size of an individual file crosses 2 or 4TB on a 32bit system. I know we've implemented file systems (QFS in this case) that were samba shared to 32bit windows hosts in excess of 40-100TB without any major issues. I'm sure there's similar cases with ZFS and thumper .. i just don't have that data. a little late to the discussion, but hth --- .je
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 29, 2007, at 14:17, Jeffery Malloch wrote: Hi Guys, SO... From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. So now I have a $70K+ lump that's useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn't do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. The other thing I haven't heard is why NOT to use ZFS. Or people who don't like it for some reason or another. Comments? I put together this chart a while back .. i should probably update it for RAID6 and RAIDZ2 # ZFS ARRAY HWCAPACITYCOMMENTS -- --- 1 R0 R1 N/2 hw mirror - no zfs healing 2 R0 R5 N-1 hw R5 - no zfs healing 3 R1 2 x R0 N/2 flexible, redundant, good perf 4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) 6 RZ R0 N-1 standard RAID-Z no mirroring 7 RZ R1 (tray) (N/2)-1 RAIDZ+1 8 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy) 9 RZ 3 x R5 N-4 triple parity calculations (XXX) 10 RZ 1 x R5 N-2 double parity calculations (XXX) (note: I included the cases where you have multiple arrays with a single lun per vdisk (say) and where you only have a single array split into multiple LUNs.) The way I see it, you're better off picking either controller parity or zfs parity .. there's no sense in computing parity multiple times unless you have cycles to spare and don't mind the performance hit .. so the questions you should really answer before you choose the hardware is what level of redundancy to capacity balance do you want? and whether or not you want to compute RAID in ZFS host memory or out on a dedicated blackbox controller? I would say something about double caching too, but I think that's moot since you'll always cache in the ARC if you use ZFS the way it's currently written. Other feasible filesystem options for Solaris - UFS, QFS, or vxfs with SVM or VxVM for volume mgmt if you're so inclined .. all depends on your budget and application. There's currently tradeoffs in each one, and contrary to some opinions, the death of any of these has been grossly exaggerated. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Thumper Origins Q
On Jan 25, 2007, at 10:16, Torrey McMahon wrote: Albert Chin wrote: On Wed, Jan 24, 2007 at 10:19:29AM -0800, Frank Cusack wrote: On January 24, 2007 10:04:04 AM -0800 Bryan Cantrill [EMAIL PROTECTED] wrote: On Wed, Jan 24, 2007 at 09:46:11AM -0800, Moazam Raja wrote: Well, he did say fairly cheap. the ST 3511 is about $18.5k. That's about the same price for the low-end NetApp FAS250 unit. Note that the 3511 is being replaced with the 6140: Which is MUCH nicer but also much pricier. Also, no non-RAID option. So there's no way to treat a 6140 as JBOD? If you wanted to use a 6140 with ZFS, and really wanted JBOD, your only choice would be a RAID 0 config on the 6140? Why would you want to treat a 6140 like a JBOD? (See the previous threads about JBOD vs HW RAID...) I was trying to see if we sold the CSM2 trays without the controller, but I don't think that's commonly asked for .. reminds me of the old D1000 days - i seem to recall putting in more of those as the A1000 controllers weren't the greatest and people tended to opt for s/w mirrors instead. Then as the system application load went higher and the data became more critical the push was towards offloading this onto better storage controllers .. so since it seems like we now have more processing and bus speed on the system that applications aren't taking advantage of yet, it looks like the pendulum might be swinging back towards host-based RAID again. not a verdict .. just a thought --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Thumper Origins Q
On Jan 25, 2007, at 14:34, Bill Sommerfeld wrote: On Thu, 2007-01-25 at 10:16 -0500, Torrey McMahon wrote: So there's no way to treat a 6140 as JBOD? If you wanted to use a 6140 with ZFS, and really wanted JBOD, your only choice would be a RAID 0 config on the 6140? Why would you want to treat a 6140 like a JBOD? (See the previous threads about JBOD vs HW RAID...) Let's turn this around. Assume I want a FC JBOD. What should I get? perhaps something coming real soon .. (stall) --- .je btw - I've also said you could do a FC target in a thumper a la FalconStor .. but i'm not sure if they've got that going on S10, and their target multipathing was less than stellar .. we did have a target mode driver at one point, but i think that project got scrapped a while back. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Thumper Origins Q
On Jan 25, 2007, at 17:30, Albert Chin wrote: On Thu, Jan 25, 2007 at 02:24:47PM -0600, Al Hopper wrote: On Thu, 25 Jan 2007, Bill Sommerfeld wrote: On Thu, 2007-01-25 at 10:16 -0500, Torrey McMahon wrote: So there's no way to treat a 6140 as JBOD? If you wanted to use a 6140 with ZFS, and really wanted JBOD, your only choice would be a RAID 0 config on the 6140? Why would you want to treat a 6140 like a JBOD? (See the previous threads about JBOD vs HW RAID...) Let's turn this around. Assume I want a FC JBOD. What should I get? Many companies make FC expansion boxes to go along with their FC based hardware RAID arrays. Often, the expansion chassis is identical to the RAID equipped chassis - same power supplies, same physical chassis and disk drive carriers - the only difference is that the slots used to house the (dual) RAID H/W controllers have been blanked off. These expansion chassis are designed to be daisy chained back to the box with the H/W RAID. So you simply use one of the expansion chassis and attach it directly to a system equipped with an FC HBA and ... you've got an FC JBOD. Nearly all of them will support two FC connections to allow dual redundant connections to the FC RAID H/W. So if you equip your ZFS host with either a dual-port FC HBA or two single-port FC HBAs - you have a pretty good redundant FC JBOD solution. An example of such an expansion box is the DS4000 EXP100 from IBM. It's also possible to purchase a 3510FC box from Sun with no RAID controllers - but their nearest equivalent of an empty box comes with 6 (overpriced) disk drives pre-installed. :( Perhaps you could use your vast influence at Sun to persuade them to sell an empty 3510FC box? Or an empty box bundled with a single or dual-port FC card (Qlogic based please). Well - there's no harm in making the suggestion ... right? Well, when you buy disk for the Sun 5320 NAS Appliance, you get a Controller Unit shelf and, if you expand storage, an Expansion Unit shelf that connects to the Controller Unit. Maybe the Expansion Unit shelf is a JBOD 6140? that's the CSM200 - the IOMs in that should just take a 2Gb or 4Gb SFP (copper or fibre) and the tray should run switched loop so you can mix FC and SATA as it connects back to the 6140 or 6540 controller head. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Thumper Origins Q
On Jan 24, 2007, at 09:25, Peter Eriksson wrote: too much of our future roadmap, suffice it to say that one should expect much, much more from Sun in this vein: innovative software and innovative hardware working together to deliver world-beating systems with undeniable economics. Yes please. Now give me a fairly cheap (but still quality) FC- attached JBOD utilizing SATA/SAS disks and I'll be really happy! :-) Could you outline why FC attached instead of network attached (iSCSI say) makes more sense to you? It might help to illustrate the demand for an FC target I'm hearing instead of just a network target .. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
On Jan 24, 2007, at 06:54, Roch - PAE wrote: [EMAIL PROTECTED] writes: Note also that for most applications, the size of their IO operations would often not match the current page size of the buffer, causing additional performance and scalability issues. Thanks for mentioning this, I forgot about it. Since ZFS's default block size is configured to be larger than a page, the application would have to issue page-aligned block-sized I/Os. Anyone adjusting the block size would presumably be responsible for ensuring that the new size is a multiple of the page size. (If they would want Direct I/O to work...) I believe UFS also has a similar requirement, but I've been wrong before. I believe the UFS requirement is that the I/O be sector aligned for DIO to be attempted. And Anton did mention that one of the benefit of DIO is the ability to direct-read a subpage block. Without UFS/DIO the OS is required to read and cache the full page and the extra amount of I/O may lead to data channel saturation (I don't see latency as an issue in here, right ?). In QFS there are mount options to do automatic type switching depending on whether or not the IO is sector aligned or not. You essentially set a trigger to switch to DIO if you receive a tunable number of well aligned IO requests. This helps tremendously in certain streaming workloads (particularly write) to reduce overhead. This is where I said that such a feature would translate for ZFS into the ability to read parts of a filesystem block which would only make sense if checksums are disabled. would it be possible to do checksums a posteri? .. i suspect that the checksum portion of the transaction may not be atomic though and this leads us back towards the older notion of a DIF. And for RAID-Z that could mean avoiding I/Os to each disks but one in a group, so that's a nice benefit. So for the performance minded customer that can't afford mirroring, is not much a fan of data integrity, that needs to do subblock reads to an uncacheable workload, then I can see a feature popping up. And this feature is independant on whether or not the data is DMA'ed straight into the user buffer. certain streaming write workloads that are time dependent can fall into this category .. if i'm doing a DMA read directly from a device's buffer that i'd like to stream - i probably want to avoid some of the caching layers of indirection that will probably impose more overhead. The idea behind allowing an application to advise the filesystem of how it plans on doing it's IO (or the state of it's own cache or buffers or stream requirements) is to prevent the one cache fits all sort of approach that we currently seem to have in the ARC. The other feature, is to avoid a bcopy by DMAing full filesystem block reads straight into user buffer (and verify checksum after). The I/O is high latency, bcopy adds a small amount. The kernel memory can be freed/reuse straight after the user read completes. This is where I ask, how much CPU is lost to the bcopy in workloads that benefit from DIO ? But isn't the cost more than just the bcopy? Isn't there additional overhead in the TLB/PTE from the page invalidation that needs to occur when you do actually go to write the page out or flush the page? At this point, there are lots of projects that will lead to performance improvements. The DIO benefits seems like small change in the context of ZFS. The quickest return on investement I see for the directio hint would be to tell ZFS to not grow the ARC when servicing such requests. How about the notion of multiple ARCs that could be referenced or fine tuned for various types of IO workload profiles to provide a more granular approach? Wouldn't this also keep the page tables smaller and hopefully more contiguous for atomic operations? Not sure what this would break .. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper Origins Q
On Jan 24, 2007, at 12:41, Bryan Cantrill wrote: well, Thumper is actually a reference to Bambi You'd have to ask Fowler, but certainly when he coined it, Bambi was the last thing on anyone's mind. I believe Fowler's intention was one that thumps (or, in the unique parlance of a certain Commander-in-Chief, one that gives a thumpin'). You can take your pick of things that thump here: http://en.wikipedia.org/wiki/Thumper given the other name is the X4500 .. it does seem like it should be a weapon --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
Roch I've been chewing on this for a little while and had some thoughts On Jan 15, 2007, at 12:02, Roch - PAE wrote: Jonathan Edwards writes: On Jan 5, 2007, at 11:10, Anton B. Rang wrote: DIRECT IO is a set of performance optimisations to circumvent shortcomings of a given filesystem. Direct I/O as generally understood (i.e. not UFS-specific) is an optimization which allows data to be transferred directly between user data buffers and disk, without a memory-to-memory copy. This isn't related to a particular file system. true .. directio(3) is generally used in the context of *any* given filesystem to advise it that an application buffer to system buffer copy may get in the way or add additional overhead (particularly if the filesystem buffer is doing additional copies.) You can also look at it as a way of reducing more layers of indirection particularly if I want the application overhead to be higher than the subsystem overhead. Programmatically .. less is more. Direct IO makes good sense when the target disk sectors are set a priori. But in the context of ZFS, would you rather have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that was possible). sure, but in a well designed filesystem this is essentially the same as efficient buffer cache utilization .. coalescing IO operations to commit on a more efficient and larger disk allocation unit. However, paged IO (and in particular ZFS paged IO) is probably a little more than simply a bcopy() in comparison to Direct IO (at least in the QFS context) As for read, I can see that when the load is cached in the disk array and we're running 100% CPU, the extra copy might be noticeable. Is this the situation that longs for DIO ? What % of a system is spent in the copy ? What is the added latency that comes from the copy ? Is DIO the best way to reduce the CPU cost of ZFS ? To achieve maximum IO rates (in particular if you have a flexible blocksize and know the optimal stripe width for the best raw disk or array logical volume performance) you're going to do much better if you don't have to pass through buffered IO strategies with the added latencies and kernel space dependencies. Consider the case where you're copying or replicating from one disk device to another in a one-time shot. There's tremendous advantage in bypassing the buffer and reading and writing full stripe passes. The additional buffer copy is also going to add latency and affect your run queue, particularly if you're working on a shared system as the buffer cache might get affected by memory pressure, kernel interrupts, or other applications. Another common case could be line speed network data capture if the frame size is already well aligned for the storage device. Being able to attach one device to another with minimal kernel intervention should be seen as an advantage for a wide range of applications that need to stream data from device A to device B and already know more than you might about both devices. The current Nevada code base has quite nice performance characteristics (and certainly quirks); there are many further efficiency gains to be reaped from ZFS. I just don't see DIO on top of that list for now. Or at least someone needs to spell out what is ZFS/DIO and how much better it is expected to be (back of the envelope calculation accepted). the real benefit is measured more in terms of memory consumption for a given application and the type of balance between application memory space and filesystem memory space. when the filesystem imposes more pressure on the application due to it's mapping you're really measuring the impact of doing an application buffer read and copy for each write. In other words you're imposing more of a limit on how the application should behave with respect to it's notion of the storage device. DIO should not been seen as a catchall for the notion of more efficiency will be gotten by bypassing the filesystem buffers but rather as please don't buffer this since you might push back on me and I don't know if I can handle a push back advice Reading RAID-Z subblocks on filesystems that have checksum disabled might be interesting. That would avoid some disk seeks.To served the subblocks directly or not is a separate matter; it's a small deal compared to the feature itself. How about disabling the DB checksum (it can't fix the block anyway) and do mirroring ? Basically speaking - there needs to be some sort of strategy for bypassing the ARC or even parts of the ARC for applications that may need to advise the filesystem of either: 1) the delicate nature of imposing additional buffering for their data flow 2) already well optimized applications that need more adaptive cache in the application instead of the underlying filesystem or volume manager --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org
Re: [zfs-discuss] Re: ZFS direct IO
On Jan 5, 2007, at 11:10, Anton B. Rang wrote: DIRECT IO is a set of performance optimisations to circumvent shortcomings of a given filesystem. Direct I/O as generally understood (i.e. not UFS-specific) is an optimization which allows data to be transferred directly between user data buffers and disk, without a memory-to-memory copy. This isn't related to a particular file system. true .. directio(3) is generally used in the context of *any* given filesystem to advise it that an application buffer to system buffer copy may get in the way or add additional overhead (particularly if the filesystem buffer is doing additional copies.) You can also look at it as a way of reducing more layers of indirection particularly if I want the application overhead to be higher than the subsystem overhead. Programmatically .. less is more. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[2]: ZFS in a SAN environment
On Dec 20, 2006, at 00:37, Anton B. Rang wrote: INFORMATION: If a member of this striped zpool becomes unavailable or develops corruption, Solaris will kernel panic and reboot to protect your data. OK, I'm puzzled. Am I the only one on this list who believes that a kernel panic, instead of EIO, represents a bug? I agree as well - did you file a bug on this yet? Inducing kernel panics (like we also do on certain sun cluster failure types) to prevent corruption can often lead to more corruption elsewhere, and usually ripples to throw admins, managers, and users in a panic as well - typically resulting in more corrupted opinions and perceptions of reliability and usability. :) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto
On Dec 20, 2006, at 04:41, Darren J Moffat wrote: Bill Sommerfeld wrote: There also may be a reason to do this when confidentiality isn't required: as a sparse provisioning hack.. If you were to build a zfs pool out of compressed zvols backed by another pool, then it would be very convenient if you could run in a mode where freed blocks were overwritten by zeros when they were freed, because this would permit the underlying compressed zvol to free *its* blocks. A very interesting observation. Particularly given that I have just created such a configuration - with iSCSI in the middle. over ipsec? wow - how many layers is that before you start talking to the real (non-psuedo) block storage device? --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
On Dec 18, 2006, at 17:52, Richard Elling wrote: In general, the closer to the user you can make policy decisions, the better decisions you can make. The fact that we've had 10 years of RAID arrays acting like dumb block devices doesn't mean that will continue for the next 10 years :-) In the interim, we will see more and more intelligence move closer to the user. I thought this is what the T10 OSD spec was set up to address. We've already got device manufacturers beginning to design and code to the spec. --- .je (ps .. actually it's closer to 20+ years of RAID and dumb block devices ..) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
On Dec 19, 2006, at 07:17, Roch - PAE wrote: Shouldn't there be a big warning when configuring a pool with no redundancy and/or should that not require a -f flag ? why? what if the redundancy is below the pool .. should we warn that ZFS isn't directly involved in redundancy decisions? --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto
On Dec 19, 2006, at 08:59, Darren J Moffat wrote: Darren Reed wrote: If/when ZFS supports this then it would be nice to also be able to have Solaris bleach swap on ZFS when it shuts down or reboots. Although it may be that this option needs to be put into how we manage swap space and not specifically zomething for ZFS. Doing this to swap space has been a kernel option on another very widely spread operating system for at least 2 major OS releases... Which ones ? I know that MacOS X and OpenBSD both support encrypted swap which for swap IMO is a better way to solve this problem. You can get that today with OpenSolaris by using the stuff in the loficc project. You will also get encrypted swap when we have ZFS crypto and you swap on a ZVOL that is encrypted. Note though that that isn't quite the same way as OpenBSD solves the encrypted swap problem, and I'm not familiar with the technical details of what Apple did in MacOS X. there's an encryption option in the dynamic_pager to write out encrypted paging files (/var/vm/swapfile*) .. it gets turned on with an environment variable that gets set at boot (what happens when you choose secure virtual memory.) Before this was implemented there was a workaround using an encrypted dmg that held the swap files .. but that was an incomplete solution. Bleaching is a time consuming task, not something I'd want to do at system boot/halt. particularly if we choose to do a 35 pass Gutmann algorithm .. :) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto
On Dec 18, 2006, at 11:54, Darren J Moffat wrote: [EMAIL PROTECTED] wrote: Rather than bleaching which doesn't always remove all stains, why can't we use a word like erasing (which is hitherto unused for filesystem use in Solaris, AFAIK) and this method doesn't remove all stains from the disk anyway it just reduces them so they can't be easily seen ;-) and if you add the right amount of ammonia is should remove everything .. (ahh - fun with trichloramine) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
On Dec 19, 2006, at 10:15, Torrey McMahon wrote: Darren J Moffat wrote: Jonathan Edwards wrote: On Dec 19, 2006, at 07:17, Roch - PAE wrote: Shouldn't there be a big warning when configuring a pool with no redundancy and/or should that not require a -f flag ? why? what if the redundancy is below the pool .. should we warn that ZFS isn't directly involved in redundancy decisions? Yes because if ZFS doesn't know about it then ZFS can't use it to do corrections when the checksums (which always work) detect problems. We do not have the intelligent end-to-end management to make these judgments. Trying to make one layer of the stack {stronger, smarter, faster, bigger,} while ignoring the others doesn't help. Trying to make educated guesses as to what the user intends doesn't help either. Hi! It looks like you're writing a block Would you like help? - Get help writing the block - Just write the block without help - (Don't show me this tip again) somehow I think we all know on some level that letting a system attempt to guess your intent will get pretty annoying after a while .. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
On Dec 18, 2006, at 16:13, Torrey McMahon wrote: Al Hopper wrote: On Sun, 17 Dec 2006, Ricardo Correia wrote: On Friday 15 December 2006 20:02, Dave Burleson wrote: Does anyone have a document that describes ZFS in a pure SAN environment? What will and will not work? From some of the information I have been gathering it doesn't appear that ZFS was intended to operate in a SAN environment. This might answer your question: http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid The section entitled Does ZFS work with SAN-attached devices? does not make it clear the (some would say) dire effects of not having pool redundancy. I think that FAQ should clearly spell out the downside; i.e., where ZFS will say (Sorry Charlie) pool is corrupt. A FAQ should always emphasize the real-world downsides to poor decisions made by the reader. Not delivering bad news does the reader a dis-service IMHO. I'd say that it's clearly described in the FAQ. If you push to hard people will infer that SANs are broken if you use ZFS on top of them or vice versa. The only bit that looks a little questionable to my eyes is ... Overall, ZFS functions as designed with SAN-attached devices, but if you expose simpler devices to ZFS, you can better leverage all available features. What are simpler devices? (I could take a guess ... ) stone tablets in a room full of monkeys with chisels? The bottom line is ZFS wants to ultimately function as the controller cache and eventually eliminate the blind data algorithms that they incorporate .. the problem is that we can't really say that explicitly since we sell, and much of the enterprise operates with enterprise class arrays and integrated data cache. The trick is in balancing who does what since you've really got duplicate Virtualization, RAID, and caching options open to you. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Vanity ZVOL paths?
On Dec 8, 2006, at 05:20, Jignesh K. Shah wrote: Hello ZFS Experts I have two ZFS pools zpool1 and zpool2 I am trying to create bunch of zvols such that their paths are similar except for consisent number scheme without reference to the zpools that actually belong. (This will allow me to have common references in my setup scripts) If I create zfs create -V 100g zpool1/tablespace1 zfs create -V 100g zpool2/tablespace2 zfs create -V 100g zpool1/tablespace3 zfs create -V 100g zpool2/tablespace4 Then I get /dev/zvol/rdsk/zpool1/tablespace1 /dev/zvol/rdsk/zpool1/tablespace2 /dev/zvol/rdsk/zpool1/tablespace3 /dev/zvol/rdsk/zpool2/tablespace4 As you notice I have two series zpool and tablespace.. I am trying to eliminate 1 series. So I tried zfs create zpool1/dbdata1 zfs create zpool2/dbdata2 zfs create zpool1/dbdata3 zfs create zpool2/dbdata4 And changed their mount point as follows zfs set mountpoint=/tablespace1 zpool1/dbdata1 zfs set mountpoint=/tablespace2 zpool1/dbdata2 zfs set mountpoint=/tablespace3 zpool2/dbdata3 zfs set mountpoint=/tablespace4 zpool2/dbdata4 And then created a common zvol name for all pools: zfs create -V 100g zpool1/dbdata1/data zfs create -V 100g zpool2/dbdata2/data zfs create -V 100g zpool1/dbdata3/data zfs create -V 100g zpool2/dbdata4/data I was expecting I will get /dev/zvol/rdsk/tablespace1/data /dev/zvol/rdsk/tablespace2/data /dev/zvol/rdsk/tablespace3/data /dev/zvol/rdsk/tablespace4/data Instead I got /dev/zvol/rdsk/zpool1/dbdata1/data /dev/zvol/rdsk/zpool2/dbdata2/data /dev/zvol/rdsk/zpool1/dbdata3/data /dev/zvol/rdsk/zpool2/dbdata4/data Any idea how do I get my abstracted zvol paths like I can do with my mountpoints in regular ZFS. setting the mountpoint isn't going to affect the volume name .. for vanity zvol paths you'll have to use symlinks .. try: mkdir /dev/zvol/rdsk/tablespace1 /dev/zvol/dsk/tablespace1 ln -s /dev/zvol/rdsk/zpool1/dbdata1/data /dev/zvol/rdsk/tablespace1/data ln -s /dev/zvol/dsk/zpool1/dbdata1/data /dev/zvol/rdsk/tablespace1/data ... etc ... or better yet, simply link to the underlying /devices entry and you don't even have to keep it in the /dev/zvol tree since everything in the /dev tree is a symlink anyhow .. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: system wont boot after zfs
Dave which BIOS manufacturers and revisions? that seems to be more of the problem as choices are typically limited across vendors .. and I take it you're running 6/06 u2 Jonathan On Nov 30, 2006, at 12:46, David Elefante wrote: Just as background: I attempted this process on the following: 1. Jetway amd socket 734 (vintage 2005) 2. Asus amd socket 939 (vintage 2005) 3. Gigabyte amd socket am2 (vintage 2006) All with the same problem. I disabled the onboard nvidia nforce 410/430 raid bios in the bios in all cases. Now whether it actually does not look for a signature, I do not know. I'm attempting to make this box into an iSCSI target for my ESX environments. I can put W3K and SanMelody on there, but it is not as interesting and I am attempting to help the Solaris community. I am simply making the business case that over three major vendors boards and the absolute latest (gigabyte), the effect was the same. As a workaround I can make slice 0 1 cyl and slice 1 1-x, and the zpool on the rest of the disk and be fine with that. So on a PC with zpool create there should be a warning for pc users that most likely if they use the entire disk, the resultant EFI label is likely to cause lack of bootability. I attempted to hotplug the sata drives after booting, and Nevada 51 came up with scratch space errors and did not recognize the drive. In any case I'm not hotplugging my drives every time. The given fact is that PC vendors are not readily adopting EFI bios at this time, the millions of PC's out there are vulnerable to this. And if x86 Solaris is to be really viable, this community needs to be addressed. Now I was at Sun 1/4 of my entire life and I know the politics, but the PC area is different. If you tell the customer to go to the mobo vendor to fix the bios, they will have to find some guy in a bunker in Taiwan. Not likely. Now I'm at VMware actively working on consolidating companies into x86 platforms. The simple fact that the holy war between AMD and Intel has created processors that a cheap enough and fast enough to cause disruption in the enterprise space. My new dual core AMD processor is incredibly fast and the entire box cost me $500 to assemble. The latest Solaris 10 documentation (thx Richard) has use the entire disk all over it. I don't see any warning in here about EFI labels, in fact these statements discourage putting ZFS in a slice.: ZFS applies an EFI label when you create a storage pool with whole disks. Disks can be labeled with a traditional Solaris VTOC label when you create a storage pool with a disk slice. Slices should only be used under the following conditions: * The device name is nonstandard. * A single disk is shared between ZFS and another file system, such as UFS. * A disk is used as a swap or a dump device. Disks can be specified by using either the full path, such as /dev/dsk/c1t0d0, or a shorthand name that consists of the device name within the /dev/dsk directory, such as c1t0d0. For example, the following are valid disk names: * c1t0d0 * /dev/dsk/c1t0d0 * c0t0d6s2 * /dev/foo/disk ZFS works best when given whole physical disks. Although constructing logical devices using a volume manager, such as Solaris Volume Manager (SVM), Veritas Volume Manager (VxVM), or a hardware volume manager (LUNs or hardware RAID) is possible, these configurations are not recommended. While ZFS functions properly on such devices, less-than-optimal performance might be the result. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Wednesday, November 29, 2006 1:24 PM To: Jonathan Edwards Cc: David Elefante; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Re: system wont boot after zfs I suspect a lack of an MBR could cause some BIOS implementations to barf .. Why? Zeroed disks don't have that issue either. What appears to be happening is more that raid controllers attempt to interpret the data in the EFI label as the proprietary hardware raid labels. At least, it seems to be a problem with internal RAIDs only. In my experience, removing the disks from the boot sequence was not enough; you need to disable the disks in the BIOS. The SCSI disks with EFI labels in the same system caused no issues at all; but the disks connected to the on-board RAID did have issues. So what you need to do is: - remove the controllers from the probe sequence - disable the disks Casper -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date: 11/30/2006 3:41 PM -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.430 / Virus Database: 268.15.2/560 - Release Date: 11/30/2006 3:41 PM
Re: [zfs-discuss] Re: ZFS ACLs and Samba
On Oct 25, 2006, at 15:38, Roger Ripley wrote: IBM has contributed code for NFSv4 ACLs under AIX's JFS; hopefully Sun will not tarry in following their lead for ZFS. http://lists.samba.org/archive/samba-cvs/2006-September/070855.html I thought this was still in draft: http://ietf.org/internet-drafts/draft-ietf-nfsv4-acl-mapping-05.txt .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored Raidz
On Oct 24, 2006, at 04:19, Roch wrote: Michel Kintz writes: Matthew Ahrens a écrit : Richard Elling - PAE wrote: Anthony Miller wrote: Hi, I've search the forums and not found any answer to the following. I have 2 JBOD arrays each with 4 disks. I want to create create a raidz on one array and have it mirrored to the other array. Today, the top level raid sets are assembled using dynamic striping. There is no option to assemble the sets with mirroring. Perhaps the ZFS team can enlighten us on their intentions in this area? Our thinking is that if you want more redundancy than RAID-Z, you should use RAID-Z with double parity, which provides more reliability and more usable storage than a mirror of RAID-Zs would. (Also, expressing mirror of RAID-Zs from the CLI would be a bit messy; you'd have to introduce parentheses in vdev descriptions or something.) It is not always a matter of more redundancy. In my customer's case, they have storage in 2 different rooms of their datacenter and want to mirror from one storage unit in one room to the other. So having in this case a combination of RAID-Z + Mirror makes sense in my mind or ? Michel. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss you may let the storage export RAID-5 luns and let ZFS mirror those. Would that work ? -r they're JBOD arrays, so unless you're proposing the use of another volume manager i don't think that would work. as for the maximum redundancy in configurations, i think that Frank hit it with the mirroring of each drive component across the arrays and doing a simple stripe I just think it would be good to add the flexibility in zpool to: 1) raidz a set of mirrors 2) mirror a couple of raidz in certain environments you care more about multiple drive or array failures than anything else. Today you can do this with zvols, but I'm a little worried about how this would perform given the nested layering you have to introduce .. eg: # zpool create a1pool raidz c0t0d0 c0t1d0 c0t2d0 .. # zpool create a2pool raidz c1t0d0 c1t1d0 c1t2d0 .. # zfs create -V size a1pool/vol # zfs create -V size a2pool/vol # zpool create mzdata mirror /dev/zvol/dsk/a1pool/vol /dev/zvol/dsk/ a2pool/vol .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Mirrored Raidz
there's 2 approaches: 1) RAID 1+Z where you mirror the individual drives across trays and then RAID-Z the whole thing 2) RAID Z+1 where you RAIDZ each tray and then mirror them I would argue that you can lose the most drives in configuration 1 and stay alive: With a simple mirrored stripe you lose if you lose 1 drives in each tray. With configuration 2 this takes it 2 drives in each tray. With configuration 1 you have to lose both sides of a 2 mirrored sets to fail. so it's not a space or performance model .. simply an availability model with failing disk Jonathan On Oct 24, 2006, at 12:46, Richard Elling - PAE wrote: Pedantic question, what would this gain us other than better data retention? Space and (especially?) performance would be worse with RAID-Z+1 than 2-way mirrors. -- richard Frank Cusack wrote: On October 24, 2006 9:19:07 AM -0700 Anton B. Rang [EMAIL PROTECTED] wrote: Our thinking is that if you want more redundancy than RAID-Z, you should use RAID-Z with double parity, which provides more reliability and more usable storage than a mirror of RAID-Zs would. This is only true if the drives have either independent or identical failure modes, I think. Consider two boxes, each containing ten drives. Creating RAID-Z within each box protects against single-drive failures. Mirroring the boxes together protects against single-box failures. But mirroring also protects against single-drive failures. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [osol-discuss] Cloning a disk w/ ZFS in it
you don't really need to do the prtvtoc and fmthard with the old Sun labels if you start at cylinder 0 since you're doing a bit - bit copy with dd .. but, keep in mind: - The Sun VTOC is the first 512B and s2 *typically* should start at cylinder 0 (unless it's been redefined .. check!) - The EFI label though, reserves the first 17KB (34 blocks) and for a dd to work, you need to either: 1) dd without the slice (eg: dd if=/dev/rdsk/c0t0d0 of=/dev/rdsk/ c1t0d0 bs=128K) or 2) prtvtoc / fmthard (eg: prtvtoc /dev/rdsk/c0t0d0s0 /tmp/ vtoc.out ; fmthard -s /tmp/vtoc.out /dev/rdsk/c1t0d0s0) .je On Oct 22, 2006, at 12:45, Krzys wrote: yeah disks need to be identical but why do you need to do prtvtoc and fmthard to duplicate the disk label (before the dd), I thought that dd would take care of all of that... whenever I used dd I used it on slice 2 and I never had to do prtvtoc and fmthard... Juts make sure disks are identical and that is the key. Regards, Chris On Fri, 20 Oct 2006, Richard Elling - PAE wrote: minor adjustments below... Darren J Moffat wrote: Asif Iqbal wrote: Hi I have a X2100 with two 74G disks. I build the OS on the first disk with slice0 root 10G ufs, slice1 2.5G swap, slice6 25MB ufs and slice7 62G zfs. What is the fastest way to clone it to the second disk. I have to build 10 of those in 2 days. Once I build the disks I slam them to the other X2100s and ship it out. if clone really means make completely identical then do this: boot of cd or network. dd if=/dev/dsk/sourcedisk of=/dev/dsk/destdisk Where sourcedisk and destdisk are both localally attached. I use prtvtoc and fmthard to duplicate the disk label (before the dd) Note: the actual disk geometry may change between vendors or disk firmware revs. You will first need to verify that the geometries are similar, especially the total number of blocks. For dd, I'd use a larger block size than the default. Something like: dd bs=1024k if=/dev/dsk/sourcedisk of=/dev/dsk/destdisk The copy should go at media speed, approximately 50-70 MBytes/s for the X2100 disks. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss !DSPAM:122,45390d6810494021468! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A versioning FS
On Oct 8, 2006, at 23:54, Nicolas Williams wrote: On Sun, Oct 08, 2006 at 11:16:21PM -0400, Jonathan Edwards wrote: On Oct 8, 2006, at 22:46, Nicolas Williams wrote: You're arguing for treating FV as extended/named attributes :) kind of - but one of the problems with EAs is the increase/bloat in the inode/dnode structures and corresponding incompatibilities with other applications or tools. This in a thread where folks [understandably] claim that storage is cheap and abundant. And I agree that it is. Plus, I think you may be jumping to conclusions about the bloat of extended attributes: Another approach might be to put it all into the block storage rather than trying to stuff it into the metadata on top. If we look at the zfs on-disk structure instead and simply extend the existing block pointer mappings to handle the diffs along with a header block to handle the version numbers - this might be an easier way out rather than trying to redefine or extend the dnode structure. Of course you'd still need a single attribute to flag reading the version block header and corresponding diff blocks, but this could go anywhere - even a magic acl perhaps .. i would argue that the overall goal should be aimed toward the reduction of complexity in the metadata nodes rather than attempting to extend them and increase the seek/parse time. Wait a minute -- the extended attribute idea is about *interfaces*, not internal implementation. I certainly did not argue that a file version should be copied into an EA. true, but I just find that the EA discussion is just as loaded as the FV discussion that too often focuses on improvements in the metadata space rather than the block data space. I'm not talking about the file version data .. rather the bplist for the file version data and possibly causing this to live in the block data space instead of the dnode DMU. This way the FV will be completely accessible within the filesystem block data structure instead of being abstracted back out of the dnode DMU. I would hold that the version data space consumption should also be readily apparent on the filesystem level and that versioned access should not impede the regular file lookup or attribute caching. It's a slight deviation from the typical EA approach, but an important distinction to make to keep the metadata structures relatively lean. Let's keep interface and implementation details separate. Most of this thread has been about interfaces precisely because that's what users will interact with; users won't care one bit about how it's all implemented under the hood. I'm not so sure you can separate the two without creating a hack. I would also argue that users (particularly the ones creating the interfaces) will care about the implementation details since those are the real underlying issues they'll be wrestling with. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A versioning FS
On Oct 8, 2006, at 21:40, Wee Yeh Tan wrote: On 10/7/06, Ben Gollmer [EMAIL PROTECTED] wrote: On Oct 6, 2006, at 6:15 PM, Nicolas Williams wrote: What I'm saying is that I'd like to be able to keep multiple versions of my files without echo * or ls showing them to me by default. Hmm, what about file.txt - ._file.txt.1, ._file.txt.2, etc? If you don't like the _ you could use @ or some other character. You missed Nicolas's point. It does not matter which delimiter you use. I still want my for i in *; do ... to work as per now. We want to differentiate files that are created intentionally from those that are just versions. If files starts showing up on their own, a lot of my scripts will break. Still, an FV-aware shell/program/API can accept an environment setting that may quiesce the version output. E.g. export show-version=off/on. if we're talking implementation - i think it would make more sense to store the block version differences in the base dnode itself rather than creating new dnode structures to handle the different versions. You'd then structure different tools or flags to handle the versions (copy them to a new file/dnode, etc) - standard or existing tools don't need to know about the underlying versions. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: A versioning FS
On Oct 6, 2006, at 23:42, Anton B. Rang wrote:I don't agree that version control systems solve the same problem as file versioning. I don't want to check *every change* that I make into version control -- it makes the history unwieldy. At the same time, if I make a change that turns out to work really poorly, I'd like to revert to the previous code -- not necessary the code which is checked in. (I suspect there may be some versioning systems which allow intermediate versions to be deleted, and I just haven't used them, but this still seems complex compared to only checking in known-good code.) The use cases are somewhat different here. I would venture to say that a *personal* file versioning system needs to be thought of differently from a *group* co-ordination formal version control system. Of course there is a fair amount of overlap in both use cases particularly when you consider a global namespace and concurrent access problems as you can see in the cedar or plan9 systems (fossil/venti):http://portal.acm.org/citation.cfm?doid=42392.42398http://cm.bell-labs.com/plan9/And if we were to also consider dynamic linking and versioning for depracated functions, there's another whole level of parallel backwards compatibility interface problems that are become much easier to approach.While this is an FV discussion, I do believe that we need some sort of clearer distinction between FV, VC, DR, CDP, and Snapshotting structured around the usability cases and close/sync vs a forced version mark/branch .. there's too much confusion in this space often with conflicting goals misapplied to often solve similar problems..je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
On Sep 5, 2006, at 06:45, Robert Milkowski wrote:Hello Wee,Tuesday, September 5, 2006, 10:58:32 AM, you wrote:WYT On 9/5/06, Torrey McMahon [EMAIL PROTECTED] wrote: This is simply not true. ZFS would protect against the same type oferrors seen on an individual drive as it would on a pool made of HW raidLUN(s). It might be overkill to layer ZFS on top of a LUN that isalready protected in some way by the devices internal RAID code but itdoes not "make your data susceptible to HW errors caused by the storagesubsystem's RAID algorithm, and slow down the I/O". WYT Roch's recommendation to leave at least 1 layer of redundancy to ZFSWYT allows the extension of ZFS's own redundancy features for some truelyWYT remarkable data reliability.WYT Perhaps, the question should be how one could mix them to get the bestWYT of both worlds instead of going to either extreme.Depends on your data but sometime it could be useful to create HW RAIDand then do just striping on ZFS side between at least two LUNs. Thatway you do not get data protection but fs/pool protection with dittoblock. Of course each LUN is HW RAID made of different physical disks.i remember working up a chart on this list about 2 months ago:Here's 10 options I can think of to summarize combinations of zfs with hw redundancy:# ZFS ARRAY HW CAPACITY COMMENTS-- --- 1 R0 R1 N/2 hw mirror - no zfs healing (XXX)2 R0 R5 N-1 hw R5 - no zfs healing (XXX)3 R1 2 x R0 N/2 flexible, redundant, good perf4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX)6 RZ R0 N-1 standard RAIDZ - no array RAID (XXX)7 RZ R1 (tray) (N/2)-1 RAIDZ+18 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy)9 RZ 2 x R5 N-3 triple parity calculations (XXX)10 RZ 1 x R5 N-2 double parity calculations (XXX)If you've invested in a RAID controller on an array, you might as well take advantage of it, otherwise you could probably get an old D1000 chassis somewhere and just run RAIDZ on JBOD. If you're more concerned about redundancy than space, with the SUN/STK 3000 series dual controller arrays I would either create at least 2 x RAID5 luns balanced across controllers and zfs mirror, or create at least 4 x RAID1 luns balanced across controllers and use RAIDZ. RAID0 isn't going to make that much sense since you've got a 128KB txg commit on zfs which isn't going to be enough to do a full stripe in most cases..je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3510 JBOD ZFS vs 3510 HW RAID
On Aug 1, 2006, at 22:23, Luke Lonergan wrote: Torrey, On 8/1/06 10:30 AM, Torrey McMahon [EMAIL PROTECTED] wrote: http://www.sun.com/storagetek/disk_systems/workgroup/3510/index.xml Look at the specs page. I did. This is 8 trays, each with 14 disks and two active Fibre channel attachments. That means that 14 disks, each with a platter rate of 80MB/s will be driven over a 400MB/s pair of Fibre Channel connections, a slowdown of almost 3 to 1. This is probably the most expensive, least efficient way to get disk bandwidth available to customers. WRT the discussion about blow the doors, etc., how about we see some bonnie++ numbers to back it up. actually .. there's SPC-2 vdbench numbers out at: http://www.storageperformance.org/results see the full disclosure report here: http://www.storageperformance.org/results/b5_Sun_SPC2_full- disclosure_r1.pdf of course that's a 36GB 15K FC system with 2 expansion trays, 4HBAs and 3 yrs maintenance in the quote that was spec'd at $72K list (or $56/GB) .. (i'll use list numbers for comparison since they're the easiest ) if you've got a copy of the vdbench tool you might want to try the profiles in the appendix on a thumper - I believe the bonnie/bonnie++ numbers tend to skew more on single threaded low blocksize memory transfer issues. now to bring the thread full circle to the original question of price/ performance and increasing the scope to include the X4500 .. for single attached low cost systems, thumper is *very* compelling particularly when you factor in the density .. for example using list prices from http://store.sun.com/ X4500 (thumper) w/ 48 x 250GB SATA drives = $32995 = $2.68/GB X4500 (thumper) w/ 48 x 500GB SATA drives = $69995 = $2.84/GB SE3511 (dual controller) w/ 12 x 500GB SATA drives = $36995 = $6.17/GB SE3510 (dual controller) w/ 12 x 300GB FC drives = $48995 = $13.61/GB So a 250GB SATA drive configured thumper (server attached with 16GB of cache .. err .. RAM) is 5x less in cost/GB than a 300GB FC drive configured 3510 (dual controllers w/ 2 x 1GB typically mirrored cache) and a 500GB SATA drive configured thumper (server attached) is 2.3x less in cost/GB than a 500GB SATA drive configured 3511 (again dual controllers w/ 2 x 1GB typically mirrored cache) For a single attached system - you're right - 400MB/s is your effective throttle (controller speeds actually) on the 3510 and your realistic throughput on the 3511 is probably going to be less than 1/2 that number if we factor in the back pressure we'll get on the cache against the back loop .. your bonnie ++ block transfer numbers on a 36 drive thumper were showing about 424MB/s on 100% write and about 1435MB/s on 100% read .. it'd be good to see the vdbench numbers as well (but i've have a hard time getting my hands on one since most appear to be out at customer sites) Now with thumper - you are SPoF'd on the motherboard and operating system - so you're not really getting the availability aspect from dual controllers .. but given the value - you could easily buy 2 and still come out ahead .. you'd have to work out some sort of timely replication of transactions between the 2 units and deal with failure cases with something like a cluster framework. Then for multi- initiator cross system access - we're back to either some sort of NFS or CIFS layer or we could always explore target mode drivers and virtualization .. so once again - there could be a compelling argument coming in that arena as well. Now, if you already have a big shared FC infrastructure - throwing dense servers in the middle of it all may not make the most sense yet - but on the flip side, we could be seeing a shrinking market for single attach low cost arrays. Lastly (for this discussion anyhow) there's the reliability and quality issues with SATA vs FC drives (bearings, platter materials, tolerances, head skew, etc) .. couple that with the fact that dense systems aren't so great when they fail .. so I guess we're right back to choosing the right systems for the right purposes (ZFS does some great things around failure detection and workaround) .. but i think we've beat that point to death .. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Best Practices for StorEdge 3510 Array and ZFS
On Aug 2, 2006, at 17:03, prasad wrote: Torrey McMahon [EMAIL PROTECTED] wrote: Are any other hosts using the array? Do you plan on carving LUNs out of the RAID5 LD and assigning them to other hosts? There are no other hosts using the array. We need all the available space (2.45TB) on just one host. One option was to create 2 LUN's and use raidz. raidz on RAID5 isn't very efficient and you'll want at least 3 lun's to do it .. you're calculating double parity and tying up too much of your drive bandwidth. if you're going to some variation of RAID5 the best throughput you'll see is to *either* pick the HW RAID characteristics *or* ZFS raidz .. but not both .. if you want a *lot* of redundancy you could create a bunch of RAID10 volumes and then do a raidz on the zpool - but you're really going to lose a lot of capacity that way. What you really want to do is make efficient use of the array cache *and* the copy on write zfs cache so you're doing mostly memory to memory transfers. so that leaves us with 2 options (each with slight variations) option 1 - raidz: I would use all the disks in the 3510 to make either 4 x 3 disk or 6 x 2 disk R0 volumes and balance them across the controllers (assuming you have 2) .. then create your raidz zpool out of all the disks .. the disadvantage (or advantage depending on how you look at it) here is that you're not using the parity engine in the 3510 and you can't really hot spare from the array.. the advantage though is the software based error correction you'll be able to do. option 2 - RAID5 either use the volume you already have or make 2 R5 volumes if you have 2 controllers to balance the LUNs .. it won't matter if they're the same size or not, and you should only really need 1 global hot spare .. then create a standard zpool with these .. the disadvantage is that you won't get the lovely raidz features .. but the possible advantage is that you've offloaded the parity calculation and workload from the host Keep in mind that zfs was originally designed with JBOD in mind .. there's still ongoing discussions on how hw RAID fits into the picture with the new and lovely sw raidz and whether or not socks will be worn when testing one vs the other .. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs. Apple XRaid
On Aug 1, 2006, at 03:43, [EMAIL PROTECTED] wrote: So what does this exercise leave me thinking? Is Linux 2.4.x really screwed up in NFS-land? This Solaris NFS replaces a Linux-based NFS server that the clients (linux and IRIX) liked just fine. Yes; the Linux NFS server and client work together just fine but generally only because the Linux NFS server replies that writes are done before they are committed to disk (async operation). The Linux NFS client is not optimized for server which do not do this and it appears to write little before waiting for the commit replies. Well .. linux clients with linux servers tend to be slightly better behaved since the server essentially fudges on the commit and the async cluster count is generally higher (it won't switch on every operation like Solaris will by default) Additionally there's a VM issue in the page-writeback code that seems to affect write performance and RPC socket performance when there's a high dirty page count. Essentially as pages are flushed there's a higher number of NFS commit operations which will tend to slow down the Solaris NFS server (and probably the txgs or zil as well with the increase in synchronous behaviour.) On the linux 2.6 VM - the number of commits has been seen to rise dramatically when the dirty page count is between 40-90% of the overall system memory .. by tuning the dirtypage_ratio back down to 10% there's typically less time spent in page-writeback and the overall async throughput should rise .. this wasn't really addressed until 2.6.15 or 2.6.16 so you might also get better results on a later kernel. Watching performance between a linux client and a linux server - the linux server seems to buffer the NFS commit operations .. of course the clients will also buffer as much as they can - so you can end up with some unbelievable performance numbers both on the filesystem layers (before you do a sync) and on the NFS client layers as well (until you unmount/remount.) Overall, I find that the Linux VM suffers from many of the same sorts of large memory performance problems that Solaris used to face before priority paging in 2.6 and subsequent page coloring schemes. Based on my unscientific mac powerbook performance observations - i suspect that there could be similar issues with various iterations of the BSD or Darwin kernels - but I haven't taken the initiative to really study any of this. So to wrap up: When doing linux client / solaris server NFS .. I'll typically tune the client for 32KB async tcp transfers (you have to dig into the kernel source to increase this and it's not really worth it) tune the VM to reduce time spent in the kludgy page-writeback (typically a sysctl setting for the dirty page ratio or some such), and then increase the nfs:nfs3_async_clusters and nfs:nfs4_async_clusters to something higher than 1 .. say 32 x 32KB transfers to get you to 1MB .. you can also increase the numbers of threads and the read ahead on the server to eek out some more performance I'd also look at tuning the volblocksize and recordsize as well as the stripe width on your array to 32K or reasonable multiples .. but I'm not sure how much of the issue is in misaligned I/O blocksizes between the various elements vs mandatory pauses or improper behaviour incurred from miscommunication .. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3510 JBOD ZFS vs 3510 HW RAID
On Aug 1, 2006, at 14:18, Torrey McMahon wrote: (I hate when I hit the Send button when trying to change windows) Eric Schrock wrote: On Tue, Aug 01, 2006 at 01:31:22PM -0400, Torrey McMahon wrote: The correct comparison is done when all the factors are taken into account. Making blanket statements like, ZFS JBODs are always ideal or ZFS on top of a raid controller is a bad idea or SATA drives are good enough without taking into account the amount of data, access patterns, numbers of hosts, price, performance, data retention policies, audit requirements ... is where I take issue. Then how are blanket statements like: That said a 3510 with a raid controller is going to blow the door, drive brackets, and skin off a JBOD in raw performance. Not offensive as well? Who said anything about offensive? I just said I take issue such statements in the general sense of trying to compare boxes to boxes or when making blanket statements such as X always works better on Y. The specific question was around a 3510JBOD having better performance then a 3510 with a raid controller. Thats where I said the raid controller performance was going to be better. just to be clear .. we're talking about a 3510 JBOD with ZFS (i guess you could run pass through on the controller or just fail the batteries on the cache) vs a 3510 with the raid controller turned on .. I'd tend to agree with Torrey, mainly since well designed RAID controllers will generally do a better job with their own back-end on aligning I/O for efficient full-stripe commits .. without battery backed memory on the host, CoW is still going to need synchronous I/O somewhere for guaranteed writes - and there's a fraction of your gain. Don't get me wrong .. CoW is key for a lot of the cool features and amazing functionality in ZFS and I like it .. it's just not generally considered a high performance I/O technique for many cases when we're talking about committing bits to spinning rust. And while it may be great for asynchronous behaviour, unless we want to reveal some amazing discovery that reverses years of I/O development - it seems to me that when we fall to synchronous behaviour the invalidation of the filesystem's page cache will always play a factor in the overall reduction of throughput. OK .. I can see that we can eliminate the read/modify/write penalty and write hole problem at the storage layer .. but so does battery backed array cache with the real limiting factor ultimately being the latency between the cache through the back-end loops to the spinning disk. (I would argue that low cache latency and under-saturated drive channels matter more than the sheer amount of coherent cache) Speaking in high generalities, the problem almost always works it's way down to how well an array solution balances properly aligned I/O with the response time between cache across the back-end loops to the spindles and any inherent latency there or in between. OK .. I can see that ZFS is a nice arbitrator and is working it's way into some of the drive mechanics, but there is still some reliance on the driver stack for determining the proper transport saturation and back- off. And great - we're making more inroads with transaction groups and an intent log that's wonderful .. and we've done a lot of cool things along the way .. maybe by the time we're done we can move the code to a minimized Solaris build on dedicated hardware .. and build an array solution (with a built in filesystem) .. that's big .. and round .. and rolls fast .. and then we can call it .. (thump thump thump) .. the zwheel :) --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs vs. vxfs
On Jul 30, 2006, at 23:44, Malahat Qureshi wrote: Is any one have a comparison between zfs vs. vxfs, I'm working on a presentation for my management on this --- That can be a tough question to answer depending on what you're looking for .. you could take the feature comparison approach like you'll find on wikipedia and i think has already been mentioned here: http://en.wikipedia.org/wiki/File_system_comparison agreed it's only a small subset, and generally feature comparisons get heavily used in marketing campaigns for some sort of mudslinging or feature bashing. Of course there's always something that doesn't really get addressed when you take a spreadsheet or bullet point approach. Or you could take the microbenchmark approach with something like Richard's filebench project: http://opensolaris.org/os/community/performance/filebench/ IMO the latter is more of a step in the right direction but the problem sets may be very different depending on your applications - it can be a tough decision to determine which numbers matter the most when you have to make tradeoffs .. your best approach is typically to try and decide some form of CTQs for your applications or organizations that take into account the relevant factors (administration, volume management, storage platforms, performance, recovery, operating systems, etc) and match up features and performance considerations concurrently. I think you'll find that ZFS is an amazing fit for most applications, but in cases where you may think you need directio or non-buffered sorts of behaviour .. you could be at a slight disadvantage. Of course Sun also offer QFS as another high performance alternative .. but like the old mantra we've all heard too many times now .. (everyone together) .. It all depends on what you're trying to do .. --- .je (* disappears back into the mist *) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS questions (hybrid HDs)
On Jun 21, 2006, at 11:05, Anton B. Rang wrote: My guess from reading between the lines of the Samsung/Microsoft press release is that there is a mechanism for the operating system to pin particular blocks into the cache (e.g. to speed boot) and the rest of the cache is used for write buffering. (Using it as a read cache doesn't buy much compared to using the normal drive cache RAM for that, and might also contribute to wear, which is why read caching appears to be under OS control rather than automatic.) Actually, Microsoft has been posting a bit about this for the upcoming Vista release .. WinHEC '06 had a few interesting papers and it looks like Microsoft is going to be introducing SuperFetch, ReadyBoost, and ReadyDrive .. mentioned here: http://www.microsoft.com/whdc/system/sysperf/accelerator.mspx The ReadyDrive paper seems to outline their strategy on the industry Hybrid Drive push and the recent t13.org adoption of the ATA-ACS8 command set: http://www.microsoft.com/whdc/device/storage/hybrid.mspx It also looks like they're aiming at some sort of driver level PriorityIO scheme which should play nicely into lower level tiered hardware in an attempt for more intelligent read/write caching: http://www.microsoft.com/whdc/driver/priorityio.mspx --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
On Jun 28, 2006, at 12:32, Erik Trimble wrote:The main reason I don't see ZFS mirror / HW RAID5 as useful is this: ZFS mirror/ RAID5: capacity = (N / 2) -1 speed N / 2 -1 minimum # disks to lose before loss of data: 4 maximum # disks to lose before loss of data: (N / 2) + 2shouldn't that be capacity = ((N -1) / 2) ?loss of a single disk would cause a rebuild on the R5 stripe which could affect performance on that side of the mirror. Generally speaking good RAID controllers will dedicate processors and channels to calculate the parity and write it out so you're not impacted from the host access PoV. There is a similar sort of CoW behaviour that can happen between the array cache and the drives, but in the ideal case you're dealing with this in dedicated hw instead of shared hw. ZFS mirror / HW Stripe capacity = (N / 2) speed = N / 2 minimum # disks to lose before loss of data: 2 maximum # disks to lose before loss of data: (N / 2) + 1 Given a reasonable number of hot-spares, I simply can't see the (very) marginal increase in safety give by using HW RAID5 as out balancing the considerable speed hit using RAID5 takes. I think you're comparing this to software R5 or at least badly implemented array code and divining that there is a considerable speed hit when using R5. In practice this is not always the case provided that the response time and interaction between the array cache and drives is sufficient for the incoming stream. By moving your operation to software you're now introducing more layers between the CPU, L1/L2 cache, memory bus, and system bus before you get to the interconnect and further latencies on the storage port and underlying device (virtualized or not.) Ideally it would be nice to see ZFS style improvements in array firmware, but given the state of embedded Solaris and the predominance of 32bit controllers - I think we're going to have some issues. We'd also need to have some sort of client mechanism to interact with the array if we're talking about moving the filesystem layer out there .. just a thoughtJon E ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: disk write cache, redux
On Jun 15, 2006, at 06:23, Roch Bourbonnais - Performance Engineering wrote: Naively I'd think a write_cache should not help throughput test since the cache should fill up after which you should still be throttled by the physical drain rate. You clearly show that it helps; Anyone knows why/how a cache helps throughput ? 7200 RPM disks are typically IOP bound - so the write cache (which can be up to 16MB on some drives) should be able to buffer enough IO to deliver more efficiently on each IOP and also reduce head seek. Not sure which vendors implement write through when the cache fills, or how detailed the drive cache algos on SATA can go .. Take a look at PSARC 2004/652: http://www.opensolaris.org/os/community/arc/caselog/2004/652/ .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss