[zfs-discuss] ZFS: A general question
Hello everyone, I'm new to ZFS and OpenSolaris, and I've been reading the docs on ZFS (the pdf The Last Word on Filesystems and wikipedia of course), and I'm trying to understand something. So ZFS is self-healing, correct? This is accomplished via parity and/or metadata of some sort on the disk, right? So it protects against data corruption, but not against disk failure. Or is it the case that ZFS intelligently puts the parity and/or metadata on alternate disks to protect against disk failure, even without a raid array? Anyway you can add mirrored, striped, raidz, or raidz2 arrays to the pool, right? But you can't effortlessly grow/shrink this protected array if you wanted to add a disk or two to increase your protected storage capacity. My understanding is that if you want to add storage to a raid array, you must copy all your data off the array, destroy the array, recreate it with your extra disk(s), then copy all your data back. I like the idea of a protected storage pool that can grow and shrink effortlessly, but if protecting your data against drive failure is not as effortless, then honestly, what's the point? In my opinion, the ease of use should be nearly that of the Drobo product. Which brings me to my final question: is there a gui tool available? I can use command line just like the next guy, but gui's sure are convenient... Thanks for your help! -Steve This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
On Sat, May 24, 2008 at 3:21 AM, Richard Elling [EMAIL PROTECTED] wrote: Consider a case where you might use large, slow SATA drives (1 TByte, 7,200 rpm) for the main storage, and a single small, fast (36 GByte, 15krpm) drive for the L2ARC. This might provide a reasonable cost/performance trade-off. In this case (or in any other case where a cache device is used), does the cache improve write performance or only reads? I presume it cannot increase write performance as the cache is considered volatile, so the write couldn't be committed until the data had left the cache device? From the ZFS admin guide [1] Using cache devices provide the greatest performance improvement for random read-workloads of mostly static content. I'm not sure if that means no performance increase for writes, or just not very much? [1]http://docs.sun.com/app/docs/doc/817-2271/gaynr?a=view -- Hugh Saunders ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help with a BIG problem,
No, this is a 64-bit system (athlon64) with 64-bit kernel of course. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help with a BIG problem,
So, I think I've narrowed it down to two things: * ZFS tries to destroy the dataset every time it's called because the last time it didn't finish destroying * In this process, ZFS makes the kernel run out of memory and die So I thought of two options, but I'm not sure if I'm right: Option 1: Destroy is an atomic operation If destroy is atomic, then I guess what it's trying to do is look up all the blocks that need to be deleted/unlinked/released/freed (not sure which is the word). After it has that list, it will write it to the ZIL (remember this is just what I suppose, correct me if I'm wrong!) and start to physically delete the blocks, until the operation is done and it's finally committed. If this is the case, then the process will be restarted from scratch every time the system is rebooted. But I read that apparently in previous versions, rebooting while destroying a clone that it's taking too long makes the clone reappear intact next time. This, and the fact that zpool iostat show only reads and no or very few writes is what lead me to think this is how it works. So if this is the case, I'd like to abort this destroy. After importing the pool, I will have everything as it was and maybe I can delete snapshots before the clone's parent snapshot and maybe this will speed up the destroy process, or just leave the clone. Option 2: Destroy is not atomic By this I don't mean that it's not atomic, as in if the operation is canceled, it will finish in an incomplete state, but as in if the system is rebooted, the operation will RESUME at the point it was where it died. If this is the case, maybe I can write a script to reboot the computer in a fixed amount of time, and run it on boot: zpool import xx sleep 20 seconds rm /etc/zfs/zpool.cache sleep 1800 seconds reboot This will work under the assumption that the list of blocks to be deleted is flushed to the ZIL or something before boot, to allow the operation to restart at the same point. This is a very nasty hack but it may do the trick only in a very slow fashion: zpool iostat shows 1MB/s read when it's doing the destroy. The dataset in question has 450GB which means that the operation will take 5 days to finish if it needs to read the whole dataset to destroy it, or 7 days if it also needs to go through the other snapshots (600GB total). So, my only viable option seems to be to abort this. How can I do this? disable the ZIL, maybe? Delete the ZIL? scrub after this? Thanks, HernĂ¡n This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
On Fri, May 23, 2008 at 05:26:34PM -0500, Bob Friesenhahn wrote: On Fri, 23 May 2008, Bill McGonigle wrote: The remote-disk cache makes perfect sense. I'm curious if there are measurable benefits for caching local disks as well? NAND-flash SSD drives have good 'seek' and slow transfer, IIRC, but that might still be useful for lots of small reads where seek is everything. NAND-flash SSD drives also wear out. They are not very useful as a cache device which is written to repetitively. A busy server could likely wear one out in just a day or two unless the drive contains aggressive hardware-based write leveling so that it might survive a few more days, depending on how large the device is. Cache devices are usually much smaller and run a lot hotter than a normal filesystem. Someone (Gigabyte, are you listening?) need to make something like the iRAM, only with more capacity and bump it up to 3.0Gbps. SAS would be nice since you could load a nice controller up with them. Does anyone make a 3.5 HDD format RAM disk system that isn't horribly expensive? Backing to disk wouldn't matter to me, but a battery that could hold at least 30 minutes of data would be nice. -brian -- Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you'll end up with a cupboard full of pop tarts and pancake mix. -- IRC User (http://www.bash.org/?841435) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
On Sat, May 24, 2008 at 3:12 AM, Steve Hull [EMAIL PROTECTED] wrote: Hello everyone, I'm new to ZFS and OpenSolaris, and I've been reading the docs on ZFS (the pdf The Last Word on Filesystems and wikipedia of course), and I'm trying to understand something. So ZFS is self-healing, correct? This is accomplished via parity and/or metadata of some sort on the disk, right? So it protects against data corruption, but not against disk failure. Or is it the case that ZFS intelligently puts the parity and/or metadata on alternate disks to protect against disk failure, even without a raid array? Anyway you can add mirrored, striped, raidz, or raidz2 arrays to the pool, right? But you can't effortlessly grow/shrink this protected array if you wanted to add a disk or two to increase your protected storage capacity. My understanding is that if you want to add storage to a raid array, you must copy all your data off the array, destroy the array, recreate it with your extra disk(s), then copy all your data back. I like the idea of a protected storage pool that can grow and shrink effortlessly, but if protecting your data against drive failure is not as effortless, then honestly, what's the point? In my opinion, the ease of use should be nearly that of the Drobo product. Which brings me to my final question: is there a gui tool available? I can use command line just like the next guy, but gui's sure are convenient... Thanks for your help! -Steve This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss You're thinking in terms of a home user. ZFS was designed for an enterprise environment. When they add disks, they don't add one disk at a time, it's a tray at a time at the very least. Because of this, they aren't ever copying data off of the array and back on, and no destruction is needed. You just add a raidz/raidz2 at a time striped across your 14 disks (or however large the tray of disks is). The gui is a web interface. Just point your browser at https://localhost:6789 --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
Anyway you can add mirrored, [...], raidz, or raidz2 arrays to the pool, right? correct. add a disk or two to increase your protected storage capacity. if its a protected vdev, like a mirror or raidz, sure... one can force add a single disk, but then the pool isn't protected until you attach a mirror to that single disk. one can't (currently) remove a vdev (shrink a pool) but one can increase each element of a vdev increasing the size of the pool while maintaining the number of elements (disk count) Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
cache improve write performance or only reads? L2ARC cache device is for reads... for write you want Intent Log The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions. For instance, databases often require their transactions to be on stable storage devices when returning from a system call. NFS and other applica- tions can also use fsync() to ensure data stability. By default, the intent log is allocated from blocks within the main pool. However, it might be possible to get better per- formance using separate intent log devices such as NVRAM or a dedicated disk. For example: # zpool create pool c0d0 c1d0 log c2d0 Multiple log devices can also be specified, and they can be mirrored. See the EXAMPLES section for an example of mirror- ing multiple log devices. Log devices can be added, replaced, attached, detached, and imported and exported as part of the larger pool. but don't underestimate the speed of several slow vdevs vs one fast vdev. Does anyone make a 3.5 HDD format RAM disk system that isn't horribly http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041956.html perhaps adding ram to the system would be more flexible? Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
Hi Steve, Am 24.05.2008 um 10:17 schrieb [EMAIL PROTECTED]: ZFS: A general question To: zfs-discuss@opensolaris.org Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=UTF-8 Hello everyone, I'm new to ZFS and OpenSolaris, and I've been reading the docs on ZFS (the pdf The Last Word on Filesystems and wikipedia of course), and I'm trying to understand something. So ZFS is self-healing, correct? This is accomplished via parity and/or metadata of some sort on the disk, right? So it protects against data corruption, but not against disk failure. This is not entirely true, but possible. You can use the copies attribute to have some sort of redundancy on a single disk. But obviously, if yo only use a single disk and it breaks completely, data loss can not be avoided. Even without redundancy features ZFS, provides very good detection of block failure and snapshots that can be used to avoid accidental deletion/unwanted changes of data Or is it the case that ZFS intelligently puts the parity and/or metadata on alternate disks to protect against disk failure, even without a raid array? You do not need a hardware RAID array to get these features and you can theoretically even use partitions/slices on a single disk, but to get good protection and acceptable performance, you will need multiple drives, since a drive can always fail in a way that it is completely unusable (i.e. it does not spin up anymore). Anyway you can add mirrored, striped, raidz, or raidz2 arrays to the pool, right? But you can't effortlessly grow/shrink this protected array if you wanted to add a disk or two to increase your protected storage capacity. A number of redundant disks is called vdev - this is probably what you call array. A vdev can be build from disks, files, iscsi targets or partitions. Several vdevs form a storage pool. You can increase the size of a pool by adding extra vdevs or replacing all members of a vdev with bigger ones. My understanding is that if you want to add storage to a raid array, you must copy all your data off the array, destroy the array, recreate it with your extra disk(s), then copy all your data back. This is currently true for shrinking a pool and for changing the number of devices in a raidz1/2 vdev. Some efforts have been made to change that - see http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Theoretically it should also be possible to evacuate vdevs (and remove them from a pool), but I do not think any code has been written to do so. The main reason is that Sun's paying customers are probably reasonably happy to just add a vdev to increase storage, so other features are much higher on their priority list. I like the idea of a protected storage pool that can grow and shrink effortlessly, but if protecting your data against drive failure is not as effortless, then honestly, what's the point? In my opinion, the ease of use should be nearly that of the Drobo product. Which brings me to my final question: is there a gui tool available? I can use command line just like the next guy, but gui's sure are convenient... I'd say: The point is First things first. Sun provides a free, reasonably manageable very robust storage concept that does not have all desirable features (yet). For a nice GUI-Tool you might have to wait for Mac OS X 10.6 ;-) Hope this helps, ralf ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
On Sat, May 24, 2008 at 4:00 PM, [EMAIL PROTECTED] wrote: cache improve write performance or only reads? L2ARC cache device is for reads... for write you want Intent Log Thanks for answering my question, I had seen mention of intent log devices, but wasn't sure of their purpose. If only one significantly faster disk is available, would it make sense to slice it and use a slice for L2ARC and a slice for ZIL? or would that cause horrible thrashing? -- Hugh Saunders ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
I like the link you sent along... They did a nice job with that. (but it does show that mixing and matching vastly different drive-sizes is not exactly optimal...) http://www.drobo.com/drobolator/index.html Doing something like this for ZFS allowing people to create pools by mixing/matching drives, raid1, and raidz/z2 drives in a zpool makes for a pretty cool page. If one of the statistical gurus can add MTBF MTTdataLoss etc. to that as a calculator at the bottom that would be even better. (someone did some static graphs for different thumper configurations for this in the past... This would just make that more general purpose/GUI driven... Sounds like a cool project) -- No mention anywhere of removing drives thereby reducing capacity though... Raid-re-striping isn't all that much fun, especially with larger drives... (and even ZFS lacks some features in this area for now) See the answer to you other question below. (from their FAQ) -- MikeE What file systems does drobo support? RESOLUTION: Drobo is a usb external disk array that is formatted by the host operating system (Windows or OS X). We currently support NTFS, HFS+, and FAT32 file systems with firmware revision 1.0.2. Drobo is not a ZFS file system. STATUS: Current specification 1.0.2 Applies to: Drobo DRO4D-U -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Steve Hull Sent: Saturday, May 24, 2008 7:00 PM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS: A general question OK so in my (admittedly basic) understanding of raidz and raidz2, these technologies are very similar to raid5 and raid6. BUT if you set up one disk as a raidz vdev, you (obviously) can't maintain data after a disk failure, but you are protected against data corruption that is NOT a result of disk failure. Right? So is there a resource somewhere that I could look at that clearly spells out how many disks I could have vs. how much resulting space I would have that would still protect me against disk failure (a la the Drobolator http://www.drobo.com/drobolator/index.html)? I mean, if I have a raidz vdev with one disk, then I add a disk, am I protected from disk failure? Is it the case that I need to have disks in groups of 4 to maintain protection against single disk failure with raidz and in groups of 5 for raidz2? It gets even more confusing if I wanted to add disks of varying sizes... And you said I could add a disk (or disks) to a mirror -- can I force add a disk (or disks) to a raidz or raidz2? Without destroying and rebuilding as I read would be required somewhere else? And if I create a zpool and add various single disks to it (without creating raidz/mirror/etc), is it the case that the zpool is essentially functioning like spanning raid? Ie, no protection at all?? Please either point me to an existing resource that spells this out a little clearer or give me a little more explanation around it. And... do you think that the Drobo (www.drobo.com) product is essentially just a box with OpenSolaris and ZFS on it? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
OK so in my (admittedly basic) understanding of raidz and raidz2, these technologies are very similar to raid5 and raid6. BUT if you set up one disk as a raidz vdev, you (obviously) can't maintain data after a disk failure, but you are protected against data corruption that is NOT a result of disk failure. Right? So is there a resource somewhere that I could look at that clearly spells out how many disks I could have vs. how much resulting space I would have that would still protect me against disk failure (a la the Drobolator http://www.drobo.com/drobolator/index.html)? I mean, if I have a raidz vdev with one disk, then I add a disk, am I protected from disk failure? Is it the case that I need to have disks in groups of 4 to maintain protection against single disk failure with raidz and in groups of 5 for raidz2? It gets even more confusing if I wanted to add disks of varying sizes... And you said I could add a disk (or disks) to a mirror -- can I force add a disk (or disks) to a raidz or raidz2? Without destroying and rebuilding as I read would be required somewhere else? And if I create a zpool and add various single disks to it (without creating raidz/mirror/etc), is it the case that the zpool is essentially functioning like spanning raid? Ie, no protection at all?? Please either point me to an existing resource that spells this out a little clearer or give me a little more explanation around it. And... do you think that the Drobo (www.drobo.com) product is essentially just a box with OpenSolaris and ZFS on it? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
Sooo... I've been reading a lot in various places. The conclusion I've drawn is this: I can create raidz vdevs in groups of 3 disks and add them to my zpool to be protected against 1 drive failure. This is the current status of growing protected space in raidz. Am I correct here? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
Hugh Saunders wrote: On Sat, May 24, 2008 at 4:00 PM, [EMAIL PROTECTED] wrote: cache improve write performance or only reads? L2ARC cache device is for reads... for write you want Intent Log Thanks for answering my question, I had seen mention of intent log devices, but wasn't sure of their purpose. If only one significantly faster disk is available, would it make sense to slice it and use a slice for L2ARC and a slice for ZIL? or would that cause horrible thrashing? I wouldn't recommend this configuration. As you say it would thrash the head. Log devices mainly need to write fast as they only ever are read once on reboot if there's uncommitted transactions. Whereas cache devices require a fast read as the write can be done slowly and asynchronously. So a common device sliced for use as both purposes wouldn't work well unless it was both fast read and write and had minimal seek times (nvram, ss disk). Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: A general question
Steve Hull wrote: Sooo... I've been reading a lot in various places. The conclusion I've drawn is this: I can create raidz vdevs in groups of 3 disks and add them to my zpool to be protected against 1 drive failure. This is the current status of growing protected space in raidz. Am I correct here? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Correct. Here's some quick summary information: a POOL is made of 1 or more VDEVs. POOLs consisting of more than 1 VDEV will stripe data across all the VDEVs. VDEVS may be freely added to any POOL, but cannot currently be removed from a POOL. When a vdev is added to a pool, data on the existing vdevs is not automatically re-distributed. That is, say you have 3 vdevs of 1GB each, and add another vdev of 1GB. The system does not immediately attempt to re-distribute the data on the original 3 devices. It will re-balance the data as you WRITE to the pool. Thus, if you expand a pool like this, it is a good idea to copy the data around. i.e. cp /zpool/olddir /zpool/newdir rm -rf /zpool/olddir If there are more than 1 vdev in a pool, the pool's capacity is determined by the smallest device. Thus, if you have a 2GB, a 3GB, and a 5GB device in a pool, the pool's capacity is 3 x 2GB = 6GB, as ZFS will only do full-stripes. Thus, there really is no equivalent to Concatenation in other RAID solutions. However, if you replace ALL devices in a pool with larger ones, ZFS will automatically expand the pool size. Thus, if you replaced the 2GB devices in the above case with 4GB devices, then the pool would automatically appear to be 3 x 4GB = 12GB. A VDEV can consist of: any file any disk slice/partition a whole disk (preferred!) a special sub-device, raidz/raidz1/raidz2/mirror/cache/log/spare For the special sub-devices, here's a summary: raidz (synonym raidz1): You must provide at LEAST 3 storage devices (where a file, slice, or disk is a storage device) 1 device's capacity is consumed in parity. However, parity is scattered around the devices, thus this is roughly analogous to RAID-5 Currently, devices CANNOT be added or removed from a raidz. It is possible to increase the size of raidz by replacing each drive, ONE AT A TIME, with a larger drive. But altering the NUMBER of drives is not possible. raidz2: You must have at LEAST _4_ storage devices 2 device's capacity is consumed by parity. Like raidz, parity is scattered around the devices, improving I/O performance. Roughly analogous to RAID-6. Altering a raidz2 is exactly like doing a raidz. mirror You must provide at LEAST 2 storage devices All data is replicated across all devices, acting as a normal mirror. You can add or detach devices from a mirror at will, so long as they are at least a big as the original mirror. spare Indicates a device which can be used as a hot spare. log indicates an Intent Log, which is basically a transactional log of filesystem operations. Generally speaking, this is used only for certain high-performance cases, and tends to be used in association with enterprise-level devices, such as solid-state drives. cache similar to an Intent Log, this provide a place to cache filesystem internals (metadata such as directory/file attributes) usually used in situations similar to log devices. All pools store redundant metadata, so they can automatically detect and repair most faults in metadata. If you vdev is raidz, raidz2 or mirror, they store redundant data (which allows them to recover from losing a disk), so they can automatically detect AND repair block-level faults. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss