Re: [zfs-discuss] ZFS and Storage
The vdev can handle dynamic lun growth, but the underlying VTOC or EFI label may need to be zero'd and reapplied if you setup the initial vdev on a slice. If you introduced the entire disk to the pool you should be fine, but I believe you'll still need to offline/online the pool. Fine, at least the vdev can handle this... I asked about this feature in October and hoped that it would be implemented when integrating ZFS into Sol10U2 ... http://www.opensolaris.org/jive/thread.jspa?messageID=11646 Does anybody know something about when this feature is finally coming? This would keep the number of LUNs low on the host. Especially as devicenames can be really ugly (long!). //Mika # mv Disclaimer.txt /dev/null - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS and Storage
I'm a little confused by the first poster's message as well, but you lose some benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn't going to detect that because all it knows about are blocks. That's the dilemma, the array provides nice features like RAID1 and RAID5, but those are of no real use when using ZFS. The advantages to use ZFS on such array are e.g. the sometimes huge write cache available, use of consolidated storage and in SAN configurations, cloning and sharing storage between hosts. The price comes of course in additional administrative overhead (lots of microcode updates, more components that can fail in between, etc). Also, in bigger companies there usually is a team of storage specialist, that mostly do not know about the applications running on top of it, or do not care... (like: here you have your bunch of gigabytes...) //Mika # mv Disclaimer.txt /dev/null - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS + NFS perfromance ?
Hi, I've just started using ZFS + NFS, and i was wondering if there is anything i can do to optimise it for being used as a mailstore ? ( small files, lots of them, with lots of directory's and high concurrent access ) So any ideas guys? P ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) For Datawarehouse and streaming applications a seq-read-omptimization could bring additional performance. For normal databases this should be benchmarked... This brings me back to another question. We have a production database, that is cloned on every end of month for end-of-month processing (currently with a feature on our storage array). I'm thinking about a ZFS version of this task. Requirements: the production database should not suffer from performance degradation, whilst running the clone in parallel. As ZFS does not clone all the blocks, I wonder how much the procution database will suffer from sharing most of the data with the clone (concurrent access vs. caching) Maybe we need a feature in ZFS to do a full clone (speak: copy all blocks) inside the pool, if performance is an issue just like the Quick Copy vs. Shadow Image -features on HDS Arrays... - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
That's the dilemma, the array provides nice features like RAID1 and RAID5, but those are of no real use when using ZFS. RAID5 is not a nice feature when it breaks. A RAID controller cannot guarantee that all bits of a RAID5 stripe are written when power fails; then you have data corruption and no one can tell you what data was corrupted. ZFS RAIDZ can. The advantages to use ZFS on such array are e.g. the sometimes huge write cache available, use of consolidated storage and in SAN configurations, cloning and sharing storage between hosts. Are huge write caches really a advantage? Or are you taking about huge write caches with non-volatile storage? The price comes of course in additional administrative overhead (lots of microcode updates, more components that can fail in between, etc). Also, in bigger companies there usually is a team of storage specialist, that mostly do not know about the applications running on top of it, or do not care... (like: here you have your bunch of gigabytes...) True enough Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + NFS perfromance ?
On Tue, Jun 27, 2006 at 10:14:06AM +0200, Patrick wrote: Hi, I've just started using ZFS + NFS, and i was wondering if there is anything i can do to optimise it for being used as a mailstore ? ( small files, lots of them, with lots of directory's and high concurrent access ) So any ideas guys? check out this thread, which may answer some of your questions: http://www.opensolaris.org/jive/thread.jspa?messageID=40617 sounds like your workload is very similar to mine. is all public access via NFS? also, check out this blog entry from Roch: http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs for small file workloads, setting recordsize to a value lower than the default (128k) may prove useful. grant. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS
Chris Csanady writes: On 6/26/06, Neil Perrin [EMAIL PROTECTED] wrote: Robert Milkowski wrote On 06/25/06 04:12,: Hello Neil, Saturday, June 24, 2006, 3:46:34 PM, you wrote: NP Chris, NP The data will be written twice on ZFS using NFS. This is because NFS NP on closing the file internally uses fsync to cause the writes to be NP committed. This causes the ZIL to immediately write the data to the intent log. NP Later the data is also written committed as part of the pools transaction group NP commit, at which point the intent block blocks are freed. NP It does seem inefficient to doubly write the data. In fact for blocks NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) NP we write the data block and also an intent log record with the block pointer. NP During txg commit we link this block into the pool tree. By experimentation NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K NP they do not benefit from this. Is 32KB easily tuned (mdb?)? I'm not sure. NFS folk? I think he is referring to the zfs_immediate_write_sz variable, but NFS will support larger block sizes as well. Unfortunately, since the maximum IP datagram size is 64k, after headers are taken into account, the largest useful value is 60k. If this is to be laid out as an indirect write, will it be written as 32k+16k+8k+4k blocks? If so, this seems like it would be quite inefficient for RAID-Z, and writes would best be left at 32k. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I think the 64K issue refers to UDP. That limits the max block size the NFS may use. But with TCP mounts, NFS is not bounded by this. It should be possible to adjust the nfs blocksize up. For this I think you need to adjust nfs4_bsize on client : echo nfs4_bsize/W131072 | mdb -kw And it could also help to tune up the transfer size echo nfs4_max_transfer_size/W131072 | mdb -kw I also wonder if general purpose NFS exports should not have their recordsize set to 32K in order to match the default NFS bsize. But I have not really looked at this perf yet. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + NFS perfromance ?
grant beattie wrote: On Tue, Jun 27, 2006 at 10:14:06AM +0200, Patrick wrote: Hi, I've just started using ZFS + NFS, and i was wondering if there is anything i can do to optimise it for being used as a mailstore ? ( small files, lots of them, with lots of directory's and high concurrent access ) So any ideas guys? check out this thread, which may answer some of your questions: http://www.opensolaris.org/jive/thread.jspa?messageID=40617 sounds like your workload is very similar to mine. is all public access via NFS? also, check out this blog entry from Roch: http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs for small file workloads, setting recordsize to a value lower than the default (128k) may prove useful. So what about for software development (like Solaris :-) where we've got lots of small files that we might be editting (biggest might be 128k) but when it comes time to compile, we can be writing out megabytes of data. Has anyone done a build of OpenSolaris over NFS served by ZFS and compared with a local ZFS build? How do both of those compared with UFS ? Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + NFS perfromance ?
Hi, sounds like your workload is very similar to mine. is all public access via NFS? Well it's not 'public directly', courier-imap/pop3/postfix/etc... but the maildirs are accessed directly by some programs for certain things. for small file workloads, setting recordsize to a value lower than the default (128k) may prove useful. When changing things like recordsize, can i do it on the fly on a volume ? ( and then if i can what happens to the data already on the volume ? ) Also, another question, when turning compression on does the data already on the volume become compressed in the background ? or is in writes from then on ? P ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + NFS perfromance ?
On Tue, Jun 27, 2006 at 11:16:40AM +0200, Patrick wrote: sounds like your workload is very similar to mine. is all public access via NFS? Well it's not 'public directly', courier-imap/pop3/postfix/etc... but the maildirs are accessed directly by some programs for certain things. yes, that's what I meant. a notable characteristic of most MTAs is that they are fsync() intensive, which can have an impact on ZFS performance. you will probably want to benchmark your IO pattern with various different configurations. for small file workloads, setting recordsize to a value lower than the default (128k) may prove useful. When changing things like recordsize, can i do it on the fly on a volume ? ( and then if i can what happens to the data already on the volume ? ) yes, as with most (all?) ZFS properties, the recordsize can be changed on the fly. existing data is unchanged - the modified settings only affect new writes. Also, another question, when turning compression on does the data already on the volume become compressed in the background ? or is in writes from then on ? as above, existing data remains unchanged. it may be desirable to do such things in the background, because it might be impractical or impossible to do it using regular filesystem access without interrupting applications. the same applies to adding new disks to an existing pool. I think there's an RFE for these sort of operations. grant. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Philip Brown writes: Roch wrote: And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) Possibly or may using fcntl ? Now the goal is to take a file with scattered blocks and order them in contiguous chunks. So this is contigent on the existence of regions of free contiguous disk space. This will get more difficult as we get close to full on the storage. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Mika Borner writes: RAID5 is not a nice feature when it breaks. Let me correct myself... RAID5 is a nice feature for systems without ZFS... Are huge write caches really a advantage? Or are you taking about huge write caches with non-volatile storage? Yes, you are right. The huge cache is needed mostly because of poor write performance for RAID5 (of course battery backuped)... // Mika Having a certain amount on non-volatile cache is great to speed up the latency of ZIL operations which directly impact some application performance. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + NFS perfromance ?
On Tue, Jun 27, 2006 at 12:07:47PM +0200, Roch wrote: for small file workloads, setting recordsize to a value lower than the default (128k) may prove useful. When changing things like recordsize, can i do it on the fly on a volume ? ( and then if i can what happens to the data already on the volume ? ) You do it On the fly for a given FS (recordsize it's not a property of ZVOL). Files that were largers than the previous recordsize will not change. Files that we smaller and thus were stored as a single record, will continue to be stored as single record until a write makes the file bigger than the current value of recordsize. At which point they are store as multiple records of the new recordsize. Performance wise I don't worry too much about these things. ah, yes. the key here is until a write makes the file bigger, which would ~never happen given Maildir format mail as the files are not modified after they are written. they may be unlinked, renamed, or rewritten with a new name - but not modified. grant. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Hello Nathanael, NB I'm a little confused by the first poster's message as well, but NB you lose some benefits of ZFS if you don't create your pools with NB either RAID1 or RAIDZ, such as data corruption detection. The NB array isn't going to detect that because all it knows about are blocks. Actually ZFS will detect data corruption if pool is not redundand but it won't repair data (metadata are protected with 2 or/and 3 copies anyway). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] assertion failure when destroy zpool on tmpfs
Hi Looks like same stack as 6413847, although it is pointed more towards hardware failure. the stack below is from 5.11 snv_38, but also seems to affect update 2 as per above bug. Enda Thomas Maier-Komor wrote: Hi, my colleage is just testing ZFS and created a zpool which had a backing store file on a TMPFS filesystem. After deleting the file everything still worked normally. But destroying the pool caused an assertion failure and a panic. I know this is neither a real-live szenario nor a good idea. The assertion failure occured on Solaris 10 update 2. Below is some mdb output, in case someone is interested in this. BTW: great to have Solaris 10 update 2 with ZFS. I can't wait to deploy it. Cheers, Tom ::panicinfo cpu1 thread 2a100ea7cc0 message assertion failed: vdev_config_sync(rvd, txg) == 0, file: ../../common/fs/zfs/spa .c, line: 2149 tstate 4480001601 g1 30037505c40 g2 10 g32 g42 g53 g6 16 g7 2a100ea7cc0 o0 11eb1e8 o1 2a100ea7928 o2 306f5b0 o3 30037505c50 o4 3c7a000 o5 15 o6 2a100ea6ff1 o7 105e560 pc 104220c npc 1042210 y 10 ::stack vpanic(11eb1e8, 13f01d8, 13f01f8, 865, 600026d4ef0, 60002793ac0) assfail+0x7c(13f01d8, 13f01f8, 865, 183e000, 11eb000, 0) spa_sync+0x190(60001f244c0, 3dd9, 600047f3500, 0, 2a100ea7cc4, 2a100ea7cbc) txg_sync_thread+0x130(60001f9c580, 3dd9, 2a100ea7ab0, 60001f9c6a0, 60001f9c692, 60001f9c690) thread_start+4(60001f9c580, 0, 0, 0, 0, 0) ::status debugging crash dump vmcore.0 (64-bit) from ai operating system: 5.11 snv_38 (sun4u) panic message: assertion failed: vdev_config_sync(rvd, txg) == 0, file: ../../common/fs/zfs/spa .c, line: 2149 dump content: kernel pages only This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Does it make sense to solve these problems piece-meal: * Performance: ZFS algorithms and NVRAM * Error detection: ZFS checksums * Error correction: ZFS RAID1 or RAIDZ Nathanael Burton wrote: If you've got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn't go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized d hardware - SAN-based hardware products allow sharing of f storage among multiple hosts. This allows storage to be utilized more effectively. I'm a little confused by the first poster's message as well, but you lose some benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn't going to detect that because all it knows about are blocks. -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Yes, but the idea of using software raid on a large server doesn't make sense in modern systems. If you've got a large database server that runs a large oracle instance, using CPU cycles for RAID is counter productive. Add to that the need to manage the hardware directly (drive microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in modern systems starts to lose value in a big way. You will detect any corruption when doing a scrub. It's not end-to- end, but it's no worse than today with VxVM. On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote: If you've got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn't go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized d hardware - SAN-based hardware products allow sharing of f storage among multiple hosts. This allows storage to be utilized more effectively. I'm a little confused by the first poster's message as well, but you lose some benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn't going to detect that because all it knows about are blocks. -Nate This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn't help when a double-failure occurs. Luckily, that's very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ eschrock - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Bart Smaalders wrote: Gregory Shaw wrote: On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: How would ZFS self heal in this case? You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. If you've got requirements for surviving an array failure, the recommended solution in that case is to mirror between volumes on multiple arrays. I've always liked software raid (mirroring) in that case, as no manual intervention is needed in the event of an array failure. Mirroring between discrete arrays is usually reserved for mission-critical applications that cost thousands of dollars per hour in downtime. In other words, it won't. You've spent the disk space, but because you're mirroring in the wrong place (the raid array) all ZFS can do is tell you that your data is gone. With luck, subsequent reads _might_ get the right data, but maybe not. Careful here when you say wrong place. There are many scenarios where mirroring in the hardware is the correct way to go even when running ZFS on top of it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Peter Rival wrote: storage arrays with the same arguments over and over without providing an answer to the customer problem doesn't do anyone any good. So. I'll restate the question. I have a 10TB database that's spread over 20 storage arrays that I'd like to migrate to ZFS. How should I configure the storage array? Let's at least get that conversation moving... I'll answer your question with more questions: What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ? What of that doesn't work for you ? What functionality of ZFS is it that you want to leverage ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Unfortunately, a storage-based RAID controller cannot detect errors which occurred between the filesystem layer and the RAID controller, in either direction - in or out. ZFS will detect them through its use of checksums. But ZFS can only fix them if it can access redundant bits. It can't tell a storage device to provide the redundant bits, so it must use its own data protection system (RAIDZ or RAID1) in order to correct errors it detects. Gregory Shaw wrote: Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn't help when a double-failure occurs. Luckily, that's very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ eschrock - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Peter Rival wrote: See, telling folks you should just use JBOD when they don't have JBOD and have invested millions to get to state they're in where they're efficiently utilizing their storage via a SAN infrastructure is just plain one big waste of everyone's time. Shouting down the advantages of storage arrays with the same arguments over and over without providing an answer to the customer problem doesn't do anyone any good. So. I'll restate the question. I have a 10TB database that's spread over 20 storage arrays that I'd like to migrate to ZFS. How should I configure the storage array? Let's at least get that conversation moving... In general, I'd say that if the storage has battery-backed cache, use RAID5 on the storage device - limit the amount of redundant data, but improve performance and achieve some data protection in fast special-purpose hardware. Just my $.02. -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Not at all. ZFS is a quantum leap in Solaris filesystem/VM functionality. However, I don't see a lot of use for RAID-Z (or Z2) in large enterprise customers situations. For instance, does ZFS enable Sun to walk into an account and say You can now replace all of your high- end (EMC) disk with JBOD.? I don't think many customers would bite on that. RAID-Z is an excellent feature, however, it doesn't address many of the reasons for using high-end arrays: - Exporting snapshots to alternate systems (for live database or backup purposes) - Remote replication - Sharing of storage among multiple systems (LUN masking and equivalent) - Storage management (migration between tiers of storage) - No-downtime failure replacement (the system doesn't even know) - Clustering I know that ZFS is still a work in progress, so some of the above may arrive in future versions of the product. I see the RAID-Z[2] value in small-to-mid size systems where the storage is relatively small and you don't have high availability requirements. On Jun 27, 2006, at 8:48 AM, Darren J Moffat wrote: So everything you are saying seems to suggest you think ZFS was a waste of engineering time since hardware raid solves all the problems ? I don't believe it does but I'm no storage expert and maybe I've drank too much cool aid. I'm software person and for me ZFS is brilliant it is so much easier than managing any of the hardware raid systems I've dealt with. -- Darren J Moffat - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. If you don't trust your storage subsystem, you're going to encounter issues regardless of the software use to store data. We'll have to see if ZFS can 'save' customers in this situation. I've found that regardless of the storage solution in question you can't anticipate all issues and when a brownout or other ugly loss-of-service occurs, you may or may not be intact, ZFS or no. I've never seen a product that can deal with all possible situations. On Jun 27, 2006, at 9:01 AM, Jeff Victor wrote: Unfortunately, a storage-based RAID controller cannot detect errors which occurred between the filesystem layer and the RAID controller, in either direction - in or out. ZFS will detect them through its use of checksums. But ZFS can only fix them if it can access redundant bits. It can't tell a storage device to provide the redundant bits, so it must use its own data protection system (RAIDZ or RAID1) in order to correct errors it detects. Gregory Shaw wrote: Most controllers support a background-scrub that will read a volume and repair any bad stripes. This addresses the bad block issue in most cases. It still doesn't help when a double-failure occurs. Luckily, that's very rare. Usually, in that case, you need to evacuate the volume and try to restore what was damaged. On Jun 26, 2006, at 6:40 PM, Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. - Eric -- Eric Schrock, Solaris Kernel Development http:// blogs.sun.com/ eschrock - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/ zones/faq -- - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382 [EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. But there's a big difference between the time ZFS gets the data and the time your typical storage system gets it. And your typical storage system does not store any information which allows it to detect all but the most simple errors. Storage systems are complicated and have many failure modes at many different levels. - disks not writing data or writing data in incorrect location - disks not reporting failures when they occur - bit errors in disk write buffers causing data corruption - storage array software with bugs - storage array with undetected hardware errors - data corruption in the path (such as switches with mangle packets but keep the TCP checksum working If you don't trust your storage subsystem, you're going to encounter issues regardless of the software use to store data. We'll have to see if ZFS can 'save' customers in this situation. I've found that regardless of the storage solution in question you can't anticipate all issues and when a brownout or other ugly loss-of-service occurs, you may or may not be intact, ZFS or no. I've never seen a product that can deal with all possible situations. ZFS attempts to deal with more problems than any of the current existing solutions by giving end-to-end verification of the data. One of the reasons why ZFS was created was a particular large customer who had datacorruption which occured two years (!) before it was detected. The bad data had migrated and corrupted; the good data was no longer available on backups (which weren't very relevant anyway after such a long time) ZFS tries to give one important guarantee: if the data is bad, we will not return it. One case in point is the person in MPK with a SATA controller which corrupts memory; he didn't discover this using UFS (except for perhaps a few strange events he noticed). After switch to ZFS he started to find corruption so now he uses a self-healing ZFS mirror (or RAIDZ). ZFS helps at the low end as much as it does at the highend. I'll bet that ZFS will generate more calls about broken hardware and fingers will be pointed at ZFS at first because it's the new kid; it will be some time before people realize that the data was rotting all along. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote: This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. There will always be some place where errors can be introduced and go on undetected. But some parts of the system are more error prone than others, and ZFS targets the most error prone of them: rotating rust. For the rest, make sure you have ECC memory, that you're using secure NFS (with krb5i or krb5p), and the probability of undetectable data corruption errors should be much closer to zero than what you'd get with other systems. That said, there's a proposal to add end-to-end data checksumming over NFSv4 (see the IETF NFSv4 WG list archives). That proposal can't protect meta-data, and it doesn't remove any one type of data corruption error on the client side, but it does on the server side. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] Re: ZFS and Storage
On 6/27/06, Erik Trimble [EMAIL PROTECTED] wrote: Darren J Moffat wrote: Peter Rival wrote: storage arrays with the same arguments over and over without providing an answer to the customer problem doesn't do anyone any good. So. I'll restate the question. I have a 10TB database that's spread over 20 storage arrays that I'd like to migrate to ZFS. How should I configure the storage array? Let's at least get that conversation moving... I'll answer your question with more questions: What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ? What of that doesn't work for you ? What functionality of ZFS is it that you want to leverage ? It seems that the big thing we all want (relative to the discussion of moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to reliabily detect that a given block is bad, and have ZFS compensate). Now, how do we get things when using HW arrays, and not just treat them like JBODs (which is impractical for large SAN and similar arrays that are already configured). Since the best way to get this is to use a Mirror or RAIDZ vdev, I'm assuming that the proper way to get benefits from both ZFS and HW RAID is the following: (1) ZFS mirror of HW stripes, i.e. zpool create tank mirror hwStripe1 hwStripe2 (2) ZFS RAIDZ of HW mirrors, i.e. zpool create tank raidz hwMirror1, hwMirror2 (3) ZFS RAIDZ of HW stripes, i.e. zpool create tank raidz hwStripe1, hwStripe2 mirrors of mirrors and raidz of raid5 is also possible, but I'm pretty sure they're considerably less useful than the 3 above. Personally, I can't think of a good reason to use ZFS with HW RAID5; case (3) above seems to me to provide better performance with roughly the same amount of redundancy (not quite true, but close). I'd vote for (1) if you need high performance, at the cost of disk space, (2) for maximum redundancy, and (3) as maximum space with reasonable performance. I'm making a couple of assumptions here: (a) you have the spare cycles on your hosts to allow for using ZFS RAIDZ, which is a non-trivial cost (though not that big, folks). (b) your HW RAID controller uses NVRAM (or battery-backed cache), which you'd like to be able to use to speed up writes (c) you HW RAID's NVRAM speeds up ALL writes, regardless of the configuration of arrays in the HW (d) having your HW controller present individual disks to the machines is a royal pain (way too many, the HW does other nice things with arrays, etc) The case for HW RAID 5 with ZFS is easy: when you use iscsi. You get major performance degradation over iscsi when trying to coordinate writes and reads serially over iscsi using RAIDZ. The sweet spot in the iscsi world is let your targets do RAID5 or whatnot (RAID10, RAID50, RAID6), and combine those into ZFS pools, mirrored or not. There are other benefits to ZFS, including snapshots, easily managed storage pools, and with iscsi, ease of switching head nodes with a simple export/import. Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
[EMAIL PROTECTED] wrote: That's the dilemma, the array provides nice features like RAID1 and RAID5, but those are of no real use when using ZFS. RAID5 is not a nice feature when it breaks. A RAID controller cannot guarantee that all bits of a RAID5 stripe are written when power fails; then you have data corruption and no one can tell you what data was corrupted. ZFS RAIDZ can. That depends on the raid controller. Some implementations use a log *and* a battery back up. In some cases the battery is a embedded UPS of sorts to make sure the power stays up long enough to take the writes from the host and get them to disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
Your example would prove more effective if you added, I've got ten databases. Five on AIX, Five on Solaris 8 Peter Rival wrote: I don't like to top-post, but there's no better way right now. This issue has recurred several times and there have been no answers to it that cover the bases. The question is, say I as a customer have a database, let's say it's around 8 TB, all built on a series of high end storage arrays that _don't_ support the JBOD everyone seems to want - what is the preferred configuration for my storage arrays to present LUNs to the OS for ZFS to consume? Let's say our choices are RAID0, RAID1, RAID0+1 (or 1+0) and RAID5 - that spans the breadth of about as good as it gets. What should I as a customer do? Should I create RAID0 sets and let ZFS self-heal via its own mirroring or RAIDZ when a disk blows in the set? Should I use RAID1 and eat the disk space used? RAID5 and be thankful I have a large write cache - and then which type of ZFS pool should I create over it? See, telling folks you should just use JBOD when they don't have JBOD and have invested millions to get to state they're in where they're efficiently utilizing their storage via a SAN infrastructure is just plain one big waste of everyone's time. Shouting down the advantages of storage arrays with the same arguments over and over without providing an answer to the customer problem doesn't do anyone any good. So. I'll restate the question. I have a 10TB database that's spread over 20 storage arrays that I'd like to migrate to ZFS. How should I configure the storage array? Let's at least get that conversation moving... - Pete Gregory Shaw wrote: Yes, but the idea of using software raid on a large server doesn't make sense in modern systems. If you've got a large database server that runs a large oracle instance, using CPU cycles for RAID is counter productive. Add to that the need to manage the hardware directly (drive microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in modern systems starts to lose value in a big way. You will detect any corruption when doing a scrub. It's not end-to-end, but it's no worse than today with VxVM. On Jun 26, 2006, at 6:09 PM, Nathanael Burton wrote: If you've got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn't go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized d hardware - SAN-based hardware products allow sharing of f storage among multiple hosts. This allows storage to be utilized more effectively. I'm a little confused by the first poster's message as well, but you lose some benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn't going to detect that because all it knows about are blocks. -Nate ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Torrey McMahon wrote: ZFS is greatfor the systems that can run it. However, any enterprise datacenter is going to be made up of many many hosts running many many OS. In that world you're going to consolidate on large arrays and use the features of those arrays where they cover the most ground. For example, if I've 100 hosts all running different OS and apps and I can perform my data replication and redundancy algorithms, in most cases Raid, in one spot then it will be much more cost efficient to do it there. Exactly what I'm pondering. In the near to mid term, Solaris with ZFS can be seen as sort of a storage virtualizer where it takes disks into ZFS pools and volumes and then presents them to other hosts and OSes via iSCSI, NFS, SMB and so on. At that point, those other OSes can enjoy the benefits of ZFS. In the long term, it would be nice to see ZFS (or its concepts) integrated as the LUN provisioning and backing store mechanism on hardware RAID arrays themselves, supplanting the traditional RAID paradigms that have been in use for years. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Root password fix on zfs root filesystem
Currently, when the root password is forgotten / munged, I boot from the cdrom into a shell, mount the root filesystem on /mnt and edit /mnt/etc/shadow, blowing away the root password. What is going to happen when the root filesystem is ZFS? Hopefully the same mechanism will be available. Ron Halstead This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Root password fix on zfs root filesystem
Ron Halstead wrote: Currently, when the root password is forgotten / munged, I boot from the cdrom into a shell, mount the root filesystem on /mnt and edit /mnt/etc/shadow, blowing away the root password. What is going to happen when the root filesystem is ZFS? Hopefully the same mechanism will be available. A similar mechanism will do what you want. The only difference is that while booted from the cdrom, you would have to use the zfs import command to import the root pool. Then you can mount the root dataset and modify it as needed. Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Jason Schroeder wrote: Torrey McMahon wrote: [EMAIL PROTECTED] wrote: I'll bet that ZFS will generate more calls about broken hardware and fingers will be pointed at ZFS at first because it's the new kid; it will be some time before people realize that the data was rotting all along. EhhhI don't think so. Most of our customers have HW arrays that have been scrubbing data for years and years as well as apps on the top that have been verifying the data. (Oracle for example.) Not to mention there will be a bit of time before people move over to ZFS in the high end. Ahh... but there is the rub. Today - you/we don't *really* know, do we? Maybe there's bad juju blocks, maybe not. Running ZFS, whether in a redundant vdev or not, will certainly turn the big spotlight on and give us the data that checksums matched, or they didn't. A spotlight on what? How is that data going to get into ZFS? The more I think about this more I realize it's going to do little for existing data sets. You're going to have to migrate that data from filesystem X into ZFS first. From that point on ZFS has no idea if the data was bad to begin with. If you can do an in place migration then you might be able to weed out some bad physical blocks/drives over time but I assert that the current disk scrubbing methodologies catch most of those. Yes, it's great for new data sets where you started with ZFS. Sorry if I sound like I'm raining on the parade here folks. That's not the case, really, and I'm all for the great new features and EAU ZFS gives where applicable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10 6/06 now available for download
Solaris 10u2 was released today. You can now download it from here: http://www.sun.com/software/solaris/get.jsp Does anyone know if ZFS is included in this release? One of my local Sun reps said it did not make it into the u2 release, though I have heard for ages that 6/06 would include it. Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10 6/06 now available for download
Indeed. ZFS is included in Solaris 10 U2. -- Prabahar. Shannon Roddy wrote: Solaris 10u2 was released today. You can now download it from here: http://www.sun.com/software/solaris/get.jsp Does anyone know if ZFS is included in this release? One of my local Sun reps said it did not make it into the u2 release, though I have heard for ages that 6/06 would include it. Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10 6/06 now available for download
Yup, it's there! Shannon Roddy said the following on 06/27/06 12:57: Solaris 10u2 was released today. You can now download it from here: http://www.sun.com/software/solaris/get.jsp Does anyone know if ZFS is included in this release? One of my local Sun reps said it did not make it into the u2 release, though I have heard for ages that 6/06 would include it. Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Gary Combs Technical Marketing Sun Microsystems, Inc. 3295 NW 211th Terrace Hillsboro, OR 97124 US Phone x32604/+1 503 715 3517 Fax 503-715-3517 Email [EMAIL PROTECTED] "The box said 'Windows 2000 Server or better', so I installed Solaris" ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10 6/06 now available for download
Shannon Roddy wrote: Solaris 10u2 was released today. You can now download it from here: http://www.sun.com/software/solaris/get.jsp Does anyone know if ZFS is included in this release? One of my local Sun reps said it did not make it into the u2 release, though I have heard for ages that 6/06 would include it. Yes. [EMAIL PROTECTED]:/home/pwags% more /etc/release Solaris 10 6/06 s10s_u2wos_09a SPARC Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 09 June 2006 [EMAIL PROTECTED]:/home/pwags% zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT sse1.06T455G633G41% ONLINE - Regards, Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Nicolas Williams wrote: On Tue, Jun 27, 2006 at 09:41:10AM -0600, Gregory Shaw wrote: This is getting pretty picky. You're saying that ZFS will detect any errors introduced after ZFS has gotten the data. However, as stated in a previous post, that doesn't guarantee that the data given to ZFS wasn't already corrupted. There will always be some place where errors can be introduced and go on undetected. But some parts of the system are more error prone than others, and ZFS targets the most error prone of them: rotating rust. For the rest, make sure you have ECC memory, that you're using secure NFS (with krb5i or krb5p), and the probability of undetectable data corruption errors should be much closer to zero than what you'd get with other systems. Another alternative is using IPsec with just AH. For the benefit of those outside of Sun MPK17 both krb5i and IPsec AH were used to diagnose and prove that we have a faulty router in a lab that was causing very strange build errors. TCP/IP alone didn't catch the problems and sometimes they showed up with SCCS simple checksums and sometimes we had compile errors. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Torrey McMahon wrote: Darren J Moffat wrote: So everything you are saying seems to suggest you think ZFS was a waste of engineering time since hardware raid solves all the problems ? I don't believe it does but I'm no storage expert and maybe I've drank too much cool aid. I'm software person and for me ZFS is brilliant it is so much easier than managing any of the hardware raid systems I've dealt with. ZFS is greatfor the systems that can run it. However, any enterprise datacenter is going to be made up of many many hosts running many many OS. In that world you're going to consolidate on large arrays and use the features of those arrays where they cover the most ground. For example, if I've 100 hosts all running different OS and apps and I can perform my data replication and redundancy algorithms, in most cases Raid, in one spot then it will be much more cost efficient to do it there. but you still need a local file system on those systems in many cases. So back to where we started I guess, how to effectively use ZFS to benefit Solaris (and the other platforms it gets ported to) while still using Hardware RAID because you have no choice but to use it. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] disk evacuate
Just wondered if there'd been any progress in this area? Correct me if i'm wrong, but as it stands, there's no way to remove a device you accidentally 'zpool add'ed without destroying the pool. On 12/06/06, Gregory Shaw [EMAIL PROTECTED] wrote: Yes, if zpool remove works like you describe, it does the same thing. Is there a time frame for that feature? Thanks! On Jun 11, 2006, at 10:21 AM, Eric Schrock wrote: This only seems valuable in the case of an unreplicated pool. We already have 'zpool offline' to take a device and prevent ZFS from talking to it (because it's in the process of failing, perhaps). This gives you what you want for mirrored and RAID-Z vdevs, since there's no data to migrate anyway. We are also planning on implementing 'zpool remove' (for more than just hot spares), which would allow you to remove an entire toplevel vdev, migrating the data off of it in the process. This would give you what you want for the case of an unreplicated pool. Does this satisfy the usage scenario you described? - Eric On Sun, Jun 11, 2006 at 07:52:37AM -0600, Gregory Shaw wrote: Pardon me if this scenario has been discussed already, but I haven't seen anything as yet. I'd like to request a 'zpool evacuate pool device' command. 'zpool evacuate' would migrate the data from a disk device to other disks in the pool. Here's the scenario: Say I have a small server with 6x146g disks in a jbod configuration. If I mirror the system disk with SVM (currently) and allocate the rest as a non-raidz pool, I end up with 4x146g in a pool of approximately 548gb capacity. If one of the disks is starting to fail, I would need to use 'zpool replace new-disk old-disk'. However, since I have no more slots in the machine to add a replacement disk, I'm stuck. This is where a 'zpool evacuate pool device' would come in handy. It would allow me to evacuate the failing device so that it could be replaced and re-added with 'zpool add pool device'. What does the group think? -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Storage
On Tue, 27 Jun 2006, Gregory Shaw wrote: Yes, but the idea of using software raid on a large server doesn't make sense in modern systems. If you've got a large database server that runs a large oracle instance, using CPU cycles for RAID is counter productive. Add to that the need to manage the hardware directly (drive microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in modern systems starts to lose value in a big way. You will detect any corruption when doing a scrub. It's not end-to- end, but it's no worse than today with VxVM. The initial impression I got, after reading the original post, is that its author[1] does not grok something fundamental about ZFS and/or how it works! Or does not understand that there are many CPU cycles in a modern Unix box that are never taken advantage of. It's clear to me that ZFS provides considerable, never before available, features and facilities, and that any impact that ZFS may have on CPU or memory utilization will become meaningless over time, as the # of CPU cores increase - along with their performance. And that average system memory size will continue to increase over time. Perhaps the author is taking a narrow view that ZFS will *replace* existing systems. I don't think that this will be the general case. Especially in a large organization where politics and turf wars will dominate any technical discussions and implementation decisions will be made by senior management who are 100% buzzword compliant (and have questionable technical/engineering skills). Rather it will provide the system designer with a hugely powerful *new* tool to apply in system design. And will challenge the designer to use it creatively and effectively. There is no such thing as the universal screw-driver. Every toolbox has tens of screwdrivers and tool designers will continue to dream about replacing them all with _one_ tool. [1] Sorry Gregory. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] This may be a somewhat silly question ...
... but I have to ask. How do I back this up? Here is my definition of a backup : (1) I can copy all data and metadata onto some media in a manner that verifies the integrity of the data and metadata written. (1.1) By verify I mean that the data written onto the media is read back and compared to the source and accuracy is assured. (2) I can walk away with the media and be able to restore the data onto bare metal with nothing other than Solaris 10 Update 2 ( or Nevada ) CDROM sets and reasonable hardware. I have a copy of the Solaris ZFS Administration Guide which is some document numbered 817-2271. 158 pages and well worth printing out I think. Let's suppose that I have a pile of disks arranged in mirrors and everything seems to be going along swimmingly thus : # zpool status zfs0 pool: zfs0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 errors: No known data errors # # zfs list NAME USED AVAIL REFER MOUNTPOINT zfs0 95.3G 70.8G 27.5K /export/zfs zfs0/backup 91.2G 70.8G 88.4G /export/zfs/backup zfs0/backup/pasiphae 2.77G 24.2G 2.77G /export/zfs/backup/pasiphae zfs0/lotus 786M 70.8G 786M /opt/lotus zfs0/zone 3.40G 70.8G 24.5K /export/zfs/zone zfs0/zone/common 24.5K 8.00G 24.5K legacy zfs0/zone/domino 24.5K 70.8G 24.5K /opt/zone/domino zfs0/zone/sugar 3.40G 12.6G 3.40G /opt/zone/sugar At this point I attach a tape drive to the machine : # devfsadm -v -C -c tape devfsadm[24247]: verbose: symlink /dev/rmt/0 - ../../devices/[EMAIL PROTECTED],0/SUNW,[EMAIL PROTECTED],880/[EMAIL PROTECTED],0: . . . devfsadm[24247]: verbose: symlink /dev/rmt/0ubn - ../../devices/[EMAIL PROTECTED],0/SUNW,[EMAIL PROTECTED],880/[EMAIL PROTECTED],0:ubn # mt -f /dev/rmt/0lbn status DLT4000 tape drive: sense key(0x6)= Unit Attention residual= 0 retries= 0 file no= 0 block no= 0 # I then create a snapshot as per the documentation : # zfs list zfs0 NAME USED AVAIL REFER MOUNTPOINT zfs0 95.3G 70.8G 27.5K /export/zfs # date Tue Jun 27 18:10:36 EDT 2006 # zfs snapshot [EMAIL PROTECTED] # zfs list NAME USED AVAIL REFER MOUNTPOINT zfs0 95.3G 70.8G 27.5K /export/zfs [EMAIL PROTECTED] 0 - 27.5K - zfs0/backup 91.2G 70.8G 88.4G /export/zfs/backup zfs0/backup/pasiphae 2.77G 24.2G 2.77G /export/zfs/backup/pasiphae zfs0/lotus 786M 70.8G 786M /opt/lotus zfs0/zone 3.40G 70.8G 24.5K /export/zfs/zone zfs0/zone/common 24.5K 8.00G 24.5K legacy zfs0/zone/domino 24.5K 70.8G 24.5K /opt/zone/domino zfs0/zone/sugar 3.40G 12.6G 3.40G /opt/zone/sugar # And then I send that snapshot to tape : # zfs send [EMAIL PROTECTED] /dev/rmt/0mbn # That command ran for maybe 15 seconds. I seriously doubt that 95GB of data was written to tape and verified in that time although I'd like to see the device and bus that can do it! :-) I'll destroy that snapshot and try something else here : # zfs destroy [EMAIL PROTECTED] Now perhaps the mystery is to try a different ZFS filesystem : # date Tue Jun 27 18:17:33 EDT 2006 # zfs snapshot zfs0/[EMAIL PROTECTED]:17Hrs I'll check the tape drive that did something above although I have no idea what. # mt -f /dev/rmt/0mbn status DLT4000 tape drive: sense key(0x0)= No Additional Sense residual= 0 retries= 0 file no= 1 block no= 0 # Now I will send that stream to the tape : # zfs send zfs0/[EMAIL PROTECTED]:17Hrs /dev/rmt/0mbn The tape is now doing something again and I don't know what. I would like to think that when it is down I can walk to a totally new machine and restore the ZFS filesystem zfs0/lotus with no issue but I don't see a verify step here anywhere and I really have no idea what will happen when I hit the end of that tape. I am very bothered that my 95GB zfs0 did not go to tape and I don't know why not. I think that my itty bitty 786MB zfs0/lotus is actually going to tape right now ( lights are flashing )
Re: [zfs-discuss] Re: ZFS and Storage
Al Hopper wrote: On Tue, 27 Jun 2006, Gregory Shaw wrote: Yes, but the idea of using software raid on a large server doesn't make sense in modern systems. If you've got a large database server that runs a large oracle instance, using CPU cycles for RAID is counter productive. Add to that the need to manage the hardware directly (drive microcode, drive brownouts/restarts, etc.) and the idea of using JBOD in modern systems starts to lose value in a big way. You will detect any corruption when doing a scrub. It's not end-to- end, but it's no worse than today with VxVM. The initial impression I got, after reading the original post, is that its author[1] does not grok something fundamental about ZFS and/or how it works! Or does not understand that there are many CPU cycles in a modern Unix box that are never taken advantage of. Just because there are idle cpu cycles does not mean it is ok for the Operating System to use them. If there is a valid reason for the OS to consume those cycles then that is fine. But every cycle that the OS consumes is one less cycle that is available for the customer apps (be it Oracle or whatever, and I spend a lot of my time trying to squeeze those cycles out of the high end systems). The job of the operating system is get the hell out of the way as quickly as possible so the user aps. can do there work. That can mean offloading some of the work onto smart arrays. As someone once said to me once, a customer does not buy hardware to run an OS on, they buy it to accomplish some given piece of work. It's clear to me that ZFS provides considerable, never before available, features and facilities, and that any impact that ZFS may have on CPU or memory utilization will become meaningless over time, as the # of CPU cores increase - along with their performance. And that average system memory size will continue to increase over time. This is true and will probably be true for ever and has been going on ever since the first chip. There has always been more demand for more power by the end users. However just because we have available cycles does not mean the OS should consume them. Perhaps the author is taking a narrow view that ZFS will *replace* existing systems. I don't think that this will be the general case. Especially in a large organization where politics and turf wars will dominate any technical discussions and implementation decisions will be made by senior management who are 100% buzzword compliant (and have questionable technical/engineering skills). Rather it will provide the system designer with a hugely powerful *new* tool to apply in system design. And will challenge the designer to use it creatively and effectively. It all depends on your needs. The idea of ZFS of providing raid capabilities is very appealing for those systems that are desk top units or small servers. But where we are talking petabyte+ storage with 30+ gig/sec of IO bandwidth capacity, I believe we will find the CPUs are going to consume way to much to handle the IO rate in such in environment, at which time the work needs to be off loaded to smart arrays (I have to do that experimentation yet). You do not buy a 18 wheel tractor trailer to simply move a lawnmower from job site to job site, you buy a SUV, pickup truck or trailer. Vice versa you do not buy a pickup truck to move a tracked excavator, you have a tractor trailer. There is no such thing as the universal screw-driver. Every toolbox has tens of screwdrivers and tool designers will continue to dream about replacing them all with _one_ tool. How true. ZFS is one of many tools available. However the impression I have been picking up out of here at various times is that alot of people view ZFS as the only tool in the tool box, thus everything is looking like a nail because all you have is a hammer. If ZFS is providing better data integrity then the current storage arrays, that sounds like to me an opportunity for the next generation of intelligent arrays to become better. Dave Valin [1] Sorry G regory. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS
Robert Milkowski wrote On 06/27/06 03:00,: Hello Chris, Tuesday, June 27, 2006, 1:07:31 AM, you wrote: CC On 6/26/06, Neil Perrin [EMAIL PROTECTED] wrote: Robert Milkowski wrote On 06/25/06 04:12,: Hello Neil, Saturday, June 24, 2006, 3:46:34 PM, you wrote: NP Chris, NP The data will be written twice on ZFS using NFS. This is because NFS NP on closing the file internally uses fsync to cause the writes to be NP committed. This causes the ZIL to immediately write the data to the intent log. NP Later the data is also written committed as part of the pools transaction group NP commit, at which point the intent block blocks are freed. NP It does seem inefficient to doubly write the data. In fact for blocks NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) NP we write the data block and also an intent log record with the block pointer. NP During txg commit we link this block into the pool tree. By experimentation NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K NP they do not benefit from this. Is 32KB easily tuned (mdb?)? I'm not sure. NFS folk? CC I think he is referring to the zfs_immediate_write_sz variable, but Exactly, I was asking about this not NFS. Sorry for the confusion. The zfs_immediate_write_sz varaible was meant for internal use and not really intended for public tuning. However, yes it could be tuned dynamically anytime using mdb, or set in /etc/system -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supporting ~10K users on ZFS
Steve Bennett wrote: OK, I know that there's been some discussion on this before, but I'm not sure that any specific advice came out of it. What would the advice be for supporting a largish number of users (10,000 say) on a system that supports ZFS? We currently use vxfs and assign a user quota, and backups are done via Legato Networker. Using lots of filesystems is definitely encouraged - as long as doing so makes sense in your environment. From what little I currently understand, the general advice would seem to be to assign a filesystem to each user, and to set a quota on that. I can see this being OK for small numbers of users (up to 1000 maybe), but I can also see it being a bit tedious for larger numbers than that. I just tried a quick test on Sol10u2: for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y done; done [apologies for the formatting - is there any way to preformat text on this forum?] It ran OK for a minute or so, but then I got a slew of errors: cannot mount '/testpool/38': unable to create mountpoint filesystem successfully created, but not mounted So, OOTB there's a limit that I need to raise to support more than approx 40 filesystems (I know that this limit can be raised, I've not checked to see exactly what I need to fix). It does beg the question of why there's a limit like this when ZFS is encouraging use of large numbers of filesystems. There is no 40 filesystem limit. You most likely had a pre-existing file/directory in testpool of the same name of the filesystem you tried to create. fsh-hake# zfs list NAME USED AVAIL REFER MOUNTPOINT testpool77K 7.81G 24.5K /testpool fsh-hake# echo hmm /testpool/01 fsh-hake# zfs create testpool/01 cannot mount 'testpool/01': Not a directory filesystem successfully created, but not mounted fsh-hake# If I have 10,000 filesystems, is the mount time going to be a problem? I tried: for x in 0 1 2 3 4 5 6 7 8 9; do for x in 0 1 2 3 4 5 6 7 8 9; do zfs umount testpool/001; zfs mount testpool/001 done; done This took 12 seconds, which is OK until you scale it up - even if we assume that mount and unmount take the same amount of time, so 100 mounts will take 6 seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is on a test system without fantastic performance, but there *will* be a much larger delay on mounting a ZFS pool like this over a comparable UFS filesystem. So this really depends on why and when you're unmounting filesystems. I suspect it won't matter much since you won't be unmounting/remounting your filesystems. I currently use Legato Networker, which (not unreasonably) backs up each filesystem as a separate session - if I continue to use this I'm going to have 10,000 backup sessions on each tape backup. I'm not sure what kind of challenges restoring this kind of beast will present. Others have already been through the problems with standard tools such as 'df' becoming less useful. Is there a specific problem you had in mind regarding 'df;? One alternative is to ditch quotas altogether - but even though disk is cheap, it's not free, and regular backups take time (and tapes are not free either!). In any case, 10,000 undergraduates really will be able to fill more disks than we can afford to provision. We tried running a Windows fileserver back in the days when it had no support for per-user quotas; we did some ad-hockery that helped to keep track of the worst offenders (ableit after the event), but what really killed us was the uncertainty over whether some idiot would decide to fill all available space with vital research data (or junk, depending on your point of view). I can see the huge benefits that ZFS quotas and reservations can bring, but I can also see that there is a possibility that there are situations where ZFS could be useful, but the lack of 'legacy' user-based quotas make it impractical. If the ZFS developers really are not going to implement user quotas is there any advice on what someone like me could do - at the moment I'm presuming that I'll just have to leave ZFS alone. I wouldn't give up that easily... looks like 1 filesystem per user, and 1 quota per filesystem does exactly what you want: fsh-hake# zfs get -r -o name,value quota testpool NAME VALUE testpool none testpool/ann 10M testpool/bob 10M testpool/john10M fsh-hake# I'm assuming that you decided against 1 filesystem per user due to supposed 40 filesystem limit, which is isn't true. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supporting ~10K users on ZFS
On Tue, 2006-06-27 at 23:07, Steve Bennett wrote: From what little I currently understand, the general advice would seem to be to assign a filesystem to each user, and to set a quota on that. I can see this being OK for small numbers of users (up to 1000 maybe), but I can also see it being a bit tedious for larger numbers than that. I've seen this discussed; even recommended. I don't think, though - given that zfs has been available in a supported version of Solaris for about 24 hours or so - that we've yet got to the point of best practice or recommendation yet. That said, the idea of one filesystem per user does have its attractions. With zfs - unlike other filesystems - it's feasible. Whether it's sensible is another matter. Still, you could give them a zone each as well... (One snag is that for undergraduates, there isn't really an intermediate level - department or research grant, for example - that can be used as the allocation unit.) I just tried a quick test on Sol10u2: for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y done; done [apologies for the formatting - is there any way to preformat text on this forum?] It ran OK for a minute or so, but then I got a slew of errors: cannot mount '/testpool/38': unable to create mountpoint filesystem successfully created, but not mounted So, OOTB there's a limit that I need to raise to support more than approx 40 filesystems (I know that this limit can be raised, I've not checked to see exactly what I need to fix). It does beg the question of why there's a limit like this when ZFS is encouraging use of large numbers of filesystems. Works fine for me. I've done this up to 16000 or so (not with current bits, that was last year). If I have 10,000 filesystems, is the mount time going to be a problem? I tried: for x in 0 1 2 3 4 5 6 7 8 9; do for x in 0 1 2 3 4 5 6 7 8 9; do zfs umount testpool/001; zfs mount testpool/001 done; done This took 12 seconds, which is OK until you scale it up - even if we assume that mount and unmount take the same amount of time, It's not quite symmetric; I think umount is a fraction slower (it has to check if the filesystem is in use, amongst other things), but the estimate is probably accurate enough. so 100 mounts will take 6 seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is on a test system without fantastic performance, but there *will* be a much larger delay on mounting a ZFS pool like this over a comparable UFS filesystem. My test last year got to 16000 filesystems on a 1G server before it went ballistic and all operations took infinitely long. I had clearly run out of physical memory. 5 minutes doesn't sound too bad to me. It's an order of magnitude quicker than it took to initialize ufs quotas before ufs logging was introduced. One alternative is to ditch quotas altogether - but even though disk is cheap, it's not free, and regular backups take time (and tapes are not free either!). In any case, 10,000 undergraduates really will be able to fill more disks than we can afford to provision. Last year, before my previous employer closed down, we switched off user disk quotas for 20,000 researchers. The world didn't end. The disks didn't fill up. All the work we had to do managing user quotas vanished. The number of calls to the helpdesk to sort out stupid problems due to applications running out of disk space plummeted down to zero. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supporting ~10K users on ZFS
[EMAIL PROTECTED] wrote On 06/27/06 17:17,: We have over 1 filesystems under /home in strongspace.com and it works fine. I forget but there was a bug or there was an improvement made around nevada build 32 (we're currently at 41) that made the initial mount on reboot significantly faster. Before that it was around 10-15 minutes. I wonder if that improvement didn't make it into sol10U2? That fix (bug 6377670) made it into build 34 and S10_U2. -Jason Sent via BlackBerry from Cingular Wireless -Original Message- From: eric kustarz [EMAIL PROTECTED] Date: Tue, 27 Jun 2006 15:55:45 To:Steve Bennett [EMAIL PROTECTED] Cc:zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Supporting ~10K users on ZFS Steve Bennett wrote: OK, I know that there's been some discussion on this before, but I'm not sure that any specific advice came out of it. What would the advice be for supporting a largish number of users (10,000 say) on a system that supports ZFS? We currently use vxfs and assign a user quota, and backups are done via Legato Networker. Using lots of filesystems is definitely encouraged - as long as doing so makes sense in your environment. From what little I currently understand, the general advice would seem to be to assign a filesystem to each user, and to set a quota on that. I can see this being OK for small numbers of users (up to 1000 maybe), but I can also see it being a bit tedious for larger numbers than that. I just tried a quick test on Sol10u2: for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y done; done [apologies for the formatting - is there any way to preformat text on this forum?] It ran OK for a minute or so, but then I got a slew of errors: cannot mount '/testpool/38': unable to create mountpoint filesystem successfully created, but not mounted So, OOTB there's a limit that I need to raise to support more than approx 40 filesystems (I know that this limit can be raised, I've not checked to see exactly what I need to fix). It does beg the question of why there's a limit like this when ZFS is encouraging use of large numbers of filesystems. There is no 40 filesystem limit. You most likely had a pre-existing file/directory in testpool of the same name of the filesystem you tried to create. fsh-hake# zfs list NAME USED AVAIL REFER MOUNTPOINT testpool77K 7.81G 24.5K /testpool fsh-hake# echo hmm /testpool/01 fsh-hake# zfs create testpool/01 cannot mount 'testpool/01': Not a directory filesystem successfully created, but not mounted fsh-hake# If I have 10,000 filesystems, is the mount time going to be a problem? I tried: for x in 0 1 2 3 4 5 6 7 8 9; do for x in 0 1 2 3 4 5 6 7 8 9; do zfs umount testpool/001; zfs mount testpool/001 done; done This took 12 seconds, which is OK until you scale it up - even if we assume that mount and unmount take the same amount of time, so 100 mounts will take 6 seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is on a test system without fantastic performance, but there *will* be a much larger delay on mounting a ZFS pool like this over a comparable UFS filesystem. So this really depends on why and when you're unmounting filesystems. I suspect it won't matter much since you won't be unmounting/remounting your filesystems. I currently use Legato Networker, which (not unreasonably) backs up each filesystem as a separate session - if I continue to use this I'm going to have 10,000 backup sessions on each tape backup. I'm not sure what kind of challenges restoring this kind of beast will present. Others have already been through the problems with standard tools such as 'df' becoming less useful. Is there a specific problem you had in mind regarding 'df;? One alternative is to ditch quotas altogether - but even though disk is cheap, it's not free, and regular backups take time (and tapes are not free either!). In any case, 10,000 undergraduates really will be able to fill more disks than we can afford to provision. We tried running a Windows fileserver back in the days when it had no support for per-user quotas; we did some ad-hockery that helped to keep track of the worst offenders (ableit after the event), but what really killed us was the uncertainty over whether some idiot would decide to fill all available space with vital research data (or junk, depending on your point of view). I can see the huge benefits that ZFS quotas and reservations can bring, but I can also see that there is a possibility that there are situations where ZFS could be useful, but the lack of 'legacy' user-based quotas make it impractical. If the ZFS developers really are not going to implement user quotas is there any advice on what someone like me could do - at the moment I'm presuming that I'll just have to leave ZFS
Re: [zfs-discuss] Re: ZFS and Storage
On Jun 27, 2006, at 3:30 PM, Al Hopper wrote:On Tue, 27 Jun 2006, Gregory Shaw wrote: Yes, but the idea of using software raid on a large server doesn'tmake sense in modern systems. If you've got a large database serverthat runs a large oracle instance, using CPU cycles for RAID iscounter productive. Add to that the need to manage the hardwaredirectly (drive microcode, drive brownouts/restarts, etc.) and theidea of using JBOD in modern systems starts to lose value in a big way.You will detect any corruption when doing a scrub. It's not end-to-end, but it's no worse than today with VxVM. The initial impression I got, after reading the original post, is that itsauthor[1] does not grok something fundamental about ZFS and/or how itworks! Or does not understand that there are many CPU cycles in a modernUnix box that are never taken advantage of.It's clear to me that ZFS provides considerable, never before available,features and facilities, and that any impact that ZFS may have on CPU ormemory utilization will become meaningless over time, as the # of CPUcores increase - along with their performance. And that average systemmemory size will continue to increase over time.Perhaps the author is taking a narrow view that ZFS will *replace*existing systems. I don't think that this will be the general case.Especially in a large organization where politics and turf wars willdominate any "technical" discussions and implementation decisions will bemade by senior management who are 100% buzzword compliant (and havequestionable technical/engineering skills). Rather it will provide thesystem designer with a hugely powerful *new* tool to apply in systemdesign. And will challenge the designer to use it creatively andeffectively.There is no such thing as the universal screw-driver. Every toolbox hastens of screwdrivers and tool designers will continue to dream aboutreplacing them all with _one_ tool.[1] Sorry Gregory.Regards,Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDTOpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss No insult taken. I was trying to point out that many customers don't have 'free' cpu cycles, and that every little bit you take from their machine for subsystem control is that much real work the system will not be doing.I think of the statement of "many cpu cycles in modern unix boxes that are never taken advantage of" in the similar vein of monitoring vendors: "It's just another agent. It won't take more than 5% of the box."I think we'll let the customer decide on the above. I've encountered both situations: customers with large boxes with plenty of headroom, and customers that run 100% all day, every day and have no cycles that aren't dedicated to real work.When I read as a ex-customer (e.g. not with Sun) that I've got to sacrifice cpu cycles in a software upgrade, it says to me that the upgrade will result in a slower system. -Gregory Shaw, IT ArchitectPhone: (303) 673-8273 Fax: (303) 673-8273ITCTO Group, Sun Microsystems Inc.1 StorageTek Drive MS 4382 [EMAIL PROTECTED] (work)Louisville, CO 80028-4382 [EMAIL PROTECTED] (home)"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss