[zfs-discuss] Why RAID 5 stops working in 2009
Anyone here read the article Why RAID 5 stops working in 2009 at http://blogs.zdnet.com/storage/?p=162 Does RAIDZ have the same chance of unrecoverable read error as RAID5 in Linux if the RAID has to be rebuilt because of a faulty disk? I imagine so because of the physical constraints that plague our hds. Granted, the chance of failure in my case shouldn't be nearly as high as I will most likely recruit four or three 750gb drives- not in the order of 10tb. With my opensolaris NAS, I will be scrubbing every week (consumer grade drives[every month for enterprise-grade]) as recommended in the ZFS best practices guide. If I zpool status and I see that the scrub is increasingly fixing errors, would that mean that the disk is in fact headed towards failure or perhaps that the natural expansion of disk usage is to blame? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Cannot replace a replacing device
I had a drive fail and replaced it with a new drive. During the resilvering process the new drive had write faults and was taken offline. These faults were caused by a broken SATA cable (drive checked with Manufacturers software and all ok). New cable fixed the the failure. However, now the drive shows as faulted. I know the drive is healthy so want to force a rescrub. However, this wont happen while it is showing FAULTED. I tried to force a replace but this gives the error Cannot replace a replacing device. So I seem to be in a stuck state, where the replace wont complete. Please help - screen output below. C3P0# zpool status pool: tank state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad4ONLINE 0 0 0 ad6ONLINE 0 0 0 replacing UNAVAIL 0 1.06K 0 insufficient replicas 1796873336336467178 UNAVAIL 0 1.23K 0 was /dev/ad7/old 4407623704004485413 FAULTED 0 1.22K 0 was /dev/ad7 errors: No known data errors C3P0# zpool replace -f tank 4407623704004485413 ad7 cannot replace 4407623704004485413 with ad7: cannot replace a replacing device C3P0# -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot replace a replacing device
Yes - but it does nothing. The drive remains FAULTED. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot replace a replacing device
Thanks for the suggestion, but have tried detaching but it refuses reporting no valid replicas. Capture below. C3P0# zpool status pool: tank state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad4ONLINE 0 0 0 ad6ONLINE 0 0 0 replacing UNAVAIL 0 9.77K 0 insufficient replicas 1796873336336467178 UNAVAIL 0 11.6K 0 was /dev/ad7/old 4407623704004485413 FAULTED 0 10.4K 0 was /dev/ad7 errors: No known data errors C3P0# zpool detach tank 1796873336336467178 cannot detach 1796873336336467178: no valid replicas C3P0# zpool detach tank 4407623704004485413 cannot detach 4407623704004485413: no valid replicas -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot replace a replacing device
Thanks for the suggestion, but have tried detaching but it refuses reporting no valid replicas. Capture below. C3P0# zpool status pool: tank state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 replacing UNAVAIL 0 9.77K 0 insufficient replicas 1796873336336467178 UNAVAIL 0 11.6K 0 was /dev/ad7/old 4407623704004485413 FAULTED 0 10.4K 0 was /dev/ad7 errors: No known data errors C3P0# zpool detach tank 1796873336336467178 cannot detach 1796873336336467178: no valid replicas C3P0# zpool detach tank 4407623704004485413 cannot detach 4407623704004485413: no valid replicas -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot replace a replacing device
Thanks - have run it and returns pretty quickly. Given the output (attached) what action can I take? Thanks James -- This message posted from opensolaris.orgDirty time logs: tank outage [300718,301073] length 356 outage [301138,301139] length 2 outage [301149,301149] length 1 outage [301151,301153] length 3 outage [301155,301155] length 1 outage [301157,301158] length 2 outage [301182,301182] length 1 outage [301262,301262] length 1 outage [301911,301916] length 6 outage [304063,304063] length 1 outage [304791,304796] length 6 raidz outage [300718,301073] length 356 outage [301138,301139] length 2 outage [301149,301149] length 1 outage [301151,301153] length 3 outage [301155,301155] length 1 outage [301157,301158] length 2 outage [301182,301182] length 1 outage [301262,301262] length 1 outage [301911,301916] length 6 outage [304063,304063] length 1 outage [304791,304796] length 6 /dev/ad4 /dev/ad6 replacing outage [300718,301073] length 356 outage [301138,301139] length 2 outage [301149,301149] length 1 outage [301151,301153] length 3 outage [301155,301155] length 1 outage [301157,301158] length 2 outage [301182,301182] length 1 outage [301262,301262] length 1 outage [301911,301916] length 6 outage [304063,304063] length 1 outage [304791,304796] length 6 /dev/ad7/old outage [300718,301073] length 356 outage [301138,301139] length 2 outage [301149,301149] length 1 outage [301151,301153] length 3 outage [301155,301155] length 1 outage [301157,301158] length 2 outage [301182,301182] length 1 outage [301262,301262] length 1 outage [301911,301916] length 6 outage [304063,304063] length 1 outage [304791,304796] length 6 /dev/ad7 outage [300718,301073] length 356 outage [301138,301139] length 2 outage [301149,301149] length 1 outage [301151,301153] length 3 outage [301155,301155] length 1 outage [301157,301158] length 2 outage [301182,301182] length 1 outage [301262,301262] length 1 outage [301911,301916] length 6 outage [304063,304063] length 1 outage [304791,304796] length 6 Metaslabs: vdev 0 0 26 20.0M offset spacemapfree -- 4 52166M 8 56 2.66G c 65 12.4M 10 66 20.7M 14 69 29.1M 18 73 29.7M 1c 77 29.6M 20 81 79.2M 24 91 87.9M 28 92 63.2M 2c 94 94.2M 30 99123M 34 103523M 38 107 50.9M 3c 111117M 40 116 54.3M 44 119 60.2M 48 123 97.4M 4c 126 1.20G 50 129 48.5M 54 132106M 58 137 27.4M 5c 140 39.6M 60 146 45.3M 64 149 34.9M 68 151544M 6c 154 36.6M 70 156 19.4M 74 160 35.7M 78 162 41.2M 7c 166 23.1M 9c 74 14.1M a0 78 15.2M a4 88 28.1M a8 174 23.3M ac 178 24.2M b0 181 26.3M b4 100 43.4M b8 104 33.6M bc 108 30.6M c0 113 59.8M c4 115 53.9M c8 120 30.8M cc 124 82.2M d0 127 36.9M d4 130 76.2M d8 133 39.7M
Re: [zfs-discuss] howto: make a pool with ashift=X
Well, for the sake of completeness (and perhaps to enable users of snv_151a) there should also be links to alternative methods: 1) Using a patched-source and recompiled, or an already precompiled, zpool binary, i.e. http://www.solarismen.de/archives/4-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-1.html http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html http://www.solarismen.de/archives/6-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-3.html http://www.solarismen.de/archives/9-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-4.html http://www.kuehnke.de/christian/solaris/zpool-s10u8 2) Making a pool in an alternate OS, such as FreeBSD LiveCD with their tricks, and then importing/upgrading in Solaris. See www.zfsguru.org and numerous posts in the internet by its author sub_mesa or sub.mesa. I am not promoting either of these methods. I've used (1) successfully on my OI_148a by taking a precompiled binary, and I didn't get around to trying (2). Just my 2c :) //Jim ___ zfs-crypto-discuss mailing list zfs-crypto-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-crypto-discuss
[zfs-discuss] Re: zfs panic when unpacking open solaris source
Looks like CR 6411261 busy intent log runs out of space on small pools. I found this one. I just bumped up the priority. Jim When unpacking the solaris source onto a local disk on a system running build 39 I got the following panic: panic[cpu0]/thread=d2c8ade0: really out of space d2c8a7b4 zfs:zio_write_allocate_gang_members+3e6 (e4385ac0) d2c8a7d0 zfs:zio_dva_allocate+81 (e4385ac0) d2c8a7e8 zfs:zio_next_stage+66 (e4385ac0) d2c8a800 zfs:zio_checksum_generate+5e (e4385ac0) d2c8a81c zfs:zio_next_stage+66 (e4385ac0) d2c8a83c zfs:zio_wait_for_children+46 (e4385ac0, 1, e4385c) d2c8a850 zfs:zio_wait_children_ready+18 (e4385ac0) d2c8a864 zfs:zio_next_stage_async+ac (e4385ac0, f8def9d0,) d2c8a874 zfs:zio_nowait+e (e4385ac0) d2c8a8d4 zfs:zio_write_allocate_gang_members+341 (e120e0c0) d2c8a8f0 zfs:zio_dva_allocate+81 (e120e0c0) d2c8a908 zfs:zio_next_stage+66 (e120e0c0) d2c8a920 zfs:zio_checksum_generate+5e (e120e0c0) d2c8a93c zfs:zio_next_stage+66 (e120e0c0) d2c8a95c zfs:zio_wait_for_children+46 (e120e0c0, 1, e120e2) d2c8a970 zfs:zio_wait_children_ready+18 (e120e0c0) d2c8a984 zfs:zio_next_stage_async+ac (e120e0c0, f8def9d0,) d2c8a994 zfs:zio_nowait+e (e120e0c0) d2c8a9f4 zfs:zio_write_allocate_gang_members+341 (e3c0a580) d2c8aa10 zfs:zio_dva_allocate+81 (e3c0a580) d2c8aa28 zfs:zio_next_stage+66 (e3c0a580) d2c8aa40 zfs:zio_checksum_generate+5e (e3c0a580) d2c8aa54 zfs:zio_next_stage+66 (e3c0a580) d2c8aaa0 zfs:zio_write_compress+236 (e3c0a580) d2c8aabc zfs:zio_next_stage+66 (e3c0a580) d2c8aadc zfs:zio_wait_for_children+46 (e3c0a580, 1, e3c0a7) d2c8aaf0 zfs:zio_wait_children_ready+18 (e3c0a580) d2c8ab04 zfs:zio_next_stage_async+ac (e3c0a580, 0, f8dbfe) d2c8ab1c zfs:zio_nowait+e (e3c0a580) d2c8ab3c zfs:arc_write+7b (e44c9780, d895e8c0,) d2c8abec zfs:dbuf_sync+5f3 (dbd6ef00, e44c9780,) d2c8ac4c zfs:dnode_sync+33a (d34fbb30, 1, e44c97) d2c8ac80 zfs:dmu_objset_sync_dnodes+7e (d2380240, d23802fc,) d2c8acd0 zfs:dmu_objset_sync+5d (d2380240, e96f1e80) d2c8ad1c zfs:dsl_pool_sync+121 (d244a180, 15e234, 0) d2c8ad6c zfs:spa_sync+10a (d895e8c0, 15e234, 0) d2c8adc8 zfs:txg_sync_thread+1df (d244a180, 0) d2c8add8 unix:thread_start+8 () I now have a chicken and egg problem, need to unpack the source to work out what is going on but can't as the system crashes unless I put it on my external USB drive but there are some issues with that! Is this a known issue? Some more data on the file systems: : sigma IA 4 $; zfs list -r home/cjg NAME USED AVAIL REFER MOUNTPOINT ome/cjg 7.81G 138M 1.99G /export/home/cjg ome/[EMAIL PROTECTED] 1.91M - 1.97G - home/[EMAIL PROTECTED]:53:46 2.38M - 1.97G - home/[EMAIL PROTECTED] 433K - 1.97G - home/[EMAIL PROTECTED] 492K - 1.97G - home/[EMAIL PROTECTED] 409K - 1.97G - home/[EMAIL PROTECTED] 474K - 1.97G - home/[EMAIL PROTECTED] 314K - 1.97G - home/[EMAIL PROTECTED] 314K - 1.97G - home/[EMAIL PROTECTED]0 - 1.97G - home/[EMAIL PROTECTED]0 - 1.97G - home/[EMAIL PROTECTED]0 - 1.97G - home/[EMAIL PROTECTED] 253K - 1.97G - home/[EMAIL PROTECTED] 342K - 1.97G - home/[EMAIL PROTECTED] 624K - 1.98G - home/[EMAIL PROTECTED] 429K - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED] 146K - 1.98G - home/[EMAIL PROTECTED] 282K - 1.98G - home/[EMAIL PROTECTED] 218K - 1.98G - home/[EMAIL PROTECTED] 300K - 1.98G - home/[EMAIL PROTECTED] 232K - 1.98G - home/[EMAIL PROTECTED] 458K - 1.98G - home/[EMAIL PROTECTED] 462K - 1.98G - home/[EMAIL PROTECTED] 576K - 1.98G - home/[EMAIL PROTECTED] 147K - 1.98G - home/[EMAIL PROTECTED] 147K - 1.98G - home/[EMAIL PROTECTED] 448K - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED] 354K - 1.98G - home/[EMAIL PROTECTED] 258K - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED]0 - 1.98G - home/[EMAIL PROTECTED] 522K - 1.98G - home/[EMAIL PROTECTED] 615K - 1.98G - home/[EMAIL PROTECTED] 766K - 1.98G - home/[EMAIL PROTECTED] 625K - 1.98G - home/[EMAIL PROTECTED] 565K - 1.98G - home/[EMAIL PROTECTED] 470K - 1.98G - home/[EMAIL PROTECTED] 495K - 1.98G - home/[EMAIL PROTECTED] 305K - 1.98G - home/[EMAIL PROTECTED] 314K - 1.98G - home
[zfs-discuss] Re: RE: [Security-discuss] Proposal for new basic privileges related with
I am also interested in writing some test cases that will check the correct semantic of access checks on files with different permissions and with different privileges set/unset by the process. Are there already file access test cases at Sun I may expand? Should test suites for OpenSolaris be written in a special kind or programming languages? We do extensive file access testing as part of the zfs test suite. The test suite is mostly written in ksh scripts with some C code. We should have the test suite available externally via OpenSolaris.org sometime in July or August. In the meantime I would code up your unit tests in ksh so they can be more easily integrated. We'll keep you posted as progress in releasing the test suite is made. Cheers, Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Let's get cooking...
http://www.tech-recipes.com/solaris_system_administration_tips1446.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS components for a minimal Solaris 10 U2 install?
For an embedded application, I'm looking at creating a minimal Solaris 10 U2 image which would include ZFS functionality. In quickly taking a look at the opensolaris.org site under pkgdefs, I see three packages that appear to be related to ZFS: SUNWzfskr, SUNWzfsr, and SUNWzfsu. Is it naive to think that this would be all that is needed for ZFS? Thanks, -- Jim C ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Big JBOD: what would you do?
I agree with Greg - For ZFS, I'd recommend a larger number of raidz luns, with a smaller number of disks per LUN, up to 6 disks per raidz lun. This will more closely align with performance best practices, so it would be cool to find common ground in terms of a sweet-spot for performance and RAS. /jim Gregory Shaw wrote: To maximize the throughput, I'd go with 8 5-disk raid-z{2} luns. Using that configuration, a full-width stripe write should be a single operation for each controller. In production, the application needs would probably dictate the resulting disk layout. If the application doesn't need tons of i/o, you could bind more disks together for larger luns... On Jul 17, 2006, at 3:30 PM, Richard Elling wrote: ZFS fans, I'm preparing some analyses on RAS for large JBOD systems such as the Sun Fire X4500 (aka Thumper). Since there are zillions of possible permutations, I need to limit the analyses to some common or desirable scenarios. Naturally, I'd like your opinions. I've already got a few scenarios in analysis, and I don't want to spoil the brain storming, so feel free to think outside of the box. If you had 46 disks to deploy, what combinations would you use? Why? Examples, 46-way RAID-0 (I'll do this just to show why you shouldn't do this) 22x2-way RAID-1+0 + 2 hot spares 15x3-way RAID-Z2+0 + 1 hot spare ... Because some people get all wrapped up with the controllers, assume 5 8-disk SATA controllers plus 1 6-disk controller. Note: the reliability of the controllers is much greater than the reliability of the disks, so the data availability and MTTDL analysis will be dominated by the disks themselves. In part, this is due to using SATA/SAS (point-to-point disk connections) rather than a parallel bus or FC-AL where we would also have to worry about bus or loop common cause failures. I will be concentrating on data availability and MTTDL as two views of RAS. The intention is that the interesting combinations will also be analyzed for performance and we can complete a full performability analysis on them. Thanks -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] (work) Louisville, CO 80028-4382[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sucking down my memory!?
I need to read through this more thoroughly to get my head around it, but on my first pass, what jumps out at me is that something significant _changed_ in terms of application behavior with the introduction of ZFS. I'm saying that that is a bad thing, or a good thing, but it is an important thing, and we should try to understand if application behavior will, in general, change with the introduction of ZFS, so we can advise users accordingly. Joe appears to be a user of Sun system for some time, with a lot of experience deploying Solaris 8 and Solaris 9. He has succesfully deployed systems without physical swap, and I understand his reason for doing so. If the introduction of Solaris 10 and ZFS means we need to change a system parameter when transitioning from S8 or S9, such as configured swap, we need to understand why, and make sure understand the performance implications. Why do you think your performance *improves* if you don't use swap? It is much more likely it *deteriates* because your swap accumulates stuff you do not use. I'm not sure what this is saying, but I don't think it came out right. As I said, I need to do another pass on the information in the messages to get a better handle on the observed behviour, but this certainly seems like something we should explore further. Watch this space. /jim At any rate, I don't think adding swap will fix the problem I am seeing in that ZFS is not releasing its unused cache when applications need it. Adding swap might allow the kernel to move it out of memory but when the system needs it again it will have to swap it back in, and only performance suffers, no? Well, you have decided that all application data needs to be memory resident all of the time; but executables don't need to be (they are now tossed out on memory shortage) and that ZFS can use less cache than it wants to. FWIW, here's the current ::memstat and swap output for my system. The reserved number is only about 46M or about 2% of RAM. Considering the box has 3G, I'm willing to sacrifice 2% in the interest of performance. Page SummaryPagesMB %Tot Kernel 249927 1952 64% Anon34719 2719% Exec and libs2415181% Page cache 1676130% Free (cachelist)11796923% Free (freelist) 88288 689 23% Total 388821 3037 Physical 382802 2990 [EMAIL PROTECTED]: swap -s total: 260008k bytes allocated + 47256k reserved = 307264k used, 381072k available So there's 47MB of memory which is not used at all. (Adding swap will give you 47MB of additional free memory without anything being written to disk). Execs are also pushed out on shortfall. There is 265 MB of anon memory and we have no clue how much of it is used at all; a large percentage is likely unused. But OTOH, you have sufficient memory on the freelist so there is not much of an issue. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS components for a minimal Solaris 10 U2 install?
Included below is a a thread which dealt with trying to find the packages necessary for a minimal Solais 10 U2 install with ZFS functionality. In addition to SUNWzfskr, SUNzfsr and SUNWzfsu the SUNWsmapi package needs to be installed. The libdiskmgt.so.1 library is required for the zpool(1M) command. Finding this out via trial and error, there is no dependency mentioned for SUNWsmapi in the SUNWzfsr depend file. Apologies if this is nitpicking, but is this missing dependency worthy of submitting a P5 CR? -- Jim C Jason Schroeder wrote: Dale Ghent wrote: On Jun 28, 2006, at 4:27 PM, Jim Connors wrote: For an embedded application, I'm looking at creating a minimal Solaris 10 U2 image which would include ZFS functionality. In quickly taking a look at the opensolaris.org site under pkgdefs, I see three packages that appear to be related to ZFS: SUNWzfskr, SUNWzfsr, and SUNWzfsu. Is it naive to think that this would be all that is needed for ZFS? Those packages, as well as what's listed in the depend files for those packages. Ahh, don't you love climbing the dependency tree? /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Glenn Brunette wrote a nifty little tool ... have to assume that all of the dependencies are appropriately doc'ed of course cough. http://blogs.sun.com/roller/page/gbrunett?entry=solaris_package_companion /jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS state between reboots for RAM rsident OS?
Guys, Thanks for the help so far, now comes the more interesting questions ... Piggybacking off of some work being done to minimize Solaris for embedded use, I have a version of Solaris 10 U2 with ZFS functionality with a disk footprint of about 60MB. Creating a miniroot based upon this image, it can be compressed to under 30MB. Currently, I load this image onto a USB keyring and boot from the USB device running the Solaris miniroot out of RAM. Note: The USB key ring is a hideously slow device, but for the sake of this proof of concept it works fine. In addition, some more packages will need to be added later on (i.e. NFS, Samba?) which will increase the footprint. My ultimate goal here would be to demonstrate a network storage appliance using ZFS, where the OS is effectively stateless, or as stateless as possible. ZFS goes a long way in assisting here since, for example, mount and nfs share information can be managed by ZFS. But I suppose it's not as stateless as I thought. Upon booting from USB device into memory, I can do a `zpool create poo1 c1d0', but a subsequent reboot does not remember this work. Doing a `zpool list' yields 'no pools available'. So the question is, what sort of state is required between reboots for ZFS? Regards, -- Jim C ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS state between reboots for RAM rsident OS?
I understand. Thanks. Just curious, ZFS manages NFS shares. Have you given any thought to what might be involved for ZFS to manage SMB shares in the same manner. This all goes towards my stateless OS theme. -- Jim C Eric Schrock wrote: You need the following file: /etc/zfs/zpool.cache This file 'knows' about all the pools on the system. These pools can typically be discovered via 'zpool import', but we can't do this at boot because: a. It can be really, really expensive (tasting every disk on the system) b. Pools can be comprised of files or devices not in /dev/dsk So, we have the cache file, which must be editable if you want to remember newly created pools. Note this only affects configuration changes to pools - everything else is stored within the pool itself. - Eric On Tue, Jul 25, 2006 at 12:18:07PM -0400, Jim Connors wrote: Guys, Thanks for the help so far, now comes the more interesting questions ... Piggybacking off of some work being done to minimize Solaris for embedded use, I have a version of Solaris 10 U2 with ZFS functionality with a disk footprint of about 60MB. Creating a miniroot based upon this image, it can be compressed to under 30MB. Currently, I load this image onto a USB keyring and boot from the USB device running the Solaris miniroot out of RAM. Note: The USB key ring is a hideously slow device, but for the sake of this proof of concept it works fine. In addition, some more packages will need to be added later on (i.e. NFS, Samba?) which will increase the footprint. My ultimate goal here would be to demonstrate a network storage appliance using ZFS, where the OS is effectively stateless, or as stateless as possible. ZFS goes a long way in assisting here since, for example, mount and nfs share information can be managed by ZFS. But I suppose it's not as stateless as I thought. Upon booting from USB device into memory, I can do a `zpool create poo1 c1d0', but a subsequent reboot does not remember this work. Doing a `zpool list' yields 'no pools available'. So the question is, what sort of state is required between reboots for ZFS? Regards, -- Jim C -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS state between reboots for RAM rsident OS?
Eric Schrock wrote: You need the following file: /etc/zfs/zpool.cache So as a workaround (or more appropriately, a kludge) would it be possible to: 1. At boot time do a 'zpool import' of some pool guaranteed to exist. For the sake of this discussion call it 'system' 2. Have /etc/zfs/zpool.cache be symbolically linked to /system/ZPOOL.CACHE -- Jim C This file 'knows' about all the pools on the system. These pools can typically be discovered via 'zpool import', but we can't do this at boot because: a. It can be really, really expensive (tasting every disk on the system) b. Pools can be comprised of files or devices not in /dev/dsk So, we have the cache file, which must be editable if you want to remember newly created pools. Note this only affects configuration changes to pools - everything else is stored within the pool itself. - Eric On Tue, Jul 25, 2006 at 12:18:07PM -0400, Jim Connors wrote: Guys, Thanks for the help so far, now comes the more interesting questions ... Piggybacking off of some work being done to minimize Solaris for embedded use, I have a version of Solaris 10 U2 with ZFS functionality with a disk footprint of about 60MB. Creating a miniroot based upon this image, it can be compressed to under 30MB. Currently, I load this image onto a USB keyring and boot from the USB device running the Solaris miniroot out of RAM. Note: The USB key ring is a hideously slow device, but for the sake of this proof of concept it works fine. In addition, some more packages will need to be added later on (i.e. NFS, Samba?) which will increase the footprint. My ultimate goal here would be to demonstrate a network storage appliance using ZFS, where the OS is effectively stateless, or as stateless as possible. ZFS goes a long way in assisting here since, for example, mount and nfs share information can be managed by ZFS. But I suppose it's not as stateless as I thought. Upon booting from USB device into memory, I can do a `zpool create poo1 c1d0', but a subsequent reboot does not remember this work. Doing a `zpool list' yields 'no pools available'. So the question is, what sort of state is required between reboots for ZFS? Regards, -- Jim C -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Assertion raised during zfs share?
Eric Schrock wrote: This indicates that share(1M) didn't produce any output, but returned a non-zero exit status. I'm not sure why this would happen - can you run the following by hand? # share /export # echo $? bash-3.00# share bash-3.00# share /export bash-3.00# echo $? 0 Looks like the NFS server is not completely configured yet, and that it requires this zfs share stuff to work first. bash-3.00# svcs -a | grep nfs/server disabled6:24:31 svc:/network/nfs/server:default bash-3.00# more /var/svc/log/network-nfs-server\:default.log [ Aug 4 06:15:31 Executing start method (/lib/svc/method/nfs-server start) ] Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 399, function zfs_share Abort - core dumped [ Aug 4 06:15:32 Method start exited with status 0 ] [ Aug 4 06:15:32 Stopping because process dumped core. ] [ Aug 4 06:15:32 Executing stop method (/lib/svc/method/nfs-server stop 30) ][ Aug 4 06:15:32 Method stop exited with status 0 ] [ Aug 4 06:15:32 Executing start method (/lib/svc/method/nfs-server start) ] Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 399, function zfs_share Abort - core dumped -- Jim C Incidentally, the explicit 'zfs share' isn't needed, as we automatically share the filesystem when the options are set (which did succeed). - Eric On Fri, Aug 04, 2006 at 12:42:02PM -0400, Jim Connors wrote: Working to get ZFS to run on a minimal Solaris 10 U2 configuration. In this scenario, ZFS is included the miniroot which is booted into RAM. When trying to share one of the filesystems, an assertion is raised - see below. If the version of source on OpenSolaris.org matches Solaris 10 U2, then it looks like it's associated with a popen of /usr/sbin/share. Can anyone shed any light on this? Thanks, -- Jim C # zfs list NAME USED AVAIL REFER MOUNTPOINT SYS 83K 163M 30.5K /SYS export 110K 72.8G 25.5K /export export/home 24.5K 72.8G 24.5K /export/home # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT SYS 195M 90K195M 0% ONLINE - export 74G114K 74.0G 0% ONLINE - # zfs set sharenfs=on export # zfs share export Assertion failed: pclose(fp) == 0, file ../common/libzfs_mount.c, line 399, function zfs_share Abort - core dumped ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Assertion raised during zfs share?
Richard Elling wrote: Jim Connors wrote: Working to get ZFS to run on a minimal Solaris 10 U2 configuration. What does minimal mean? Most likely, you are missing something. -- richard Yeah. Looking at package and SMF dependencies plus a whole lot of and trial and error, I've currently got Solaris down to 47 packages. The nfs/server service for Solaris 10 U2 will first try to do a zfs share. For the next step, I'll probably comment out that stuff and see I can bring up the nfs server code and share a UFS filesystem using the traditional methods. Once that's OK I'll move on to the ZFS portion and investigate. Thanks, -- Jim C ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
Roch - PAE wrote: The hard part is getting a set of simple requirements. As you go into more complex data center environments you get hit with older Solaris revs, other OSs, SOX compliance issues, etc. etc. etc. The world where most of us seem to be playing with ZFS is on the lower end of the complexity scale. I've been watching this thread and unfortunately fit this model. I'd hoped that ZFS might scale enough to solve my problem but you seem to be saying that it's mostly untested in large scale environments. About 7 years ago we ran out of inodes on our UFS file systems. We used bFile as middleware for a while to distribute the files across multiple disks and then switched to VFS on SAN about 5 years ago. Distribution across file systems and inode depletion continued to be a problem so we switched middleware to another vendor that essentially compresses about 200 files into a single 10Mb archive and uses a DB to find the file within the archive on the correct disk. Expensive, complex and slow but effective solution until the latest license renewal when we got hit with a huge bill. I'd love to go back to a pure file system model and looked at Reiser4, JFS, NTFS and now ZFS for a way to support over 100 million small documents and 16Tb. We average 2 file reads and 1 file write per second 24/7 with expected growth to 24Tb. I'd be willing to scrap everything we have to find a non-proprietary long term solution. ZFS looked like it might provide an answer. Are you saying it's not really suitable for this type of application? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs hot spare not automatically getting used
So is there a command to make the spare get used, or so I have to remove it as a spare and add it if it doesn't get automatically used? Is this a bug to be fixed, or will this always be the case when the disks aren't exactly the same size? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs hot spare not automatically getting used
I know this isn't necessarily ZFS specific, but after I reboot I spin the drives back up, but nothing I do (devfsadm, disks, etc) can get them seen again until the next reboot. I've got some older scsi drives in an old Andataco Gigaraid enclosure which I thought supported hot-swap, but I seem unable to hot swap them in. The PC has an adaptec 39160 card in it and I'm running Nevada b51. Is this not a setup that can support hot swap? Or is there something I have to do other than devfsadm to get the scsi bus rescanned? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Managed to corrupt my pool
Platform: - old dell workstation with an Andataco gigaraid enclosure plugged into an Adaptec 39160 - Nevada b51 Current zpool config: - one two-disk mirror with two hot spares In my ferocious pounding of ZFS I've managed to corrupt my data pool. This is what I've been doing to test it: - set zil_disable to 1 in /etc/system - continually untar a couple of files into the filesystem - manually spin down a drive in the mirror by holding down the button on the enclosure - for any system hangs reboot with a nasty reboot -dnq I've gotten different results after the spindown: - works properly: short or no hang, hot spare successfully added to the mirror - system hangs, and after a reboot the spare is not added - tar hangs, but after running zpool status the hot spare is added properly and tar continues - tar continues, but hangs on zpool status The last is what happened just prior to the corruption. Here's the output of zpool status: nextest-01# zpool status -v pool: zmir state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed with 1 errors on Thu Nov 30 11:37:21 2006 config: NAMESTATE READ WRITE CKSUM zmirDEGRADED 8 0 4 mirrorDEGRADED 8 0 4 c3t3d0 ONLINE 0 024 c3t4d0 UNAVAIL 0 0 0 cannot open spares c0t0d0AVAIL c3t1d0AVAIL errors: The following persistent errors have been detected: DATASET OBJECT RANGE 15 0 lvl=4294967295 blkid=0 So the questions are: - is this fixable? I don't see an inum I could run find on to remove, and I can't even do a zfs volinit anyway: nextest-01# zfs volinit cannot iterate filesystems: I/O error - would not enabling zil_disable have prevented this? - Should I have been doing a 3-way mirror? - Is there a more optimum configuration to help prevent this kind of corruption? Ultimately, I want to build a ZFS server with performance and reliability comparable to say, a Netapp, but the fact that I appear to have been able to nuke my pool by simulating a hardware error gives me pause. I'd love to know if I'm off-base in my worries. Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Managed to corrupt my pool
So the questions are: - is this fixable? I don't see an inum I could run find on to remove, and I can't even do a zfs volinit anyway: nextest-01# zfs volinit cannot iterate filesystems: I/O error - would not enabling zil_disable have prevented this? - Should I have been doing a 3-way mirror? - Is there a more optimum configuration to help prevent this kind of corruption? Anyone have any thoughts on this? I'd really like to be able to build a nice ZFS box for file service but if a hardware failure can corrupt a disk pool I'll have to try to find another solution, I'm afraid. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Managed to corrupt my pool
Anyone have any thoughts on this? I'd really like to be able to build a nice ZFS box for file service but if a hardware failure can corrupt a disk pool I'll have to try to find another solution, I'm afraid. Sorry, I worded this poorly -- if the loss of a disk in a mirror can corrupt the pool it's going to give me pause in implementing a ZFS solution. Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Netapp to Solaris/ZFS issues
We have two aging Netapp filers and can't afford to buy new Netapp gear, so we've been looking with a lot of interest at building NFS fileservers running ZFS as a possible future approach. Two issues have come up in the discussion - Adding new disks to a RAID-Z pool (Netapps handle adding new disks very nicely). Mirroring is an alternative, but when you're on a tight budget losing N/2 disk capacity is painful. - The default scheme of one filesystem per user runs into problems with linux NFS clients; on one linux system, with 1300 logins, we already have to do symlinks with amd because linux systems can't mount more than about 255 filesystems at once. We can of course just have one filesystem exported, and make /home/student a subdirectory of that, but then we run into problems with quotas -- and on an undergraduate fileserver, quotas aren't optional! Neither of these problems are necessarily showstoppers, but both make the transition more difficult. Any progress that could be made with them would help sites like us make the switch sooner. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Hey Ben - I need more time to look at this and connect some dots, but real quick Some nfsstat data that we could use to potentially correlate to the local server activity would be interesting. zfs_create() seems to be the heavy hitter, but a periodic kernel profile (especially if we can catch a 97% SYS period) would help: #lockstat -i997 -Ik -s 10 sleep 60 Alternatively: #dtrace -n 'profile-997hz / arg0 != 0 / { @s[stack()]=count(); }' It would also be interesting to see what the zfs_create()'s are doing. Perhaps a quick: #dtrace -n 'zfs_create:entry { printf(ZFS Create: %s\n, stringof(args[0]-v_path)); }' It would also be interesting to see the network stats. Grab Brendan's nicstat and collect some samples You're reference to low traffic is in bandwidth, which, as you indicate, is really, really low. But the data, at least up to this point, suggests the workload is not data/bandwidth intensive, but more attribute intensive. Note again zfs_create() is the heavy ZFS function, along with zfs_getattr. Perhaps it's the attribute-intensive nature of the load that is at the root of this. I can spend more time on this tomorrow (traveling today). Thanks, /jim Ben Rockwood wrote: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? It's not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. File creates are metadata intensive, right? And these operations need to be synchronous to guarantee file system consistency (yes, I am familiar with the ZFS COW model). AnywayI'm feeling rather naive' here, but I've seen the NFS enforced synchronous semantics phrase kicked around many times as the explanation for suboptimal performance for metadata-intensive operations when ZFS is the underlying file system, but I never really understood what's unsynchronous about doing the same thing to a local ZFS. And yes, there is certainly a network latency component to the NFS configuration, so for any synchronous operation, I would expect things to be slower when done over NFS. Awaiting enlightment :^) /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Can't destroy corrupted pool
Ok, so I'm planning on wiping my test pool that seems to have problems with non-spare disks being marked as spares, but I can't destroy it: # zpool destroy -f zmir cannot iterate filesystems: I/O error Anyone know how I can nuke this for good? Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Can't destroy corrupted pool
BTW, I'm also unable to export the pool -- same error. Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Can't destroy corrupted pool
Nevermind: # zfs destroy [EMAIL PROTECTED]:28 cannot open '[EMAIL PROTECTED]:28': I/O error Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Can't destroy corrupted pool
You are likely hitting: 6397052 unmounting datasets should process /etc/mnttab instead of traverse DSL Which was fixed in build 46 of Nevada. In the meantime, you can remove /etc/zfs/zpool.cache manually and reboot, which will remove all your pools (which you can then re-import on an individual basis). I'm running b51, but I'll try deleting the cache. Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Can't destroy corrupted pool
This worked. I've restarted my testing but I've been fdisking each drive before I add it to the pool, and so far the system is behaving as expected when I spin a drive down, i.e., the hot spare gets automatically used. This makes me wonder if it's possible to ensure that the forced addition of a drive to a pool wipes the pool of any previous data, especially any zfs metadata. I'll keep the list posted as I continue my tests. Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs exported a live filesystem
By mistake, I just exported my test filesystem while it was up and being served via NFS, causing my tar over NFS to start throwing stale file handle errors. Should I file this as a bug, or should I just not do that :- Ko, This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs exported a live filesystem
For the record, this happened with a new filesystem. I didn't muck about with an old filesystem while it was still mounted, I created a new one, mounted it and then accidentally exported it. Except that it doesn't: # mount /dev/dsk/c1t1d0s0 /mnt # share /mnt # umount /mnt umount: /mnt busy # unshare /mnt # umount /mnt If you umount -f it will though! Well, sure, but I was still surprised that it happened anyway. The system is working as designed, the NFS client did what it was supposed to do. If you brought the pool back in again with zpool import things should have picked up where they left off. Yep -- an import/shareall made the FS available again. Whats more you we probably running as root when you did that so you got what you asked for - there is only so much protection we can give without being annoying! Sure, but there are still safeguards in place even when running things as root, such as requiring umount -f as above, or warning you when running format on a disk with mounted partitions. Since this appeared to be an operation that may warrant such a safeguard I thought I'd check and see if this was to be expected or if a safeguard should be put in. Annoying isn't always bad :- Now having said that I personally wouldn't have expected that zpool export should have worked as easily as that while there where shared filesystems. I would have expected that exporting the pool should have attempted to unmount all the ZFS filesystems first - which would have failed without a -f flag because they were shared. So IMO it is a bug or at least an RFE. Ok, where should I file an RFE? Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Kickstart hot spare attachment
For my latest test I set up a stripe of two mirrors with one hot spare like so: zpool create -f -m /export/zmir zmir mirror c0t0d0 c3t2d0 mirror c3t3d0 c3t4d0 spare c3t1d0 I spun down c3t2d0 and c3t4d0 simultaneously, and while the system kept running (my tar over NFS barely hiccuped), the zpool command hung again. I rebooted the machine with -dnq, and although the system didn't come up the first time, it did after a fsck and a second reboot. However, once again the hot spare isn't getting used: # zpool status -v pool: zmir state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Tue Dec 12 09:15:49 2006 config: NAMESTATE READ WRITE CKSUM zmirDEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c0t0d0 ONLINE 0 0 0 c3t2d0 UNAVAIL 0 0 0 cannot open mirrorDEGRADED 0 0 0 c3t3d0 ONLINE 0 0 0 c3t4d0 UNAVAIL 0 0 0 cannot open spares c3t1d0AVAIL A few questions: - I know I can attach it via the zpool commands, but is there a way to kickstart the attachment process if it fails to attach automatically upon disk failure? - In this instance the spare is twice as big as the other drives -- does that make a difference? - Is there something inherent to an old SCSI bus that causes spun- down drives to hang the system in some way, even if it's just hanging the zpool/zfs system calls? Would a thumper be more resilient to this? Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
Jason J. W. Williams wrote: Could the replication engine eventually be integrated more tightly with ZFS? Not it in the present form. The architecture and implementation of Availability Suite is driven off block-based replication at the device level (/dev/rdsk/...), something that allows the product to replicate any Solaris file system, database, etc., without any knowledge of what it is actually replicating. To pursue ZFS replication in the manner of Availability Suite, one needs to see what replication looks like from an abstract point of view. So simplistically, remote replication is like the letter 'h', where the left side of the letter is the complete I/O path on the primary node, the horizontal part of the letter is the remote replication network link, and the right side of the letter is only the bottom half of the complete I/O path on the secondary node. Next ZFS would have to have its functional I/O path split into two halves, a top and bottom piece. Next we configure replication, the letter 'h', between two given nodes, running both a top and bottom piece of ZFS on the source node, and just the bottom half of ZFS on the secondary node. Today, the SNDR component of Availability Suite works like the letter 'h' today, where we split the Solaris I/O stack into a top and bottom half. The top half is that software (file system, database or application I/O) that directs its I/Os to the bottom half (raw device, volume manager or block device). So all that needs to be done is to design and build a new variant of the letter 'h', and find the place to separate ZFS into two pieces. - Jim Dunham That would be slick alternative to send/recv. Best Regards, Jason On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote: Project Overview: I propose the creation of a project on opensolaris.org, to bring to the community two Solaris host-based data services; namely volume snapshot and volume replication. These two data services exist today as the Sun StorageTek Availability Suite, a Solaris 8, 9 10, unbundled product set, consisting of Instant Image (II) and Network Data Replicator (SNDR). Project Description: Although Availability Suite is typically known as just two data services (II SNDR), there is an underlying Solaris I/O filter driver framework which supports these two data services. This framework provides the means to stack one or more block-based, pseudo device drivers on to any pre-provisioned cb_ops structure, [ http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs ], thereby shunting all cb_ops I/O into the top of a developed filter driver, (for driver specific processing), then out the bottom of this filter driver, back into the original cb_ops entry points. Availability Suite was developed to interpose itself on the I/O stack of a block device, providing a filter driver framework with the means to intercept any I/O originating from an upstream file system, database or application layer I/O. This framework provided the means for Availability Suite to support snapshot and remote replication data services for UFS, QFS, VxFS, and more recently the ZFS file system, plus various databases like Oracle, Sybase and PostgreSQL, and also application I/Os. By providing a filter driver at this point in the Solaris I/O stack, it allows for any number of data services to be implemented, without regard to the underlying block storage that they will be configured on. Today, as a snapshot and/or replication solution, the framework allows both the source and destination block storage device to not only differ in physical characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical characteristics such as in RAID type, volume managed storage (i.e., SVM, VxVM), lofi, zvols, even ram disks. Community Involvement: By providing this filter-driver framework, two working filter drivers (II SNDR), and an extensive collection of supporting software and utilities, it is envisioned that those individuals and companies that adopt OpenSolaris as a viable storage platform, will also utilize and enhance the existing II SNDR data services, plus have offered to them the means in which to develop their own block-based filter driver(s), further enhancing the use and adoption on OpenSolaris. A very timely example that is very applicable to Availability Suite and the OpenSolaris community, is the recent announcement of the Project Proposal: lofi [ compression encryption ] - http://www.opensolaris.org/jive/click.jspamessageID=26841. By leveraging both the Availability Suite and the lofi OpenSolaris projects, it would be highly probable to not only offer compression encryption to lofi devices (as already proposed), but by collectively leveraging these two project, creating the means to support file systems, databases and applications, across all block-based storage devices. Since Availability
[zfs-discuss] Re: ZFS panics system during boot, after 11/06 upgrade
There are ZFS file systems. There are no zones. Any help would be greatly appreciated, this is my everyday computer. Take a look at page 167 of the admin guide: http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf You need to delete /etc/zfs/zpool.cache. And, use zpool import to recover. Cheers, Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
Jason, Thank you for the detailed explanation. It is very helpful to understand the issue. Is anyone successfully using SNDR with ZFS yet? Of the opportunities I've been involved with the answer is yes, but so far I've not seen SNDR with ZFS in a production environment, but that does not mean they don't exists. It was not until late June '06, that AVS 4.0, Solaris 10 and ZFS were generally available, and to date AVS has not been made available for the Solaris Express, Community Release, but it will be real soon. While I have your attention, there are two issues between ZFS and AVS that needs mentioning. 1). When ZFS is given an entire LUN to place in a ZFS storage pool, ZFS detect this, enabling SCSI write-caching on the LUN, and also opens the LUN with exclusive access, preventing other data services (like AVS) from accessing this device. The work-around is to manually format the LUN, typically placing all the blocks into a single partition, then just place this partition into the ZFS storage pool. ZFS detect this, not owning the entire LUN, and doesn't enable write-caching, which means it also doesn't open the LUN with exclusive access, and therefore AVS and ZFS can share the same LUN. I thought about submitting an RFE to have ZFS provide a means to override this restriction, but I am not 100% certain that a ZFS filesystem directly accessing a write-cached enabled LUN is the same thing as a replicated ZFS filesystem accessing a write-cached enabled LUN. Even though AVS is write-order consistent, there are disaster recovery scenarios, when enacted, where block-order, verses write-order I/Os are issued. 2). One has to be very cautious in using zpool import -f (forced import), especially on a LUN or LUNs in which SNDR is actively replicating into. If ZFS complains that the storage pool was not cleanly exported when issuing a zpool import ..., and one attempts a zpool import -f , without checking the active replication state, they are sure to panic Solaris. Of course this failure scenario is no different then accessing a LUN or LUNs on dual-ported, or SAN based storage when another Solaris host is still accessing the ZFS filesystem, or controller based replication, as they are all just different operational scenarios of the same issue, data blocks changing out from underneath the ZFS filesystem, and its CRC checking mechanisms. Jim Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read Only Zpool: ZFS and Replication
Ben, I've been playing with replication of a ZFS Zpool using the recently released AVS. I'm pleased with things, but just replicating the data is only part of the problem. The big question is: can I have a zpool open in 2 places? No. The ability to have a zpool open in two place would required shared ZFS. The semantics of remote replication can be viewed to that of two Solaris hosts looking at the same SAN or dual-ported storage. Today, ZFS detects this with both SNDR and shared storage, as part of zpool import, warning that the pool is active elsewhere. What I really want is a Zpool on node1 open and writable (production storage) and a replicated to node2 where its open for read-only access (standby storage). The best you can do for this to use the II portion of Availability Suite to take a snapshot of the active SNDR replica on the remote node, getting a snapshot of the ZFS filesystem being replicated. In this case, ZFS on the remote node will see and detect replicated disk blocks changing in the zpool it is reading from. This is an old problem. I'm not sure its remotely possible. Its bad enough with UFS, but ZFS maintains a hell of a lot more meta-data. How is node2 supposed to know that a snapshot has been created for instance. With UFS you can at least get by some of these problems using directio, but thats not an option with a zpool. I know this is a fairly remedial issue to bring up... but if I think about what I want Thumper-to-Thumper replication to look like, I want 2 usable storage systems. As I see it now the secondary storage (node2) is useless untill you break replication and import the pool, do your thing, and then re-sync storage to re-enable replication. Am I missing something? I'm hoping there is an option I'm not aware of. No. Also just to be clear, after you ... do your thing, and then re-sync storage ... the re-sync is keep all of the data on the SNDR primary OR keep all the data on the SNDR secondary.There is no means to combine writes that occurred in two separate ZFS filesystems, back into one filesystem. The remote ZFS filesystem is essentially a clone of the original filesystem, and once a write I/O occurs to either side, the two filesystems take on a life of their own. Of course this is not unique to the ZFS filesystem, as the same is true for all others, and this underlying storage behavior is not unique to SNDR as it happens with other host-based replication and controller-based replication. Jim benr. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
Frank, On Fri, 2 Feb 2007, Torrey McMahon wrote: Jason J. W. Williams wrote: Hi Jim, Thank you very much for the heads up. Unfortunately, we need the write-cache enabled for the application I was thinking of combining this with. Sounds like SNDR and ZFS need some more soak time together before you can use both to their full potential together? Well...there is the fact that SNDR works with other FS other then ZFS. (Yes, I know this is the ZFS list.) Working around architectural issues for ZFS and ZFS alone might cause issues for others. SNDR has some issues with logging UFS as well. If you start a SNDR live copy on an active logging UFS (not _writelocked_), the UFS log state may not be copied consistently. Treading very carefully, UFS logging may have issues with being replicated, not the other way around. SNDR replication (after synchronizing) maintains a write-order consistent volume, thus if there is an issue with UFS logging being able to access an SNDR secondary, then UFS logging will also have issues with accessing a volume after Solaris crashes. The end result of Solaris crashing, or SNDR replication stopping, is a write-ordered, crash-consistent volume. Given that both UFS logging and SNDR are (near) perfect (or there would be a flood of escalations), this issue in all cases I've seen to date, is that the SNDR primary volume being replicated is mounted with UFS logging enable, but the SNDR secondary is not mounted with UFS logging enabled. Once this condition happens, the problem can be resolved by fixing /etc/vfstab to correct the inconsistent mount options, and then performing an SNDR update sync. If you want a live remote replication facility, it _NEEDS_ to talk to the filesystem somehow. There must be a callback mechanism that the filesystem could use to tell the replicator and from exactly now on you start replicating. The only entity which can truly give this signal is the filesystem itself. There is an RFE against SNDR for something called in-line PIT. I hope that this work will get done soon. And no, that _not_ when the filesystem does a flush write cache ioctl. Or when the user has just issued a sync command or similar. For ZFS, it'd be when a ZIL transaction is closed (as I understand it), for UFS it'd be when the UFS log is fully rolled. There's no notification to external entities when these two events happen. Because ZFS is always on-disk consistent, this is not an issue. So far in ALL my testing with replicating ZFS with SNDR, I have not seen ZFS fail! Of course be careful to not confuse my stated position with another closely related scenario. That being accessing ZFS on the remote node via a forced import zpool import -f name, with active SNDR replication, as ZFS is sure to panic the system. ZFS, unlike other filesystems has 0% tolerance to corrupted metadata. Jim SNDR tries its best to achieve this detection, but without actually _stopping_ all I/O (on UFS: writelocking), there's a window of vulnerability still open. And SNDR/II don't stop filesystem I/O - by basic principle. That's how they're sold/advertised/intended to be used. I'm all willing to see SNDR/II go open - we could finally work these issues ! FrankH. I think the best of both worlds approach would be to let SNDR plug-in to ZFS along the same lines the crypto stuff will be able to plug in, different compression types, etc. There once was a slide that showed how that workedor I'm hallucinating again. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read Only Zpool: ZFS and Replication
Robert, Hello Ben, Monday, February 5, 2007, 9:17:01 AM, you wrote: BR I've been playing with replication of a ZFS Zpool using the BR recently released AVS. I'm pleased with things, but just BR replicating the data is only part of the problem. The big BR question is: can I have a zpool open in 2 places? BR What I really want is a Zpool on node1 open and writable BR (production storage) and a replicated to node2 where its open for BR read-only access (standby storage). BR This is an old problem. I'm not sure its remotely possible. Its BR bad enough with UFS, but ZFS maintains a hell of a lot more BR meta-data. How is node2 supposed to know that a snapshot has been BR created for instance. With UFS you can at least get by some of BR these problems using directio, but thats not an option with a zpool. BR I know this is a fairly remedial issue to bring up... but if I BR think about what I want Thumper-to-Thumper replication to look BR like, I want 2 usable storage systems. As I see it now the BR secondary storage (node2) is useless untill you break replication BR and import the pool, do your thing, and then re-sync storage to re-enable replication. BR Am I missing something? I'm hoping there is an option I'm not aware of. You can't mount rw on one node and ro on another (not to mention that zfs doesn't offer you to import RO pools right now). You can mount the same file system like UFS in RO on both nodes but not ZFS (no ro import). One can not just mount a filesystem in RO mode if SNDR or any other host-based or controller-based replication is underneath. For all filesystems that I know of, expect of course shared-reader QFS, this will fail given time. Even if one has the means to mount a filesystem with DIRECTIO (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem from looking at the contents of block A and then acting on block B. The reason being is that during replication at time T1 both blocks A B could be written and be consistent with each other. Next the file system reads block A. Now replication at time T2 updates blocks A B, also consistent with each other. Next the file system reads block B and panics due to an inconsistency only it sees between old A and new B. I know this for a fact, since a forced zpool import -f name, is a common instance of this exact failure, due most likely checksum failures between metadata blocks A B. Of course using an instantly accessible II snapshot of an SNDR secondary volume would work just fine, since the data being read is now point-in-time consistent, and static. - Jim I belive what you really need is 'zfs send continuos' feature. We are developing something like this right now. I expect we can give more details really soon now. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read Only Zpool: ZFS and Replication
Ben Rockwood wrote: Jim Dunham wrote: Robert, Hello Ben, Monday, February 5, 2007, 9:17:01 AM, you wrote: BR I've been playing with replication of a ZFS Zpool using the BR recently released AVS. I'm pleased with things, but just BR replicating the data is only part of the problem. The big BR question is: can I have a zpool open in 2 places? BR What I really want is a Zpool on node1 open and writable BR (production storage) and a replicated to node2 where its open for BR read-only access (standby storage). BR This is an old problem. I'm not sure its remotely possible. Its BR bad enough with UFS, but ZFS maintains a hell of a lot more BR meta-data. How is node2 supposed to know that a snapshot has been BR created for instance. With UFS you can at least get by some of BR these problems using directio, but thats not an option with a zpool. BR I know this is a fairly remedial issue to bring up... but if I BR think about what I want Thumper-to-Thumper replication to look BR like, I want 2 usable storage systems. As I see it now the BR secondary storage (node2) is useless untill you break replication BR and import the pool, do your thing, and then re-sync storage to re-enable replication. BR Am I missing something? I'm hoping there is an option I'm not aware of. You can't mount rw on one node and ro on another (not to mention that zfs doesn't offer you to import RO pools right now). You can mount the same file system like UFS in RO on both nodes but not ZFS (no ro import). One can not just mount a filesystem in RO mode if SNDR or any other host-based or controller-based replication is underneath. For all filesystems that I know of, expect of course shared-reader QFS, this will fail given time. Even if one has the means to mount a filesystem with DIRECTIO (no-caching), READ-ONLY (no-writes), it does not prevent a filesystem from looking at the contents of block A and then acting on block B. The reason being is that during replication at time T1 both blocks A B could be written and be consistent with each other. Next the file system reads block A. Now replication at time T2 updates blocks A B, also consistent with each other. Next the file system reads block B and panics due to an inconsistency only it sees between old A and new B. I know this for a fact, since a forced zpool import -f name, is a common instance of this exact failure, due most likely checksum failures between metadata blocks A B. Ya, that bit me last night. 'zpool import' shows the pool fine, but when you force the import you panic: Feb 5 07:14:10 uma ^Mpanic[cpu0]/thread=fe8001072c80: Feb 5 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write on unknown off 0: zio fe80c54ed380 [L0 unallocated] 400L/200P DVA[0]=0:36000:200 DVA[1]=0:9c0003800:200 DVA[2]=0:20004e00:200 fletcher4 lzjb LE contiguous birth=57416 fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5 Feb 5 07:14:11 uma unix: [ID 10 kern.notice] Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a40 zfs:zio_done+140 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a60 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ab0 zfs:zio_wait_for_children+5d () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ad0 zfs:zio_wait_children_done+20 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072af0 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b40 zfs:zio_vdev_io_assess+129 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b60 zfs:zio_next_stage+68 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bb0 zfs:vdev_mirror_io_done+2af () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bd0 zfs:zio_vdev_io_done+26 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c60 genunix:taskq_thread+1a7 () Feb 5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c70 unix:thread_start+8 () Feb 5 07:14:11 uma unix: [ID 10 kern.notice] So without using II, whats the best method of bring up the secondary storage? Is just dropping the primary into logging acceptable? Yes, placing SNDR in logging mode stops the replication of writes. Also performing a zpool export on the primary node, and waiting (sndradm -w) until all writes are replicated, means that on the SNDR secondary node, a zpool import can be done without using the -f, as a forced imported is not need, since the zpool export operation got replicated. Be sure to remember to zpool export on the remote node, before resuming replication on the primary node, or another panic will likely occur. Jim benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] FROSUG February Meeting Announcement (2/22/2007)
This month's FROSUG (Front Range OpenSolaris User Group) meeting is on Thursday, February 22, 2007. Our presentation is ZFS as a Root File System by Lori Alt. In addition, Jon Bowman will be giving an OpenSolaris Update, and we will also be doing an InstallFest. So, if you want help installing an OpenSolaris distribution, backup your laptop and bring it to the meeting! About the presentation(s): One of the next steps in the evolution of ZFS is to enable its use as a root file system. This presentation will focus on how booting from ZFS will work, how installation will be affected by ZFS's feature set, and the many advantages that will result from being able to use ZFS as a root file system. The presentation(s)s will be posted here prior to the meeting: http://www.opensolaris.org/os/community/os_user_groups/frosug/ About our presenter(s): Lori Alt is a Staff Engineer at Sun Microsystems, where she has worked since 1991. Lori worked on Solaris install and upgrade and then on UFS, where she led the multi-terabyte UFS project. She has Bachelor's and Master's degrees in computer science from Washington University in St. Louis, MO. - Meeting Details: When: Thursday, February 22, 2007 Times: 6:00pm - 6:30pm Doors open and Pizza 6:30pm - 6:45pm OpenSolaris Update (Jon Bowman) 6:45pm - 8:30pm ZFS as a Root File System (Lori Alt) Where: Sun Broomfield Campus Building 1 - Conference Center 500 Eldorado Blvd. Broomfield, CO 80021 Note: The location of this meeting may change. We will send out an additional email prior to the meeting if this happens. Pizza and soft drinks will be served at the beginning of the meeting. Please RSVP to frosug-rsvp(AT)opensolaris(DOT)org in order to help us plan for food and setup access to the Sun campus. We hope to see you there! Thanks, FROSUG +++ Future Meeting Plans: March 29, 2007: Doug McCallum presents sharemgr This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] UPDATE: FROSUG February Meeting (2/22/2007)
***Meeting Update*** We will be having this month's meeting at the Omni Interlocken Resort in Broomfield and a conference call number is being provided for those who can not make the meeting in person, see Meeting Details below for more information. In addition, we will be discussing Solaris Express Developer Edition during the OpenSolaris Update and providing free SXDE DVDs. Hope to see you there. This month's meeting is getting a lot of interest! ***Meeting Update*** This month's FROSUG (Front Range OpenSolaris User Group) meeting is on Thursday, February 22, 2007. Our presentation is ZFS as a Root File System by Lori Alt. In addition, Jon Bowman will be giving an OpenSolaris Update, and we will also be doing an InstallFest. So, if you want help installing Solaris Express Developer Edition, backup your laptop and bring it to the meeting! About the presentation: One of the next steps in the evolution of ZFS is to enable its use as a root file system. This presentation will focus on how booting from ZFS will work, how installation will be affected by ZFS's feature set, and the many advantages that will result from being able to use ZFS as a root file system. The presentation will be posted here prior to the meeting: http://www.opensolaris.org/os/community/os_user_groups/frosug/ About our presenter: Lori Alt is a Staff Engineer at Sun Microsystems, where she has worked since 1991. Lori worked on Solaris install and upgrade and then on UFS, where she led the multi-terabyte UFS project. She has Bachelor's and Master's degrees in computer science from Washington University in St. Louis, MO. - Meeting Details When: Thursday, February 22, 2007 Times: 6:00pm - 6:30pm Food and Drinks 6:30pm - 6:45pm OpenSolaris Update (Jon Bowman) 6:45pm - 8:30pm ZFS as a Root File System (Lori Alt) Where: Omni Interlocken Resort (Fir Conference Room) 500 Interlocken Blvd. Broomfield, CO 80021 Conference Call Information US: 866-545-5198 INTL: 865-521-8904 Access Code: 5518835 - The meeting is free and open to the public. Snacks and soft drinks will be served at the beginning of the meeting. Please RSVP to frosug-rsvp(AT)opensolaris(DOT)org in order to help us plan for food. We hope to see you there! Thanks, FROSUG - Future Meeting Plans: March 29, 2007: Doug McCallum presents sharemgr If you have ideas for meeting topics, send them to: ug-frosug(AT)opensolaris(DOT)org This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why number of NFS threads jumps to the max value?
You don't honestly, really, reasonably, expect someone, anyone, to look at the stack trace of a few hundred threads, and post something along the lines of This is what is wrong with your NFS server.Do you? Without any other information at all? We're here to help, but please reset your expectations around our abilities to root-cause pathological behavior based an almost no information. What size and type of server? What size and type of storage? What release of Solaris? What how may networks, and what type? What is being used to generate the load for the testing? What is the zpool configuration? What do the system stats look like while under load (e.g. mpstat), and how to they change when you see this behavior? What does zpool iostat zpool_name 1 data look like while under load? Are you collecting nfsstat data - what is the rate of incoming NFS ops? Can you characterize the load - read/write data intensive, metadata intensive? Are the client machines Solaris, or something else? Does this last for seconds, minutes, tens-of-minutes? Does the system remain in this state indefinitely until reboot, or does it normalize? Can you consistently reproduce this problem? /jim Leon Koll wrote: Hello, gurus I need your help. During the benchmark test of NFS-shared ZFS file systems at some moment the number of NFS threads jumps to the maximal value, 1027 (NFSD_SERVERS was set to 1024). The latency also grows and the number of IOPS is going down. I've collected the output of echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k that can be seen here: http://tinyurl.com/yrvn4z Could you please look at it and tell me what's wrong with my NFS server. Appreciate, -- Leon This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
How/when did you configure arc_c_max? Immediately following a reboot, I set arc.c_max using mdb, then verified reading the arc structure again. arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Yep. I'm using a x4500 in-house to sort out performance of a customer test case that uses mmap. We acquired the new DIMMs to bring the x4500 to 32GB, since the workload has a 64GB working set size, and we were clobbering a 16GB thumper. We wanted to see how doubling memory may help. I'm trying clamp the ARC size because for mmap-intensive workloads, it seems to hurt more than help (although, based on experiments up to this point, it's not hurting a lot). I'll do another reboot, and run it all down for you serially... /jim Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: ARC_mru::print -d size lsize size = 0t10224433152 lsize = 0t10218960896 ARC_mfu::print -d size lsize size = 0t303450112 lsize = 0t289998848 ARC_anon::print -d size size = 0 So it looks like the MRU is running at 10GB... What does this tell us? Thanks, /jim [EMAIL PROTECTED] wrote: This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Following a reboot: arc::print -tad { . . . c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t16588228608 c02e29f8 uint64_t c = 0t33176457216 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t33176457216 . . . } c02e2a08 /Z 0x2000 --- set c_max to 512MB arc+0x48: 0x7b9789000 = 0x2000 arc::print -tad { . . . c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t16588228608 c02e29f8 uint64_t c = 0t33176457216 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t536870912 - c_max is 512MB . . . } ARC_mru::print -d size lsize size = 0t294912 lsize = 0t32768 Run the workload a couple times... c02e29e8 uint64_t size = 0t27121205248 --- ARC size is 27GB c02e29f0 uint64_t p = 0t10551351442 c02e29f8 uint64_t c = 0t27121332576 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t536870912 - c_max is 512MB ARC_mru::print -d size lsize size = 0t223985664 lsize = 0t221839360 ARC_mfu::print -d size lsize size = 0t26897219584 -- MFU list is almost 27GB ... lsize = 0t26869121024 Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Will try that now... /jim [EMAIL PROTECTED] wrote: I suppose I should have been more forward about making my last point. If the arc_c_max isn't set in /etc/system, I don't believe that the ARC will initialize arc.p to the correct value. I could be wrong about this; however, next time you set c_max, set c to the same value as c_max and set p to half of c. Let me know if this addresses the problem or not. -j How/when did you configure arc_c_max? Immediately following a reboot, I set arc.c_max using mdb, then verified reading the arc structure again. arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Yep. I'm using a x4500 in-house to sort out performance of a customer test case that uses mmap. We acquired the new DIMMs to bring the x4500 to 32GB, since the workload has a 64GB working set size, and we were clobbering a 16GB thumper. We wanted to see how doubling memory may help. I'm trying clamp the ARC size because for mmap-intensive workloads, it seems to hurt more than help (although, based on experiments up to this point, it's not hurting a lot). I'll do another reboot, and run it all down for you serially... /jim Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: ARC_mru::print -d size lsize size = 0t10224433152 lsize = 0t10218960896 ARC_mfu::print -d size lsize size = 0t303450112 lsize = 0t289998848 ARC_anon::print -d size size = 0 So it looks like the MRU is running at 10GB... What does this tell us? Thanks, /jim [EMAIL PROTECTED] wrote: This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
All righty...I set c_max to 512MB, c to 512MB, and p to 256MB... arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t16588228608 c02e29f8 uint64_t c = 0t33176457216 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t33176457216 ... } c02e2a08 /Z 0x2000 arc+0x48: 0x7b9789000 = 0x2000 c02e29f8 /Z 0x2000 arc+0x38: 0x7b9789000 = 0x2000 c02e29f0 /Z 0x1000 arc+0x30: 0x3dcbc4800 = 0x1000 arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t268435456 -- p is 256MB c02e29f8 uint64_t c = 0t536870912 -- c is 512MB c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t536870912--- c_max is 512MB ... } After a few runs of the workload ... arc::print -d size size = 0t536788992 Ah - looks like we're out of the woods. The ARC remains clamped at 512MB. Thanks! /jim [EMAIL PROTECTED] wrote: I suppose I should have been more forward about making my last point. If the arc_c_max isn't set in /etc/system, I don't believe that the ARC will initialize arc.p to the correct value. I could be wrong about this; however, next time you set c_max, set c to the same value as c_max and set p to half of c. Let me know if this addresses the problem or not. -j How/when did you configure arc_c_max? Immediately following a reboot, I set arc.c_max using mdb, then verified reading the arc structure again. arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Yep. I'm using a x4500 in-house to sort out performance of a customer test case that uses mmap. We acquired the new DIMMs to bring the x4500 to 32GB, since the workload has a 64GB working set size, and we were clobbering a 16GB thumper. We wanted to see how doubling memory may help. I'm trying clamp the ARC size because for mmap-intensive workloads, it seems to hurt more than help (although, based on experiments up to this point, it's not hurting a lot). I'll do another reboot, and run it all down for you serially... /jim Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: ARC_mru::print -d size lsize size = 0t10224433152 lsize = 0t10218960896 ARC_mfu::print -d size lsize size = 0t303450112 lsize = 0t289998848 ARC_anon::print -d size size = 0 So it looks like the MRU is running at 10GB... What does this tell us? Thanks, /jim [EMAIL PROTECTED] wrote: This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http
Re: [zfs-discuss] Re: ZFS with raidz
(I'm probably not the best person to answer this, but that has never stopped me before, and I need to give Richard Elling a little more time to get the Goats, Cows and Horses fed, sip his morning coffee, and offer a proper response...) Would it benefit us to have the disk be setup as a raidz along with the hardware raid 5 that is already setup too? Way back when, we called such configurations plaiding, which described a host-based RAID configuration that criss-crossed hardware RAID LUNs. In doing such things, we had potentially better data availability with a configuration that could survive more failure modes. Alternatively, we used the hardware RAID for the availability configuration (hardware RAID 5), and used host-based RAID to stripe across hardware RAID5 LUNs for performance. Seemed to work pretty well. In theory, a raidz pool spread across some number of underlying hardware raid 5 LUNs would offer protection against more failure mode, such as the loss of an entire raid5 LUN. So from a failure protection/data availability point of view, it offers some benefit. Now, as to whether or not you experience a real, measurable benefit over time is hard to say. Each additional level of protection/redundancy has a diminishing return, often times at a dramatic incremental cost (e.g. getting from four nines to five nines). Or with this double raid slow our performance with both a software and hardware raid setup? You will certainly pay a performance - using raidz across the raid5 luns will reduce deliverable IOPS from the raid 5 luns. Whether or not the performance trade-off is worth the RAS gain varies based on your RAS and data availability requirements. Or would raidz setup be better than the hardware raid5 setup? Assuming a robust raid5 implementation with battery-backed nvram (protect against the write hole and partial stripe writes), I think a raidz zpool covers more of the datapath then a hardware raid 5 LUN, but I'll wait for Richard to elaborate here (or tell me I'm wrong). Also if we do set the disks as a raidz would it benefit use more if we specified each disks in the raidz or create them as Luns then specify the setup in raidz. Isn't' this the same question as the first question? I'm not sure what you're asking here... The questions you're asking are good ones, and date back to the decades old struggle around configuration tradeoffs for performance / availability / cost. My knee-jerk reaction is that one level of RAID, like either hardware raid5 ZFS raidz is sufficient for availability, and keeps things relatively simple (and simple also improves RAS). The advantage host-based RAID has always had of hardware RAID is the ability to create software LUNs (like a raidz1 or raidz2 zpool) across physical disk controllers, which may also cross SAN switches, etc. So, twas me, I'd go with non-hardware RAID5 devices from the storage frame, and create raidz1 or raidz2 zpools across controllers. But, that's me... :^) /jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] The value of validating your backups...
http://www.cnn.com/2007/US/03/20/lost.data.ap/index.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS with raidz
Hi Kory - Your problem came our way through other Sun folks a few days ago, and I wish I had that magic setting to help, but the reality is that I'm not aware of anything that will improve the time required to mount 12k file systems. I would add (not that this helps) that I'm not convinced this problem is unique to ZFS, but I do not have experience or empirical data on mount time for 12k UFS, QFS, ext4, etc, file systems. There is an RFE filed on this: http://bugs.opensolaris.org/view_bug.do?bug_id=6478980 As I said, I wish I had a better answer. Thanks, /jim Kory Wheatley wrote: Currently we are trying to setup zfs as file systems for all our user accounts under /homea /homec /homef /homei /homem /homep /homes and /homet. Right now on our Sun Fire v890 with 4 dual core processors and 16gb of memory we have 12,000 zfs file systems setup. Which Sun has promised will work, but we didn't know that it would take over an hour to do a reboot on this machine to mount and umount all these file systems. What were trying to accomplish is the best performance along with best data protection. Sun speaks that ZFS supports millions of fil e systems, but what they left out is how long it takes to do a reboot when you have thousand's of file systems. Currently we have three LUNS on our EMC disk array that we've created one zfs storage pool, and we've created these 12,000 zfs file system to this zfs pool. We really don't want to have to go ufs to create our user student accounts. We like the flexibility of ZFS, but with the slow boot process it will kill us when we have to implement patches that require a reboot. These ZFS file systems will contain all the student data, so reliability and performance is a key to us. Do you know away or a different setup for ZFS to allow our system to boot up faster? I know each mount takes up memory so that's part of the slowness when mounting and umounting. We know when the system is up that the kernel is using 3gb of memory out of the 16gb, and there's nothing else on this box right, but ZFS. There's no data in those thousand's of file systems yet. Richard Elling wrote: Jim Mauro wrote: (I'm probably not the best person to answer this, but that has never stopped me before, and I need to give Richard Elling a little more time to get the Goats, Cows and Horses fed, sip his morning coffee, and offer a proper response...) chores are done, wading through the morning e-mail... Would it benefit us to have the disk be setup as a raidz along with the hardware raid 5 that is already setup too? Way back when, we called such configurations plaiding, which described a host-based RAID configuration that criss-crossed hardware RAID LUNs. In doing such things, we had potentially better data availability with a configuration that could survive more failure modes. Alternatively, we used the hardware RAID for the availability configuration (hardware RAID 5), and used host-based RAID to stripe across hardware RAID5 LUNs for performance. Seemed to work pretty well. Yep, there are various ways to do this and, in general, the more copies of the data you have, the better reliability you have. Space is also fairly easy to calculate. Performance can be tricky, and you may need to benchmark with your workload to see which is better, due to the difficulty in modeling such systems. In theory, a raidz pool spread across some number of underlying hardware raid 5 LUNs would offer protection against more failure mode, such as the loss of an entire raid5 LUN. So from a failure protection/data availability point of view, it offers some benefit. Now, as to whether or not you experience a real, measurable benefit over time is hard to say. Each additional level of protection/redundancy has a diminishing return, often times at a dramatic incremental cost (e.g. getting from four nines to five nines). If money was no issue, I'm sure we could come up with an awesome solution :-) Or with this double raid slow our performance with both a software and hardware raid setup? You will certainly pay a performance - using raidz across the raid5 luns will reduce deliverable IOPS from the raid 5 luns. Whether or not the performance trade-off is worth the RAS gain varies based on your RAS and data availability requirements. Fast, inexpensive, reliable: pick two. Or would raidz setup be better than the hardware raid5 setup? Assuming a robust raid5 implementation with battery-backed nvram (protect against the write hole and partial stripe writes), I think a raidz zpool covers more of the datapath then a hardware raid 5 LUN, but I'll wait for Richard to elaborate here (or tell me I'm wrong). In general, you want the data protection in the application, or as close to the application as you can get. Since programmers tend to be lazy (Gosling said it, not me! :-) most rely on the file system and underlying constructs to ensure data protection. So
[zfs-discuss] REMINDER: FROSUG March Meeting Announcement (3/29/2007)
== Reminder: this meeting is tomorrow == Also, we will briefly talk about the Project Blackbox tour that is coming to the Denver area April 12-13. More information is at: http://www.sun.com/emrkt/blackbox == Reminder: this meeting is tomorrow == This month's FROSUG (Front Range OpenSolaris User Group) meeting is on Thursday, March 29, 2007. Our presentation is on Sharemgr by Doug McCallum. In addition, we will be giving an OpenSolaris Update, and will be having an InstallFest. So, if you want help installing an OpenSolaris distribution, backup your laptop and bring it to the meeting! !! We will be providing FREE Solaris Express Developer Edition DVDs. !! About the presentation: The sharemgr project is a framework for managing file sharing servers. It provides a mechanism to manage groups of shares as a single object and integrates share and group configuration into the Solaris Management Framework (SMF). The presentation has been posted on the frosug web page: http://www.opensolaris.org/os/community/os_user_groups/frosug/ About our presenter: Doug McCallum has been an engineer at Sun for more than 15 years. He has worked on a variety of Solaris projects including the original Solaris x86 port, networking, device support and volume management. More recently he has been working on improving the manageability of file sharing. - Meeting Details: When: Thursday, March 29, 2007 Times: 6:00pm - 6:30pm Doors open and Pizza 6:30pm - 6:45pm OpenSolaris Update (Jim Walker) 6:45pm - 8:30pm Sharemgr (Doug McCallum) Where: Sun Broomfield Campus Building 1 - Conference Center 500 Eldorado Blvd. Broomfield, CO 80021 The meeting is free and open to the public. Pizza and soft drinks will be served at the beginning of the meeting. Please RSVP to frosug-rsvp(AT)opensolaris(DOT)org in order to help us plan for food and setup access to the Sun campus. We hope to see you there! Thanks, FROSUG - Future Meeting Plans: April 2007: Dave McLoughlin (OpenLogic) presents Open Source Management May 2007: SunStudio Compiler If you have ideas for meeting topics, send them to: ug-frosug(AT)opensolaris(DOT)org This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
So you're not really sure it's the ARC growing, but only that the kernel is growing to 6.8GB. Print the arc values via mdb: # mdb -k Loading modules: [ unix krtld genunix specfs dtrace uppc scsi_vhci ufs ip hook neti sctp arp usba nca lofs zfs random sppp crypto ptm ipc ] arc::print -t size c p c_max uint64_t size = 0x2a8000 uint64_t c = 0x1cdfe800 uint64_t p = 0xe707400 uint64_t c_max = 0x1cdfe800 Is size = c_max? Assuming it is, you need to look through kmastats and see where the kernel memory is being used (again, inside mdb): ::kmastat The above generates a LOT of output that's not completely painless to parse, but it's not too bad either. If you think it's DNLC related, you can monitor the number of entries with: # kstat -p unix:0:dnlcstats:dir_entries_cached_current unix:0:dnlcstats:dir_entries_cached_current 9374 # You can also monitor kernel memory for the dnlc (just using grep with the kmastat in mdb): ::kmastat ! grep dnlc dnlc_space_cache 16104254 4096 104 0 The 5th column starting from the left is mem in use, in this example 4096. I'm not sure if the dnlc_space_cache represents all of kernel memory used for the dnlc. It might, but I need to look at the code to be sure... Let's start with this... /jim Jason J. W. Williams wrote: Hi Guys, Rather than starting a new thread I thought I'd continue this thread. I've been running Build 54 on a Thumper since Mid January and wanted to ask a question about the zfs_arc_max setting. We set it to 0x1 #4GB, however its creeping over that till our Kernel memory usage is nearly 7GB (::memstat inserted below). This is a database server so I was curious if the DNLC would have this affect over time, as it does quite quickly when dealing with small files? Would it be worth upgrade to Build 59? Thank you in advance! Best Regards, Jason Page SummaryPagesMB %Tot Kernel1750044 6836 42% Anon 1211203 4731 29% Exec and libs7648290% Page cache 220434 8615% Free (cachelist) 318625 12448% Free (freelist)659607 2576 16% Total 4167561 16279 Physical 4078747 15932 On 3/23/07, Roch - PAE [EMAIL PROTECTED] wrote: With latest Nevada setting zfs_arc_max in /etc/system is sufficient. Playing with mdb on a live system is more tricky and is what caused the problem here. -r [EMAIL PROTECTED] writes: Jim Mauro wrote: All righty...I set c_max to 512MB, c to 512MB, and p to 256MB... arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t16588228608 c02e29f8 uint64_t c = 0t33176457216 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t33176457216 ... } c02e2a08 /Z 0x2000 arc+0x48: 0x7b9789000 = 0x2000 c02e29f8 /Z 0x2000 arc+0x38: 0x7b9789000 = 0x2000 c02e29f0 /Z 0x1000 arc+0x30: 0x3dcbc4800 = 0x1000 arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t268435456 -- p is 256MB c02e29f8 uint64_t c = 0t536870912 -- c is 512MB c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t536870912--- c_max is 512MB ... } After a few runs of the workload ... arc::print -d size size = 0t536788992 Ah - looks like we're out of the woods. The ARC remains clamped at 512MB. Is there a way to set these fields using /etc/system? Or does this require a new or modified init script to run and do the above with each boot? Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs boot image conversion kit is posted
I'm not sure I understand the question. Virtual machines are built by either running a virtualization technology in a host operating systems, such as running VMware Workstation in Linux, running Parallels in Mac OS X, Linux or Windows, etc. These are sometimes referred to as Type II VMMs, where the VMM (Virtual Machine Monitor - the chunk of software responsible for running the guest operating system) is hosted by a traditional operating system. In Type I VMMs, the VMM runs on the hardware. VMware ESX Server is an example of this (although some argue it is not, since technically there's an ESX kernel that runs on the hardware in support of the VMM). So Building a virtual machine on a zpool would require that the host operating system supports ZFS. An example here would be our forthcoming (no, I do not know when), Solaris/Xen integration, assuming there is support for putting Xen domU's on a ZFS. It may help to point out that when a virtual machine is created, it includes defining a virtual hard drive, which is typically just a file in the file system space of the hosting operating system. Given that, a hosting operating system that supports ZFS can allow for configuring virtual hard drives in the ZFS space. So I guess the anwer to your question is theoretically yes, but I'm not aware of an implementation that would allow for such a configuration that exists today. I think I just confused the issue...ah well... /jim PS - FWIW, I have a zpool configured in nv62 running in a Parallels virtual machine on Mac OS X. The nv62 system disk is a virtual hard disk that exists as a file in Mac OS X HFS+, thus this particular zpool is a partition on that virtual hard drive. Lori Alt wrote: I was hoping that someone more well-versed in virtual machines would respond to this so I wouldn't have to show my ignorance, but no such luck, so here goes: Is it even possible to build a virtual machine out of a zfs storage pool? Note that it isn't just zfs as a root file system we're trying out. It's the whole concept of booting from a dataset within a storage pool. I don't know enough about how one sets up a virtual machine to know whether it's possible or even meaningful to talk about generating a b62-on-zfs virtual machine. Lori MC wrote: If the goal is to test ZFS as a root file system, could I suggest making a virtual machine of b62-on-zfs available for download? This would reduce duplicated effort and encourage new people to try it out. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Fwd: [zfs-discuss] Re: Mac OS X Leopard to use ZFS
Hello - I think L4 still needs to evolve. BTW, i believe microkernels is the _right_ way and L4 is a first step in that direction. Perhaps you could elaborate on this? I thought the microkernel debate ended in the 1990s, in terms of being a compelling technology direction for kernel development targetting general purpose computing. Sure, there may be a niche market for microkernels (which depends, in part, on your definition of what a microkernel is), but in terms of broad applicability, I thought the jury was in. CMU's Mach was the last run at this that had any momentum. Thank you. /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [storage-discuss] Performance expectations of iscsi targets?
Paul, While testing iscsi targets exported from thumpers via 10GbE and imported 10GbE on T2000s I am not seeing the throughput I expect, and more importantly there is a tremendous amount of read IO happending on a purely sequential write workload. (Note all systems have Sun 10GbE cards and are running Nevada b65.) The read IO activity you are seeing is a direct result of re-writes on the ZFS storage pool. If you were to recreate the test from scratch, you would notice that on the very first pass of write I/Os from 'dd', there would be no reads. This is an artifact of using zvols as backing store for iSCSI Targets. The iSCSI Target software supports raw SCSI disks, Solaris raw devices (/dev/rdsk/), Solaris block devices (/dev/dsk/...), zvols, SVM volumes, files in file systems, including temps. Simple write workload (from T2000): # time dd if=/dev/zero of=/dev/rdsk/ c6t01144F210ECC2A004675E957d0 \ bs=64k count=100 A couple of things, maybe missing here, or the commands are not true cut-n-paste of what is being tested. 1). From the iSCSI initiator, there is no device at /dev/rdsk/ c6t01144F210ECC2A004675E957d0, note the missing slice. (s0, s1, s2, etc). 2). Even if one was to specify a slice, as in /dev/rdsk/ c6t01144F210ECC2A004675E957d0s2, it is unlikely that the LUN has been formatted. When I run format the first time, I get the error message of Please run fdisk first. Of course this does not have to be the case, because if the ZFS storage pool that backed up this LUN had previously been formatted with either a Solaris VTOC or Intel EFI label, then the disk would show up correctly. Performance of iscsi target pool on new blocks: bash-3.00# zpool iostat thumper1-vdev0 1 thumper1-vdev0 17.4G 2.70T 0526 0 63.6M thumper1-vdev0 17.5G 2.70T 0564 0 60.5M thumper1-vdev0 17.5G 2.70T 0 0 0 0 thumper1-vdev0 17.5G 2.70T 0 0 0 0 thumper1-vdev0 17.5G 2.70T 0 0 0 0 Configuration of zpool/iscsi target: # zpool status thumper1-vdev0 pool: thumper1-vdev0 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM thumper1-vdev0 ONLINE 0 0 0 c0t7d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 c6t7d0ONLINE 0 0 0 c7t7d0ONLINE 0 0 0 c8t7d0ONLINE 0 0 0 errors: No known data errors The first thing is that for this pool I was expecting 200-300MB/s throughput, since it is a simple stripe across 6, 500G disks. In fact, a direct local workload (directly on thumper1) of the same type confirms what I expected: bash-3.00# dd if=/dev/zero of=/dev/zvol/rdsk/thumper1-vdev0/iscsi bs=64k count=100 bash-3.00# zpool iostat thumper1-vdev0 1 thumper1-vdev0 20.4G 2.70T 0 2.71K 0 335M thumper1-vdev0 20.4G 2.70T 0 2.92K 0 374M thumper1-vdev0 20.4G 2.70T 0 2.88K 0 368M thumper1-vdev0 20.4G 2.70T 0 2.84K 0 363M thumper1-vdev0 20.4G 2.70T 0 2.57K 0 327M The second thing, is that when overwriting already written blocks via the iscsi target (from the T2000) I see a lot of read bandwidth for blocks that are being completely overwritten. This does not seem to slow down the write performance, but 1) it is not seem in the direct case; and 2) it consumes channel bandwidth unnecessarily. bash-3.00# zpool iostat thumper1-vdev0 1 thumper1-vdev0 8.90G 2.71T279783 31.7M 95.9M thumper1-vdev0 8.90G 2.71T281318 31.7M 29.1M thumper1-vdev0 8.90G 2.71T139 0 15.8M 0 thumper1-vdev0 8.90G 2.71T279 0 31.7M 0 thumper1-vdev0 8.90G 2.71T139 0 15.8M 0 Can anyone help to explain what I am seeing, or give me some guidance on diagnosing the cause of the following: - The bottleneck in accessing the iscsi target from the T2000 From the iSCSI Initiator's point of view, there are various (Negotiated) Login Parameters, which may have a direct effect on performance. Take a look at iscsiadm list target --verbose, then consult the iSCSI man pages, or documentation online at docs.sun.com. Remember to keep track of what you change on a per-target basis, and only change one parameter at a time, and measure your results. - The cause of the extra read bandwidth when overwriting blocks on the iscsi target from the T2000. ZFS as the backing store, and it COW (Copy-on-write) in maintaining the ZFS zvols within the storage pool. Any help is much appreciated, paul ___ storage-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/storage-discuss Jim Dunham Solaris, Storage Software
[zfs-discuss] ZFS test suite released on OpenSolaris.org
The ZFS test suite is being released today on OpenSolaris.org along with the Solaris Test Framework (STF), Checkenv and Runwattr test tools. The source tarball, binary package and baseline can be downloaded from the test consolidation download center at http://dlc.sun.com/osol/test/downloads/current. And, the source code can be viewed in the Solaris Test Collection (STC) 2.0 source tree at: http://cvs.opensolaris.org/source/xref/test/ontest-stc2/src/suites/zfs. The STF, Checkenv and Runwattr packages must be installed prior to executing a ZFS test run. More information is available in the ZFS README file and on the ZFS test suite webpage at: http://opensolaris.org/os/community/zfs/zfstestsuite. Any questions about the ZFS test suite can be sent to zfs discuss at: http://www.opensolaris.org/os/community/zfs/discussions. Any questions about STF, and the test tools can be sent to testing discuss at: http://www.opensolaris.org/os/community/testing/discussions. Happy Hunting, Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sharemgr Test Suite Released on OpenSolaris.org
The Sharemgr test suite is available on OpenSolaris.org. The source tarball, binary package and baseline can be downloaded from the test consolidation download center at: http://dlc.sun.com/osol/test/downloads/current The source code can be viewed in the Solaris Test Collection (STC) 2.0 source tree at: http://cvs.opensolaris.org/source/xref/test/ontest-stc2/src/suites/share The SUNWstc-tetlite package must be installed prior to executing a Sharemgr test run. More information on the Sharemgr test suite is available in the Sharemgr README file at: http://src.opensolaris.org/source/xref/test/ontest-stc2/src/suites/share/README Any questions about the Sharemgr test suite can be sent to testing discuss at: http://www.opensolaris.org/os/community/testing/discussions Cheers, Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does iSCSI target support SCSI-3 PGR reservation ?
A quick look through the source would seem to indicate that the PERSISTENT RESERVE commands are not supported by the Solaris ISCSI target at all. Correct. There is an RFE outstanding for iSCSI Target to implement PGR for both raw SCSI-3 devices, and block devices. http://bugs.opensolaris.org/view_bug.do?bug_id=6415440 http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/ iscsi/iscsitgtd/t10_spc.c This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Solaris, Storage Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 Email: [EMAIL PROTECTED] http://blogs.sun.com/avs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] New version of the ZFS test suite released
Version 1.8 of the ZFS test suite was released today on opensolaris.org. The ZFS test suite source tarballs, packages and baseline can be downloaded at: http://dlc.sun.com/osol/test/downloads/current/ The ZFS test suite source can be browsed at: http://src.opensolaris.org/source/xref/test/ontest-stc2/src/suites/zfs/ More information on the ZFS test suite is at: http://opensolaris.org/os/community/zfs/zfstestsuite/ Questions about the ZFS test suite can be sent to zfs-discuss at: http://www.opensolaris.org/jive/forum.jspa?forumID=80 Cheers, Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] do zfs filesystems isolate corruption?
Chris, In the old days of UFS, on occasion one might create multiple file systems (using multiple partitions) of a large LUN if filesystem corruption was a concern. It didn’t happen often but filesystem corruption has happened. So, if filesystem X was corrupt filesystem Y would be just fine. With ZFS, does the same logic hold true for two filesystems coming from the same pool? For the purposes of isolating corruption, the separation of two or more filesystems coming from the same ZFS storage pool does not help. An entire ZFS storage pool is the unit of I/O consistency, as all ZFS filesystems created within this single storage pool share the same physical storage. When configuring a ZFS storage pool the [poor] decision of choosing a non-redundant (single or concatenation of disks) verses redundant (mirror, raidz, raidz2) storage pool, offers no means for ZFS to automatically recover for some forms of corruption. Even when using a redundant storage pool, there are scenarios in which this is not good enough. This is when filesystem needs transitions into availability, such as when the loss or accessibility of two or more disks, causes mirroring or raidz to be ineffective. As of Solaris Express build 68, Availability Suite [http:// www.opensolaris.org/os/project/avs/] is part of base Solaris, offering both local snapshots and remote mirrors, both of which work with ZFS. Locally on a single Solaris host, snapshots of the entire ZFS storage pool can be taken at intervals of ones choosing, and with multiple snapshots of a single master, collections of snapshots, say at intervals of one hour, can be retained. Options allow for 100% independent snapshots (much like your UFS analogy above), dependent where only the Copy-On-Write data is retained, or compact dependent where the snapshots physical storage is some percentage of the master. Remotely between to or more Solaris hosts, remote mirrors of the entire ZFS storage pool can be configured, where synchronous replication can offer zero data loss, or asynchronous replication can offer near zero data loss, but both offering write-order, on disk consistency. A key aspect of remote replication with Availability Suite, is that the replicated ZFS storage pool can be quiesced on the remote node and accessed, or in a disaster recover scenario, take over instantly where the primary left off. When the primary site is restored, the MTTR (Mean Time To Recovery) is essentially zero, since Availability Suite supports on-demand pull, so yet to be replicated blocks are retrieved synchronously, allowing the ZFS filesystem and applications to be resumed without waiting for a potentially length resynchronization. Said slightly differently, I’m assuming that if the pool becomes mangled some how then all filesystems will be toast … but is it possible to have one filesystem be corrupted while the other filesystems are fine? Hmmm, does the answer depend on if the filesystems are nested ex: 1 /my_fs_1 /my_fs_2 ex: 2 /home_dirs/home_dirs/chris TIA! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Solaris, Storage Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 Email: [EMAIL PROTECTED] http://blogs.sun.com/avs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Under the Hood Presentation Slides
Is the referenced Laminated Handout on slide 3 available anywhere in any form electronically? If not, I'd be happy to create an electronic copy and make it pubically available. Thanks, /jim Joy Marshall wrote: It's taken a while but at last we have been able to post the ZFS Under the Hood presentation slides from the session back at May's LOSUG. You can view both the presentation slides and a layered overview here: Presentation: http://www.opensolaris.org/os/community/os_user_groups/losug/ZFS-UTH_3_v1.1_LOSUG.pdf Overview: http://www.opensolaris.org/os/community/os_user_groups/losug/ZFS-UTH_LayeredOverview_v2.3.pdf Joy This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single SAN Lun presented to 4 Hosts
Rainer, If you are looking for a means to safely READ any filesystem, please take a look at Availability Suite. One can safely take Point-in-Time copies of any Solaris supported filesystem, including ZFS, at any snapshot interval of one's choosing, and then access the shadow volume on any system within the SAN, be it Fibre Channel or iSCSI. If the node wanting access to the data is distant, Available Suite also offers Remote Replication. http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ Jim Ronald, thanks for your comments. I was thinking about this scenario: Host w continuously has a UFS mounted with read/write access. Host w writes to the file f/ff/fff. Host w ceases to touch anything under f. Three hours later, host r mounts the file system read-only, reads f/ff/fff, and unmounts the file system. My assumption was: a1) This scenario won't hurt w, a2) this scenario won't damage the data on the file system, a3) this scenario won't hurt r, and a4) the read operation will succeed, even if w continues with arbitrary I/O, except that it doesn't touch anything under f until after r has unmounted the file system. Of course everything that you and Tim and Casper said is true, but I'm still inclined to try that scenario. Rainer ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Solaris, Storage Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 Email: [EMAIL PROTECTED] http://blogs.sun.com/avs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
I'll take a look at this. ZFS provides outstanding sequential IO performance (both read and write). In my testing, I can essentially sustain hardware speeds with ZFS on sequential loads. That is, assuming 30-60MB/sec per disk sequential IO capability (depending on hitting inner or out cylinders), I get linear scale-up on sequential loads as I add disks to a zpool, e.g. I can sustain 250-300MB/sec on a 6 disk zpool, and it's pretty consistent for raidz and raidz2. Your numbers are in the 50-90MB/second range, or roughly 1/2 to 1/4 what was measured on the other 2 file systems for the same test. Very odd. Still looking... Thanks, /jim Jeffrey W. Baker wrote: I have a lot of people whispering zfs in my virtual ear these days, and at the same time I have an irrational attachment to xfs based entirely on its lack of the 32000 subdirectory limit. I'm not afraid of ext4's newness, since really a lot of that stuff has been in Lustre for years. So a-benchmarking I went. Results at the bottom: http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html Short version: ext4 is awesome. zfs has absurdly fast metadata operations but falls apart on sequential transfer. xfs has great sequential transfer but really bad metadata ops, like 3 minutes to tar up the kernel. It would be nice if mke2fs would copy xfs's code for optimal layout on a software raid. The mkfs defaults and the mdadm defaults interact badly. Postmark is somewhat bogus benchmark with some obvious quantization problems. Regards, jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (politics) Sharks in the waters
About 2 years ago I was able to get a little closer to the patent litigation process, by way of giving a deposition in litigation that was filed against Sun and Apple (and has been settled). Apparently, there's an entire sub-economy built on patent litigation among the technology players. Suits, counter-suits, counter-counter-suits, etc, are just part of every day business. And the money that gets poured down the drain! Here's an example. During my deposition, the lawyer questioning me opened a large box, and removed 3 sets of a 500+ slide deck created by myself and Richard McDougall for seminars and tutorials on Solaris. Each set was color print on heavy, glossy paper. That represented color printing of about 1600 pages total. All so the attorney could question me about 2 of the slides. I almost fell off my chair /jim Rob Windsor wrote: http://news.com.com/NetApp+files+patent+suit+against+Sun/2100-1014_3-6206194.html I'm curious how many of those patent filings cover technologies that they carried over from Auspex. While it is legal for them to do so, it is a bit shady to inherit technology (two paths; employees departing Auspex and the Auspex bankruptcy asset buyout), file patents against that technology, and then open suits against other companies based on (patents covering) that technology. (No, I'm not defending Sun in it's apparent patent-growling, either, it all sucks IMO.) Rob++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about uberblock blkptr
Hey Max - Check out the on-disk specification document at http://opensolaris.org/os/community/zfs/docs/. Page 32 illustration shows the rootbp pointing to a dnode_phys_t object (the first member of a objset_phys_t data structure). The source code indicates ub_rootbp is a blkptr_t, which contains a 3 member array of dva_t 's called blk_dva (blk_dva[3]). Each dva_t is a 2 member array of 64-bit unsigned ints (dva_word[2]). So it looks like each blk_dva contains 3 128-bit DVA's You probably figured all this out alreadydid you try using a objset_phys_t to format the data? Thanks, /jim [EMAIL PROTECTED] wrote: Hi All, I have modified mdb so that I can examine data structures on disk using ::print. This works fine for disks containing ufs file systems. It also works for zfs file systems, but... I use the dva block number from the uberblock_t to print what is at the block on disk. The problem I am having is that I can not figure out what (if any) structure to use. All of the xxx_phys_t types that I try do not look right. So, the question is, just what is the structure that the uberblock_t dva's refer to on the disk? Here is an example: First, I use zdb to get the dva for the rootbp (should match the value in the uberblock_t(?)). # zdb - usbhard | grep -i dva Dataset mos [META], ID 0, cr_txg 4, 1003K, 167 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]=0:111f79000:200 DVA[1]=0:506bde00:200 DVA[2]=0:36a286e00:200 fletcher4 lzjb LE contiguous birth=621838 fill=167 cksum=84daa9667:365cb5b02b0:b4e531085e90:197eb9d99a3beb bp = [L0 DMU objset] 400L/200P DVA[0]=0:111f6ae00:200 DVA[1]=0:502efe00:200 DVA[2]=0:36a284e00:200 fletcher4 lzjb LE contiguous birth=621838 fill=34026 cksum=cd0d51959:4fef8f217c3:10036508a5cc4:2320f4b2cde529 Dataset usbhard [ZPL], ID 5, cr_txg 4, 15.7G, 34026 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]=0:111f6ae00:200 DVA[1]=0:502efe00:200 DVA[2]=0:36a284e00:200 fletcher4 lzjb LE contiguous birth=621838 fill=34026 cksum=cd0d51959:4fef8f217c3:10036508a5cc4:2320f4b2cde529 first block: [L0 ZIL intent log] 9000L/9000P DVA[0]=0:36aef6000:9000 zilog uncompressed LE contiguous birth=263950 fill=0 cksum=97a624646cebdadb:fd7b50f37b55153b:5:1 ^C # Then I run my modified mdb on the vdev containing the usbhard pool # ./mdb /dev/rdsk/c4t0d0s0 I am using the DVA[0} for the META data set above. Note that I have tried all of the xxx_phys_t structures that I can find in zfs source, but none of them look right. Here is example output dumping the data as a objset_phys_t. (The shift by 9 and adding 40 is from the zfs on-disk format paper, I have tried without the addition, without the shift, in all combinations, but the output still does not make sense). (111f790009)+40::print zfs`objset_phys_t { os_meta_dnode = { dn_type = 0x4f dn_indblkshift = 0x75 dn_nlevels = 0x82 dn_nblkptr = 0x25 dn_bonustype = 0x47 dn_checksum = 0x52 dn_compress = 0x1f dn_flags = 0x82 dn_datablkszsec = 0x5e13 dn_bonuslen = 0x63c1 dn_pad2 = [ 0x2e, 0xb9, 0xaa, 0x22 ] dn_maxblkid = 0x20a34fa97f3ff2a6 dn_used = 0xac2ea261cef045ff dn_pad3 = [ 0x9c2b4541ab9f78c0, 0xdb27e70dce903053, 0x315efac9cb693387, 0x2d56c54db5da75bf ] dn_blkptr = [ { blk_dva = [ { dva_word = [ 0x87c9ed7672454887, 0x760f569622246efe ] } { dva_word = [ 0xce26ac20a6a5315c, 0x38802e5d7cce495f ] } { dva_word = [ 0x9241150676798b95, 0x9c6985f95335742c ] } ] None of this looks believable. So, just what is the rootbp in the uberblock_t referring to? thanks, max ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] io:::start and zfs filenames?
Hi Neel - Thanks for pushing this out. I've been tripping over this for a while. You can instrument zfs_read() and zfs_write() to reliably track filenames: #!/usr/sbin/dtrace -s #pragma D option quiet zfs_read:entry, zfs_write:entry { printf(%s of %s\n,probefunc, stringof(args[0]-v_path)); } I'm not sure why the io:::start does not work for ZFS. I didn't spend any real time on this, but it appears none of the ZFS code calls bdev_strategy() directly, and instrumenting bdev_strategy:enter (which is where io:::start lives) to track filenames via stringof(args[0]-b_vp-v_path) does not work either. Use the zfs r/w function entry points for now. What sayeth the ZFS team regarding the use of a stable DTrace provider with their file system? Thanks, /jim Neelakanth Nadgir wrote: io:::start probe does not seem to get zfs filenames in args[2]-fi_pathname. Any ideas how to get this info? -neel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] io:::start and zfs filenames?
What sayeth the ZFS team regarding the use of a stable DTrace provider with their file system? For the record, the above has a tone to it that I really did not intend (antagonistic?), so I had a good chat with Roch about this. The file pathname is derived via a translator from the vnode v_path structure member, and thus requires an instantiated vnode when the probe fires - this is why instrumenting bdev_strategy:entry and tracing args[0]-b_vp-v_path has the same problem; no vnode. An alternative approach to tracking filenames with IOs is using the fsinfo provider (Solaris 10 Update 2) . This is a handy place to start: #!/usr/sbin/dtrace -s #pragma D option quiet fsinfo::: / execname != dtrace / { @[execname, args[0]-fi_pathname, args[0]-fi_fs, probename] = count(); } END { printf(%-16s %-24s %-8s %-16s %-8s\n,EXEC,PATH,FS,NAME,COUNT); printa(%-16s %-24s %-8s %-16s [EMAIL PROTECTED],@); } Which yields... EXEC PATH FS NAME COUNT gnome-panel /zp ufs lookup 1 gnome-panel /zp/home zfs lookup 1 gnome-panel /zp/home/mauroj zfs lookup 1 gnome-panel /zp/home/mauroj/.recently-used.xbel.HKF3YT zfs getattr 1 gnome-panel /zp/home/mauroj/.recently-used.xbel.HKF3YT zfs lookup 1 snip metacity unknownsockfs poll 1031 vmware-user unknownsockfs poll 1212 Xorg unknownsockfs rwlock 1573 Xorg unknownsockfs rwunlock 1573 gnome-terminal unknownsockfs poll 2084 dbwriter /zp/spacezfs realvp 4254 dbwriter /zp/spacezfs remove 4254 dbwriter /zp/space/f33zfs close4254 dbwriter /zp/space/f33zfs lookup 4254 dbwriter /zp/space/f33zfs read 4254 dbwriter /zp/space/f33zfs realvp 4254 dbwriter /zp/space/f33zfs seek 4254 dbwriter /zp/space/f33zfs write4254 dbwriter /zp/spacezfs getsecattr 4255 dbwriter /zp/space/f33zfs ioctl4255 dbwriter /zp/space/f33zfs open 4255 dbwriter unknownzfs create 4255 dbwriter /zp/space/f33zfs rwunlock 8508 dbwriter /zp/spacezfs lookup 8509 dbwriter /zp/space/f33zfs rwlock 8509 dbwriter /zp ufs lookup 8515 Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] io:::start and zfs filenames?
Hey Neel - Try this: nv70b cat zfs_page.d #!/usr/sbin/dtrace -s #pragma D option quiet zfs_putpage:entry { printf(zfs write to %s\n,stringof(args[0]-v_path)); } zfs_getpage:entry { printf(zfs read from %s\n,stringof(args[0]-v_path)); } I did some quick tests with mmap'd ZFS files, and it seems to work /jim Neelakanth Nadgir wrote: Jim I can't use zfs_read/write as the file is mmap()'d so no read/write! -neel On Sep 26, 2007, at 5:07 AM, Jim Mauro [EMAIL PROTECTED] wrote: Hi Neel - Thanks for pushing this out. I've been tripping over this for a while. You can instrument zfs_read() and zfs_write() to reliably track filenames: #!/usr/sbin/dtrace -s #pragma D option quiet zfs_read:entry, zfs_write:entry { printf(%s of %s\n,probefunc, stringof(args[0]-v_path)); } I'm not sure why the io:::start does not work for ZFS. I didn't spend any real time on this, but it appears none of the ZFS code calls bdev_strategy() directly, and instrumenting bdev_strategy:enter (which is where io:::start lives) to track filenames via stringof(args[0]-b_vp-v_path) does not work either. Use the zfs r/w function entry points for now. What sayeth the ZFS team regarding the use of a stable DTrace provider with their file system? Thanks, /jim Neelakanth Nadgir wrote: io:::start probe does not seem to get zfs filenames in args[2]-fi_pathname. Any ideas how to get this info? -neel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Hey Roch - We do not retain 2 copies of the same data. If the DB cache is made large enough to consume most of memory, the ZFS copy will quickly be evicted to stage other I/Os on their way to the DB cache. What problem does that pose ? Can't answer that question empirically, because we can't measure this, but I imagine there's some overhead to ZFS cache management in evicting and replacing blocks, and that overhead could be eliminated if ZFS could be told not to cache the blocks at all. Now, obviously, whether this overhead would be in the noise level, or something that actually hurts sustainable performance will depend on several things, but I can envision scenerios where it's overhead I'd rather avoid if I could. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Where does the win come from with directI/O? Is it 1), 2), or some combination? If its a combination, what's the percentage of each towards the win? That will vary based on workload (I know, you already knew that ... :^). Decomposing the performance win between what is gained as a result of single writer lock breakup and no caching is something we can only guess at, because, at least for UFS, you can't do just one - it's all or nothing. We need to tease 1) and 2) apart to have a full understanding. We can't. We can only guess (for UFS). My opinion - it's a must-have for ZFS if we're going to get serious attention in the database space. I'll bet dollars-to-donuts that, over the next several years, we'll burn many tens-of-millions of dollars on customer support escalations that come down to memory utilization issues and contention between database specific buffering and the ARC. This is entirely my opinion (not that of Sun), and I've been wrong before. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS File system and Oracle raw files compatibility
If the question is can Oracle files (datafiles, log files, etc) exist on a ZFS, the answer is absolutely yes. More simply put, can you configure you Oracle database on ZFS - absolutely. The question, as stated, is confusing, because the term compatible can have pretty broad meaning. So, I answered the question I think you wanted to ask. Thanks, /jim Dale Pannell wrote: I have a customer that would like to know if the ZFS file system is compatible with Oracle raw files. Any help you can provide is greatly appreciated. Please respond directly to me since I am not part of the zfs-discuss email alias. //Dale Pannell// SR Systems Engineer Office: 972.546.4111 Mobile: 214.284.6057 Email: [EMAIL PROTECTED] *Sun Storage Group* ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS mirroring
Mertol, Hi; Do any of you know when ZFS remote mirroring will be available ? Host-based replication of ZFS, and all other Solaris filesystems is available using Sun StorageTek Available Suite. AVS has been part of OpenSolaris since build 68. http://www.opensolaris.org/os/project/avs/ Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 http://blogs.sun.com/avs regards image001.gif Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] image001.gif ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI target using ZFS filesystem as backing
John, I'm working on a Sun Ultra 80 M2 workstation. It has eight 750 GB SATA disks installed. I've tried the following on both ON build 72, Solaris 10 update 4, and Indiana with the same results. If I create a ZFS filesystem using 1-7 hard drives (I've tried 1 and 7), and then try to make an iSCSI target on that pool, when a client machine tries to access the iSCSI volume, the memory usage on the Ultra 80 goes to the same size as the ZFS filesystem. For example: I'm creating a RaidZ ZFS pool: zpool create -f telephone raidz c9d0 c10d0 c11d0 c12d0 c13d0 c14d0 c15d0 I then create a two terabyte filesystem on that zvol: zfs create -V 2000g telephone/jelley And make it into an iSCSI target: iscsitadm create target -b /dev/zvol/dsk/telephone/jelley jelley Try changing from a cached ZVOL to a raw ZVOL iscsitadm create target -b /dev/zvol/Rdsk/telephone/jelley jelley You can also try: zpool set shareiscsi=on telephone/jelley - Jim Now if I perform a 'iscsitadm list target', the iSCSI target appears like it should: Target: jelley iSCSI Name: iqn.1986-03.com.sun:02:fcaa1650-f202-4fef-b44b- b9452a237511.jelley Connections: 0 Now when I try to connect to it with my Windows 2003 server running the MS iSCSI initiator, I see the memory usage climb to the point that the totally exhausts all available physical memory (prstat): PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/ NLWP 511 root 2000G 106M sleep 590 0:02:58 1.1% iscsitgtd/15 2139 root 8140K 4204K sleep 590 0:00:00 0.0% sshd/1 2164 root 3276K 2740K cpu1490 0:00:00 0.0% prstat/1 2144 root 2672K 1752K sleep 490 0:00:00 0.0% bash/1 574 noaccess 173M 92M sleep 590 0:03:18 0.0% java/25 Do you see the iscsitgtd process trying to use 2000 gigabytes of RAM? I can sit there and hold down spacebar while the Windows workstation is trying to access it, and the memory usage climbs at an astronomical rate, until it exhausts all the available memory on the box (several hundred megabytes per minute). The total ram it tries to allocate depends totally on the size of the iSCSI volume. If it's a 1000 megabyte volume, then it only allocates a gig... if it's 600 gigs, it tries to allocate 600 gigs. Now here is the real kicker. I took this down to as simple of a configuration as possible--one single drive with a ZFS filesystem on it. The memory utilization was the same. I then tried creating the iSCSI target on a UFS filesystem. Everything work beautifully, and memory utilization was no longer directly proportional to the size of the iSCSI volume. If I create something small, like a 100 gig iSCSI target, the system does eventually get around to finishing and releases the ram. When what's really strange is when I try to access the iSCSI volume, the memory usage then climbs megabyte per megabyte until it is exhausted, and then access to the iSCSI volume is terribly slow. I can copy a 300 meg file in just six seconds when the memory utilization on the iscsitgtd process is low. But if I try a 2.5 gig file, once it get's about 1500 megs into it, performance drops about 99.9% and it's incredibly slow... again, until it's done and the iscsitgtd releases the ram, then it's plenty zippy for small IO operations. Has anybody else been making iSCSI targets on ZFS pools? I've had a case open with Sun since Oct 3, if any Sun folks want to look at the details (case #65684887). I'm getting very desperate to get this fixed, as this massive amount of storage was the only reason I got this M80... Any pointers would be greatly appreciated. Thanks- John Tracy This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Would you two please SHUT THE F$%K UP. Dear God, my kids don't go own like this. Please - let it die already. Thanks very much. /jim can you guess? wrote: Hello can, Thursday, December 13, 2007, 12:02:56 AM, you wrote: cyg On the other hand, there's always the possibility that someone cyg else learned something useful out of this. And my question about To be honest - there's basically nothing useful in the thread, perhaps except one thing - doesn't make any sense to listen to you. I'm afraid you don't qualify to have an opinion on that, Robert - because you so obviously *haven't* really listened. Until it became obvious that you never would, I was willing to continue to attempt to carry on a technical discussion with you, while ignoring the morons here who had nothing whatsoever in the way of technical comments to offer (but continued to babble on anyway). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What does dataset is busy actually mean?
I've hit the problem myself recently, and mounting the filesystem cleared something in the brains of ZFS and alowed me to snapshot. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg00812.html PS: I'll use Google before asking some questions, a'la (C) Bart Simpson That's how I found your question ;) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
-in-Time Copy software, the software can be configured to automatically take a snapshot prior to re-synchronization, and automatically delete the snapshot if completed successfully. The use of I/O consistency groups assure that not only are the replicas write-order consistent during replication, but also that snapshots taken prior to re- synchronization are consistent too. Thanks Steve This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Auto backup and auto restore of ZFS via Firewire drive
It's good he didn't mail you, now we all know some under-the-hood details via Googling ;) Thanks to both of you for this :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup/replication system
Łukasz K wrote: Hi I'm using ZFS on few X4500 and I need to backup them. The data on source pool keeps changing so the online replication would be the best solution. As I know AVS doesn't support ZFS - there is a problem with mounting backup pool. This is not true, if replication is configured correctly. Where are you getting information about the aforementioned problem? Have you looked at the following? http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ Other backup systems (disk-to-disk or block-to-block) have the same problem with mounting ZFS pool. I hope I'm wrong ? In case of any problem I want the backup pool to be operational within 1 hour. Do you know any solution ? --Lukas Zagłosuj i zgarnij 10.000 złotych! Wybierz z nami Internetowego SportoWWWca Roku. Oddaj swój głos na najlepszego. - Kliknij: http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Fsportowiec2007.htmlsid=166 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup/replication system
Eric, On Jan 10, 2008, at 4:50 AM, Łukasz K wrote: Hi I'm using ZFS on few X4500 and I need to backup them. The data on source pool keeps changing so the online replication would be the best solution. As I know AVS doesn't support ZFS - there is a problem with mounting backup pool. Other backup systems (disk-to-disk or block-to-block) have the same problem with mounting ZFS pool. I hope I'm wrong ? In case of any problem I want the backup pool to be operational within 1 hour. Do you know any solution ? If it doesn't need to be synchronous, then you can use 'zfs send -R'. The prior statement could lead one to believe that 'zfs send -R' is asynchronous replication, which it is not. The functionality ZFS provides via send/recv is known as time-fixed, or snapshot replication. Here, a non-changing data source, the snapshot, is synchronized from the source to destination node based on either a full or differential set of changes. Unlike synchronous or asynchronous replication, where data is continuously replicated in a write-order consistent manner, time-fixed replication is discontinuous, often driven by taking periodic snapshots of the changing data, performing the differential synchronization of the non-changing source data to the remote host, then waiting until the next interval. The most common problem with time-fixed replication is trying to determine, or calculate the periodic interval to use, since its optimal value is based on many variables, most of which are changing over time and usage patterns. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Break a ZFS mirror and concatenate the disks
Kory, Yes, I get it now. You want to detach one of the disks and then readd the same disk, but lose the redundancy of the mirror. Just as long as you realize you're losing the redundancy. I'm wondering if zpool add will complain. I don't have a system to try this at the moment. The correct, just verified steps are as follows: zpool detach moodle c2t3d0 zpool add moodle c2t3d0 I performed these steps while the zpool was online, under heavy I/O, with an I/O tool that does data validation. When done, I then performed a final zpool scrub moodle, with no issues, and then revalidated all the data. As stated earlier, sacrificing redundancy (RAID 1 mirroring) for double the storage (RAID 0 concatenation) is being penny wise, and pound foolish. Jim Cindy Kory Wheatley wrote: Currently c2t2d0 c2t3d0 are setup in a mirror. I want to break the mirror and save the data on c2t2d0 (which both drives are 73g. Then I want to concatenate c2t2do to c2t3d0 so I have a pool of 146GB no longer in a mirror just concatenated. But since their mirror right now I need the data save on one disk so I don't lose everything. I don't need to add new disks that not an option I want to break the mirror so I can expand the disks together in a pool but save the data. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iscsi on zvol
Jan, I'm wondering if it's possible to import a zpool on an iscsi-device LOCALLY. Following scenario: HostA (Sol10u4): - Pool-1 (a striped-raidz-pool) - iscsi-zvol on Pool-1 HostB (Sol10u3): - Pool-2 is a Mirror of one local device and the iscsi-vol of HostA Is ist possible to mount the iscsi-vol (or import Pool-2) on HostA? No, due to a common misconception in the iSCSI space, spanning an iSCSI Target's backing store, to the resulting LUN as seen by iSCSI Initiators. On HostA, the ZVOL called iscsi-zvol has a volume size, a size specified in the zfs create -V size Pool-1/iscsi-zvol. When an iSCSI Target is created out of this ZVOL, then the iSCSI Initiator discovers and enables this LUN on HostB, but this LUN is unformatted. In other words, this LUN does not contain a Solaris VTOC or an Intel EFI disk label, as its just a bunch of blocks. When issuing the zpool create Pool-2 mirror local-disk iscsi-vol, an Intel EFI disk label is placed on the disk (consuming some of the blocks), then all the remaining space is placed in partition (or slice 0), after which ZFS lays down its filesystem metadata in the space occupied by partition 0. Now back on HostA, the ZVOL ( /dev/zvol/rdsk/Pool-1/iscsi-zvol ) looks like a bunch of blocks. Since this is a ZVOL, not a SCSI or iSCSI LUN, Solaris does not see the Intel EFI disk label, thus ZFS will not be able to see the ZFS filesystem metadata. So even though the ZVOL contains all the right data, from the point of view of Solaris, this disk is not a LUN, and thus can not be accessed as such. Jim I know, this is (also) iSCSI-related, but mostly a ZFS-question. Thanks for your answers, Jan Dreyer ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iscsi on zvol
After posting my reply to the initial note on this thread, and then reading it again, I have some followup comments: The following statement should have said ... this ZVOL in not a LUN, . So even though the ZVOL contains all the right data, from the point of view of Solaris, this disk is not a LUN, and thus can not be accessed as such. But then could it be? On HostA, where the ZVOL (iscsi-zvol) is served out as an iSCSI Target, there is nothing to prevent the iSCSI Initiator on HostA to discover the iSCSI Target on its own node. Doing so it will create an iSCSI LUN, which will be seen by Solaris. This is an example of iSCSI loopback, which works quite well. This raises a key point that that you should be aware of. ZFS does not support shared access to the same ZFS filesystem. If the ZFS storage pool Pool-2, is currently imported on HostB, an attempt to zpool import the iSCSI LUN on HostA, ZFS will report that this ZPOOL is being access on another host, which it is, HostB. Do not try to force a zpool import of this iSCSI LUN, or a Solaris panic will soon follow. (See key point above). If the ZFS storage pool Pool-2 is currently exported on HostB, an attempt to zpool import the iSCSI LUN on HostA will work, except that now 1/2 of the mirrored zpool will not be accessible, since its a local device on HostB, and therefore not accessible. Maybe the local device on HostB should also be an iSCSI Target too. One more thing. ZFS and iSCSI start and stop at different times during Solaris boot and shutdown, so I would recommend using legacy mount points, or manual zpool import / exports when trying configurations at this level. Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: Re: Presales support on ZFS]
Enrico, Hello I'm offering a solution based on our disks where replication and storage management should be made using only ZFS... The test change few bytes on one file ( 10bytes ) and check how many bytes the source sends to target. The customer tried the replication between 2 volume...They compared ZFS replica with true copy replica and they realized the following considerations: 1. ZFS uses a block bigger than HDS true copy 2. true copy sends 32Kbytes and ZFS 100K and more changing only 10 file bytes Can we configure ZFS to improve replication efficiencies ? The solution should consider 5 remote site replicating on one central data-center. Considering the zfs block overhead the customer is thinking to buy a solution based on traditional storage arrays like HDS entry level arrays ( our 2530/2540 ). If so ..with the ZFS the network traffic, storage space become big problems for the customer infrastructures. Are there any documentation explaining internal ZFS replication mechanism to face the customer doubts ? Thanks Do we need of AVS in our solution to solve the problem ? AVS, not unlike HDS, does block-based replication based on actual write I/Os to configured devices. Therefore if the means for ZFS to change 10 bytes, results in ZFS writing 100KB or more, AVS will be essentially be no different than HDS in this specific area. Of course this begs to question, is the measure of a 10 byte change to a given file, a viable metric for choosing one form of replication over another? I think, or would hope not. What is need is a characterization of the application(s) write-rate to one or more ZFS filesystems, over the customers requirements for data replication. A good place to start is: http://www.sun.com/storagetek/white-papers/data_replication_strategies.pdf http://www.sun.com/storagetek/white-papers/enterprise_continuity.pdf Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 The above link shows how to disable to ZIL for testing purposes (it's not generally recommended to keep it disabled in production). As to the putpack schedule of recent ZFS features into Solaris 10, I'm afraid I don't have the information. Hopefully, someone else will know... Thanks, /jim Jonathan Loran wrote: Is it true that Solaris 10 u4 does not have any of the nice ZIL controls that exist in the various recent Open Solaris flavors? I would like to move my ZIL to solid state storage, but I fear I can't do it until I have another update. Heck, I would be happy to just be able to turn the ZIL off to see how my NFS on ZFS performance is effected before spending the $'s. Anyone know when will we see this in Solaris 10? Thanks, Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS replication strategies
Erast, Take a look on NexentaStor - its a complete 2nd tier solution: http://www.nexenta.com/products and AVS is nicely integrated via management RPC interface which is connecting multiple NexentaStor nodes together and greatly simplifies AVS usage with ZFS... See demo here: http://www.nexenta.com/demos/auto-cdp.html Very nice job.. Its refreshing to see something I know oh too well, with an updated management interface, and a good portion of the plumbing hidden away. - Jim On Fri, 2008-02-01 at 10:15 -0800, Vincent Fox wrote: Does anyone have any particularly creative ZFS replication strategies they could share? I have 5 high-performance Cyrus mail-servers, with about a Terabyte of storage each of which only 200-300 gigs is used though even including 14 days of snapshot space. I am thinking about setting up a single 3511 with 4 terabytes of storage at a remote site as a backup device for the content. Struggling with how to organize the idea of wedging 5 servers into the one array though. Simplest way that occurs is one big RAID-5 storage pool with all disks. Then slice out 5 LUNs each as it's own ZFS pool. Then use zfs send receive to replicate the pools. Ideally I'd love it if ZFS directly supported the idea of rolling snapshots out into slower secondary storage disks on the SAN, but in the meanwhile looks like we have to roll our own solutions. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mounting a copy of a zfs pool /file system while orginal is still active
Darren J Moffat wrote: Dave Lowenstein wrote: Nope, doesn't work. Try presenting one of those lun snapshots to your host, run cfgadm - al, then run zpool import. #zpool import no pools available to import Does format(1M) see the luns ? If format(1M) can't see them it is unlikely that ZFS will either. It would make my life so much simpler if you could do something like this: zpool import --import-as yourpool.backup yourpool zpool import [-o mntopts] [ -o property=value] ... [-d dir | -c cachefile] [-D] [-f] [-R root] pool | id [newpool] Imports a specific pool. A pool can be identified by its name or the numeric identifier. If newpool is specified, the pool is imported using the name newpool. Otherwise, it is imported with the same name as its exported name. Given that the pool is snapshot of one or more vdevs in an existing ZFS storage pool, not only is the name identical, so it is the numeric identifier. If can be determined that when using zpool import , duplicates are suppressed, even if those duplicates are entirely separate vdevs containing block-based snapshots, physical copies, remote mirrors or iSCSI Targets. The steps to reproduce this behavior on a single node, using files and stand Solaris utilities is as follows: # mkfile 500m /var/tmp/pool_file # zpool create pool /var/tmp/pool_file # zpool status pool pool: pool state: ONLINE scrub: none requested config: NAME STATEREAD WRITE CKSUM pool ONLINE 0 0 0 /var/tmp/pool_file ONLINE 0 0 0 errors: No known data errors # zpool export pool # dd if=/var/tmp/pool_file of=/var/tmp/pool_snapshot { wait, wait, wait, ... more on this later ...} 1024000+0 records in 1024000+0 records out # zpool import -d /var/tmp pool: pool id: 14424098069460077054 state: ONLINE action: The pool can be imported using its name or numeric identified config: pool ONLINE /var/tmp/pool_file ONLINE Question: What happened to the other ZFS storage pool call pool_snapshot? Answer: Its presence is suppressed by zpool import. If one was to rename /var/tmp/pool_file to some other directory, the /var/tmp/ pool_snapshot will now appear. # mv /var/tmp/pool_file /var/pool_file # zpool import -d /var/tmp pool: pool id: 14424098069460077054 state: ONLINE action: The pool can be imported using its name or numeric identified config: pool ONLINE /var/tmp/pool_snapshot ONLINE 0 0 0 At this point, if one was to go ahead with the import of pool, (which would work) then rename /var/pool_file back to /var/tmp/pool_file, its presence would now be suppressed. Conversely, if the rename was done first, then a zpool import was attempted, again only one storage pool would exists at any given time. Clearly there is some explicit suppressing of duplicate storage pools going on here. Browsing the ZFS code looking for answer, the logic surrounding zfs_inuse(), seem to cause this behavior, expected or not. http://cvs.opensolaris.org/source/search?q=vdev_inuseproject=%2Fonnv = As mentioned earlier, the {wait, wait, wait, ...} can be eliminated by using Availability Suite Point-in-Time Copy, by itself, or in combination with Availability Suite Remote Copy or iSCSI Target, all of which are present in OpenSolaris today, and all are much fast then the dd utility. As one that supports both Availability Suite and iSCSI Target, not suppressing duplicate pool names and pool identifiers, in combination with a rename on import, zpool import -new name ..., would provide a means to support various copies, or nearly identical copies of a ZFS storage pool on the same Solaris host. While browsing the ZFS source code, I noticed that usr/src/cmd/ztest/ ztest.c, includes ztest_spa_rename(), a ZFS test which renames a ZFS storage pool to a different name, tests the pool under its new name, and then renames it back. I wonder why this functionality was not exposed as part of zpool support? - Jim # zpool import foopool barpool -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. work: 781.442.4042 cell: 603-724-3972 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: Re: Presales support on ZFS]
Enrico, Is there any forecast to improve the efficiency of the replication mechanisms of ZFS ? Fishwork - new NAS release I would take some time to talk with and understand exactly what the customer's expectation are for replication. i would not base my decision on the cost of replicating 10 bytes, regardless of how inefficient it may be. These two documents should help: http://www.sun.com/storagetek/white-papers/data_replication_strategies.pdf http://www.sun.com/storagetek/white-papers/enterprise_continuity.pdf Two key metrics of replication are: Recovery Point Objective (RPO), is the amount of data lost (or less), measured as a unit time. Once a day backups yield a 24 hour RPO, once an hour snapshots yields ~1 hour RPO, asynchronous replication yields zero seconds to a few minutes RPO, and synchronous replication means zero seconds RPO. Recovery Time Objective (RTO), is the amount of time after a failure, until normal operations are restored. Tapes backups could be minutes to hours, local snapshots could be nearly instantaneous, assuming the local site survived the failure. Remote snapshots or replicas could be minutes, hours or days, depending on the amount of data to resynchronize, impacted by network bandwidth and latency. Availability Suite has a unique feature in this last area, called on- demand pull. Assuming that the primary site's volumes are lost, after they have been re-provisioned, a reverse update can be initiated. Besides the background resilvering in the reverse direction being active, eventually restoring all lost data, on-demand pull performs synchronous replication of data blocks on demand, as needed by the filesystem, database or application. Although the performance will be less then synchronous replication, the RTO is quite low. This type of recovery is analogous to loosing one's entire email account, having recovery initiated, but also selected email can be open as needed before the entire volume is restored, using on demand requests to satisfy data blocks for relevant email requests. Jim Considering the solution we are offering to our customer ( 5 remote sites replicating in one central data-center ) with ZFS ( cheapest solution ) I should consider 3 times the network load of a solution based on SNDR-AVS and 3 times the storage space too..correct ? I there any documentation on that ? Thanks Richard Elling ha scritto: Enrico Rampazzo wrote: Hello I'm offering a solution based on our disks where replication and storage management should be made using only ZFS... The test change few bytes on one file ( 10bytes ) and check how many bytes the source sends to target. The customer tried the replication between 2 volume...They compared ZFS replica with true copy replica and they realized the following considerations: 1. ZFS uses a block bigger than HDS true copy ZFS uses dynamic block sizes. Depending on the configuration and workload, just a few disk blocks will change, or a bunch of redundant metadata might change. In either case, changing the ZFS recordsize will make little, if any, change. 2. true copy sends 32Kbytes and ZFS 100K and more changing only 10 file bytes Can we configure ZFS to improve replication efficiencies ? By default, ZFS writes two copies of metadata. I would not recommend reducing this because it will increase your exposure to faults. What may be happening here is that a 10 byte write may cause a metadata change resulting in a minimum of three 512 byte physical blocks being changed. The metadata copies are on spatially diverse, so you may see these three blocks starting at non-contiguous boundaries. If Truecopy sends only 32kByte blocks (speculation), then the remote transfer will be 96kBytes for 3 local, physical block writes. OTOH, ZFS will coalesce writes. So you may be able to update a number of files yet still only replicate 96kBytes through Truecopy. YMMV. Since the customer is performing replication, I'll assume they are very interested in data protection, so keeping the redundant metadata is a good idea. The customer should also be aware that replication at the application level is *always* more efficient than replicating somewhere down the software stack where you lose data context. -- richard The solution should consider 5 remote site replicating on one central data-center. Considering the zfs block overhead the customer is thinking to buy a solution based on traditional storage arrays like HDS entry level arrays ( our 2530/2540 ). If so ..with the ZFS the network traffic, storage space become big problems for the customer infrastructures. Are there any documentation explaining internal ZFS replication mechanism to face the customer doubts ? Thanks Do we need of AVS in our solution to solve the problem ? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http
Re: [zfs-discuss] iscsi core dumps when under IO
Stephen, I am getting a strange issue when using zfs/iscsi shares out of it. when I have attached a a cent os 5 initiator to the zfs target it works fine normally until i start doing heavy 100MB/s+ copies to a seperate cfs/nfs export on the same zfs pool. the error I am getting is: [ Feb 20 10:41:07 Stopping because process dumped core. ] [ Feb 20 10:41:07 Executing stop method (/lib/svc/method/svc- iscsitgt stop 143) ] I was wondering if any one had any ideas. I am running 10U4 with all of the latest and greatest patches. Thank you. There are a set of issues we have recently have been resolved in Nevada regarding the iSCSI Target under load. We are looking at back porting these changes to S10. The nature of the failure appears to be an iSCSI Initiator seeing long service times (in seconds), triggering a LUN reset. The LUN reset causes all I/O to be cleaned up specific to that LUN. Given the multi- threaded nature of the iSCSI Target, the odds are pretty high that cleanup across every possible I/O state would be possible, and some of the states were not handled correctly. The follow command is likely to show the reason for the process dumped core, being an assert in the T10 state machine. # mdb /core ::status ::quit This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cpying between pools
Vahid, We need to move about 1T of data from one zpool on EMC dmx-3000 to another storage device (dmx-3). DMX-3 can be visible on the same host where dmx-3000 is being used on or from another host. What is the best way to transfer the data from dmx-3000 to dmx-3? Is it possible to add the new dmx as a sub mirror of the old dmx and after the sync is finished, remove the old dmx from the mirror. See:zpool replace [-f] pool old_device [new_device] - Jim Thank you, This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI targets mapped to a VMWare ESX server
Mertol Ozyoney wrote: Hi All ; There are a set of issues being looked at that prevent the VMWare ESX server from working with the Solaris iSCSI Target. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6597310 At this time there is no target date when this issues will be resolved. Jim We are running latest Solaris 10 a X4500 Thumper. We defined a test iSCSI Lun. Out put below Target: AkhanTemp/VM iSCSI Name: iqn.1986-03.com.sun:02:72406bf8-2f5f-635a-f64c- cb664935f3d1 Alias: AkhanTemp/VM Connections: 0 ACL list: TPGT list: LUN information: LUN: 0 GUID: 01144fa709302a0047fa50e6 VID: SUN PID: SOLARIS Type: disk Size: 100G Backing store: /dev/zvol/rdsk/AkhanTemp/VM Status: online We tried to access the LUN from a windows laptop, and it worked without any problems. However VMWare ESX 3,2 Server is unable to access the LUN’s. We checked that the virtual interface can ping X4500. Sometimes it sees the Lun , but 200+ Lun’s with the same proporties are listed and we cant add them as storage. Then after a rescan they vanish. Any help appraciated Mertol image001.gif Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving zfs pool to new machine?
Steve, Can someone tell me or point me to links that describe how to do the following. I had a machine that crashed and I want to move to a newer machine anyway. The boot disk on the old machine is fried. The two disks I was using for a zfs pool on that machine need to be moved to a newer machine now running 2008.05 OpenSolaris. What is the procedure for getting back the pool on the new machine and not losing any of the files I had in that pool? I searched the docs, but did not find a clear answer to this and experimenting with various zsh and zpool commands did not see the two disks or their contents. To see all available pools to import: zpool import From this list, it should include your prior storage pool name zpool import pool-name - Jim The new disks are c6t0d0s0 and c6t1d0s0. They are identical disks set that were set up in a mirrored pool on the old machine. Thanks, Steve Christensen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Image with DD from ZFS partition
Hans, hello, can i create a image from ZFS with the DD command? Yes, with restrictions. First, a ZFS storage pool must be in the zpool export state to be copied, so that a write-order consistent set of data exists in the copy. ZFS does an excellent job of detecting inconsistencies in those volumes making up a single ZFS storage pool, so a copy of a imported storage pool is sure to be inconsistent, and thus unusable by ZFS. Although there are various means to copy ZFS (actually copy the individual vdevs in a single ZFS storage pool), one can not zpool import this copy of ZFS on the same node as the original ZFS storage pool. Unlike other Solaris filesystems, ZFS maintains metadata on each vdev that is used to reconstruct a ZFS storage pool at zpool import time. The logic within zpool import processing will correctly find all constituent volumes (vdevs) of a single ZFS storage pool, but ultimately hides / excludes other volumes (the copies) from being considered as part of the current or any other zpool import operation. Only the original, nots its copy, can be seen or utilized by zpool import If possible, the ZFS copy can be moved or accessed (using dual-ported disks, FC SAN, iSCSI SAN, Availability Suite, etc.) from another host, and then only there can the ZFS copy undergo a successful zpool import. As a slight segue, Availability Suite (AVS), can create an instantly accessible copy of the constituent volumes (vdevs) of a ZFS storage pool (in lieu of using DD which can take minutes, or hours). This is the Point-in-Time Copy, or II (Instant Image) part of AVS. This copy can also be replicated to a remote Solaris host where it can be imported. This is the Remote Copy, of SNDR (Network Data Replicator) part of AVS. AVS also supports the ability to synchronously, or asynchronously replicate the actual ZFS storage pool to a another host, (no local copy needed), and then zpool imported the replica remotely. See: opensolaris.org/os/project/avs/, plus the demos. when i work with linux i use partimage to create an image from one partitino and store it on another. so i can restore it if an error. partimage do not work with zfs, so i must use the DD command. i think so: DD IF=/dev/sda1 OF=/backup/image can i create an image this way, and restore it the other: DD IF=/backup/image OF=/dev/sda1 when i have two partitions with zfs, can i boot from the live cd, mount one partition to use it as backup target? or is it possible to create a ext2 partition and use a linux rescue cd to backup the zfs partition with dd ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Jim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SMC Webconsole 3.1 and ZFS Administration 1.0 - stacktraces in snv_b89
I've installed SXDE (snv_89) and found that the web console only listens on https://localhost:6789/ now, and the module for ZFS admin doesn't work. When I open the link, the left frame lists a stacktrace (below) and the right frame is plain empty. Any suggestions? I tried substituting different SUNWzfsgr and SUNWzfsgu packages from older Solarises (x86/sparc, snv_77/84/89, sol10u3/u4), and directly substituting the zfs.jar file, but these actions resulted in either the same error or crash-and-restart of SMC Webserver. I didn't yet try installing an older SUNWmco* packages (a 10u4 system with SMC 3.0.2 works ok), I'm not sure it's a good idea ;) The system has JDK 1.6.0_06 per default, maybe that's the culprit? I tried setting it to JDL 1.5.0_15 and web-module zfs refused to start and register itself... === Application Error com.iplanet.jato.NavigationException: Exception encountered during forward Root cause = [java.lang.IllegalArgumentException: No enum const class com.sun.zfs.common.model.AclInheritProperty$AclInherit.restricted] Notes for application developers: * To prevent users from seeing this error message, override the onUncaughtException() method in the module servlet and take action specific to the application * To see a stack trace from this error, see the source for this page Generated Thu May 29 17:39:50 MSD 2008 === In fact, the traces in the logs are quite long (several screenfulls) and nearly the same; this one starts as: === com.iplanet.jato.NavigationException: Exception encountered during forward Root cause = [java.lang.IllegalArgumentException: No enum const class com.sun.zfs.common.model.AclInheritProperty$AclInherit.restricted] at com.iplanet.jato.view.ViewBeanBase.forward(ViewBeanBase.java:380) at com.iplanet.jato.view.ViewBeanBase.forwardTo(ViewBeanBase.java:261) at com.iplanet.jato.ApplicationServletBase.dispatchRequest(ApplicationServletBase.java:981) at com.iplanet.jato.ApplicationServletBase.processRequest(ApplicationServletBase.java:615) at com.iplanet.jato.ApplicationServletBase.doGet(ApplicationServletBase.java:459) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) ... === This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Liveupgrade snv_77 with a ZFS root to snv_89
We have a test machine installed with a ZFS root (snv_77/x86 and rootpol/rootfs with grub support). Recently tried to update it to snv_89 which (in Flag Days list) claimed more support for ZFS boot roots, but the installer disk didn't find any previously installed operating system to upgrade. Then we tried to install SUNWlu* packages from snv_89 disk onto snv_77 system. It worked in terms of package updates, but lucreate fails: # lucreate -n snv_89 ERROR: The system must be rebooted after applying required patches. Please reboot and try again. Apparently we rebooted a lot and it did not help... How can we upgrade the system? In particular, how does LU do it? :) Now working on an idea to update all existing packages in the cloned root, using pkgrm/pkgadd -R. Updating only some packages didn't help much (kernel, zfs, libs). A backup plan is to move the ZFS root back to UFS, update and move it back. Probably would work, but not an elegant job ;) Suggestions welcome, maybe we'll try out some of them and report ;) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Liveupgrade snv_77 with a ZFS root to snv_89
You mean this: https://www.opensolaris.org/jive/thread.jspa?threadID=46626tstart=120 Elegant script, I like it, thanks :) Trying now... Some patching follows: -for fs in `zfs list -H | grep ^$ROOTPOOL/$ROOTFS | awk '{ print $1 };'` +for fs in `zfs list -H | grep ^$ROOTPOOL/$ROOTFS | grep -w $ROOTFS | grep -v '@' | awk '{ print $1 };'` In essence, skip snapshots (@) and non-rootpool/rootfs/subfs paths. On my system I happen to have both problems (a clone rootpool/rootfs_snv77 and some snapshots of both the clone and rootfs). Alas, so far the upgrade didn't get going (ttinstall doesn't see the old system, neither ZFS root nor the older UFS SVM-mirror root), although rootpool/rootfs got mounted to /a. I'm now reboot-and-retrying - perhaps early tests and script rewrites/reruns messed something up. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Liveupgrade snv_77 with a ZFS root to snv_89
Alas, didn't work so far. Can the problem be that the zfs-root disk is not the first on the controller (system boots from the grub on the older ufs-root slice), and/or that zfs is mirrored? And that I have snapshots and a data pool too? These are the boot disks (SVM mirror with ufs and grub): [EMAIL PROTECTED] /]# metastat -c d1 m 4.0GB d12 d10 d12 s 4.0GB c3t2d0s0 d10 s 4.0GB c3t0d0s0 This is the actual system: [EMAIL PROTECTED] /]# zpool status pool: pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t0d0s3 ONLINE 0 0 0 c3t1d0s3 ONLINE 0 0 0 c3t2d0s3 ONLINE 0 0 0 c3t3d0s3 ONLINE 0 0 0 errors: No known data errors pool: rootpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rootpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t1d0s0 ONLINE 0 0 0 c3t3d0s0 ONLINE 0 0 0 errors: No known data errors This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss