Re: [zfs-discuss] Petabyte pool?
>hakan...@ohsu.edu said: >> I get a little nervous at the thought of hooking all that up to a single >> server, and am a little vague on how much RAM would be advisable, other than >> "as much as will fit" (:-). Then again, I've been waiting for something like >> pNFS/NFSv4.1 to be usable for gluing together multiple NFS servers into a >> single global namespace, without any sign of that happening anytime soon. > richard.ell...@gmail.com said: > NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace > without needing the complexity of NFSv4.1, lustre, glusterfs, etc. Been using NFSv4 since it showed up in Solaris-10 FCS, and it is true that I've been clever enough (without automount -- I like my computers to be as deterministic as possible, thank you very much :-) for our NFS clients to see a single directory-tree namespace which abstracts away the actual server/location of a particular piece of data. However, we find it starts getting hard to manage when a single project (think "directory node") needs more space than their current NFS server will hold. Or perhaps what you're getting at above is even more clever than I have been to date, and is eluding me at the moment. I did see someone mention "NFSv4 referrals" recently, maybe that would help. Plus, believe it or not, some of our customers still insist on having the server name in their path hierarchy for some reason, like /home/mynfs1/, /home/mynfs2/, and so on. Perhaps I've just not been persuasive enough yet (:-). richard.ell...@gmail.com said: > Don't forget about backups :-) I was hoping I could get by with telling them to buy two of everything. Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabyte pool?
>>Ray said: >>> Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed >>> to a couple of LSI SAS switches. >> >Marion said: >> How many HBA's in the R720? > Ray said: > We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]). Sounds similar in approach to the Aberdeen product another sender referred to, with SAS switch layout: http://www.aberdeeninc.com/images/1-up-petarack2.jpg One concern I had is that I compared our SuperMicro JBOD with 40x 4TB drives in it, connected via a dual-port LSI SAS 9200-8e HBA, to the same pool layout on a 40-slot server with 40x SATA drives in it. But the server uses no SAS expanders, instead using SAS-to-SATA octopus cables to connect the drives directly to three internal SAS HBA's (2x 9201-16i's, 1x 9211-8i). What I found was that the internal pool was significantly faster for both sequential and random I/O than the pool on the external JBOD. My conclusion was that I would not want to exceed ~48 drives on a single 8-port SAS HBA. So I thought that running the I/O of all your hundreds of drives through only two HBA's would be a bottleneck. LSI's specs say 4800MBytes/sec for an 8-port SAS HBA, but 4000MBytes/sec for that card in an x8 PCIe-2.0 slot. Sure, the newer 9207-8e is rated at 8000MBytes/sec in an x8 PCIe-3.0 slot, but it still has only the same 8 SAS ports going at 4800MBytes/sec. Yes, I know the disks probably can't go that fast. But in my tests above, the internal 40-disk pool measures 2000MBytes/sec sequential reads and writes, while the external 40-disk JBOD measures at 1500 to 1700 MBytes/sec. Not a lot slower, but significantly slower, so I do think the number of HBA's makes a difference. At the moment, I'm leaning toward piling six, eight, or ten HBA's into a server, preferably one with dual IOH's (thus two PCIe busses), and connecting dual-path JBOD's in that manner. I hadn't looked into SAS switches much, but they do look more reliable than daisy-chaining a bunch of JBOD's together. I just haven't seen how to get more bandwidth through them to a single host. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabyte pool?
rvandol...@esri.com said: > We've come close: > > admin@mes-str-imgnx-p1:~$ zpool list > NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT > datapool 978T 298T 680T30% 1.00x ONLINE - > syspool278G 104G 174G37% 1.00x ONLINE - > > Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to > a couple of LSI SAS switches. Thanks Ray, We've been looking at those too (we've had good luck with our MD1200's). How many HBA's in the R720? Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Petabyte pool?
Greetings, Has anyone out there built a 1-petabyte pool? I've been asked to look into this, and was told "low performance" is fine, workload is likely to be write-once, read-occasionally, archive storage of gene sequencing data. Probably a single 10Gbit NIC for connectivity is sufficient. We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis, using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3). Back-of-the-envelope might suggest stacking up eight to ten of those, depending if you want a "raw marketing petabyte", or a proper "power-of-two usable petabyte". I get a little nervous at the thought of hooking all that up to a single server, and am a little vague on how much RAM would be advisable, other than "as much as will fit" (:-). Then again, I've been waiting for something like pNFS/NFSv4.1 to be usable for gluing together multiple NFS servers into a single global namespace, without any sign of that happening anytime soon. So, has anyone done this? Or come close to it? Thoughts, even if you haven't done it yourself? Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt_sas multipath problem?
j...@opensolaris.org said: > Output from 'prtconf -v' would help, as would a cogent description of what > you are looking at to determine that MPxIO isn't working. Sorry James, I must've made a cut-and-paste-o and left out my description of the symptom. That being, 40 new drives show up as 80 new disk devices at the OS level (in "format", in "cfgadm -alv", in "ls /dev/dsk" and in "prtconf -Dv" listings). Adding the drives' string to a white-list in scsi_vhci.conf got us going, thanks to Richard's reminder. I do have before and after prtconf listings, if anyone is interested. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt_sas multipath problem?
richard.ell...@gmail.com said: > Sometimes the mpxio detection doesn't work properly. You can try to whitelist > them, https://www.illumos.org/issues/644 And I said: > Thanks Richard, I was hoping I hadn't just made up my vague memory of such > functionality. We'll give it a try. That did the trick. I added these lines to /kernel/drv/scsi_vhci.conf, at the end of the file: scsi-vhci-failover-override = "WD WD4001FYYG-01SL3", "f_sym"; # WD RE 4TB SAS HDD A reboot was involved, as I wasn't able to coax the system into re-reading the scsi_vhci.conf file using "update_drv scsi_vhci", nor by unplugging and replugging the JBOD's SAS cables, "cfgadm -c unconfigure c49", etc. I'm off to exercise it with filebench tomorrow Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt_sas multipath problem?
> On Jan 7, 2013, at 1:20 PM, Marion Hakanson wrote: > Greetings, > We're trying out a new JBOD here. Multipath (mpxio) is not working, and we > could use some feedback and/or troubleshooting advice. > . . . richard.ell...@gmail.com said: > Sometimes the mpxio detection doesn't work properly. You can try to whitelist > them, https://www.illumos.org/issues/644 Thanks Richard, I was hoping I hadn't just made up my vague memory of such functionality. We'll give it a try. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mpt_sas multipath problem?
Greetings, We're trying out a new JBOD here. Multipath (mpxio) is not working, and we could use some feedback and/or troubleshooting advice. The OS is oi151a7, running on an existing server with a 54TB pool of internal drives. I believe the server hardware is not relevant to the JBOD issue, although the internal drives do appear to the OS with multipath device names (despite the fact that these internal drives are cabled up in a single-path configuration). If anything, this does confirm that multipath is enabled in mpt_sas.conf via the mpxio-disable="no" directive (internal HBA's are LSI SAS, 2x 9201-16i and 1x 9211-8i). The JBOD is a SuperMicro 847E26-RJBOD1, with the front backplane daisy-chained to the rear backplane (both expanders). Each of the two expander chains is connected to one port of an LSI SAS 9200-8e HBA. So far, all this hardware has appeared as working for others and well-supported, and this 9200-8e is running the -IT firmware, version 15.0.0.0. The drives are 40x of the WD4001FYYG SAS 4TB variety, firmware VR02. The spot-checks I've done so far seem to show that both device instances of a drive show up in "prtconf -Dv" with identical serial numbers and identical "devid" and "guid" values, so I'm not sure what might be missing to allow mpxio to recognize them as the same device. Has anyone out there got this type of hardware working? In a multipath configuration? Suggestions on mdb or dtrace code I can use to debug? Are there "secrets" to the internal daisy-chain cabling that our vendor is not aware of? Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot replace X with Y: devices have different sector alignment
tron...@gmail.com said: > That said, I've already migrated far too many times already. I really, really > don't want to migrate the pool again, if it can be avoided. I've already > migrated from raidz1 to raidz2 and then from raidz2 to mirror vdevs. Then, > even though I already had a mix of 512b and 4k discs in the pool, when I > bought new 3TB discs, I couldn't add them to the pool, and I had to set up a > new pool with ashift=12. In retrospect, I should have built the new pool > without the 2TB drives, and had I known what I do now, I would definately > have done that. Are you sure you can't find 3TB/4TB drives with 512b sectors? If you can believe the "User Sectors Per Drive" specifications, these WD disks do: WD4000FYYZ, WD3000FYYZ Those are the SATA part-numbers; There are SAS equivalents. I also found the Hitachi UltraStar 7K3000 and 7K4000 drives claim to support 512-byte sector sizes. Sure, they're expensive, but what enterprise-grade drives aren't? And, they might solve your problem. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar
richard.ell...@richardelling.com said: > We are starting to see a number of SAS HDDs that prefer logical-block to > round-robin. I see this with late model Seagate and Toshiba HDDs. > > There is another, similar issue with recognition of multipathing by the > scsi_vhci driver. Both of these are being tracked as https://www.illumos.org/ > issues/644 and there is an alternate scsi_vhci.conf file posted in that > bugid. Interesting, I just last week had a Toshiba come from Dell as a replacement for a Seagate 2TB SAS drive; On Solaris-10, the Toshiba insisted on showing up as 2 drives, so mpxio was not recognizing it. Fortunately I was able to swap the drive for a Seagate, but I'll stash away a copy of the scsi_vhci.conf entry for the future. > We're considering making logical-block the default (as in above bugid) and we > have not discovered a reason to keep round-robin. If you know of any reason > why round-robin is useful, please add to the bugid. Should be fine. When I first ran into this a couple years ago, I did a lot of tests and found logical-block to be slower than "none" (with those Seagate 2TB SAS drives in Dell MD1200's), but not a whole lot slower. I vaguely recall that round-robin was better for highly random, small I/O (IOPS-intensive) workloads. I got the best results by manually load-balancing half the drives to one path and half the drives to the other path. But I decided it was not worth the effort. Maybe if there was a way to automatically do that (with a relatively static result) Of course, this was all tested on Solaris-10, so your mileage may vary. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar
a...@blackandcode.com said: > I'm spec'ing out a Thumper-esque solution and having trouble finding my > favorite Hitachi Ultrastar 2TB drives at a reasonable post-flood price. The > Seagate Constellations seem pretty reasonable given the market circumstances > but I don't have any experience with them. Anybody using these in their ZFS > systems and have you had good luck? We have a lot of 2TB and 3TB Seagates here, they work fine. Most of ours are the Nearline-SAS variety, in Dell MD1200 enclosures, used on Windows & Linux behind PERC H800 RAID cards, and on Solaris-10 and OpenIndiana behind LSI SAS HBA's. We do have one new server with a pile of 2TB SATA Seagate's as well, so far working fine. The only caveat I've found is that the Nearline SAS Seagates go really slow with the Solaris default multipath load-balancing setting (round-robin). Set it to "none" or some large block value and they go fast. This issue doesn't appear when used with the PERC H800's. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
p...@kraus-haus.org said: > Without knowing the I/O pattern, saying 500 MB/sec. is meaningless. > Achieving 500MB/sec. with 8KB files and lots of random accesses is really > hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of > 100MB+ files is much easier. > . . . > For ZFS, performance is proportional to the number of vdevs NOT the > number of drives or the number of drives per vdev. See https:// > docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL > Xc for some testing I did a while back. I did not test sequential read as > that is not part of our workload. > . . . > I understand why the read performance scales with the number of vdevs, > but I have never really understood _why_ it does not also scale with the > number of drives in each vdev. When I did my testing with 40 dribves, I > expected similar READ performance regardless of the layout, but that was NOT > the case. In your first paragraph you make the important point that "performance" is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above, you go back to using "performance" in its ambiguous form. I assume that by "performance" you are mostly focussing on random-read performance My experience is that sequential read performance _does_ scale with the number of drives in each vdev. Both sequential and random write performance also scales in this manner (note that ZFS tends to save up small, random writes and flush them out in a sequential batch). Small, random read performance does not scale with the number of drives in each raidz[123] vdev because of the dynamic striping. In order to read a single logical block, ZFS has to read all the segments of that logical block, which have been spread out across multiple drives, in order to validate the checksum before returning that logical block to the application. This is why a single vdev's random-read performance is equivalent to the random-read performance of a single drive. p...@kraus-haus.org said: > The recommendation is to not go over 8 or so drives per vdev, but that is > a performance issue NOT a reliability one. I have also not been able to > duplicate others observations that 2^N drives per vdev is a magic number (4, > 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is > reliable, just (relatively) slow :-) Again, the "performance issue" you describe above is for the random-read case, not sequential. If you rarely experience small-random-read workloads, then raidz* will perform just fine. We often see 2000 MBytes/sec sequential read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's (using 2TB drives). However, when a disk fails and must be resilvered, that's when you will run into the slow performance of the small, random read workload. This is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives, especially of the 1TB+ size. That way if it takes 200 hours to resilver, you've still got a lot of redundancy in place. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] permissions
capcas...@gmail.com said: > I have a file that I can't delete, change permissions or owner. ls -v does > not show any acl's on the file not even those for normal unix rw etc. > permissions from ls -l show -rwx-- chmod gived an error of not owner for > the owner !! and for root just says can't change or not owner depending on > the mode i am trying to set. In addition to Richard's suggestion, a couple things come to mind: (1) Ability to delete a file (or rename it) depends on the permissions of the directory which contains it, not on the file itself. (2) If you're doing the delete/chown on an NFS client, ownerships could be different than expected if UID mapping is broken (NFSv4 with a mismatched domain), or if remote root is being mapped to "nobody" on the NFS server. Similar issues could happen for a CIFS client. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC, block based or file based?
mattba...@gmail.com said: > We're looking at buying some additional SSD's for L2ARC (as well as > additional RAM to support the increased L2ARC size) and I'm wondering if we > NEED to plan for them to be large enough to hold the entire file or if ZFS > can cache the most heavily used parts of a single file. > > After watching arcstat (Mike Harsch's updated version) and arc_summary, I'm > still not sure what to make of it. It's rare that the l2arc (14Gb) hits > double digits in %hit whereas the ARC (3Gb) is frequently >80% hit. I'm not sure of the answer to your initial question (file-based vs block-based), but I may have an explanation for the stats you're seeing. We have a system here with 96GB of RAM and also the Sun F20 flash accelerator card (96GB), most of which is used for L2ARC. Note that data is not written into the L2ARC until it is evicted from the ARC (e.g. when something newer or more frequently used needs ARC space). So, my interpretation of the high hit rates on the in-RAM ARC, and low hit rates on the L2ARC, is that the working set of data fits mostly in RAM, and the system seldom needs to go to the L2ARC for more. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
lmulc...@marinsoftware.com said: > . . . > The MySQL server is: > Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks > each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD) > > I have also seen similar symptoms on systems with MD1000 disk arrays > containing 2TB 7200RPM SATA drives. > > The only thing of note that seems to show up in the /var/adm/messages file on > this MySQL server is: > > Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/ > pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt > request inquiry page 0x89 for SATA target:58 failed! Oc > . . . Have you got the latest firmware on your LSI 1068E HBA's? These have been known to have lockups/timeouts when used with SAS expanders (disk enclosures) with incompatible firmware revisions, and/or with older mpt drivers. The MD1220 is a 6Gbit/sec device. You may be better off with a matching HBA -- Dell has certainly told us the MD1200-series is not intended for use with the 3Gbit/sec HBA's. We're doing fine with the LSI SAS 9200-8e, for example, when connecting to Dell MD1200's with the 2TB "nearline SAS" disk drives. Last, are you sure it's memory-related? You might keep an eye on "arcstat.pl" output and see what the ARC sizes look like just prior to lockup. Also, maybe you can look up instructions on how to force a crash dump when the system hangs -- one of the experts around here could tell a lot from a crash dump file. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot format 2.5TB ext disk (EFI)
kitty@oracle.com said: > It wouldn't let me > # zpool create test_pool c5t0d0p0 > cannot create 'test_pool': invalid argument for this pool operation Try without the "p0", i.e. just: # zpool create test_pool c5t0d0 Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
jp...@cam.ac.uk said: > I can't speak for this particular situation or solution, but I think in > principle you are wrong. Networks are fast. Hard drives are slow. Put a > 10G connection between your storage and your front ends and you'll have the > bandwidth[1]. Actually if you really were hitting 1000x8Mbits I'd put 2, > but that is just a question of scale. In a different situation I have boxes > which peak at around 7 Gb/s down a 10G link (in reality I don't need that > much because it is all about the IOPS for me). That is with just twelve 15k > disks. Your situation appears to be pretty ideal for storage hardware, so > perfectly achievable from an appliance. Depending on usage, I disagree with your bandwidth and latency figures above. An X4540, or an X4170 with J4000 JBOD's, has more bandwidth to its disks than 10Gbit ethernet. You would need three 10GbE interfaces between your CPU and the storage appliance to equal the bandwidth of a single 8-port 3Gb/s SAS HBA (five of them for 6Gb/s SAS). It's also the case that the Unified Storage platform doesn't have enough bandwidth to drive more than four 10GbE ports at their full speed: http://dtrace.org/blogs/brendan/2009/09/22/7410-hardware-update-and-analyzing-t he-hypertransport/ We have a customer (internal to the university here) that does high throughput gene sequencing. They like a server which can hold the large amounts of data, do a first pass analysis on it, and then serve it up over the network to a compute cluster for further computation. Oracle has nothing in their product line (anymore) to meet that need. They ended up ordering an 8U chassis w/40x 2TB drives in it, and are willing to pay the $2k/yr retail ransom to Oracle to run Solaris (ZFS) on it, at least for the first year. Maybe OpenIndiana next year, we'll see. Bye Oracle Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun T3-2 and ZFS on JBODS
sigbj...@nixtra.com said: > I will do some testing on the loadbalance on/off. We have nearline SAS disks, > which does have dual path from the disk, however it's still just 7200rpm > drives. > > Are you using SATA , SAS or SAS-nearline in your array? Do you have multiple > SAS connections to your arrays, or do you use a single connection per array > only? We have four Dell MD1200's connected to three Solaris-10 systems. Three of the MD1200's have nearline-SAS 2TB 7200RPM drives, and one has SAS 300GB 15000RPM drives. All the MD1200's are connected with dual SAS modules to a dual-port HBA on their respective servers (one setup is with two MD1200's daisy-chained, but again using dual SAS modules & cables). Both types of drives suffer super-slow writes (but reasonable reads) when loadbalance=roundrobin is in effect. E.g 280 MB/sec sequential reads, and 28MB/sec sequential writes, for the 15kRPM SAS drives I tested last week. We don't see this extreme slowness on our dual-path Sun J4000 JBOD's, but those all have SATA drives (with the dual-port interposers inside the drive sleds). Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun T3-2 and ZFS on JBODS
sigbj...@nixtra.com said: > I've played around with turning on and off mpxio on the mpt_sas driver, > disabling increased the performance from 30MB / sec, but it's still far from > the original performance. I've attached some dumps of zpool iostat before and > after reinstallation. I find "zpool iostat" is less useful in telling what the drives are doing than "iostat -xn 1". In particular, the latter will give you an idea of how many operations are queued per drive, and how long it's taking the drives to handle those operations, etc. On our Solaris-10 systems (U8 and U9), if mpxio is enabled, you really want to set loadbalance=none. The default (round-robin) makes some of our JBOD's (Dell MD1200) go really slow for writes. I see you have tried with mpxio disabled, so your issue may be different. You don't say what you're doing to generate your test workload, but there are some workloads which will speed up a lot if the ZIL is disabled. Maybe that or some other /etc/system tweaks were in place on the original system. Also use "format -e" and its "write_cache" commands to see if the drives' write caches are enabled or not. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SIL3114 and sparc solaris 10
nat...@tuneunix.com said: > I can confirm that on *at least* 4 different cards - from different board > OEMs - I have seen single bit ZFS checksum errors that went away immediately > after removing the 3114 based card. > > I stepped up to the 3124 (pci-x up to 133mhz) and 3132 (pci-e) and have > never looked back. > > I now throw any 3114 card I find into the bin at the first available > opportunity as they are a pile of doom waiting to insert an exploding garden > gnome into the unsuspecting chest cavity of your data. Maybe I've just been lucky. I have a 3114 card configured with two ports internal and two external (E-SATA). There is a ZFS pool configured as a mirror of a 1TB drive on the E-SATA port in an external dock, and a 1TB drive on a motherboard SATA port. It's been running like this for a couple of years, with weekly scrubs, and has so far had no errors. The system is a 32-bit x86 running Solaris-10U6. My 3114 card came with RAID firmware, and I re-flashed it to non-RAID, as others have mentioned. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
matt.connolly...@gmail.com said: > After putting the drive online (and letting the resilver complete) I took the > slow drive (c8t1d0 western digital green) offline and the system ran very > nicely. > > It is a 4k sector drive, but I thought zfs recognised those drives and didn't > need any special configuration...? That's a nice confirmation of the cost of not doing anything special (:-). I hear the problem may be due to 4k drives which report themselves as 512b drives, for boot/BIOS compatibility reasons. I've also seen various ways to force 4k alignment, and check what the "ashift" value is in your pool's drives, etc. Google "solaris zfs 4k sector align" will lead the way. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
matt.connolly...@gmail.com said: > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 1.2 36.0 153.6 4608.0 1.2 0.3 31.99.3 16 18 c12d0 > 0.0 113.40.0 7446.7 0.8 0.17.00.5 15 5 c8t0d0 > 0.2 106.44.1 7427.8 4.0 0.1 37.81.4 93 14 c8t1d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.4 73.2 25.7 9243.0 2.3 0.7 31.69.8 34 37 c12d0 > 0.0 226.60.0 24860.5 1.6 0.27.00.9 25 19 c8t0d0 > 0.2 127.63.4 12377.6 3.8 0.3 29.72.2 91 27 c8t1d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 44.20.0 5657.6 1.4 0.4 31.79.0 19 20 c12d0 > 0.2 76.04.8 9420.8 1.1 0.1 14.21.7 12 13 c8t0d0 > 0.0 16.60.0 2058.4 9.0 1.0 542.1 60.2 100 100 c8t1d0 > extended device statistics > r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.00.20.0 25.6 0.0 0.00.32.3 0 0 c12d0 > 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 > 0.0 11.00.0 1365.6 9.0 1.0 818.1 90.9 100 100 c8t1d0 > . . . matt.connolly...@gmail.com said: > I expect that the c8t0d0 WD Green is the lemon here and for some reason is > getting stuck in periods where it can write no faster than about 2MB/s. Does > this sound right? No, it's the opposite. The drive sitting at 100%-busy, c8t1d0, while the other drive is idle, is the sick one. It's slower than the other, has 9.0 operations waiting (queued) to finish. The other one is idle because it has already finished the write activity and is waiting for the slow one in the mirror to catch up. If you run "iostat -xn" without the interval argument, i.e. so it prints out only one set of stats, you'll see the average performance of the drives since last reboot. If the "asvc_t" figure is significantly larger for one drive than the other, that's a way to identify the one which has been slower over the long term. > Secondly, what I wonder is why it is that the whole file system seems to hang > up at this time. Surely if the other drive is doing nothing, a web page can > be served by reading from the available drive (c8t1d0) while the slow drive > (c8t0d0) is stuck writing slow. The available drive is c8t0d0 in this case. However, if ZFS is in the middle of a txg (ZFS transaction) commit, it cannot safely do much with the pool until that commit finishes. You can see that ZFS only lets 10 operations accumulate per drive (used to be 35), i.e. 9.0 in the "wait" column, and 1.0 in the "actv" column, so it's kinda stuck until the drive gets its work done. Maybe the drive is failing, or maybe it's one of those with large sectors that are not properly aligned with the on-disk partitions. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] reliable, enterprise worthy JBODs?
tmcmah...@yahoo.com said: > Interesting. Did you switch to the load-balance option? Yes, I ended up with "load-balance=none". Here's a thread about it in the storage-discuss mailing list: http://opensolaris.org/jive/thread.jspa?threadID=130975&tstart=90 Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] reliable, enterprise worthy JBODs?
p...@bolthole.com said: > Any other suggestions for (large-)enterprise-grade, supported JBOD hardware > for ZFS these days? Either fibre or SAS would be okay. As others have said, it depends on your definition of "enterprise-grade". We're using Dell's MD1200 SAS JBOD's with Solaris-10 and ZFS. Ours have the Seagate 2TB "Nearline SAS" drives. The 3Gbit Sun SAS HBA, using mpt driver, works fine, although with a stream of harmless warning messages about "unknown event 10 received". The newer 6Gbit Sun/Oracle SAS HBA, using mpt_sas driver, works well without that issue. The only special tuning I had to do was turn off round-robin load-balancing in the mpxio configuration. The Seagate drives were incredibly slow when running in round-robin mode, very speedy without. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS slows down over a couple of days
Stephan, The "vmstat" shows you are not actually short of memory; The "pi" and "po" columns are zero, so the system is not having to do any paging, and it seems unlike the system is slow directly because of RAM shortage. With the ARC, it's not unusual for vmstat to show little free memory, but the system will give up that RAM when an application asks for it. You can tell if this is happening a lot by: echo "::arc" | mdb -k | grep throttle If the value of "memory_throttle_count" is large, that will indicate that apps are often asking the kernel to give up ARC memory. Also, as you said, the "iostat" figures look idle. You can tell more using "iostat -xn 1", which will give service times & percent-busy figures for the actual devices. It could be that something about the networking involved is what is actually slow. You could find out if it's a local bottleneck by trying some simple I/O tests on the server itself, maybe: dd if=/dev/zero of=/file/in/zpool bs=1024k and watching what iostat shows, etc. Another test is to try a network-only test, maybe using "ttcp" between the server and a client. This could tell you if it's network or storage that's causing the slow-down. If you don't have "ttcp", something silly like, on a client running: dd if=/dev/zero bs=1024k | ssh -c blowfish server "dd of=/dev/null bs=1024k" You can watch network throughput on the server using: dladm show-link -s -i 1 Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz recovery
z...@lordcow.org said: > For example when I 'dd if=/dev/zero of=/dev/ad6', or physically remove the > drive for awhile, then 'online' the disk, after it resilvers I'm typically > left with the following after scrubbing: > > r...@file:~# zpool status > pool: pool > state: ONLINE status: One or more devices has experienced an unrecoverable > error. An > attempt was made to correct the error. Applications are unaffected. > action: > Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h0m with 0 errors on Fri Dec 10 23:45:56 2010 > config: > > NAMESTATE READ WRITE CKSUM > poolONLINE 0 0 0 > raidz1ONLINE 0 0 0 > ad12ONLINE 0 0 0 > ad13ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 7 > > errors: No known data errors > > http://www.sun.com/msg/ZFS-8000-9P lists my above actions as a cause for this > state and rightfully doesn't think them serious. When I 'clear' the errors > though and offline/fault another drive, and then reboot, the array faults. > That tells me ad6 was never fully integrated back in. Can I tell the array to > re-add ad6 from scratch? 'detach' and 'remove' don't work for raidz. > Otherwise I need to use 'replace' to get out of this situation. After you "clear" the errors, do another "scrub" before trying anything else. Once you get a complete scrub with no new errors (and no checksum errors), you should be confident that the damaged drive has been fully re-integrated into the pool. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with STK raid card w battery
replic...@gmail.com said: > One other question, how can I ensure that the controller's cache is really > being used? (arcconf doesn't seem to show much). Since ZFS would flush the > data as soon as it can, I am curious to see if the caching is making a > difference or not. Share out a dataset on the pool over NFS to a remote client. On the client, unpack a tar archive onto the NFS dataset, timing how long it takes. Do this once with the cache set to "write-through" (which basically disables the write cache), and again with it set to "write-back". Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
rewar...@hotmail.com said: > ok... we're making progress. After swapping the LSI HBA for a Dell H800 the > issue disappeared. Now, I'd rather not use those controllers because they > don't have a JBOD mode. We have no choice but to make individual RAID0 > volumes for each disks which means we need to reboot the server every time we > replace a failed drive. That's not good... Earlier you said you had eliminated the ZIL as an issue, but one difference between the Dell H800 and the LSI HBA is that the H800 has an NV cache (if you have the battery backup present). A very simple test would be when things are running slow, try disabling the ZIL temporarily, to see if that makes things go fast. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] possible ZFS-related panic?
Folks, Has anyone seen a panic traceback like the following? This is Solaris-10u7 on a Thumper, acting as an NFS server. The machine was up for nearly a year, I added a dataset to an existing pool, set compression=on for the first time on this system, loaded some data in there (via "rsync"), then mounted it to the NFS client. The first data was written by the client itself in a 10pm cron-job, and the system crashed at 10:02pm as below: panic[cpu2]/thread=fe8000f5cc60: page_sub: bad arg(s): pp 872b5610, *ppp 0 fe8000f5c470 unix:mutex_exit_critical_size+20219 () fe8000f5c4b0 unix:page_list_sub_pages+161 () fe8000f5c510 unix:page_claim_contig_pages+190 () fe8000f5c600 unix:page_geti_contig_pages+44b () fe8000f5c660 unix:page_get_contig_pages+c2 () fe8000f5c6f0 unix:page_get_freelist+1a4 () fe8000f5c760 unix:page_create_get_something+95 () fe8000f5c7f0 unix:page_create_va+2a1 () fe8000f5c850 unix:segkmem_page_create+72 () fe8000f5c8b0 unix:segkmem_xalloc+60 () fe8000f5c8e0 unix:segkmem_alloc_vn+8a () fe8000f5c8f0 unix:segkmem_alloc+10 () fe8000f5c9c0 genunix:vmem_xalloc+315 () fe8000f5ca20 genunix:vmem_alloc+155 () fe8000f5ca90 genunix:kmem_slab_create+77 () fe8000f5cac0 genunix:kmem_slab_alloc+107 () fe8000f5caf0 genunix:kmem_cache_alloc+e9 () fe8000f5cb00 zfs:zio_buf_alloc+1d () fe8000f5cb50 zfs:zio_compress_data+ba () fe8000f5cba0 zfs:zio_write_compress+78 () fe8000f5cbc0 zfs:zio_execute+60 () fe8000f5cc40 genunix:taskq_thread+bc () fe8000f5cc50 unix:thread_start+8 () syncing file systems... done . . . Unencumbered by more than a gut feeling, I disabled compression on the dataset, and we've gotten through two nightly runs of the same NFS client job without crashing, but of course we would tecnically have to wait for nearly a year before we've exactly replicated the original situation (:-). Unfortunately the dump-slice was slightly too small, we were just short of enough space to capture the whole 10GB crash dump. I did get savecore to write something out, and I uploaded it to the Oracle support site,but it gives "scat" too much indigestion to be useful to the engineer I'm working with. They have not found any matching bugs so far, so I thought I'd ask a slightly wider audience here. Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
markwo...@yahoo.com said: > So the question is with a proper ZIL SSD from SUN, and a RAID10... would I be > able to support all the VM's or would it still be pushing the limits a 44 > disk pool? If it weren't a closed 7000-series appliance, I'd suggest running the "zilstat" script. It should make it clear whether (and by how much) you would benefit from the Logzilla addition in your current raidz configuration. Maybe there's some equivalent in the builtin FishWorks analytics which can give you the same information. Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance Testing
p...@kraus-haus.org said: > Based on these results, and our capacity needs, I am planning to go with 5 > disk raidz2 vdevs. I did similar tests with a Thumper in 2008, with X4150/J4400 in 2009, and more recently comparing X4170/J4400 and X4170/MD1200: http://acc.ohsu.edu/~hakansom/thumper_bench.html http://acc.ohsu.edu/~hakansom/j4400_bench.html http://acc.ohsu.edu/~hakansom/md1200_loadbal_bench.html On the Thumper, we went with 7x(4D+2P) raidz2, and as a general-purpose NFS server performance has been fantastic except (as expected without any NV ZIL) for the very rare "lots of small synchronous I/O" workloads (like extracting a tar archive via an NFS client). In fact, our experience with the above has led us to go with 6x(5D+2P) on our new X4170/J4400 NFS server. The difference between this config and 7x(4D+2P) on the same hardware is pretty small, and both are faster than the Thumper. > Since we have five J4400, I am considering using one disk > in each of the five arrays per vdev, so that a complete failure of a J4400 > does not cause any loss of data. What is the general opinion of that approach We did something like this on the Thumper, with one disk on each of the internal HBA's. Since our new system has only two J4400's, we didn't try to cover this type of failure. > and does anyone know how to map the MPxIO device name back to a physical > drive ? You can use CAM to view the mapping of physical drives to device names (with or without MPxIO enabled). That's the most human-friendly way that I've found. If you're using Oracle/Sun LSI HBA's (mpt), a "raidctl -l" will list out devices names like 0.0.0, 0.1.0, and so on. That middle digit does seem to correspond with the physical slot number in the J4400's, at least initially. Unfortunately (for this purpose), if you move drives around, the "raidctl" names follow the drives to their new locations, as do the Solaris device names (verified by "dd if=/dev/dsk/... of=/dev/null" and watching the blinkenlights). Also, with multiple paths, devices will show up with two different names in "raidctl -l", so it's a bit of a pain to make sense of it all. So, just use CAM Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please trim posts
doug.lin...@merchantlink.com said: > Apparently, before Outlook there WERE no meetings, because it's clearly > impossible to schedule one without it. Don't tell my boss, but I use Outlook for the scheduling, and fetchmail plus procmail to download email out of Exchange and into my favorite email client. Thankfully, Exchange listens to incoming SMTP when I need to send messages. > And please don't mail me with your favorite OSS solution. I've tried them > all. None of them integrate with Exchange *smoothly* and *cleanly*. They're > all workarounds and kludges that are as annoying in the end as Outlook. Hmm, what I'm doing doesn't _integrate_ with Exchange; It just bypasses it for the email portion of my needs. Non-OSS: Mac OS X 10.6 claims to integrate with Exchange, although I have not yet tried it myself. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one more time: pool size changes
frank+lists/z...@linetwo.net said: > I remember, and this was a few years back but I don't see why it would be any > different now, we were trying to add drives 1-2 at a time to medium-sized > arrays (don't buy the disks until we need them, to hold onto cash), and the > Netapp performance kept going down down down. We eventually had to borrow an > array from Netapp to copy our data onto to rebalance. Netapp told us > explicitly, make sure to add an entire shelf at a time (and a new raid group, > obviously, don't extend any existing group). The advent of aggregates fixed that problem. Used to be that a raid-group belonged to only one volume. Now multiple flex-vols (even tiny ones) share all the spindles (and parity drives) on their aggregate, and you can rebalance after adding drives without having to manually move/copy existing data. Pretty slick, if you can afford the price. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one more time: pool size changes
frank+lists/z...@linetwo.net said: > Well in that case it's invalid to compare against Netapp since they can't do > it either (seems to be the consensus on this list). Neither zfs nor Netapp > (nor any product) is really designed to handle adding one drive at a time. > Normally you have to add an entire shelf, and if you're doing that it's > better to add a new vdev to your pool. This is incorrect (and another poster has pointed this out). NetApp can add a single drive (or more) to a raid-group, and has been able to do so since before they had dual-parity, aggregates, flex-vols, and rebalancing. BTW, the rebalance after growing an aggregate is not automatic (as of OnTAP-7.3 anyway). You invoke a command manually on each volume that you care about, and the rebalance runs in the background until finished. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Write retry errors to SSD's on SAS backplane (mpt)
rvandol...@esri.com said: > We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02) > (onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1 (Rev > 1.1) backplane. > . . . > The system is fully patched Solaris 10 U8, and the mpt driver is > version 1.92: Since you're running on Solaris-10 (and its mpt driver), have you tried the firmware that Sun recommends for their own 1068E-based HBA's? There are a couple of versions depending on your usage, but they're all earlier revs than the 1.28.02.00 you have: http://www.lsi.com/support/sun/sg_xpci8sas_e_sRoHS.html Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] j4500 cache flush
bene...@yahoo.com said: > Marion - Do you happen to know which SAS hba it applys to? Here's the article: http://sunsolve.sun.com/search/document.do?assetkey=1-66-248487-1 The title is "Write-Caching on JBOD SATA Drive is Erroneously Enabled by Default When Connected to Non-RAID SAS HBAs". By the way, you can use "raidctl" to view/manage firmware on these. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] j4500 cache flush
erik.trim...@sun.com said: > All J4xxx systems are really nothing more than huge SAS expanders hooked to > a bunch of disks, so cache flush requests will either come from ZFS or any > attached controller. Note that I /think/ most non-RAID controllers don't > initiate their own cache flush requests. Docs for the non-RAID HBA's sold by Sun say that with proper (recent) firmware, at power-up the HBA will disable write caches on the disks themselves (this refers to the LSI 1068-based HBA's, anyway). There was a Sun Alert issued for early revisions which failed to disable disk caches, resulting in data loss at power loss for cache-unaware software. Solaris/OpenSolaris ZFS will then enable the write caches once it knows it has control of whole disks, and issues flushes to the drives as appropriate. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
car...@taltos.org said: > NetApp does _not_ expose an ACL via NFSv3, just old school POSIX mode/owner/ > group info. I don't know how NetApp deals with chmod, but I'm sure it's > documented. The answer is, "It depends." If the NetApp volume is NTFS-only permissions, then chmod from the Unix/NFS side doesn't work, and you can only manipulate permissions from Windows clients.. If it's a "mixed" security-style volume, chmod from the Unix/NFS side will delete the NTFS ACL's, and the SMB clients will see faked-up ACL's that match the new POSIX permissions. Whichever side made the most recent change will be in effect. Newer OnTAP versions have an optional setting which overrides this effect of chmod if NFSv4 is in effect on mixed-security volumes, and instead tries to mirror the ACL's as identically as possible to both kinds of clients. Poor old NFSv3 and older clients still see gibberish POSIX permissions, but the least privilege available in ACL's is enforced by the filer. BTW, our experience has been that NFSv4 on NetApp does not work very well, and NetApp support folks have advised us to not use it in order to avoid crashing the filer. They of course blame the various incompatible NFSv4 client implementations out there Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
hen...@acm.org said: > I've been surveying various forums looking for other places using ZFS ACL's > in production to compare notes and see how if at all they've handled some of > the issues we've found deploying them. > > So far, I haven't found anybody using them in any substantial way, let alone > trying to leverage them to allow a very large user population to have highly > flexible control over access to their data. > > Anyone here that has a non-negligible ACL deployment that would be interested > in discussing it? We've been using them here for a couple of years now. Personally, I'd say if you set one ACL, you're already in "non-negligible" territory. It's not easy to get them right, and usually the hardest task is in figuring out what the users want, so we don't use them unless the users' needs cannot be met using traditional Unix/POSIX permissions. The only way we've been able to do this effectively is by scripting it so it's repeatable (and documented), and using inheritance to propagate them to any new items which are added to shared areas. The scripting also (sorta) covers the problem that most backup and file transfer utilities are not capable of backing up and restoring the NFSv4-style ACL's on ZFS. So, let the discussion ensue Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interrupt sharing
d...@dd-b.net said: > I know from startup log messages that I've got several interrupts being > shared. I've been wondering how serious this is. I don't have any > particular performance problems, but then again my cpu and motherboard are > from 2006 and I'd like to extend their service life, so using them more > efficiently isn't a bad idea. Plus it's all a learning experience :-). Mine's from 2004, and I've been going through the same adjustments here. > While I see the relevance to diagnosing performance problems, for my case, is > there likely to be anything I can do about interrupt assignments? Or is this > something that, if it's a problem, is an unfixable problem (short of changing > hardware)? I think there's BIOS stuff to shuffle interrupt assignments some, > but do changes at that level survive kernel startup, or get overwritten? Experience with my motherboard is that even when you switch the BIOS "Plug-n-Play OS" setting between "No" and "Yes", Solaris-10 doesn't seem to change where it maps any devices. Probably a removal of the /etc/path_to_inst file and reconfiguration reboot would be required, but even that won't move devices required for booting. Also, the onboard devices (like your nv_sata, ehci, etc.) are not likely to move around at all. Only things that could be moved to different PCI/PCI-X/PCIe slots are likely to move. Ran across this note: http://blogs.sun.com/sming56/entry/interrupts_output_in_mdb I found it pretty time-consuming just mapping the OS's device instance numbers to the physical devices. Taking the device instance numbers from "intrstat" or "echo '::interrupts -d' | mdb -k" and digging through the output of "prtconf -Dv" and/or boot-up /var/adm/messages stuff was pretty tedious. Check out what mine looks like, in particular the case where four devices share the same interrupt -- the two onboard SATA ports, onboard ethernet, and one slow-mode USB port (Intel ICH5 chipset). There doesn't appear to be a thing you can do about this sharing. The system's never seemed slow, though I do try to avoid using that particular USB port. # echo '::interrupts -d' | mdb -k IRQ Vector IPL Bus Type CPU Share APIC/INT# Driver Name(s) 10x41 5 ISA Fixed 0 1 0x0/0x1 i8042#0 60x43 5 ISA Fixed 0 1 0x0/0x6 fdc#0 90x81 9 PCI Fixed 0 1 0x0/0x9 acpi_wrapper_isr 12 0x42 5 ISA Fixed 0 1 0x0/0xc i8042#0 15 0x44 5 ISA Fixed 0 1 0x0/0xf ata#1 16 0x82 9 PCI Fixed 0 3 0x0/0x10 uhci#3, uhci#0, nvidia#0 17 0x86 9 PCI Fixed 0 1 0x0/0x11 audio810#0 18 0x85 9 PCI Fixed 0 4 0x0/0x12 pci-ide#1, e1000g#0, uhci#2, pci-ide#1 19 0x84 9 PCI Fixed 0 1 0x0/0x13 uhci#1 22 0x40 5 PCI Fixed 0 1 0x0/0x16 pci-ide#2 23 0x83 9 PCI Fixed 0 1 0x0/0x17 ehci#0 160 0xa0 0 IPI ALL 0 - poke_cpu 192 0xc0 13IPI ALL 1 - xc_serv 208 0xd0 14IPI ALL 1 - kcpc_hw_overflow_intr 209 0xd1 14IPI ALL 1 - cbe_fire 210 0xd3 14IPI ALL 1 - cbe_fire 240 0xe0 15IPI ALL 1 - xc_serv 241 0xe1 15IPI ALL 1 - apic_error_intr # Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor ZIL SLC SSD performance
felix.buenem...@googlemail.com said: > I think I'll try one of thise inexpensive battery-backed PCI RAM drives from > Gigabyte and see how much IOPS they can pull. Another poster, Tracy Bernath, got decent ZIL IOPS from an OCZ Vertex unit. Dunno if that's sufficient for your purposes, but it looked pretty good for the money. Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Identifying firmware version of SATA controller (LSI)
rvandol...@esri.com said: > I'm trying to figure out where I can find the firmware on the LSI > controller... are the bootup messages the only place I could expect to see > this? prtconf and prtdiag both don't appear to give firmware information. > . . . > Solaris 10 U8 x86. The "raidctl" command is your friend; Useful for updating firmware if you choose to do so, as well. You can also find the revisions in the output of "prtconf -Dv", search for "firm" in the long list. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building big cheap storage system. What hardware to use?
fjwc...@gmail.com said: > Yes, if I was to re-do the hardware config for these servers, using what I > know now, I would do things a little differently: > . . . > - find a case with more than 24 drive bays (any way to get a Thumper > without the extra hardware/software?) ;) > . . . It's called the Sun Storage J4500 array. Well, until Oracle gets around to changing the name anyway Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool creation best practices
mijoh...@gmail.com said: > I've never had a lun go bad but bad things do happen. Does anyone else use > ZFS in this way? Is this an unrecommended setup? We used ZFS like this on a Hitachi array for 3 years. Worked fine, not one bad block/checksum error detected. Still using it on an old Sun 6120 array, too. > It's too late to change my > setup, but in the future when I'm planning new systems, should I consider the > effort to allow zfs fully control all the disks? Well, you should certainly consider all the alternatives you can afford. Our customers happen to like cheap bulk storage, so we have a Thumper, and a few SAS-connected Sun J4000 SATA JBOD's. But our grant-funded researchers may not be a "typical" customer mix Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] getting decent NFS performance
erik.trim...@sun.com said: > The suggestion was to make the SSD on each machine an iSCSI volume, and add > the two volumes as a mirrored ZIL into the zpool. I've mentioned the following before For a poor-person's slog which gives decent NFS performance, we have had good results with allocating a slice on (e.g.) an X4150's internal disk, behind the internal Adaptec RAID controller. Said controller has only 256MB of NVRAM, but it made a big difference with NFS performance (look for the "tar unpack" results at the bottom of the page): http://acc.ohsu.edu/~hakansom/j4400_bench.html You can always replace them when funding for your Zeus SSD's comes in (:-). Regards, -- Marion Hakanson OHSU Advanced Computing Center ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS on JBOD storage, mpt driver issue - server not responding
m...@cybershade.us said: > So at this point this looks like an issue with the MPT driver or these SAS > cards (I tested two) when under heavy load. I put the latest firmware for the > SAS card from LSI's web site - v1.29.00 without any changes, server still > locks. > > Any ideas, suggestions how to fix or workaround this issue? The adapter is > suppose to be enterprise-class. We have three of these HBA's, used as follows: X4150, J4400, Solaris-10U7-x86, mpt patch 141737-01 V245, J4200, Solaris-10U7, mpt patch 141736-05 X4170, J4400, Solaris-10U8-x86, mpt/kernel patch 141445-09 None of these systems are suffering the issues you describe. All of their SAS HBA's are running the latest Sun-supported firmware I could find for these HBA's, which is v1.26.03.00 (BIOS 6.24.00), in LSI firmware update 14.2.2 at: http://www.lsi.com/support/sun/ In that package is also a v1.27.03.00 firmware for use when connecting to the F5100 flash accelerator, but it's clearly labelled as only for use with that device. Anyway, I of course don't know if you've already tried the v1.26.03.00 firmware in your situation, but I wanted to at least report that we are using this combination on Solaris-10 without experiencing the timeout issues you are having. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (home NAS) zfs and spinning down of drives
jimkli...@cos.ru said: > Thanks for the link, but the main concern in spinning down drives of a ZFS > pool is that ZFS by default is not so idle. Every 5 to 30 seconds it closes > a transaction group (TXG) which requires a synchronous write of metadata to > disk. You know, it's just going to depend on your usage. On my home machine (Solaris-10U6 with U8-level patches), the drives are set to spin down after 30 minutes of idle time. I'm not certain if the root pool spins down, but the drives in the 2nd mirrored pool do spin down. This pool contains my Solaris home directory and the Samba-connected datasets for backups of other computers. It is true that I have to make sure Thunderbird and Firefox are not running in order to idle the home directory. Then the drives spin down and seem to stay that way until I wake up the display by moving the mouse or accessing the keyboard. They will also spin up when a nightly backup kicks off on one of the other systems, or if I SSH-in from work to check something. I don't do anything special other than stopping Thunderbird and Firefox when I leave the computer. I just select "Lock Screen" from the Gnome Launch menu, the screen-lock window pops up, and the display goes into power-save mode shortly after. I don't think there's anything magic about ZFS with regard to keeping the drives busy. The fancy power-saving stuff was done by Green-Bytes; There they modified ZFS to do the meta-data updates onto Flash-based SSD's separate from the rest of the usual pool drives. That way things like ZIL activity did not have to spin up a large number of data drives just to make small metadata updates, etc. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
zfs...@jeremykister.com said: > unfortunately, fdisk won't help me at all: > # fdisk -E /dev/rdsk/c12t1d0p0 > # zpool create -f testp c12t1d0 > invalid vdev specification > the following errors must be manually repaired: > /dev/dsk/c3t11d0s0 is part of active ZFS pool dbzpool. Please see zpool(1M). Hmm. Did you do the "devfsadm -Cv" as someone else suggested? I think I would do that both before and after the "fdisk -E". Then give the "dd" treatment again, with the EFI-style partion label in place. I've sometimes had to do the "dd" treatment with both VTOC and EFI labels on the same drive in order to make ZFS forget it had ever been used in a pool. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
>I said: >> You'll need to give the same "dd" treatment to the end of the disk as well; >> ZFS puts copies of its labels at the beginning and at the end. Oh, and zfs...@jeremykister.com said: > im not sure what you mean here - I thought p0 was the entire disk in x86 - > and s2 was the whole disk in the partition. what else should i overwrite? Sorry, yes, you did get the whole slice overwritten. Most people just add a "count=10" or something similar, to overwrite the beginning of the drive, but your invocation would overwrite the whole thing. If the disk is going to be part of whole-disk zpool, I like to make sure there is not an old VTOC-style partition table on there. That can be done either via some "format -e" commands, or with "fdisk -E", to put an EFI label on there. Anyway, I agree with the desire for "zpool" to be able to do this itself, with less possibility of human error in partitioning, etc. Glad to hear there's already an RFE filed for it. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris disk confusion ?
zfs...@jeremykister.com said: > # format -e c12t1d0 selecting c12t1d0 [disk formatted] /dev/dsk/c3t11d0s0 is > part of active ZFS pool dbzpool. Please see zpool(1M). > > It is true that c3t11d0 is part of dbzpool. But why is solaris upset about > c3t11 when i'm working with c12t1 ?? So i checked the device links, and all > looks fine: > . . . Could it be that c12t1d0 was at some time in the past (either in this machine or another machine) known as c3t11d0, and was part of a pool called "dbzpool"? > i tried: > fdisk -B /dev/rdsk/c12t1d0 > dd if=/dev/zero of=/dev/rdsk/c12t1d0p0 bs=1024k > dd if=/dev/zero of=/dev/rdsk/c12t1d0s2 bs=1024k > > but Solaris still has some association between c3t11 and c12t1. You'll need to give the same "dd" treatment to the end of the disk as well; ZFS puts copies of its labels at the beginning and at the end. Oh, and you can "fdisk -E /dev/rdsk/c12t1d0" to convert to a single, whole-disk EFI partition (non-VTOC style). Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sniping a bad inode in zfs?
da...@elemental.org said: > Normally on UFS I would just take the 'nuke it from orbit' route and use clri > to wipe the directory's inode. However, clri doesn't appear to be zfs aware > (there's not even a zfs analog of clri in /usr/lib/fs/ zfs), and I don't > immediately see an option in zdb which would help cure this. Well, it might make things worse, but have you tried /usr/sbin/unlink ? I'm on Solaris-10, so don't know if that's still part of OpenSolaris. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2
opensolaris-zfs-disc...@mlists.thewrittenword.com said: > Is it really pointless? Maybe they want the insurance RAIDZ2 provides. Given > the choice between insurance and performance, I'll take insurance, though it > depends on your use case. We're using 5-disk RAIDZ2 vdevs. > . . . > Would love to hear other opinions on this. Hi again Albert, On our Thumper, we use 7x 6-disk raidz2's (750GB drives). It seems a good compromise between capacity, IOPS, and data protection. Like you, we are afraid of the possibility of a 2nd disk failure during resilvering of these large drives. Our usage is a mix of disk-to-disk-to-tape backups, archival, and multi-user (tens of users) NFS/SFTP service, in roughly that order of load. We have had no performance problems with this layout. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "zfs send..." too slow?
knatte_fnatte_tja...@yahoo.com said: > Is rsync faster? As I have understood it, "zfs send.." gives me an exact > replica, whereas rsync doesnt necessary do that, maybe the ACL are not > replicated, etc. Is this correct about rsync vs "zfs send"? It is true that rsync (as of 3.0.5, anyway) does not preserve NFSv4/ZFS ACL's. It also cannot handle ZFS snapshots. On the other hand, you can run multiple rsync's in parallel; You can only do that with zfs send/recv if you have multiple, independent ZFS datasets that can be done in parallel. So which one goes faster will depend on your situation. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange results ...
jel+...@cs.uni-magdeburg.de said: > 2nd) Never had a Sun STK RAID INT before. Actually my intention was to create > a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way zpool mirror > with the 4 remaining disks. However, the controller seems not to support > JBODs :( - which is also bad, since we can't simply put those disks into > another machine with a different controller without data loss, because the > controller seems to use its own format under the hood. Yes, those Adaptec/STK internal RAID cards are annoying for use with ZFS. You also cannot replace a failed disk without using the STK RAID software to configure the new disk as a standalone volume (before "zpool replace"). Fortunately you probably don't need to boot into the BIOS-level utility, I think you can use the Adaptec StorMan utilities from within the OS, if you remembered to install them. > Also the 256MB > BBCache seems to be a little bit small for ZIL even if one would know, how to > configure it ... Unless you have an external (non-NV cached) pool on the same server, you wouldn't gain anything from setting up a separate ZIL in this case. All your internal drives have NV cache without doing anything special. > So what would you recommend? Creating 2 appropriate STK INT arrays and using > both as a single zpool device, i.e. without ZFS mirror devs and 2nd copies? Here's what we did: Configure all internal disks as standalone volumes on the RAID card. All those volumes have the battery-backed cache enabled. The first two 146GB drives got sliced in two: the first half of each disk became the boot/root mirror pool. The 2nd half was used for a separate-ZIL mirror, applied to an external SATA pool. Our remaining internal drives were configured into a mirrored ZFS pool for database transaction logs. No need for a separate ZIL there, since the internal drives effectively have NV cache as far as ZFS is concerned. Yes, the 256MB cache is small, but if it fills up, it is backed by the 10kRPM internal SAS drives, which should have decent latency when compared to external SATA JBOD drives. And even this tiny NV cache makes a huge difference when used on an NFS server: http://acc.ohsu.edu/~hakansom/j4400_bench.html Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool without any redundancy
>I wrote: >> Is anyone else tired of seeing the word redundancy? (:-) matthias.ap...@lanlabor.com said: > Only in a perfect world (tm) ;-) > IMHO there is no such thing as "too much redundancy". In the real world the > possibilities of redundancy are only limited by money, Sigh. I was just joking about how many times the word showed up in all of our postings. http://www.imdb.com/title/tt1436296/ Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool without any redundancy
mmusa...@east.sun.com said: > What benefit are you hoping zfs will provide in this situation? Examine > your situation carefully and determine what filesystem works best for you. > There are many reasons to use ZFS, but if your configuration isn't set up to > take advantage of those reasons, then there's a disconnect somewhere. How about if your config can only take advantage of _some_ of those reasons to use ZFS? There are plenty of benefits to using ZFS on a single bare hard drive, and those benefits apply to using it on an expensive SAN array. It's up to each individual to decide if adding redundancy is worthwhile or not. I'm not saying ZFS is perfect. And, ZFS is indeed better when it can make use of redundancy. But ZFS has lost data even with such redundancy, so having it does not confer magical protection from all disasters. Anyway, here's a note describing our experience with this situation: We've been using ZFS here on two hardware RAID fiberchannel arrays, with no ZFS-level redundancy, starting September-2006 -- roughly 6TB of data, checksums enabled, weekly scrubs, regular tape backups. So far there has been not one checksum error detected on these arrays. We've had dumb SAN connectivity losses, complete power failures on arrays, FC switches, and/or file servers, and so on, but no loss of data. Before ZFS, we used a combination of SAM-QFS and UFS filesystems on the same arrays, and ZFS has proved much easier to manage, reducing data loss due to human errors in volume and space management. The checksum feature makes filesystems without it into second-class offerings, in my opinion. Is anyone else tired of seeing the word redundancy? (:-) Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
webcl...@rochester.rr.com said: > To verify data, I cannot depend on existing tools since diff is not large > file aware. My best idea at this point is to calculate and compare MD5 sums > of every file and spot check other properties as best I can. Ray, I recommend that you use rsync's "-c" to compare copies. It reads all the source files, computes a checksum for them, then does the same for the destination and compares checksums. As far as I know, the only thing that rsync can't do in your situation is the ZFS/NFSv4 ACL's. I've used it to migrate many TB's of data. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status
David Stewart wrote: > How do I identify which drive it is? I hear each drive spinning (I listened > to them individually) so I can't simply select the one that is not spinning. You can try reading from each raw device, and looking for a blinky-light to identify which one is active. If you don't have individual lights, you may be able to hear which one is active. The "dd" command should do. Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
rswwal...@gmail.com said: > Yes, but if it's on NFS you can just figure out the workload in MB/s and use > that as a rough guideline. I wonder if that's the case. We have an NFS server without NVRAM cache (X4500), and it gets huge MB/sec throughput on large-file writes over NFS. But it's painfully slow on the "tar extract lots of small files" test, where many, tiny, synchronous metadata operations are performed. > I did a smiliar test with a 512MB BBU controller and saw no difference with > or without the SSD slog, so I didn't end up using it. > > Does your BBU controller ignore the ZFS flushes? I believe it does (it would be slow otherwise). It's the Sun StorageTek internal SAS RAID HBA. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
j...@jamver.id.au said: > For a predominantly NFS server purpose, it really looks like a case of the > slog has to outperform your main pool for continuous write speed as well as > an instant response time as the primary criterion. Which might as well be a > fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of > them. I wonder if you ran Richard Elling's "zilstat" while running your workload. That should tell you how much ZIL bandwidth is needed, and it would be interesting to see if its stats match with your other measurements of slog-device traffic. I did some filebench and "tar extract over NFS" tests of J4400 (500GB, 7200RPM SATA drives), with and without slog, where slog was using the internal 2.5" 10kRPM SAS drives in an X4150. These drives were behind the standard Sun/Adaptec internal RAID controller, 256MB battery-backed cache memory, all on Solaris-10U7. We saw slight differences on filebench oltp profile, and a huge speedup for the "tar extract over NFS" tests with the slog present. Granted, the latter was with only one NFS client, so likely did not fill NVRAM. Pretty good results for a poor-person's slog, though: http://acc.ohsu.edu/~hakansom/j4400_bench.html Just as an aside, and based on my experience as a user/admin of various NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem to get very good improvements with relatively small amounts of NVRAM (128K, 1MB, 256MB, etc.). None of the filers I've seen have ever had tens of GB of NVRAM. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving volumes to new controller
vidar.nil...@palantir.no said: > I'm trying to move disks in a zpool from one SATA-kontroller to another. Its > 16 disks in 4x4 raidz. Just to see if it could be done, I moved one disk from > one raidz over to the new controller. Server was powered off. > . . . > zpool replace storage c10t7d0 c11t0d0 > /dev/dsk/c11t0d0s0 is part of active ZFS pool storage. Please see zpool(1M). > . . . > I've tried several things now (fumbling around in the dark :-)). I tried to > delete all partitions and relabel the disk, with no other results than above. > . . . To recover from this situation, you'll need to erase enough blocks of the disk to get rid of the ZFS pool info. You could do this a number of ways, but probably the simplest is: dd if=/dev/zero of=/dev/rdsk/c11t0d0 bs=512 count=100 You may also need to give the same treatment to the last several blocks of the disk, where redundant ZFS labels may still be present. > Both controllers are "raid controllers", and I haven't found any way to make > them presents the disks directly to opensolaris. So I have made 1 volume for > each drive (the raid5 implementation is rather slow, and they have no > battery). Maybe this is the source of the problems? I don't think so. If the two RAID controllers were not compatible, I doubt that ZFS would see the pool info on the disk that you already moved. By the way, after you've got the above issue fixed, if you can power the server off, you might be able to move all the drives at once, without any resilvering. Just "zpool export storage" before moving the drives, then afterwards "zpool import" should be able to find the pool in the new location. Note that this export/import approach probably won't work if the two RAID controllers are not compatible with each other. Some RAID controllers can be re-flashed with non-RAID firmware, so that might simplify things. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
rswwal...@gmail.com said: > It's not the stripes that make a difference, but the number of controllers > there. > > What's the system config on that puppy? The "zpool status -v" output was from a Thumper (X4500), slightly edited, since in our real-world Thumper, we use c6t0d0 in c5t4d0's place in the "optimal" layout I posted, because c5t4d0 is used in the boot-drive mirror. See the following for our 2006 Thumper benchmarks, which appear to bear out Richard Elling's RaidOptimizer analysis: http://acc.ohsu.edu/~hakansom/thumper_bench.html While I'm at it, filebench numbers from a recent J4400-based database server deployment, with some "slog vs no-slog" comparisons (sorry, no SSD's available here yet): http://acc.ohsu.edu/~hakansom/j4400_bench.html Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
rswwal...@gmail.com said: > There is another type of failure that mirrors help with and that is > controller or path failures. If one side of a mirror set is on one > controller or path and the other on another then a failure of one will not > take down the set. > > You can't get that with RAIDZn. You can if you have a stripe of RAIDZn's, and enough controllers (or paths) to go around. The raidz2 below should be able to survive the loss of two controllers, shouldn't it? Regards, Marion $ zpool status -v pool: zp1 state: ONLINE scrub: scrub completed after 7h9m with 0 errors on Mon Sep 14 13:39:03 2009 config: NAMESTATE READ WRITE CKSUM bulk_zp01 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 raidz2ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 spares c0t0d0AVAIL c1t0d0AVAIL c4t0d0AVAIL c7t0d0AVAIL errors: No known data errors $ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity
asher...@versature.com said: > And, on that subject, is there truly a difference between Seagate's line-up > of 7200 RPM drives? They seem to now have a bunch: > . . . > Other manufacturers seem to have similar lineups. Is the difference going to > matter to me when putting a mess of them into a SAS JBOD with an expander? There are differences even within the lineup of Sun-supplied SATA drives. Some support multipathing, and some do not. Even some that are reported (in Sun docs) to support it, do not. http://opensolaris.org/jive/thread.jspa?threadID=107049&tstart=30 http://opensolaris.org/jive/thread.jspa?threadID=107057&tstart=15 Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
bfrie...@simple.dallas.tx.us said: > No. I am suggesting that all Solaris 10 (and probably OpenSolaris systems) > currently have a software-imposed read bottleneck which places a limit on > how well systems will perform on this simple sequential read benchmark. > After a certain point (which is unfortunately not very high), throwing more > hardware at the problem does not result in any speed improvement. This is > demonstrated by Scott Lawson's little two disk mirror almost producing the > same performance as our much more exotic setups. Apologies for reawakening this thread -- I was away last week. Bob, have you tried changing your benchmark to be multithreaded? It occurs to me that maybe a single cpio invocation is another bottleneck. I've definitely experienced the case where a single bonnie++ process was not enough to max out the storage system. I'm not suggesting that the bug you're demonstrating is not real. It's clear that subsequent runs on the same system show the degradation, and that points out a problem. Rather, I'm thinking that maybe the timing comparisons between low-end and high-end storage systems on this particular test are not revealing the whole story. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does zpool clear delete corrupted files
jlo...@ssl.berkeley.edu said: > What's odd is we've checked a few hundred files, and most of them don't > seem to have any corruption. I'm thinking what's wrong is the metadata for > these files is corrupted somehow, yet we can read them just fine. I wish I > could tell which ones are really bad, so we wouldn't have to recreate them > unnecessarily. They are mirrored in various places, or can be recreated > via reprocessing, but recreating/ restoring that many files is no easy task. You know, this sounds similar to what happened to me once when I did a "zpool offline" to half of a mirror, changed a lot of stuff in the pool (like adding 20GB of data to an 80GB pool), then "zpool online", thinking ZFS might be smart enough to sync up the changes that had happened since detaching. Instead, a bunch of bad files were reported. Since I knew nothing was wrong with the half of the mirror that had never been offlined, I just did a "zpool detach" of the formerly offlined drive, "zpool clear" to clear the error counts, "zpool scrub" to check for integrity, then "zpool attach" to cause resilver to start from scratch. If this describes your situation, I guess the tricky part for you is to now decide which half of your mirror is the good half. There's always "rsync -n -v -a -c ..." to compare copies of files that happen to reside elsewhere. Slow but safe. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] storage & zilstat assistance
bfrie...@simple.dallas.tx.us said: > Your IOPS don't seem high. You are currently using RAID-5, which is a poor > choice for a database. If you use ZFS mirrors you are going to unleash a > lot more IOPS from the available spindles. RAID-5 may be poor for some database loads, but it's perfectly adequate for this one (small data warehouse, sequential writes, and so far mostly sequential reads as well). So far the RAID-5 LUN has not been a problem, and it doesn't look like the low IOPS are because of the hardware, rather the database/application just isn't demanding more. Please correct me if I've come to the wrong conclusion here > I am not familiar with zilstat. Presumaby the '93' is actually 930 ops/ > second? I think you answered your question in your second post. But for others, the "93" is the total ops over the reporting interval. In this case, the interval was 10 seconds, so 9.3 ops/sec. > I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS drives > arranged as six mirrors (2540 is configured like a JBOD). While I don't run > a database, I have run an IOPS benchmark with random writers (8K blocks) and > see a peak of 3708 ops/sec. With a SATA model you are not likely to see > half of that. Thanks for the 2540 numbers you posted. There's a SAS 2530 here with the same 300GB 15kRPM drives, and as you said, it's fast. But it looks so far like the SATA model, even with less than half the IOPS, will be more than enough for our workload. I'm pretty convinced that the SATA 2540 will be sufficient. What I'm not sure of is if the cheaper J4200 without SSD would be sufficient. I.e., are we generating enough synchronous traffic that lack of NVRAM cache will cause problems? One thing zilstat doesn't make obvious (to me) is the latency effects of a separate log/ZIL device. I guess I could force our old array's cache into write-through mode and see what happens to the numbers. Judging by our experience with NFS servers using this same array, I'm reluctant to try. Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] storage & zilstat assistance
Greetings, We have a small Oracle project on ZFS (Solaris-10), using a SAN-connected array which is need of replacement. I'm weighing whether to recommend a Sun 2540 array or a Sun J4200 JBOD as the replacement. The old array and the new ones all have 7200RPM SATA drives. I've been watching the workload on the current storage using Richard Elling's handy zilstat tool, and could use some more eyes/brains than just mine for making sense of the results. There are three pools; One is on a mirrored pair of internal 2.5" SAS 10kRPM drives, which holds some database logs; The 2nd is a RAID-5 LUN on the old SAN array (6 drives), which holds database tables & indices; The 3rd is a mirrored pair of SAN drives, holding log replicas, archives, and RMAN backup files. I've included inline below an edited "zpool status" listing to show the ZFS pools, a listing of "zilstat -l 30 10" showing ZIL traffic for each of the three pools, and a listing of "iostat -xn 10" for the relevant devices, all during the same time period. Note that the time these stats were taken was a bit atypical, in that an RMAN backup was taking place, which was the source of the read (over)load on the "san_sp2" pool devices. So, here are my conclusions, and I'd like a sanity check since I don't have a lot of experience with interpreting ZIL activity just yet. (1) ZIL activity is not very heavy. Transaction logs on the internal drives, which have no NVRAM cache, appear to generate low enough levels of traffic that we could get by without an SSD ZIL if a JBOD solution is chosen. We can keep using the internal drive pool after the old SAN array is replaced. (2) During RMAN backups, ZIL activity gets much heavier on the affected SAN pool. We see a low-enough average rate (maybe 200 KBytes/sec), but the occasional peak of as much as 1 to 2 MBytes/sec. The 100%-busy figures here are for "regular" read traffic, not ZIL. (3) Probably to be safe, we should go with the 2540 array, which does have a small NVRAM cache, even though it is a fair bit more expensive than the J4200 JBOD solution. Adding a Logzilla SSD to the J4200 is way more expensive than the 2540 with its NVRAM cache, and an 18GB Logzilla is probably overkill for this workload. I guess one question I'd add is: The "ops" numbers seem pretty small. Is it possible to give enough spindles to a pool to handle that many IOP's without needing an NVRAM cache? I know latency comes into play at some point, but are we at that point? Thanks and regards, Marion === pool: int_mp1 config: NAME STATE READ WRITE CKSUM int_mp1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s5 ONLINE 0 0 0 c0t1d0s5 ONLINE 0 0 0 pool: san_sp1 config: NAME STATE READ WRITE CKSUM san_sp1 ONLINE 0 0 0 c3t4849544143484920443630303133323230303430d0 ONLINE 0 0 0 pool: san_sp2 config: NAME STATE READ WRITE CKSUM san_sp2 ONLINE 0 0 0 c3t4849544143484920443630303133323230303033d0 ONLINE 0 0 0 === # zilstat -p san_sp1 -l 30 10 N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops <=4kB 4-32kB >=32kB 108992 10899 108992 143360 14336 143360 5 1 2 2 0 0 0 0 0 0 0 0 0 0 33536 3353 16768 40960 4096 20480 2 0 2 0 134144 13414 50304 163840 16384 61440 8 0 8 0 16768 1676 16768 20480 2048 20480 1 0 1 0 0 0 0 0 0 0 0 0 0 0 134144 13414 134144 221184 22118 221184 2 0 0 2 134848 13484 117376 233472 23347 143360 9 0 8 1 ^C # zilstat -p san_sp2 -l 30 10 N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops <=4kB 4-32kB >=32kB 1126264 112626 3185921658880 165888 466944 56 0 50 6 67072 6707 25152 114688 11468 53248 6 0 6 0 61120 6112 16768 86016 8601 20480 7 3 4 0 193216 19321 83840 258048 25804 114688 14 0 14 0 1563584 15635810437761916928 1916921282048 96 3 93 0 50304 5030 1
Re: [zfs-discuss] [on-discuss] Reliability at power failure?
udip...@gmail.com said: > dick at nagual.nl wrote: >> Maybe because on the fifth day some hardware failure occurred? ;-) > > That would be which? The system works and is up and running beautifully. > OpenSolaris, as of now. Running beautifully as long as the power stays on? Is it hard to believe hardware might glitch at power-failure (or power-on-after-failure)? > Ah, you're hinting at a rare hardware glitch as underlying problem? AFAIU, > it is a proclaimed feature of ZFS that writes are atomic, out and over Not only does ZFS advertise atomic updates, it also _depends_ on them, and checks for them having happened, likely more so than other filesystems. Is it hard to believe that ZFS is exercising and/or checking up on your hardware in ways that Linux does not do? > Uwe, > who is a big fan of a ZFS that fulfills all of its promises. Snapshots and > luupgrade have yet to fail me on it. And a few other beautiful things. It is > the reliability that makes me wonder if UFS/FFS/ext3 are not better choices > in this respect. Blaming standard, off-the-shelf hardware as 'too cheap' is a > too slippery slope, btw. Sorry to hear you're still having this issue. I can only offer anecdotal experience: Running Solaris-10 here, non-mirrored ZFS root/boot since last December (other ZFS filesystems, mirrored and non-mirrored, for 2 years prior), on standard off-the-shelf PC, slightly more than 5 years old. This system has been through multiple power-failures, never with any corruption. Same goes for a 2-yr-old Dell desktop PC at work, with mirrored ZFS root/boot; Multiple power failures, never any reported checksum errors or other corruption. We also have Solaris-10 systems at work, non-ZFS-boot, but with ZFS running without redundancy on non-Sun fiberchannel RAID gear. These have had power failures and other SAN outages without causing corruption of ZFS filesystems. We have experienced a number of times where systems failed to boot after power-failure, due to boot-archive being out of date. Not corrupted, just out of date. Annoying and inconvient for production systems, but nothing at all to do with ZFS. So, I personally have not found ZFS to be any less reliable in presence of power failures than Solaris-10/UFS or Linux on the same hardware. I wonder what is it that's unique or rare about your situation, that OpenSolaris and/or ZFS is uncovering? I also wonder how hard it might be to make ZFS resilient to whatever unique/rare circumstances you have, as compared to finding/fixing/avoiding those circumstances. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
mi...@cc.umanitoba.ca said: > What would I look for with mpstat? Look for a CPU (thread) that might be 100% utilized; Also look to see if that CPU (or CPU's) has a larger number in the "ithr" column than all other CPU's. The idea here is that you aren't getting much out of the T2000 if only one (or a few) of its 32 CPU's is working hard. On our T2000's running Solaris-10 (Update 4, I believe), the default kernel settings do not enable interrupt-fanout for the network interfaces. So you can end up with all four of your e1000g's being serviced by the same CPU. You can't get even one interface to handle more than 35-45% of a gigabit if that's the case, but proper tuning has allowed us to see 90MByte/sec each, on multiple interfaces simultaneously. Note I'm not suggesting this explains your situation. But even if you've addressed this particular issue, you could still have some other piece of your stack which ends up bottlenecked on a single CPU, and mpstat can show if that's happening. Oh yes, "intrstat" can also show if hardware device interrupts are being spread among multiple CPU's. On the T2000, it's recommended that you set things up so only one thread per core is allowed to handle interrupts, freeing the others for application-only work. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris / ZFS at the low-end
bh...@freaks.com said: > Even with a very weak CPU the system is close to saturating the PCI bus for > reads with most configurations. Nice little machine. I wonder if you'd get some of the bonnie numbers increased if you ran multiple bonnie's in parallel. Even though the sequential throughput is near 100MB/sec, using both CPU cores might push more random IOP's than a single-threaded bonnie can go. This was certainly the case on an UltraSPARC-T1 (1GHz) here -- not known for single-threaded speed, but good multithreaded throughput. I ran three bonnie++'s together, using "-p 3" to initialize a semaphore, and "-y" on the three measurement runs to synchronize their startup. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] ZFS performance issue - READ is slow as hell...
james.ma...@sun.com said: > I'm not yet sure what's broken here, but there's something pathologically > wrong with the IO rates to the device during the ZFS tests. In both cases, > the wait queue is getting backed up, with horrific wait queue latency > numbers. On the read side, I don't understand why we're seeing 4-5 seconds of > zero disk activity on the read test in between bursts of a small number of > reads. We observed such long pauses (with zero disk activity) with a disk array that was being fed more operations than it could handle (FC queue depth). The array was not losing ops, but the OS would fill the device's queue and then the OS would completely freeze on any disk-related activity for the affected LUN's. All zpool or zfs commands related to those pools would be unresponsive during those periods, until the load slowed down enough such that the OS wasn't ahead of the array. This was with Solaris-10 here, not OpenSolaris or SXCE, but I suspect the principal would still apply. Naturally, the original poster may have a very different situation, so take the above as you wish. Maybe Dtrace can help: http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io http://blogs.sun.com/chrisg/entry/latency_bubbles_follow_up http://blogs.sun.com/chrisg/entry/that_we_should_make Note that using the above references, Dtrace showed that we had some FC operations which took 60 or even 120 seconds to complete. Things got much better here when we zeroed in on two settings: (a) set FC queue depth for the device to match its backend capacity (4). (b) turn off sorting of the queue by the OS/driver (latency evened out). Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bad SWAP performance from zvol
casper@sun.com said: > I've upgraded my system from ufs to zfs (root pool). > By default, it creates a zvol for dump and swap. > . . . > So I removed the zvol swap and now I have a standard swap partition. The > performance is much better (night and day). The system is usable and I > don't know the job is running. > > Is this expected? If you're using Solaris-10U6 to migrate, the early revisions of liveupgrade would create swap and dump zvols that have some different properties than what S10U6 Jumpstart creates. On x86 here, the swap zvol ends up with 4k volblocksize when you Jumpstart install, but liveupgrade sets it to 8k (which does not match the system page-size of 4k). The other difference I noticed was the dump zvol from Jumpstart install has 128k volblocksize, but early S10U6 liveupgrade set it to 8k, which makes crash-dumps incredibly slow (should you have one). I know that subsequent LU patches have fixed the dump zvol volblocksize, but am not sure if the swap zvol has been updated in a LU patch. Sorry I can't report on whether zvol swap is slower than UFS swap slice for us here; None of our ZFS-root systems have done any significant swapping/paging as far as I can tell. $ zfs get volsize,referenced rpool/swap NAMEPROPERTYVALUE SOURCE rpool/swap volsize 4.00G - rpool/swap referenced 105M- $ Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Virutal zfs server vs hardware zfs server
n...@jnickelsen.de said: > As far as I know the situation with ATI is that, while ATI supplies > well-performing binary drivers for MS Windows (of course) and Linux, there is > no such thing for other OSs. So OpenSolaris uses standardized interfaces of > the graphics hardware, which have comparatively low bandwidth. > . . . > But there are things that really are a pain, e. g. web pages that constantly > blend one picture into the other, for instance http://www.strato.de/ . While > you would not notice that, usually, this page makes my laptop really slow, > such that it requires significant effort even to find and press the button to > close the window. Wow, this is getting pretty far afield from a ZFS discussion. Hopefully others will find this a helpful tidbit I just found some xorg.conf settings which greatly alleviate this issue on my Solaris-10-x86 machine with ATI Radeon 9200 graphics adapter. In the "Device" section, try one of the following: Option "AccelMethod" "EXA" # default is "XAA" Or: Option "XaaNoOffscreenPixmaps" "on" Seriously, it's almost like having a new PC. Either option makes the "100% CPU while fading rotating images" go away; Personally, I prefer the 2nd option, as I found the 1st method led to slightly slower redrawing of windows (e.g. when you switch between GNOME desktops), but that will depend on what else you're doing. But yes, nVidia cards are much, much better supported in Solaris. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on SAN?
bfrie...@simple.dallas.tx.us said: > A 12-disk pool that I built a year ago is still working fine with absolutely > no problems at all. Another two disk pool built using cheap large USB > drives has been running for maybe eight months, with no problems. We have non-redundant ZFS pools on an HDS 9520V array, and also a Sun 6120 array, some of them running for two years now (S10U3, S10U4, S10U5, both SPARC and x86), up to 4TB in size. We have experienced SAN zoning mistakes, complete power loss to arrays, servers, and/or SAN switches, etc., with no pool corruption or data loss. We have not even seen one block checksum error detected by ZFS on these arrays (we have seen one such error on our X4500 in the past 6 months). Note that the only available pool failure mode in the presence of a SAN I/O error for these OS's has been to panic/reboot, but so far when the systems have come back, data has been fine. We also do tape backups of these pools, of course. Regards, -- Marion Hakanson OHSU Advanced Computing Center ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Introducing zilstat
The zilstat tool is very helpful, thanks! I tried it on an X4500 NFS server, while extracting a 14MB tar archive, both via an NFS client, and locally on the X4500 itself. Over NFS, said extract took ~2 minutes, and showed peaks of 4MB/sec buffer-bytes going through the ZIL. When run locally on the X4500, the extract took about 1 second, with zilstat showing all zeroes. I wonder if this is a case where that ZIL bypass kicks in for >32K writes, in the local tar extraction. Does zilstat's underlying dtrace include these bypass-writes in the totals it displays? I think if it's possible to get stats on this bypassed data, I'd like to see it as another column (or set of columns) in the zilstat output. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
d...@yahoo.com said: > Any recommendations for an SSD to work with an X4500 server? Will the SSDs > used in the 7000 series servers work with X4500s or X4540s? The Sun System Handbook (sunsolve.sun.com) for the 7210 appliance (an X4540-based system) lists the "logzilla" device with this fine print: PN#371-4192 Solid State disk drives can only be installed in slots 3 and 11. Makes me wonder if they would work in our X4500 NFS server. Our ZFS pool is already deployed (Solaris-10), but we have four hot spares -- two of which could be given up in favor of a mirrored ZIL. An OS upgrade to S10U6 would give the separate-log functionality, if the drivers, etc. supported the actual SSD device. I doubt we'll go out and buy them before finding out if they'll actually work -- it would be a real shame if they didn't, though. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hybrid Pools - Since when?
richard.ell...@sun.com said: > L2ARC arrived in NV at the same time as ZFS boot, b79, November 2007. It was > not back-ported to Solaris 10u6. You sure? Here's output on a Solaris-10u6 machine: cyclops 4959# uname -a SunOS cyclops 5.10 Generic_137138-09 i86pc i386 i86pc cyclops 4960# zpool upgrade -v This system is currently running ZFS pool version 10. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. cyclops 4961# Note, I haven't tried adding a cache device yet. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To separate /var or not separate /var, that is the question....
vincent_b_...@yahoo.com said: > Just wondering if (excepting the existing zones thread) there are any > compelling arguments to keep /var as it's own filesystem for your typical > Solaris server. Web servers and the like. Well, it's been considered a "best practice" for servers for a lot of years to keep /var/ as a separate fileystem: (1) You can use special mount options, such as "nosuid", which improves security. E.g. world-writable areas (/var/tmp) cannot be seeded with a trojan or other privilege-escalating attack. (2) You can limit the size, preventing a non-privileged process from using up all the system's disk space. If you don't believe me, go read Sun's own Blueprints books/articles. Personally, I'd like to place a limit on /var/core/; That's the only consistent "out of disk space" cause I've seen on our Solaris-10 systems, and that happens whether /var/ is separate or not. Maybe /var/crash/ as well. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool replace - choke point
[EMAIL PROTECTED] said: > Thanks for the tips. I'm not sure if they will be relevant, though. We > don't talk directly with the AMS1000. We are using a USP-VM to virtualize > all of our storage and we didn't have to add anything to the drv > configuration files to see the new disk (mpxio was already turned on). We > are using the Sun drivers and mpxio and we didn't require any tinkering to > see the new LUNs. Yes, the fact that the USP-VM was recognized automatically by Solaris drivers is a good sign. I suggest that you check to see what queue-depth and disksort values you ended up with from the automatic settings: echo "*ssd_state::walk softstate |::print -t struct sd_lun un_throttle" \ | mdb -k The "ssd_state" would be "sd_state" on an x86 machine (Solaris-10). The "un_throttle" above will show the current max_throttle (queue depth); Replace it with "un_min_throttle" to see the min, and "un_f_disksort_disabled" to see the current queue-sort setting. The HDS docs for 9500 series suggested 32 as the max_throttle to use, and the default setting (Solaris-10) was 256 (hopefully with the USP-VM you get something more reasonable). And while 32 did work for us, i.e. no operations were ever lost as far as I could tell, the array back-end -- the drives themselves, and the internal SATA shelf connections, have an actual queue depth of four for each array controller. The AMS1000 has the same limitation for SATA shelves, according to our HDS engineer. In short, Solaris, especially with ZFS, functions much better if it does not try to send more FC operations to the array than the actual physical devices can handle. We were actually seeing NFS client operations hang for minutes at a time when the SAN-hosted NFS server was making its ZFS devices busy -- and this was true even if clients were using different devices than the busy ones. We do not see these hangs after making the described changes, and I believe this is because the OS is no longer waiting around for a response from devices that aren't going to respond in a reasonable amount of time. Yes, having the USP between the host and the AMS1000 will affect things; There's probably some huge cache in there somewhere. But unless you've got cache of hundreds of GB in size, at some point a resilver operation is going to end up running at the speed of the actual back-end device. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool replace - choke point
[EMAIL PROTECTED] said: > I think we found the choke point. The silver lining is that it isn't the > T2000 or ZFS. We think it is the new SAN, an Hitachi AMS1000, which has > 7200RPM SATA disks with the cache turned off. This system has a very small > cache, and when we did turn it on for one of the replacement LUNs we saw a > 10x improvement - until the cache filled up about 1 minute later (was using > zpool iostat). Oh well. We have experience with a T2000 connected to the HDS 9520V, predecessor to the AMS arrays, with SATA drives, and it's likely that your AMS1000 SATA has similar characteristics. I didn't see if you're using Sun's drivers to talk to the SAN/array, but we are using Solaris-10 (and Sun drivers + MPXIO), and since the Hitachi storage isn't automatically recognized (sd/ssd, scsi_vhci), it took a fair amount of tinkering to get parameters adjusted to work well with the HDS storage. The combination that has given us best results with ZFS is: (a) Tell the array to ignore SYNCHRONIZE_CACHE requests from the host. (b) Balance drives within each AMS disk shelf across both array controllers. (c) Set the host's max queue depth to 4 for the SATA LUN's (sd/ssd driver). (d) Set the host's disable_disksort flag (sd/ssd driver) for HDS LUN's. Here's the reference we used for setting the parameters in Solaris-10: http://wikis.sun.com/display/StorageDev/Parameter+Configuration Note that the AMS uses read-after-write verification on SATA drives, so you only have half the IOP's for writes that the drives are capable of handling. We've found that small RAID volumes (e.g. a two-drive mirror) are unbelievably slow, so you'd want to go toward having more drives per RAID group, if possible. Honestly, if I recall correctly what I saw in your "iostat" listings earlier, your situation is not nearly as "bad" as with our older array. You don't seem to be driving those HDS LUN's to the extreme busy states that we have seen on our 9520V. It was not unusual for us to see LUN's at 100% busy, 100% wait, with 35 ops total in the "actv" and "wait" columns, and I don't recall seeing any 100%-busy devices in your logs. But getting the FC queue-depth (max-throttle) setting to match what the array's back-end I/O can handle greatly reduced the long "zpool status" and other I/O-related hangs that we were experiencing. And disabling the host-side FC queue-sorting greatly improved the overall latency of the system when busy. Maybe it'll help yours too. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs boot - U6 kernel patch breaks sparc boot
[EMAIL PROTECTED] said: > I thought to look at df output before rebooting, and there are PAGES & PAGES > like this: > >/var/run/.patchSafeModeOrigFiles/usr/platform/FJSV,GPUZC-M/lib/libcpc.so.1 7597264 85240 7512024 2%/usr/platform/FJSV,GPUZC-M/lib/libcpc.so.1 > . . . > Hundreds of mountpoints, what's it doing in there? That's normal, for deferred-activation patches (like this jumbo kernel patch). They are loopback mounts which are supposed to keep any kernel-specific things from being affected by something that would otherwise change the running kernel. Using liveupgrade for patches is quite a bit cleaner, in my opinion, if you have that option. It seems to do a good job of updating grub on all bootable drives as well (as of S10U6, anyway). Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to ludelete BE with ufs
[EMAIL PROTECTED] said: > # ludelete beA > ERROR: cannot open 'pool00/zones/global/home': dataset does not exist > ERROR: cannot mount mount point device > > ERROR: failed to mount file system on > > ERROR: unmounting partially mounted boot environment file systems > ERROR: cannot mount boot environment by icf file > ERROR: Cannot mount BE . > Unable to delete boot environment. > . . . > Big Mistake... For ZFS boot I need space for a seperate zfs root pool. So > whilst booted under beB I backup my pool00 data, destroy pool00, re-create > pool00 (a little differently, thus the error it would seem) but hold out one > of the drives and use it to create a rpool00 root pool. Then I > . . . I made this same mistake. If you "grep pool00 /etc/lu/ICF.1" you'll see filesystems beA expects to be mounted in beA; Some of those it may expect to be able to share between the current BE and beA. The way to fix things is to create a temporary pool "pool00"; This need not be on an actual disk, it could be hosted in a file or a slice, etc. Then create those datasets in the temporary pool, and try the "ludelete beA" again. Note that if the problem datasets are supposed to be shared between current BE and beA, you'll need them mounted on the original paths in the current BE, because "ludelete" will use loopback mounts to attach them into beA during the deletion process. I guess the moral of the story is that you should ludelete any old BE's before you alter the filesystems/datasets that it mounts. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
[EMAIL PROTECTED] said: > but Marion's is not really possible at all, and won't be for a while with > other groups' choice of storage-consumer platform, so it'd have to be > GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude > in-house hack. Well, of course the magnitude of fringe factor is in the eye of the beholder. I didn't intend to make pNFS seem like a done deal. I don't quite yet think of OpenSolaris as a "done deal" either, still using Solaris-10 here in production, but since this is an OpenSolaris mailing list I should be more careful. Anyway, from looking over the wiki/blog info, apparently the sticking point with pNFS may be client-side availability -- there's only Linux and (Open)Solaris NFSv4.1 clients just yet. Still, pNFS claims to be backwards compatible with NFS v3 clients: If you point a traditional NFS client at the pNFS metadata server, the MDS is supposed to relay the data from the backend data servers. [EMAIL PROTECTED] said: > It's a shame that Lustre isn't available on Solaris yet either. Actually, that may not be so terribly fringey, either. Lustre and Sun's Scalable Storage product can make use of Thumpers: http://www.sun.com/software/products/lustre/ http://www.sun.com/servers/cr/scalablestorage/ Apparently it's possible to have a Solaris/ZFS data-server for Lustre backend storage: http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU I see they do not yet have anything other than Linux clients, so that's a limitation. But you can share out a Lustre filesystem over NFS, potentially from multiple Lustre clients. Maybe via CIFS/samba as well. Lastly, I've considered the idea of using Shared-QFS to glue together multiple Thumper-hosted ISCSI LUN's. You could add shared-QFS clients (acting as NFS/CIFS servers) if the client load needed more than one. Then SAM-FS would be a possibility for backup/replication. Anyway, I do feel that none of this stuff is quite "there" yet. But my experience with ZFS on fiberchannel SAN storage, that sinking feeling I've had when a little connectivity glitch resulted in a ZFS panic, makes me wonder if non-redundant ZFS on an ISCSI SAN is "there" yet, either. So far none of our lost-connection incidents resulted in pool corruption, but we have only 4TB or so. Restoring that much from tape is feasible, but even if Gray's 150TB of data can be recreated, it would take weeks to reload it. If it's decided that the clustered-filesystem solutions aren't feasible yet, the suggestion I've seen that I liked the best was Richard's, with a bad-boy server SAS-connected to multiple J4500's. But since Gray's project already has the X4500's, I guess they'd have to find another use for them (:-). Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
[EMAIL PROTECTED] said: > In general, such tasks would be better served by T5220 (or the new T5440 :-) > and J4500s. This would change the data paths from: > client T5220 X4500 disks to > client T5440 disks > > With the J4500 you get the same storage density as the X4500, but with SAS > access (some would call this direct access). You will have much better > bandwidth and lower latency between the T5440 (server) and disks while still > having the ability to multi-head the disks. The There's an odd economic factor here, if you're in the .edu sector: The Sun Education Essentials promotional price list has the X4540 priced lower than a bare J4500 (not on the promotional list, but with a standard EDU discount). We have a project under development right now which might be served well by one of these EDU X4540's with a J4400 attached to it. The spec sheets for J4400 and J4500 say you can chain together enough of them to make a pool of 192 drives. I'm unsure about the bandwidth of these daisy-chained SAS interconnects, though. Any thoughts as to how high one might scale an X4540-plus-J4x00 solution? How does the X4540's internal disk bandwidth compare to that of the (non-RAID) SAS HBA? Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
[EMAIL PROTECTED] said: > It's interesting how the speed and optimisation of these maintenance > activities limit pool size. It's not just full scrubs. If the filesystem is > subject to corruption, you need a backup. If the filesystem takes two months > to back up / restore, then you need really solid incremental backup/restore > features, and the backup needs to be a cold spare, not just a > backup---restoring means switching the roles of the primary and backup > system, not actually moving data. I'll chime in here with feeling uncomfortable with such a huge ZFS pool, and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach. There just seem to be too many moving parts depending on each other, any one of which can make the entire pool unavailable. For the stated usage of the original poster, I think I would aim toward turning each of the Thumpers into an NFS server, configure the head-node as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS to the "cluster" of file servers. You'll end up with a huge logical pool, but a Thumper outage should result only in loss of access to the data on that particular system. The work of scrub/resilver/replication can be divided among the servers rather than all living on a single head node. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
[EMAIL PROTECTED] said: > We did ask our vendor, but we were just told that AVS does not support > x4500. You might have to use the open-source version of AVS, but it's not clear if that requires OpenSolaris or if it will run on Solaris-10. Here's a description of how to set it up between two X4500's: http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS noob question
[EMAIL PROTECTED] said: > I took a snapshot of a directory in which I hold PDF files related to math. > I then added a 50MB pdf file from a CD (Oxford Math Reference; I strongly > reccomend this to any math enthusiast) and did "zfs list" to see the size of > the snapshot (sheer curiosity). I don't have compression turned on for this > filesystem. However, it seems that the 50MB PDF took up only 64K. How is > that possible? Is ZFS such a good filesystem, that it shrinks files to a > mere fraction of their size? If I understand correctly, you were expecting the snapshot to grow in size because you made a change to the current filesystem, right? Since the new file did not exist in the old snapshot, it could never have known about the blocks of data in the new file. The older snapshot only needs to remember blocks that existed at the time of the snapshot and which differ now, e.g. blocks in files which get modified or removed. I would expect that when you add a new file, the only blocks that change would be in the directory node (metadata blocks) which ends up containing the new file. That could indeed be 64k worth of changed blocks. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with Traditional SAN
[EMAIL PROTECTED] said: > That's the one that's been an issue for me and my customers - they get billed > back for GB allocated to their servers by the back end arrays. To be more > explicit about the 'self-healing properties' - To deal with any fs > corruption situation that would traditionally require an fsck on UFS (SAN > switch crash, multipathing issues, cables going flaky or getting pulled, > server crash that corrupts fs's) ZFS needs some disk redundancy in place so > it has parity and can recover. (raidz, zfs mirror, etc) Which means to use > ZFS a customer have to pay more to get the back end storage redundancy they > need to recover from anything that would cause an fsck on UFS. I'm not > saying it's a bad implementation or that the gains aren't worth it, just that > cost-wise, ZFS is more expensive in this particular bill-back model. If your back-end array implements RAID-0, you need not suffer the extra expense. Allocate one RAID-0 LUN per physical drive, then use ZFS to make raidz or mirrored pools as appropriate. To add to the other anecdotes on this thread: We have non-redundant ZFS pools on SAN storage, in production use for about a year, replacing some SAM-QFS filesystems which were formerly on the same arrays. We have had the "normal" ZFS panics occur in the presence of I/O errors (SAN zoning mistakes, cable issues, switch bugs), and had no ZFS corruption nor data loss as a result. We run S10U4 and S10U5, both SPARC and x86. MPXIO works fine, once you have OS and arrays configured properly. Note that I'd by far prefer to have ZFS-level redundancy, but our equipment doesn't support a useful RAID-0, and our customers want cheap storage. But we also charge them for tape backups Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD update
[EMAIL PROTECTED] said: > Seriously, I don't even care about the cost. Even with the smallest > capacity, four of those gives me 128GB of write cache supporting 680MB/s and > 40k IOPS. Show me a hardware raid controller that can even come close to > that. Four of those will strain even 10GB/s Infiniband. I had my sights set lower. Our Thumper has four hot-spare drives right now. I'd take one or two of those out and replace them with one or two 80GB SSD's, upgrade to S10U6 when available, and set them up as a separate log device. Now I've gotten rid of the horrible NFS latencies that come from NFS-vs-ZIL interaction. It would only take a tiny SSD for an NFS ZIL, really. We have an old array with 1GB cache, and telling that to ignore cache-flush requests from ZFS made a huge difference in NFS latency. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver in progress - which disk is inconsistent?
[EMAIL PROTECTED] said: > AFAIK there is no way to tell resilvering to pause, so I want to detach the > inconsistent disk and attach it again tonight, when it won't affect users. To > do that I need to know which disk is inconsistent, but zpool status does not > show me any info in regard. > > Is there any way to identify which disk is inconsistent? I know this is too late to help you now, but... Doesn't "zpool status -v" do what you want? Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS jammed while busy
[EMAIL PROTECTED] said: > I'm curious about your array configuration above... did you create your > RAIDZ2 as one vdev or multiple vdev's? If multiple, how many? On mine, I > have all 10 disks set up as one RAIDZ2 vdev which is supposed to be near the > performance limit... I'm wondering how much I would gain by splitting it into > two vdev's for the price of losing 1.5TB (2 disks) worth of storage. You've probably already seen/heard this, but I haven't seen it mentioned in this thread. The consensus is, and measurements seem to confirm, that splitting it into two vdev's will double your available IOPS for small, random read loads on raidz/raidz2. Here are some references and examples: http://blogs.sun.com/roch/entry/when_to_and_not_to http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance1 http://acc.ohsu.edu/~hakansom/thumper_bench.html Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs device busy
[EMAIL PROTECTED] said: > I am having trouble destroying a zfs file system (device busy) and fuser > isn't telling me who has the file open: > . . . > This situation appears to occur every night during a system test. The only > peculiar operation on the errant file system is that another system NFS > mounts it with vers=2 in a non-global zone, and then halts that zone. I > haven't been able to reproduce the problem outside the test. If you have a filesystem shared-out (exported) on an NFS-server, you'll get this kind of behavior. No client need have it mounted. You must first do a "unshare /files/custfs/cust12/2053699a" in your example before trying to unmount it. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub performance
[EMAIL PROTECTED] said: > It is also interesting to note that this system is now making negative > progress. I can understand the remaining time estimate going up with time, > but what does it mean for the % complete number to go down after 6 hours of > work? Sorry I don't have any helpful experience in this area. It occurs to me that perhaps you are detecting a gravity wave of some sort -- Thumpers are pretty heavy, and thus may be more affected than the average server. Or the guys at SLAC have, unbeknownst to you, somehow accelerated your Thumper to near the speed of light. (:-) Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs over zfs
[EMAIL PROTECTED] said: > i am a little new to zfs so please excuse my ignorance. i have a poweredge > 2950 running Nevada B82 with an Apple Xraid attached over a fiber hba. they > are formatted to JBOD with the pool configured as follows: > . . . > i have a filesystem (tpool4/seplog) shared over nfs. creating files locally > seems to be fine but writing files over nfs seem to be extremely slow on one > of the clients(os x) it reports over 3hours to copy a 500MB file. also > during the copy when i issue a zpool iostat -v 5 the response time increases > for the command. i have also noticed that none of the led's on the drives > flicker. If you haven't already, tell the Xraid to ignore cache-flush requests from the host: http://www.opensolaris.org/jive/thread.jspa?threadID=11641 Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] filebench for Solaris 10?
[EMAIL PROTECTED] said: > This is what I get with the filebench-1.1.0_x86_pkg.tar.gz from SourceForge: > > # pkgadd -d . > pkgadd: ERROR: no packages were found in > > # ls > install/ pkginfo pkgmapreloc/ > . . . Um, "cd .." and "pkgadd -d ." again. The package is the actual directory that you unpacked. Note the instructions for unpacking confused me a bit as well. I had expected to "pkgadd -d . filebench", but pkgadd is smart enough to scan the entire "-d" directory for packages. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'du' is not accurate on zfs
[EMAIL PROTECTED] said: > It may not be relevant, but I've seen ZFS add weird delays to things too. I > deleted a file to free up space, but when I checked no more space was > reported. A second or two later the space appeared. Run the "sync" command before you do the "du". That flushes the ARC and/or ZIL out to disk, after which you'll get accurate results. I do the same when timing how long it takes to create a file -- time the file creation plus the sync to see how long it takes to get the data to nonvolatile storage. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] five megabytes per second with Microsoft iSCSI initiator (2.06)
[EMAIL PROTECTED] said: > I'm creating a zfs volume, and sharing it with "zfs set shareiscsi=on > poolname/volume". I can access the iSCSI volume without any problems, but IO > is terribly slow, as in five megabytes per second sustained transfers. > > I've tried creating an iSCSI target stored on a UFS filesystem, and get the > same slow IO. I've tried every level of RAID available in ZFS with the same > results. Apologies if you've already done so, but try testing your network (without iSCSI and storage). You can use "ttcp" from blastwave.org on the Solaris side, and PCATTCP on the Windows side. That should tell you if your TCP/IP stacks and network hardware are in good condition. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] filebench for Solaris 10?
[EMAIL PROTECTED] said: > Some of us are still using Solaris 10 since it is the version of Solaris > released and supported by Sun. The 'filebench' software from SourceForge > does not seem to install or work on Solaris 10. The 'pkgadd' command > refuses to recognize the package, even when it is set to Solaris 2.4 mode. I've installed and run filebench (version 1.1.0) from the SourceForge packages on Solaris-10 here, both SPARC and x86_64, with no problems. Looks like I downloaded it 23-Jan-2008. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write throttling
[EMAIL PROTECTED] said: > I also tried using O_DSYNC, which stops the pathological behaviour but makes > things pretty slow - I only get a maximum of about 20MBytes/sec, which is > obviously much less than the hardware can sustain. I may misunderstand this situation, but while you're waiting for the new code from Sun, you might try O_DSYNC and at the same time tell the 6140 to ignore cache-flush requests from the host. That should get you running at spindle-speed: http://blogs.digitar.com/jjww/?itemid=44 Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss