Re: [zfs-discuss] Yager on ZFS
On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote: How so? In my opinion, it seems like a cure for the brain damage of RAID-5. Nope. A decent RAID-5 hardware implementation has no 'write hole' to worry about, and one can make a software implementation similarly robust with some effort (e.g., by using a transaction log to protect the data-plus-parity double-update or by using COW mechanisms like ZFS's in a more intelligent manner). Can you reference a software RAID implementation which implements a solution to the write hole and performs well. My understanding (and this is based on what I've been told from people more knowledgeable in this domain than I) is that software RAID has suffered from being unable to provide both correctness and acceptable performance. The part of RAID-Z that's brain-damaged is its concurrent-small-to-medium-sized-access performance (at least up to request sizes equal to the largest block size that ZFS supports, and arguably somewhat beyond that): while conventional RAID-5 can satisfy N+1 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in parallel (though the latter also take an extra rev to complete), RAID-Z can satisfy only one small-to-medium access request at a time (well, plus a smidge for read accesses if it doesn't verity the parity) - effectively providing RAID-3-style performance. Brain damage seems a bit of an alarmist label. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. The easiest way to fix ZFS's deficiency in this area would probably be to map each group of N blocks in a file as a stripe with its own parity - which would have the added benefit of removing any need to handle parity groups at the disk level (this would, incidentally, not be a bad idea to use for mirroring as well, if my impression is correct that there's a remnant of LVM-style internal management there). While this wouldn't allow use of parity RAID for very small files, in most installations they really don't occupy much space compared to that used by large files so this should not constitute a significant drawback. I don't really think this would be feasible given how ZFS is stratified today, but go ahead and prove me wrong: here are the instructions for bringing over a copy of the source code: http://www.opensolaris.org/os/community/tools/scm - ahl -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS mirror and sun STK 2540 FC array
Hi all, we have just bought a sun X2200M2 (4GB / 2 opteron 2214 / 2 disks 250GB SATA2, solaris 10 update 4) and a sun STK 2540 FC array (8 disks SAS 146 GB, 1 raid controller). The server is attached to the array with a single 4 Gb Fibre Channel link. I want to make a mirror using ZFS with this array. I have created 2 volumes on the array in RAID0 (stripe of 128 KB) presented to the host with lun0 and lun1. So, on the host : bash-3.00# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1d0 DEFAULT cyl 30397 alt 2 hd 255 sec 63 /[EMAIL PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 1. c2d0 DEFAULT cyl 30397 alt 2 hd 255 sec 63 /[EMAIL PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 2. c6t600A0B800038AFBC02F7472155C0d0 DEFAULT cyl 35505 alt 2 hd 255 sec 126 /scsi_vhci/[EMAIL PROTECTED] 3. c6t600A0B800038AFBC02F347215518d0 DEFAULT cyl 35505 alt 2 hd 255 sec 126 /scsi_vhci/[EMAIL PROTECTED] Specify disk (enter its number): bash-3.00# zpool create tank mirror c6t600A0B800038AFBC02F347215518d0 c6t600A0B800038AFBC02F7472155C0d0 bash-3.00# df -h /tank Filesystem size used avail capacity Mounted on tank 532G24K 532G 1%/tank I have tested the performance with a simple dd [ time dd if=/dev/zero of=/tank/testfile bs=1024k count=1 time dd if=/tank/testfile of=/dev/null bs=1024k count=1 ] command and it gives : # local throughput stk2540 mirror zfs /tank read 232 MB/s write 175 MB/s # just to test the max perf I did: zpool destroy -f tank zpool create -f pool c6t600A0B800038AFBC02F347215518d0 And the same basic dd gives me : single zfs /pool read 320 MB/s write 263 MB/s Just to give an idea the SVM mirror using the two local sata2 disks gives : read 58 MB/s write 52 MB/s So, in production the zfs /tank mirror will be used to hold our home directories (10 users using 10GB each), our projects files (200 GB mostly text files and cvs database), and some vendors tools (100 GB). People will access the data (/tank) using nfs4 with their workstations (sun ultra 20M2 with centos 4update5). On the ultra20 M2, the basic test via nfs4 gives : read 104 MB/s write 63 MB/s A this point, I have the following questions : -- Does someone has some similar figures about the STK 2540 using zfs ? -- Instead of doing only 2 volumes in the array, what do you think about doing 8 volumes (one for each disk) and doing a 4 two way mirror : zpool create tank mirror c6t6001.. c6t6002.. mirror c6t6003.. c6t6004.. {...} mirror c6t6007.. c6t6008.. -- I will add 4 disks in the array next summer. Do you think I should create 2 new luns in the array and doing a : zpool add tank mirror c6t6001..(lun3) c6t6001..(lun4) or build from scratch the 2 luns (6 disks raid0) , and the pool tank (ie : backup /tank - zpool destroy -- add disk - reconfigure array -- zpool create tank ... - restore backuped data) -- I think about doing a disk scrubbing once a month. Is it sufficient ? -- Have you got any comment on the performance from the nfs4 client ? If you add any advices / suggestions, feel free to share. Thanks, Benjamin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on a raid box
Hi Dan, Dan Pritts wrote: On Tue, Nov 13, 2007 at 12:25:24PM +0100, Paul Boven wrote: We've building a storage system that should have about 2TB of storage and good sequential write speed. The server side is a Sun X4200 running Solaris 10u4 (plus yesterday's recommended patch cluster), the array we bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and it's connected to the Sun through U320-scsi. We are doing basically the same thing with simliar Western Scientific (wsm.com) raids, based on infortrend controllers. ZFS notices when we pull a disk and goes on and does the right thing. I wonder if you've got a scsi card/driver problem. We tried using an Adaptec card with solaris with poor results; switched to LSI, it just works. Thanks for your reply. The SCSI-card in the X4200 is a Sun Single Channel U320 card that came with the system, but the PCB artwork does sport a nice 'LSI LOGIC' imprint. So, just to make sure we're talking about the same thing here - your drives are SATA, you're exporting each drive through the Western Scientific raidbox as a seperate volume, and zfs actually brings in a hot spare when you pull a drive? Over here, I've still not been able to accomplish that - even after installing Nevada b76 on the machine, removing a disk will not cause a hot-spare to become active, nor does resilvering start. Our Transtec raidbox seems to be based on a chipset by Promise, by the way. Regards, Paul Boven. -- Paul Boven [EMAIL PROTECTED] +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It's a fringe science ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on a raid box
On Fri, Nov 16, 2007 at 11:31:00AM +0100, Paul Boven wrote: Thanks for your reply. The SCSI-card in the X4200 is a Sun Single Channel U320 card that came with the system, but the PCB artwork does sport a nice 'LSI LOGIC' imprint. That is probably the same card i'm using; it's actually a Sun card but as you say is OEM by LSI. So, just to make sure we're talking about the same thing here - your drives are SATA, yes you're exporting each drive through the Western Scientific raidbox as a seperate volume, yes and zfs actually brings in a hot spare when you pull a drive? yes OS is Sol10U4, system is an X4200, original hardware rev. Over here, I've still not been able to accomplish that - even after installing Nevada b76 on the machine, removing a disk will not cause a hot-spare to become active, nor does resilvering start. Our Transtec raidbox seems to be based on a chipset by Promise, by the way. I have heard some bad things about the Promise RAID boxes but I haven't had any direct experience. I do own one Promise box that accepts 4 PATA drives and exports them to a host as scsi disks. Shockingly, it uses a master/slave IDE configuration rather than 4 separate IDE controllers. It wasn't super expensive but it wasn't dirt cheap, either, and it seems it would have cost another $5 to manufacture the right way. I've had fine luck with Promise $25 ATA PCI cards :) The infortrend units, on the other hand, I have had generally quite good luck with. When I worked at UUNet in the late '90s we had hundreds of their SCSI RAIDs deployed. I do have an Infortrend FC-attached raid with SATA disks, which basically works fine. It has an external JBOD also SATA disks connecting to the main raid with FC. Unfortunately, The RAID unit boots faster than the JBOD. So, if you turn them on at the same time, it thinks the JBOD is gone and doesn't notice it's there until you reboot the controller. That caused a little pucker for my colleagues when it happened while i was on vacation. The support guy at the reseller we were working with (NOT Western Scientific) told them the raid was hosed and they should rebuild from scratch, hope you had a backup. danno -- Dan Pritts, System Administrator Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4500 device disconnect problem persists
We are having the same problem. First with 125025-05 and then also with 125205-07 Solaris 10 update 4 - Know with all Patchesx We opened a Case and got T-PATCH 127871-02 we installed the Marvell Driver Binary 3 Days ago. T127871-02/SUNWckr/reloc/kernel/misc/sata T127871-02/SUNWmv88sx/reloc/kernel/drv/marvell88sx T127871-02/SUNWmv88sx/reloc/kernel/drv/amd64/marvell88sx T127871-02/SUNWsi3124/reloc/kernel/drv/si3124 T127871-02/SUNWsi3124/reloc/kernel/drv/amd64/si3124 It seems that this resolve the device reset problem and the nfsd crash on x4500 with one raidz2 pool and a lot of zfs Filesystems This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
... I personally believe that since most people will have hardware LUN's (with underlying RAID) and cache, it will be difficult to notice anything. Given that those hardware LUN's might be busy with their own wizardry ;) You will also have to minimize the effect of the database cache ... By definition, once you've got the entire database in cache, none of this matters (though filling up the cache itself takes some added time if the table is fragmented). Most real-world databases don't manage to fit all or even mostly in cache, because people aren't willing to dedicate that much RAM to running them. Instead, they either use a lot less RAM than the database size or share the system with other activity that shares use of the RAM. In other words, they use a cost-effective rather than a money-is-no-object configuration, but then would still like to get the best performance they can from it. It will be a tough assignment ... maybe someone has already done this? Thinking about this (very abstract) ... does it really matter? [8KB-a][8KB-b][8KB-c] So what it 8KB-b gets updated and moved somewhere else? If the DB gets a request to read 8KB-a, it needs to do an I/O (eliminate all caching). If it gets a request to read 8KB-b, it needs to do an I/O. Does it matter that b is somewhere else ... Yes, with any competently-designed database. it still needs to go get it ... only in a very abstract world with read-ahead (both hardware or db) would 8KB-b be in cache after 8KB-a was read. 1. If there's no other activity on the disk, then the disk's track cache will acquire the following data when the first block is read, because it has nothing better to do. But if the all the disks are just sitting around waiting for this table scan to get to them, then if ZFS has a sufficiently intelligent read-ahead mechanism it could help out a lot here as well: the differences become greater when the system is busier. 2. Even a moderately smart disk will detect a sequential access pattern if one exists and may read ahead at least modestly after having detected that pattern even if it *does* have other requests pending. 3. But in any event any competent database will explicitly issue prefetches when it knows (and it *does* know) that it is scanning a table sequentially - and will also have taken pains to try to ensure that the table data is laid out such that it can be scanned efficiently. If it's using disks that support tagged command queuing it may just issue a bunch of single-database-block requests at once, and the disk will organize them such that they can all be satisfied by a single streaming access; with disks that don't support queuing, the database can elect to issue a single large I/O request covering many database blocks, accomplishing the same thing as long as the table is in fact laid out contiguously on the medium (the database knows this if it's handling the layout directly, but when it's using a file system as an intermediary it usually can only hope that the file system has minimized file fragmentation). Hmmm... the only way is to get some data :) *hehe* Data is good, as long as you successfully analyze what it actually means: it either tends to confirm one's understanding or to refine it. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool question
On Thu, 15 Nov 2007, Brian Lionberger wrote: The question is, should I create one zpool or two to hold /export/home and /export/backup? Currently I have one pool for /export/home and one pool for /export/backup. Should it be on pool for both??? Would this be better and why? One thing to consider is that pools are the granularity of 'export' operations, so if you ever want to, for example, move the /export/backup disks to a new computer, but keep /export/home on the current computer, you couldn't do that easily if you create a pair of striped 2-way mirrors. Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Need a 2-port PCI-X SATA-II controller for x86
I'll be setting up a small server and need two SATA-II ports for an x86 box. The cheaper the better. Thanks!! -brian -- Perl can be fast and elegant as much as J2EE can be fast and elegant. In the hands of a skilled artisan, it can and does happen; it's just that most of the shit out there is built by people who'd be better suited to making sure that my burger is cooked thoroughly. -- Jonathan Patschke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to destory a faulted pool
How I can destroy the following pool ? pool: mstor0 id: 5853485601755236913 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: mstor0 UNAVAIL insufficient replicas raidz1UNAVAIL insufficient replicas c5t0d0 FAULTED corrupted data c4t0d0 FAULTED corrupted data c1t0d0 ONLINE c0t0d0 ONLINE pool: zpool1 id: 14693037944182338678 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-3C config: zpool1 UNAVAIL insufficient replicas raidz1UNAVAIL insufficient replicas c0t1d0 UNAVAIL cannot open c1t1d0 UNAVAIL cannot open c4t1d0 UNAVAIL cannot open c6t1d0 UNAVAIL cannot open c7t1d0 UNAVAIL cannot open raidz1UNAVAIL insufficient replicas c0t2d0 UNAVAIL cannot open c1t2d0 UNAVAIL cannot open c4t2d0 UNAVAIL cannot open c6t2d0 UNAVAIL cannot open c7t2d0 UNAVAIL cannot open ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Brain damage seems a bit of an alarmist label. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. The concurrent small-i/o problem is fundamental though. If you have an application where you care only about random concurrent reads for example, you would not want to use raidz/raidz2 currently. No amount of smartness in the application gets around this. It *is* a relevant shortcoming of raidz/raidz2 compared to raid5/raid6, even if in many cases it is not significant. If disk space is not an issue, striping across mirrors will be okay for random seeks. But if you also care about diskspace, it's a show stopper unless you can throw money at the problem. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool question
I have a zpool issue that I need to discuss. My application is going to run on a 3120 with 4 disks. Two(mirrored) disks will represent /export/home and the other two(mirrored) will be /export/backup. The question is, should I create one zpool or two to hold /export/home and /export/backup? Currently I have one pool for /export/home and one pool for /export/backup. Should it be on pool for both??? Would this be better and why? Thanks for any help and advice. Brian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS for consumers WAS:Yager on ZFS
Splitting this thread and changing the subject to reflect that... On 11/14/07, can you guess? [EMAIL PROTECTED] wrote: Another prominent debate in this thread revolves around the question of just how significant ZFS's unusual strengths are for *consumer* use. WAFL clearly plays no part in that debate, because it's available only on closed, server systems. I am both a large systems administrator and a 'home user' (I prefer that term to 'consumer'). I am also very slow to adopt new technologies in either environment. We have started using ZFS at work due to performance improvements (for our workload) over UFS (or any other FS we tested). At home the biggest reason I went with ZFS for my data is ease of management. I split my data up based on what it is ... media (photos, movies, etc.), vendor stuff (software, datasheets, etc.), home directories, and other misc. data. This gives me a good way to control backups based on the data type. I know, this is all more sophisticated than the typical home user. The biggest win for me is that I don't have to partition my storage in advance. I build one zpool and multiple datasets. I don't set quotas or reservations (although I could). So I suppose my argument for ZFS in home use is not data integrity, but much simpler management, both short and long term. -- Paul Kraus Albacon 2008 Facilities ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? billtodd at metrocast.net writes: You really ought to read a post before responding to it: the CERN study did encounter bad RAM (and my post mentioned that) - but ZFS usually can't do a damn thing about bad RAM, because errors tend to arise either before ZFS ever gets the data or after it has already returned and checked it (and in both cases, ZFS will think that everything's just fine). According to the memtest86 author, corruption most often occurs at the moment memory cells are written to, by causing bitflips in adjacent cells. So when a disk DMA data to RAM, and corruption occur when the DMA operation writes to the memory cells, and then ZFS verifies the checksum, then it will detect the corruption. Therefore ZFS is perfectly capable (and even likely) to detect memory corruption during simple read operations from a ZFS pool. Of course there are other cases where neither ZFS nor any other checksumming filesystem is capable of detecting anything (e.g. the sequence of events: data is corrupted, checksummed, written to disk). Indeed - the latter was the first of the two scenarios that I sketched out. But at least on the read end of things ZFS should have a good chance of catching errors due to marginal RAM. That must mean that most of the worrisome alpha-particle problems of yore have finally been put to rest (since they'd be similarly likely to trash data on the read side after ZFS had verified it). I think I remember reading that somewhere at some point, but I'd never gotten around to reading that far in the admirably-detailed documentation that accompanies memtest: thanks for enlightening me. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] read/write NFS block size and ZFS
msl wrote: Hello all... I'm migrating a nfs server from linux to solaris, and all clients(linux) are using read/write block sizes of 8192. That was the better performance that i got, and it's working pretty well (nfsv3). I want to use all the zfs' advantages, and i know i can have a performance loss, so i want to know if there is a recomendation for bs on nfs/zfs, or what do you think about it. That is the network block transfer size. The default is normally 32 kBytes. I don't see any reason to change ZFS's block size to match. You should follow the best practices as described at http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide If you notice a performance issue with metadata updates, be sure to check out http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide -- richard I must test, or there is no need to make such configurations with zfs? Thanks very much for your time! Leal. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS snapshot send/receive via intermediate device
Hey folks, I have no knowledge at all about how streams work in Solaris, so this might have a simple answer, or be completely impossible. Unfortunately I'm a windows admin so haven't a clue which :) We're looking at rolling out a couple of ZFS servers on our network, and instead of tapes we're considering using off-site NAS boxes for backups. We think there's likely to be too much data each day to send the incremental snapshots to the remote systems over the wire, so we're wondering if we can use removable disks instead to transport just the incremental changes. The idea is that we can do the initial zfs send on-site with the NAS plugged on the network, and from then on we just need a 500GB removable disk to take the changes off site each night. Let me be clear on that: We're not thinking of storing the whole zfs pool on the removable disk, there's just too much data. Instead, we want to use zfs send -i to store just the incremental changes on a removable disk, so we can then take that disk home and plug it into another device and use zfs receive to upload the changes. Does anybody know if that's possible? If it works it's a nice and simple off-site backup, with the added benefit that we have a very rapid disaster recovery response. No need to waste time restoring from tape: the off-site backup can be brought onto the network and data is accessible immediately. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot mount 'mypool': Input/output error
On Nov 15, 2007 9:42 AM, Nabeel Saad [EMAIL PROTECTED] wrote: I am sure I will not use ZFS to its fullest potential at all.. right now I'm trying to recover the dead disk, so if it works to mount a single disk/boot disk, that's all I need, I don't need it to be very functional. As I suggested, I will only be using this to change permissions and then return the disk into the appropriate Server once I am able to log back into that server. (Sorry, forgot to CC the list.) Ok, so assuming that all you want to do is mount your old Solaris disk and change some permissions, then there is probably an easier solution which is to put the hard drive back in the original machine and boot from a (Open)Solaris CD or DVD. This eliminates the whole Linux/FUSE issues you're getting into. Your easiest option might be to try the new OpenSolaris Developer Preview distribution since it's actually a Live CD which would give you a full GUI and networking to play with. http://www.opensolaris.org/os/downloads/ Once the Live CD boots, you should be able to mount your drive to an alternate path like /a and then change permissions. If you boot from a regular Solaris CD or DVD it will start the install process, but then you should be able to simply cancel the install and get to a command line and work from there. Good luck! Regards, -Eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool io to 6140 is really slow
I have the following layout A 490 with 8 1.8Ghz and 16G mem. 6 6140s with 2 FC controllers using A1 anfd B1 controller port 4Gbps speed. Each controller has 2G NVRAM On 6140s I setup raid0 lun per SAS disks with 16K segment size. On 490 I created a zpool with 8 4+1 raidz1s I am getting zpool IO of only 125MB/s with zfs:zfs_nocacheflush = 1 in /etc/system Is there a way I can improve the performance. I like to get 1GB/sec IO. Currently each lun is setup as primary A1 and secondary B1 or vice versa I also have write cache eanble according to CAM -- Asif Iqbal PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to destory a faulted pool
Manoj, # zpool destroy -f mstor0 Regards, Marco Lopes. Manoj Nayak wrote: How I can destroy the following pool ? pool: mstor0 id: 5853485601755236913 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: mstor0 UNAVAIL insufficient replicas raidz1UNAVAIL insufficient replicas c5t0d0 FAULTED corrupted data c4t0d0 FAULTED corrupted data c1t0d0 ONLINE c0t0d0 ONLINE pool: zpool1 id: 14693037944182338678 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-3C config: zpool1 UNAVAIL insufficient replicas raidz1UNAVAIL insufficient replicas c0t1d0 UNAVAIL cannot open c1t1d0 UNAVAIL cannot open c4t1d0 UNAVAIL cannot open c6t1d0 UNAVAIL cannot open c7t1d0 UNAVAIL cannot open raidz1UNAVAIL insufficient replicas c0t2d0 UNAVAIL cannot open c1t2d0 UNAVAIL cannot open c4t2d0 UNAVAIL cannot open c6t2d0 UNAVAIL cannot open c7t2d0 UNAVAIL cannot open ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Marco S. Lopes Senior Technical Specialist US Systems Practice Professional Services Delivery Sun Microsystems 925 984 6611 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 117.00.0 14970.1 0.0 35.0 299.2 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.10.0 15111.9 0.0 35.0 296.4 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.0
[zfs-discuss] pls discontinue troll bait was: Yager on ZFS and ZFS + DB + fragments
I've been observing two threads on zfs-discuss with the following Subject lines: Yager on ZFS ZFS + DB + fragments and have reached the rather obvious conclusion that the author can you guess? is a professional spinmeister, who gave up a promising career in political speech writing, to hassle the technical list membership on zfs-discuss. To illustrate my viewpoint, I offer the following excerpts (reformatted from an obvious WinDoze Luser Mail client): Excerpt 1: Is this premium technical BullShit (BS) or what? - BS 301 'grad level technical BS' --- Still, it does drive up snapshot overhead, and if you start trying to use snapshots to simulate 'continuous data protection' rather than more sparingly the problem becomes more significant (because each snapshot will catch any background defragmentation activity at a different point, such that common parent blocks may appear in more than one snapshot even if no child data has actually been updated). Once you introduce CDP into the process (and it's tempting to, since the file system is in a better position to handle it efficiently than some add-on product), rethinking how one approaches snapshots (and COW in general) starts to make more sense. - end of BS 301 'grad level technical BS' --- Comment: Amazing: so many words, so little meaningful technical content! Excerpt 2: Even better than Excerpt 1 - truely exceptional BullShit: - BS 401 'PhD level technical BS' -- No, but I described how to use a transaction log to do so and later on in the post how ZFS could implement a different solution more consistent with its current behavior. In the case of the transaction log, the key is to use the log not only to protect the RAID update but to protect the associated higher-level file operation as well, such that a single log force satisfies both (otherwise, logging the RAID update separately would indeed slow things down - unless you had NVRAM to use for it, in which case you've effectively just reimplemented a low-end RAID controller - which is probably why no one has implemented that kind of solution in a stand-alone software RAID product). ... - end of BS 401 'PhD level technical BS' -- Go ahead and lookup the full context of these exceptional BS excerpts and see if the full context brings any further enlightment. I think you'll quickly realize that, after reading the full context, this is nothing more than a complete waste of time and that there is nothing of technical value to learned from this text. In fact, there is very, very little to be learned from any posts on this list where the Subject line is either: Yager on ZFS ZFS + DB + fragments and the author is: can you guess? [EMAIL PROTECTED] I'm not, for a moment, suggesting that one can't learn *something* from the posts of the author can you guess? [EMAIL PROTECTED]... indeed there are significant spinmeistering skills to be learned from these posts; including how to combine portions of cited published technical studies (Google Study, CERN study) with a line of total semi-technical bullshit worthy of any political spinmeister working withing the DC Beltway Bandit area. In fact, if I'm trying to conn^H^H^H^H talk someone out of several million dollars to fund a totally BS research project, I'll pay any reasonable fees that can you guess? would demand. Because I'm convinced, that with his premium spinmeistering/BS skills - nothing is impossible: pigs can fly, NetApp == ZFS, the world is flat and ZFS is a totally deficient technical design because they did'nt solicit his totally invaluable technical input. And.. one note of caution for Jeff Bonwick and Team ZFS - lookout ... for this guy - because his new ZFS competitor filesystem, called, appropriately, GOMFS (Guess-O-Matic-File-System) is about to be released and it'll basically, if I understand can you guess?'s email fully, solve all the current ZFS design deficiencies, and totally dominate all *nix based filesystems for the next 400 years. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from sugar-coating school? Sorry - I never attended! :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. Neil. Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. Neil. Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 9:17 PM, Joe Little [EMAIL PROTECTED] wrote: On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. Roch wrote this before (thus my interest in the log or NVRAM like solution): There are 2 independant things at play here. a) NFS sync semantics conspire againts single thread performance with any backend filesystem. However NVRAM normally offers some releaf of the issue. b) ZFS sync semantics along with the Storage Software + imprecise protocol in between, conspire againts ZFS performance of some workloads on NVRAM backed storage. NFS being one of the affected workloads. The conjunction of the 2 causes worst than expected NFS perfomance over ZFS backend running __on NVRAM back storage__. If you are not considering NVRAM storage, then I know of no ZFS/NFS specific problems. Issue b) is being delt with, by both Solaris and Storage Vendors (we need a refined protocol); Issue a) is not related to ZFS and rather fundamental NFS issue. Maybe future NFS protocol will help. Net net; if one finds a way to 'disable cache flushing' on the storage side, then one reaches the state we'll be, out of the box, when b) is implemented by Solaris _and_ Storage vendor. At that point, ZFS becomes a fine NFS server not only on JBOD as it is today , both also on NVRAM backed storage. It's complex enough, I thougt it was worth repeating. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. Neil. Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 ... ___ zfs-discuss