Re: [zfs-discuss] ZFS and powerpath
Brian Wilson wrote: On Jul 16, 2007, at 6:06 PM, Torrey McMahon wrote: Darren Dunham wrote: My previous experience with powerpath was that it rode below the Solaris device layer. So you couldn't cause trespass by using the wrong device. It would just go to powerpath which would choose the link to use on its own. Is this not true or has it changed over time? I haven't looked at power path for some time but it used to be the opposite. The powerpath node sat on top of the actual device paths. One of the selling points of mpxio is that it doesn't have that problem. (At least for devices it supports.) Most of the multipath software had that same limitation I agree, it's not true. I don't know how long it hasn't been true, but the last year and a half I've been implementing PowerPath on Solaris 8, 9, 10, the way to make it work is to point whatever disk tool you're using to the emcpower device. The other paths are there because leadville finds them and creates them (if you're using leadville), but PowerPath isn't doing anything to make them redundant, it's giving you the emcpower device and the emcp, etc. drivers to front end them and give you a multipathed device (the emcpower device). It DOES choose which one to use, for all I/O going through the emcpower device. In a situation where you lose paths and I/O is moving, you'll see scsi errors down one path, then the next, then the next, as PowerPath gets fed the scsi error and tries the next device path. If you use those actual device paths, you're not actually getting a device that PowerPath is multipathing for you (i.e. it does not dig in beneath the scsi driver) I'm afraid I have to disagree with you: I'm using the /dev/dsk/c2t$WWNdXs2 devices quite happily with powerpath handling failover for my clariion. # powermt version EMC powermt for PowerPath (c) Version 4.4.0 (build 274) # powermt display dev=58 Pseudo name=emcpower58a CLARiiON ID=APM00051704678 [uscicsap1] Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1: /oracle/Q02/saparch] state=alive; policy=BasicFailover; priority=0; queued-IOs=0 Owner: default=SP A, current=SP A == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O PathsInterf. ModeState Q-IOs Errors == 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 SP A1 active alive 0 0 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 SP B1 active alive 0 0 # fsck /dev/dsk/c2t5006016130202E48d58s0 ** /dev/dsk/c2t5006016130202E48d58s0 ** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n 144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0% fragmentation) # fsck /dev/dsk/c2t5006016930202E48d58s0 ** /dev/dsk/c2t5006016930202E48d58s0 ** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n 144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0% fragmentation) ### So at this point, I can look down either path and get to my data. Now I kill 1 of the 2 paths via SAN zoning. cfgadm -c configure c2, and powermt check reports that the path to SP A is now dead. I'm still able to fsck the dead path: # cfgadm -c configure c2 # powermt check Warning: CLARiiON device path c2t5006016130202E48d58s0 is currently dead. Do you want to remove it (y/n/a/q)? n # powermt display dev=58 Pseudo name=emcpower58a CLARiiON ID=APM00051704678 [uscicsap1] Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1: /oracle/Q02/saparch] state=alive; policy=BasicFailover; priority=0; queued-IOs=0 Owner: default=SP A, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O PathsInterf. ModeState Q-IOs Errors == 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 SP A1 active dead 0 1 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 SP B1 active alive 0 0 # fsck /dev/dsk/c2t5006016130202E48d58s0 ** /dev/dsk/c2t5006016130202E48d58s0 ** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch **
Re: [zfs-discuss] ZFS send needs optimalization
Hello Łukasz, Monday, July 23, 2007, 1:19:16 PM, you wrote: Ł ZFS send is very slow. Ł dmu_sendbackup function is traversing dataset in one thread and in Ł traverse callback function ( backup_cb ) we are waiting for data in Ł arc_read called with ARC_WAIT flag. Ł I want to parallize zfs send to make it faster. Ł dmu_sendbackup could allocate buffer, that will be used for buffering output. Ł Few threads can traverse dataset, few threads would be used for async read operations. Ł I think it could speed up zfs send operation 10x. Ł What do you think about it ? I guess you should check with Matthew Ahrens as IIRC he's working on 'zfs send -r' and possibly some other improvements to zfs send. The question is what code changes Matthew has done so far (it hasn't been integrated AFAIK) and possibly work from there. Or perhaps Matthew is already working on it also... Now, if zfs resides on lots of disks then I guess it should speed up zfs send considerably, at least in some cases (lot of small files, written/deleted/created randomly). Then it would be great if you could implement something and share with us some results to see if there's actually some performance gain. Also I guess you'll have to write all transactions to the other end (zfs recv) in the same order they were created on disk,or not? ps. Lukasz - nice to see you here more and more :) -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sharemgr Test Suite Released on OpenSolaris.org
The Sharemgr test suite is available on OpenSolaris.org. The source tarball, binary package and baseline can be downloaded from the test consolidation download center at: http://dlc.sun.com/osol/test/downloads/current The source code can be viewed in the Solaris Test Collection (STC) 2.0 source tree at: http://cvs.opensolaris.org/source/xref/test/ontest-stc2/src/suites/share The SUNWstc-tetlite package must be installed prior to executing a Sharemgr test run. More information on the Sharemgr test suite is available in the Sharemgr README file at: http://src.opensolaris.org/source/xref/test/ontest-stc2/src/suites/share/README Any questions about the Sharemgr test suite can be sent to testing discuss at: http://www.opensolaris.org/os/community/testing/discussions Cheers, Jim This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS and NFS Mounting - Missing Permissions
Hi I'm trying to setup a new NFS server, and wish to use Solaris and ZFS. I have a ZFS filesystem set up to handle the users home directories and setup sharing # zfs list NAME USED AVAIL REFER MOUNTPOINT data 896K 9.75G 35.3K /data data/home 751K 9.75G 38.0K /data/home data/home/bob 32.6K 9.75G 32.6K /data/home/bob data/home/joe 647K 9.37M 647K /data/home/joe data/home/paul32.6K 9.75G 32.6K /data/home/paul # zfs get sharenfs data/home NAME PROPERTY VALUE SOURCE data/homesharenfs rw local And these directories are owned by the user # ls -l /data/home total 12 drwxrwxr-x 2 bob sigma 2 Jul 23 08:47 bob drwxrwxr-x 2 joe sigma 4 Jul 23 11:31 joe drwxrwxr-x 2 paul sigma 2 Jul 23 08:47 paul I have the top level directory shared (/data/home). When I mount this on the client pc (ubuntu) I loose all the permissions, and can't see any of the files.. [EMAIL PROTECTED]:/nfs/home# ls -l total 6 drwxr-xr-x 2 root root 2 2007-07-23 08:47 bob drwxr-xr-x 2 root root 2 2007-07-23 08:47 joe drwxr-xr-x 2 root root 2 2007-07-23 08:47 paul [EMAIL PROTECTED]:/nfs/home# ls -l joe total 0 However, when I mount each directory manually, it works.. [EMAIL PROTECTED]:~# mount torit01sx:/data/home/joe /scott [EMAIL PROTECTED]:~# ls -l /scott total 613 -rwxrwxrwx 1 joe sigma 612352 2007-07-23 11:32 file Any ideas? When I try the same thing with a UFS based filesystem it works as expected [EMAIL PROTECTED]:/# mount torit01sx:/export/home /scott [EMAIL PROTECTED]:/# ls -l scott total 1 drwxr-xr-x 2 joe sigma 512 2007-07-23 12:25 joe Any help would be greatly appreciated.. Thanks in advance Scott This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send needs optimalization
Robert Milkowski wrote: Hello Łukasz, Monday, July 23, 2007, 1:19:16 PM, you wrote: Ł ZFS send is very slow. Ł dmu_sendbackup function is traversing dataset in one thread and in Ł traverse callback function ( backup_cb ) we are waiting for data in Ł arc_read called with ARC_WAIT flag. That's correct. Ł I want to parallize zfs send to make it faster. Ł dmu_sendbackup could allocate buffer, that will be used for buffering output. Ł Few threads can traverse dataset, few threads would be used for async read operations. Ł I think it could speed up zfs send operation 10x. Ł What do you think about it ? You're right that we need to issue more i/os in parallel -- see 6333409 traversal code should be able to issue multiple reads in parallel However, it may be much more straightforward to just issue prefetches appropriately, rather than attempt to coordinate multiple threads. That said, feel free to experiment. I guess you should check with Matthew Ahrens as IIRC he's working on 'zfs send -r' and possibly some other improvements to zfs send. The question is what code changes Matthew has done so far (it hasn't been integrated AFAIK) and possibly work from there. Or perhaps Matthew is already working on it also... Unfortunately I am not working on this bug as part of my zfs send -r changes. But I plan to work on it (unless you get to it first!) later this year as part of the pool space reduction changes. Also I guess you'll have to write all transactions to the other end (zfs recv) in the same order they were created on disk,or not? Nope, that's (one of) the beauty of zfs send. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] more love for databases
On Jul 22, 2007, at 7:39 PM, JS wrote: There a way to take advantage of this in Sol10/u03? sorry, variable 'zfs_vdev_cache_max' is not defined in the 'zfs' module That tunable/hack will be available in s10u4: http://bugs.opensolaris.org/view_bug.do?bug_id=6472021 wait about a month and it should be officially out... eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and NFS Mounting - Missing Permissions
Scott Adair wrote: Hi I'm trying to setup a new NFS server, and wish to use Solaris and ZFS. I have a ZFS filesystem set up to handle the users home directories and setup sharing # zfs list NAME USED AVAIL REFER MOUNTPOINT data 896K 9.75G 35.3K /data data/home 751K 9.75G 38.0K /data/home data/home/bob 32.6K 9.75G 32.6K /data/home/bob data/home/joe 647K 9.37M 647K /data/home/joe data/home/paul32.6K 9.75G 32.6K /data/home/paul # zfs get sharenfs data/home NAME PROPERTY VALUE SOURCE data/homesharenfs rw local And these directories are owned by the user # ls -l /data/home total 12 drwxrwxr-x 2 bob sigma 2 Jul 23 08:47 bob drwxrwxr-x 2 joe sigma 4 Jul 23 11:31 joe drwxrwxr-x 2 paul sigma 2 Jul 23 08:47 paul I have the top level directory shared (/data/home). When I mount this on the client pc (ubuntu) I loose all the permissions, and can't see any of the files.. /data/home is a different file system than /data/home/joe. NFS shares do not cross file system boundaries. You'll need to share /data/home/joe, too. -- richard [EMAIL PROTECTED]:/nfs/home# ls -l total 6 drwxr-xr-x 2 root root 2 2007-07-23 08:47 bob drwxr-xr-x 2 root root 2 2007-07-23 08:47 joe drwxr-xr-x 2 root root 2 2007-07-23 08:47 paul [EMAIL PROTECTED]:/nfs/home# ls -l joe total 0 However, when I mount each directory manually, it works.. [EMAIL PROTECTED]:~# mount torit01sx:/data/home/joe /scott [EMAIL PROTECTED]:~# ls -l /scott total 613 -rwxrwxrwx 1 joe sigma 612352 2007-07-23 11:32 file Any ideas? When I try the same thing with a UFS based filesystem it works as expected [EMAIL PROTECTED]:/# mount torit01sx:/export/home /scott [EMAIL PROTECTED]:/# ls -l scott total 1 drwxr-xr-x 2 joe sigma 512 2007-07-23 12:25 joe Any help would be greatly appreciated.. Thanks in advance Scott This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] General recommendations on raidz groups of different sizes
Richard Elling wrote: Haudy Kazemi wrote: How would one calculate system reliability estimates here? One is a RAIDZ set of 6 disks, the other a set of 8. The reliability of each RAIDZ set by itself isn't too hard to calculate, but put together, especially since they're different sizes, I don't know. We just weigh them accordingly. MTTDL for a 6-disk set will be better than for an 8-disk set, though that seems to be counter-intuitive for some folks. Let me see if I can put some numbers together later this week... OK, after some math we can get some idea... Using the MTTDL[1] model for a default disk (500 GBytes, 800k hours MTBF, 24 hours logistical response, 60 GBytes/hr resync) we get: configMTTDL[1] (yrs) 6-disk raidz75,319 8-disk raidz40,349 2x 6-disk raidz 37,659 6-disk raidz + 8-disk raidz 26,274 2x 8-disk raidz 20,175 As you would expect, the MTTDL for the 6-disk scenario is better than for the 8-disk scenario. So it follows that the MTTDL for a pair of 6-disk raidz sets is better than for a pair of 8-disk raidz sets and the 6+8 scenario is in between. The MTTDL[1] model is described at: http://blogs.sun.com/relling/entry/a_story_of_two_mttdl -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ETA of device evacuation?
Hi Louwtjie, (CC'd to the list as an FYI to others) The biggest gotcha is the SE6140's have a 12 byte SCSI control data block, and thus can only do 2TB LUNs out to the host. That's not an issue with ZFS however since you can just tack them together and grow your pool that way. See the attached PNG. That's how we're doing it. You'd have one ZFS file system on top of the pool for your customers setup. SE6140 limitations: Maximum Volumes Per Array 2,048 Maximum Volumes Per RAID group 256 Maximum Volume size 2 TB (minus 12 GB) Maximum Drives in RAID group 30 Maximum RAID group Size 20 TB Storage Partitions Yes Maximum Total Partitions 64 Maximum Hosts per Partition 256 Maximum Volumes per Partition 256 Maximum Number of Global Hot Spares 15 The above limits might matter if you thought you'd just have one fat LUN coming from your SE6140. You can't do it that way. But as shown in the picture you can use ZFS to do all that for you. If you keep all of your LUNs at exactly 2000GB when you make them then you can mirror and then detach an array's LUNs one by one until you can remove the array. It'll be nice when ZFS has the ability to natively remove LUNs from a pool, expected in several months apparently. Don't try to install the SE6140 software on Solaris 11 unless you're good at porting. It's possible (our NFS server is Sol 11 b64a) but it's not end-user friendly. Solaris 9 or 10 is fine. We needed the Sol 11 b64a version for the ZFS ISCSI abilities which are fixed in that release. When setting up the SE6140s, I found the serial ports didn't function with the supplied cables, at least on our equipment. Be proactive and wire all the SE6140s into one management network and put a DHCP server on there that allocates IPs according to the MAC addresses on the controller cards. Then, and not before, go and register the arrays in the Common Array Manager software. (And download the latest (May 2007) from Sun first too). Trying to change an arrays IP once it's in the CAM setup is nasty. From what you describe you can do it all with just one array. All of ours are 750GB SATA disk SE6140s, expandable to 88TB per array. Our biggest is one controller and one expansion tray so we have lots of headroom. You lose up to 20% of your raw capacity after the SE6140 RAID5 volume setup and the ZFS overheads too. Keep that in mind when scoping your solution. In your /etc/system, put in the tuning: set zfs:zil_disable=1 set zfs:zfs_nocacheflush=1 Our NFS database clients also have these tunings: VCS mount opts (or /etc/vfstab for you) MountOpt = rw,bg,hard,intr,proto=tcp,vers=3,rsize=32768,wsize=32768,forcedirectio ce.conf mods: name=pci108e,abba parent=/[EMAIL PROTECTED],70 unit-address=1 adv_autoneg_cap=1 adv_1000fdx_cap=1 accept_jumbo=1 adv_pause_cap=1; This gets about 400MBytes/s all together, running to a T2000 NFS server. That's pretty much the limits of the hardware so we're happy with that :) We're yet to look at MPxIO and load balancing across controllers. Plus I'm not sure I've tuned the file systems for Oracle block sizes. Depending on your solution that probably isn't an issue with you. We like the ability to do ZFS snapshots and clones, we can copy an entire DB setup and create a clone in about ten seconds. Before it took hours using the EMCs. Cheers, Mark. After reading your post ... I was wondering whether you would give some input/advice on a certain configuration I'm working on. A customer (potential) are considering using a server (probably Sun galaxy) connected to 2 switches and lots (lots!) of 6140's. - One large filesystem - 70TB - No downtime growth/expansion Since it seems that you have several 6140's under ZFS control ... any problems/comments for me? Thank you. On 7/19/07, Mark Ashley [EMAIL PROTECTED] wrote: Hi folks, One of the things I'm really hanging out for is the ability to evacuate the data from a zpool device onto the other devices and then remove the device. Without mirroring it first etc. The zpool would of course shrink in size according to how much space you just took away. Our situation is we have a number of SE6140 arrays attached to a host with a total of 35TB. Some arrays are owned by other projects but are on loan for a while. I'd like to make one very large pool from the (maximum 2TB! wtf!) LUNs from the SE6140s and once our DBAs are done with the workspace, remove the LUNs and free up the SE6140 arrays so their owners can begin to use them. At the moment once a device is in a zpool, it's stuck there. That's a problem. What sort of time frame are we looking at until it's possible to remove LUNs from zpools? ta, Mark. inline: SE6140_to_ZFS.png___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss