Re: [zfs-discuss] ZFS on Ubuntu
All true, I just saw too many need ubuntu and zfs and thought to state the obvious in case the patch set for nexenta happen to differ enough to provide a working set. I've had nexenta succeed where opensolaris quarter releases failed and vice versa On Jun 27, 2010, at 9:54 PM, Erik Trimble erik.trim...@oracle.com wrote: On 6/27/2010 9:07 PM, Richard Elling wrote: On Jun 27, 2010, at 8:52 PM, Erik Trimble wrote: But that won't solve the OP's problem, which was that OpenSolaris doesn't support his hardware. Nexenta has the same hardware limitations as OpenSolaris. AFAICT, the OP's problem is with a keyboard. The vagaries of keyboards is well documented, but there is no silver bullet. Indeed, I have one box that seems to be more or less happy with PS-2 vs USB for every other OS or hypervisor. My advice, have one of each handy, just in case. -- richard Right. I was just pointing out the fallacy of thinking that Nexenta might work on hardware that OpenSolaris doesn't (or has problems with). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Ubuntu
Of course, nexenta os is a build of ubuntu on an opensolaris kernel. On Jun 26, 2010, at 12:27 AM, Freddie Cash fjwc...@gmail.com wrote: On Sat, Jun 26, 2010 at 12:20 AM, Ben Miles merloc...@hotmail.com wrote: What supporting applications are there on Ubuntu for RAIDZ? None. Ubuntu doesn't officially support ZFS. You can kind of make it work using the ZFS-FUSE project. But it's not stable, nor recommended. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely bad performance - hw failure?
I've had this happen to me too. I found some dtrace scripts at the time that showed that the file system was spending too much time finding available 128k blocks or the like as I was near full per each disk, even though combined I still had 140GB left of my 3TB pool. The SPA code I believe it was was spending too much time walking the available pool for continguous space for new writes, and this affecting both read and write performance dramatically (measured in kb/sec). I was able to alleviate the pressure so to speak by adjusting the recordsize for the pool down to 8k (32k is likely more recommended) and from there I could then start to clear out space. Anything below 10% available space seems to cause ZFS to start behaving poorly, and getting down lower increases the problems. But the root cause was metadata management on pools w/ less than 5-10% disk space left. In my case, I had lots of symlinks, lots of small files, and also dozens of snapshots. My pool was a RAID10 (aka, 3 mirror sets striped). On Sun, Dec 27, 2009 at 4:52 PM, Morten-Christian Bernson m...@uib.no wrote: Lately my zfs pool in my home server has degraded to a state where it can be said it doesn't work at all. Read spead is slower than I can read from the internet on my slow dsl-line... This is compared to just a short while ago, where I could read from it with over 50mb/sec over the network. My setup: Running latest Solaris 10: # uname -a SunOS solssd01 5.10 Generic_142901-02 i86pc i386 i86pc # zpool status DATA pool: DATA state: ONLINE config: NAME STATE READ WRITE CKSUM DATA ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 spares c0t2d0 AVAIL errors: No known data errors # zfs list -r DATA NAME USED AVAIL REFER MOUNTPOINT DATA 3,78T 229G 3,78T /DATA All of the drives in this pool are 1.5tb western digital green drives. I am not seeing any error messages in /var/adm/messages, and fmdump -eV shows no errors... However, I am seeing some soft faults in iostat -eEn: errors --- s/w h/w trn tot device 2 0 0 2 c0t0d0 1 0 0 1 c1t0d0 2 0 0 2 c2t1d0 151 0 0 151 c2t2d0 151 0 0 151 c2t3d0 153 0 0 153 c2t4d0 153 0 0 153 c2t5d0 2 0 0 2 c0t1d0 3 0 0 3 c0t2d0 0 0 0 0 solssd01:vold(pid531) c0t0d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: Sun Product: STK RAID INT Revision: V1.0 Serial No: Size: 31.87GB 31866224128 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c1t0d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0 Vendor: _NEC Product: DVD_RW ND-3500AG Revision: 2.16 Serial No: Size: 0.00GB 0 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 1 Predictive Failure Analysis: 0 c2t1d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD753LJ Revision: 1113 Serial No: Size: 750.16GB 750156373504 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c2t2d0 Soft Errors: 151 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD15EADS-00R Revision: 0A01 Serial No: Size: 1500.30GB 1500301909504 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 151 Predictive Failure Analysis: 0 c2t3d0 Soft Errors: 151 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD15EADS-00R Revision: 0A01 Serial No: Size: 1500.30GB 1500301909504 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 151 Predictive Failure Analysis: 0 c2t4d0 Soft Errors: 153 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD15EADS-00R Revision: 0A01 Serial No: Size: 1500.30GB 1500301909504 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 153 Predictive Failure Analysis: 0 c2t5d0 Soft Errors: 153 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD15EADS-00R Revision: 0A01 Serial No: Size: 1500.30GB 1500301909504 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 153 Predictive Failure Analysis: 0 c0t1d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: Sun Product: STK RAID INT Revision: V1.0 Serial No: Size: 31.87GB 31866224128 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c0t2d0 Soft Errors: 3 Hard Errors: 0
Re: [zfs-discuss] SATA controller suggestion
On Thu, Jun 5, 2008 at 9:26 PM, Tim [EMAIL PROTECTED] wrote: On Thu, Jun 5, 2008 at 11:12 PM, Joe Little [EMAIL PROTECTED] wrote: On Thu, Jun 5, 2008 at 8:16 PM, Tim [EMAIL PROTECTED] wrote: On Thu, Jun 5, 2008 at 9:17 PM, Peeyush Singh [EMAIL PROTECTED] wrote: Hey guys, please excuse me in advance if I say or ask anything stupid :) Anyway, Solaris newbie here. I've built for myself a new file server to use at home, in which I'm planning on configuring SXCE-89 ZFS. It's a Supermicro C2SBX motherboard with a Core2Duo 4GB DDR3. I have 6x750GB SATA drives in it connected to the onboard ICH9-R controller (with BIOS RAID disabled AHCI enabled). I also have a 160GB SATA drive connected to a PCI SIIG SC-SA0012-S1 controller, the drive which will be used as the system drive. My plan is to configure a RAID-Z2 pool on the 6x750 drives. The system drive is just there for Solaris. I'm also out of ports to use on the motherboard, hence why I'm using an add-in PCI SATA controller. My problem is that Solaris is not recognizing the system drive during the DVD install procedure. It sees the 6x750GB onboard drives fine. I originally used a RocketRAID 1720 SATA controller, which uses its own HighPoint chipset I believe, and it was a no-go. I went and exchanged that controller for a SIIG SC-SA0012-S1 controller, which I thought used a Silicon Integrated (SII) chipset. The install DVD isn't recognizing it unfortunatly, now I'm not so sure that it uses a SII chipset. I checked the HCL, and it only lists a few cards that are reported to work under SXCE. If anyone has any suggestions on either... A) Using a different driver during the install procedure, or... B) A different, cheap SATA controller I'd appreciate it very much. Sorry for the rambling post, but I wanted to be detailed from the get-go. Thanks for any input! :) PS. On a side note, I'm interested in playing around with SXCE development. It looks interesting :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'm still a fan of the marvell based supermicro card. I run two of them in my fileserver. AOC-SAT2-MV8 http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm I gave treatment to this question a few days ago. Yes, if you want PCI-X, go with the Marvell. If you want PCIe SATA, then its either a SIIG produced Si3124 card or a lot of guessing. I think the real winner is going to be the newer SAS/SATA mixed HBAs from LSI based on the 1068 chipset, which Sun has been supporting well in newer hardware. http://jmlittle.blogspot.com/2008/06/recommended-disk-controllers-for-zfs.html **pci or pci-x. Yes, you might see *SOME* loss in speed from a pci interface, but let's be honest, there aren't a whole lot of users on this list that have the infrastructure to use greater than 100MB/sec who are asking this sort of question. A PCI bus should have no issues pushing that. Equally important, don't mix SATA-I and SATA-II on that system motherboard, or on one of those add-on cards. http://jmlittle.blogspot.com/2008/05/mixing-sata-dos-and-donts.html I mix SATA-I and SATA-II and haven't had any issues to date. Unless you have an official bug logged/linked, that's as good as a wives tail. No bug to report, but it was one of the issues with losing my log device a bit ago. ZFS engineers appear to be aware of it. Among other things, its why there is a known work around to disable command queueing (NCQ) on the marvell card when SATA-I drives are attached to it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SATA controller suggestion
On Thu, Jun 5, 2008 at 8:16 PM, Tim [EMAIL PROTECTED] wrote: On Thu, Jun 5, 2008 at 9:17 PM, Peeyush Singh [EMAIL PROTECTED] wrote: Hey guys, please excuse me in advance if I say or ask anything stupid :) Anyway, Solaris newbie here. I've built for myself a new file server to use at home, in which I'm planning on configuring SXCE-89 ZFS. It's a Supermicro C2SBX motherboard with a Core2Duo 4GB DDR3. I have 6x750GB SATA drives in it connected to the onboard ICH9-R controller (with BIOS RAID disabled AHCI enabled). I also have a 160GB SATA drive connected to a PCI SIIG SC-SA0012-S1 controller, the drive which will be used as the system drive. My plan is to configure a RAID-Z2 pool on the 6x750 drives. The system drive is just there for Solaris. I'm also out of ports to use on the motherboard, hence why I'm using an add-in PCI SATA controller. My problem is that Solaris is not recognizing the system drive during the DVD install procedure. It sees the 6x750GB onboard drives fine. I originally used a RocketRAID 1720 SATA controller, which uses its own HighPoint chipset I believe, and it was a no-go. I went and exchanged that controller for a SIIG SC-SA0012-S1 controller, which I thought used a Silicon Integrated (SII) chipset. The install DVD isn't recognizing it unfortunatly, now I'm not so sure that it uses a SII chipset. I checked the HCL, and it only lists a few cards that are reported to work under SXCE. If anyone has any suggestions on either... A) Using a different driver during the install procedure, or... B) A different, cheap SATA controller I'd appreciate it very much. Sorry for the rambling post, but I wanted to be detailed from the get-go. Thanks for any input! :) PS. On a side note, I'm interested in playing around with SXCE development. It looks interesting :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'm still a fan of the marvell based supermicro card. I run two of them in my fileserver. AOC-SAT2-MV8 http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm I gave treatment to this question a few days ago. Yes, if you want PCI-X, go with the Marvell. If you want PCIe SATA, then its either a SIIG produced Si3124 card or a lot of guessing. I think the real winner is going to be the newer SAS/SATA mixed HBAs from LSI based on the 1068 chipset, which Sun has been supporting well in newer hardware. http://jmlittle.blogspot.com/2008/06/recommended-disk-controllers-for-zfs.html Equally important, don't mix SATA-I and SATA-II on that system motherboard, or on one of those add-on cards. http://jmlittle.blogspot.com/2008/05/mixing-sata-dos-and-donts.html It's the same chipset that's in the thumper, and it pretty cheap for an 8-port card. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot delete file when fs 100% full
On Fri, May 30, 2008 at 7:43 AM, Paul Raines [EMAIL PROTECTED] wrote: It seems when a zfs filesystem with reserv/quota is 100% full users can no longer even delete files to fix the situation getting errors like these: $ rm rh.pm6895.medial.V2.tif rm: cannot remove `rh.pm6895.medial.V2.tif': Disk quota exceeded (this is over NFS from a RHEL4 Linux box) I can log in as root on the Sun server and delete the file as root. After doing that, the user can then delete files okay. Is there anyway to workaround this does not involve root intervention? Users are filling up their volumes all the time which is the reason they must have reserv/quota set. Well, with the Copy-on-right filesystem a delete actually requires a write. That said, there have been certain religious arguments on the list about whether the quota support presented by ZFS is sufficient. In a nutshell, per user quotas are not implemented, and the suggested workaround is the per-user filesystem with quota/reservations. Its inelegant at best since the auto-mount definitions become their own pain to maintain. The other unimplemented feature is the soft and hard quota limits. Most people have gotten around this by actually presenting only UFS volumes held inside ZFS zvols to end users, but that defeats the purpose of providing snapshots directly to end users, etc. However, since snapshots are only available at the filesystem level, you still are restricted to one filesystem per user to use snapshots well, but I would argue hard/soft limits on the quota are the unanswered problem that doesn't have a known workaround. -- --- Paul Rainesemail: raines at nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog failure ... *ANY* way to recover?
On Fri, May 30, 2008 at 6:30 AM, Jeb Campbell [EMAIL PROTECTED] wrote: Ok, here is where I'm at: My install of OS 2008.05 (snv_86?) will not even come up in single user. The OS 2008.05 live cd comes up fine, but I can't import my old pool b/c of the missing log (and I have to import to fix the log ...). So I think I'll boot from the live cd, import my rootpool, mount it, and copy /rootpool/etc/zfs.cache to zfs.cache.save. Then I'll stick the zfs.cache from the live cd onto the rootpool, update boot bits, and cross my fingers. The goal of this is to get my installed OS to finish a boot, then I can try using the saved zfs.cache to load the degraded pool w/o import. As long as I can get to it read-only, I'll copy what I can off. Any tips, comments, or suggestions would be welcome, It seems we are in our own little echo chamber here. Well, these are the bugs/resolutions that need to be addressed: 1) and l2arc or log device needs to evacuation-possible 2) any failure of a l2arc or log device should never prevent importation of a pool. It is an additional device for cache/log purposes, and failures of these devices should be correctly handled, but not at the scope of failing the volume/losing already stored data. Yes, this means that data in the intent log may be lost, but I'd rather lose that 0.01% vs the whole volume 3) The failure to iterate filesystems bug is also quite annoying. If any sub-FS in a zfs pool is not iteratable (data/home in my example), the importation of the pool should note the error but proceed. Mounts to data/proj and data/* except for data/home should continue. Again, always attempt to do as much as possible to provide access to the data that's still available. Giving up on all mounts because of a fault on one is not reasonable behavior There are more things here, and perhaps one can argue with the above. However, my stance is being overly conservative to _DATA ACCESS_ -- faulting a pool until you can contact support and get a hack to recover otherwise available data is just not good form :) You'll lose customers left and right because of the cost of the downtime Jeb This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-help] 1TB ZFS thin provisioned partition prevents Opensolaris from booting.
On Fri, May 30, 2008 at 7:07 AM, Hugh Saunders [EMAIL PROTECTED] wrote: On Fri, May 30, 2008 at 10:37 AM, Akhilesh Mritunjai [EMAIL PROTECTED] wrote: I think it's right. You'd have to move to a 64 bit kernel. Any reasons to stick to a 32 bit kernel ? My reason would be lack of 64bit hardware :( Is this an iscsi specific limitation? or will any multi-TB pool have problems on 32bit hardware? If so whats the upper bound to pool size on 32bit? I've noticed its only a problem of per-LUN sizes on 32bit Solaris clients trying to import them into ZFS. You can build a ZFS volume of any* size as long as the underlying LUNs are less than 1.4TB each I believe. Or so I've seen by experimentation. * is because the total pool size and the the per-LUN size was simply something that worked for me, but in the end, I did go w/ 64-bit processors as the memory crunch that ZFS has makes 32bit unusable for any heavy use beyond .5TB of disk, again by observation. -- Hugh Saunders ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog failure ... *ANY* way to recover?
On Thu, May 29, 2008 at 7:25 PM, Jeb Campbell [EMAIL PROTECTED] wrote: Meant to add that zpool import -f pool doesn't work b/c of the missing log vdev. All the other disks are there and show up with zpool import, but it won't import. Is there anyway a util could clear the log device vdev from the remaining raidz2 devices? Then I could import just a standard raidz2 pool. I really love zfs (and had recently upgraded to 6 disks in raidz2), but this is *really* gonna hurt to lose all this stuff (yeah, the work stuff is backed up, but I have/had tons of personal stuff on there). I definitely would prefer to just sit tight, and see if there is any way to get this going (read only would be fine). You can mount all those filesystems, and then zfs send/recv them off to another box. Its sucks, but as of now, there is no re-importing of the pool UNTIL the log can be removed. Sadly, I think that log removal will at least require importation of the pool in question first. For some reason you already can't import your pool. In my case, I was running B70 and could import the pool still, but just degraded. I think that once you are at a higher rev (which I do not know, but inclusive of B82 and B85), you won't be able to import it anymore when it fails. Jeb This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog failure ... *ANY* way to recover?
On Thu, May 29, 2008 at 8:59 PM, Joe Little [EMAIL PROTECTED] wrote: On Thu, May 29, 2008 at 7:25 PM, Jeb Campbell [EMAIL PROTECTED] wrote: Meant to add that zpool import -f pool doesn't work b/c of the missing log vdev. All the other disks are there and show up with zpool import, but it won't import. Is there anyway a util could clear the log device vdev from the remaining raidz2 devices? Then I could import just a standard raidz2 pool. I really love zfs (and had recently upgraded to 6 disks in raidz2), but this is *really* gonna hurt to lose all this stuff (yeah, the work stuff is backed up, but I have/had tons of personal stuff on there). I definitely would prefer to just sit tight, and see if there is any way to get this going (read only would be fine). More to the point, does it say there are any permanent errors that you find? Again, I was able to import it after reassigning the log device so it thinks its there. I got to this point: [EMAIL PROTECTED]:~# zpool status -v pool: data state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM dataONLINE 0 024 raidz1ONLINE 0 024 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 logsONLINE 0 024 c3t1d0ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: data/home:0x0 Yes, because of the error I can no longer have any mounts created at import, but the zfs mount data/proj or other filesystem, but not data/home, is still possible. Again, I think that you will want to use -o ro as an option to that mount command to not have the system go bonkers. Check my blog for more info on reseting the log device for a zfs replace action -- which itself puts you into more troubling position of possibly having corruptions from the resilver, but at least for me allowed me to mount the pool for read-only mounts of the remaining filesystems. You can mount all those filesystems, and then zfs send/recv them off to another box. Its sucks, but as of now, there is no re-importing of the pool UNTIL the log can be removed. Sadly, I think that log removal will at least require importation of the pool in question first. For some reason you already can't import your pool. In my case, I was running B70 and could import the pool still, but just degraded. I think that once you are at a higher rev (which I do not know, but inclusive of B82 and B85), you won't be able to import it anymore when it fails. Jeb This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] slog devices don't resilver correctly
This past weekend, but holiday was ruined due to a log device replacement gone awry. I posted all about it here: http://jmlittle.blogspot.com/2008/05/problem-with-slogs-how-i-lost.html In a nutshell, an resilver of a single log device with itself, due to the fact one can't remove a log device from a pool once defined, cause ZFS to fully resilver but then attach the log device as as stripe to the volume, and no longer as a log device. The subsequent pool failure was exceptionally bad as the volume could no longer be imported and required read-only mounting of the remaining filesystems that I could to recover data. It would appear that log resilvers are broken, at least up to B85. I haven't seen code changes in this space so I presume this is likely an unaddressed problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] indiana as nfs server: crash due to zfs
On Mon, May 26, 2008 at 6:10 AM, Gerard Henry [EMAIL PROTECTED] wrote: hello all, i have indiana freshly installed on a sun ultra 20 machine. It only does nfs server. During one night, the kernel had crashed, and i got this messages: May 22 02:18:57 ultra20 unix: [ID 836849 kern.notice] May 22 02:18:57 ultra20 ^Mpanic[cpu0]/thread=ff0003d06c80: May 22 02:18:57 ultra20 genunix: [ID 603766 kern.notice] assertion failed: sm-sm_space == 0 (0x4000 == 0x0), file: ../../common/fs/zfs/space_map.c, line: 315 May 22 02:18:57 ultra20 unix: [ID 10 kern.notice] May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d06830 genunix:assfail3+b9 () May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d068e0 zfs:space_map_load+2c2 () May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d06920 zfs:metaslab_activate+66 () May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d069e0 zfs:metaslab_group_alloc+24e () May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d06ab0 zfs:metaslab_alloc_dva+1da () May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d06b50 zfs:metaslab_alloc+82 () May 22 02:18:57 ultra20 genunix: [ID 655072 kern.notice] ff0003d06ba0 zfs:zio_dva_allocate+62 () Searching on the net, it seems that this kind of error is usual, does it mean that i can't use indiana as a robust nfs server? What actions can i do if i want to investigate? I've seen many people trying to use (most cases successfully) Indiana or some OpenSolaris build for quasi-production NFS or similar servicing. I think if you want robust, go with something that is targeted for robustness for your case, such as NexentaStor (paid or free editions). I come off as a shill for a solution that I use, but it amazes me that people ask for a robust stable-tracking solutions but always track the bleeding edge instead. Nothing wrong with that, and I do the same, but I know that's what its for :) thanks in advance, gerard This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog devices don't resilver correctly
On Tue, May 27, 2008 at 1:50 PM, Eric Schrock [EMAIL PROTECTED] wrote: Yeah, I noticed this the other day while I was working on an unrelated problem. The basic problem is that log devices are kept within the normal vdev tree, and are only distinguished by a bit indicating that they are log devices (and is the source for a number of other inconsistencies that Pwel has encountered). When doing a replacement, the userland code is responsible for creating the vdev configuration to use for the newly attached vdev. In this case, it doesn't preserve the 'is_log' bit correctly. This should be enforced in the kernel - it doesn't make sense to replace a log device with a non-log device, ever. I have a workspace with some other random ZFS changes, so I'll try to include this as well. FWIW, removing log devices is significantly easier than removing arbitrary devices, since there is no data to migrate (after the current txg is synced). At one point there were plans to do this as a separate piece of work (since the vdev changes are needed for the general case anyway), but I don't know whether this is still the case. Thanks for the reply. As noted, I do recommend against the log device as you can't remove it and the replacement as you see is touchy at best. I know the larger, but general vdev evacuation is ongoing, but if it is simple, log evacuation would make logs useful now instead of waiting. - Eric On Tue, May 27, 2008 at 01:13:47PM -0700, Joe Little wrote: This past weekend, but holiday was ruined due to a log device replacement gone awry. I posted all about it here: http://jmlittle.blogspot.com/2008/05/problem-with-slogs-how-i-lost.html In a nutshell, an resilver of a single log device with itself, due to the fact one can't remove a log device from a pool once defined, cause ZFS to fully resilver but then attach the log device as as stripe to the volume, and no longer as a log device. The subsequent pool failure was exceptionally bad as the volume could no longer be imported and required read-only mounting of the remaining filesystems that I could to recover data. It would appear that log resilvers are broken, at least up to B85. I haven't seen code changes in this space so I presume this is likely an unaddressed problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog devices don't resilver correctly
On Tue, May 27, 2008 at 4:50 PM, Eric Schrock [EMAIL PROTECTED] wrote: Joe - We definitely don't do great accounting of the 'vdev_islog' state here, and it's possible to create a situation where the parent replacing vdev has the state set but the children do not, but I have been unable to reproduce the behavior you saw. I have rebooted the system during resilver, manually detached the replacing vdev, and a variety of other things, but I've never seen the behavior you describe. In all cases, the log state is kept with the replacing vdev and restored when the resilver completes. I have also not observed the resilver failing with a bad log device. Can you provide more information about how to reproduce this problem? Perhaps without rebooting into B70 in the middle? Well, this happened live on a production system, and I'm still in the process of rebuilding said system (trying to save all the snapshots) I don't know what triggered it. It was trying to resilver in B85, rebooted into B70 where it did resilver (but it was now using cmdk device naming vs the full scsi device names). It was marked degraded still even though re-silvering finished. Since the resilver took so long, I suspect the splicing in of the device took place in the B70. Again, it would never work in B85 -- just kept resetting. I'm wondering if the device path changing from cxtxdx to cxdx could be the trigger point. Thanks, - Eric On Tue, May 27, 2008 at 01:50:04PM -0700, Eric Schrock wrote: Yeah, I noticed this the other day while I was working on an unrelated problem. The basic problem is that log devices are kept within the normal vdev tree, and are only distinguished by a bit indicating that they are log devices (and is the source for a number of other inconsistencies that Pwel has encountered). When doing a replacement, the userland code is responsible for creating the vdev configuration to use for the newly attached vdev. In this case, it doesn't preserve the 'is_log' bit correctly. This should be enforced in the kernel - it doesn't make sense to replace a log device with a non-log device, ever. I have a workspace with some other random ZFS changes, so I'll try to include this as well. FWIW, removing log devices is significantly easier than removing arbitrary devices, since there is no data to migrate (after the current txg is synced). At one point there were plans to do this as a separate piece of work (since the vdev changes are needed for the general case anyway), but I don't know whether this is still the case. - Eric On Tue, May 27, 2008 at 01:13:47PM -0700, Joe Little wrote: This past weekend, but holiday was ruined due to a log device replacement gone awry. I posted all about it here: http://jmlittle.blogspot.com/2008/05/problem-with-slogs-how-i-lost.html In a nutshell, an resilver of a single log device with itself, due to the fact one can't remove a log device from a pool once defined, cause ZFS to fully resilver but then attach the log device as as stripe to the volume, and no longer as a log device. The subsequent pool failure was exceptionally bad as the volume could no longer be imported and required read-only mounting of the remaining filesystems that I could to recover data. It would appear that log resilvers are broken, at least up to B85. I haven't seen code changes in this space so I presume this is likely an unaddressed problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs mount i/o error and workarounds
Hello list, We discovered a failed disk with checksum errors. Took out the disk and resilvered, which reported many errors. A few of my subvolumes to the pool won't mount anymore, with zfs import poolname reporting that cannot mount 'poolname/proj': I/O error Ok, we have a problem. I can successfully clone any snapshot of 'proj' to get it mounted, and it looks like all snapshots are intact. This is all just backups, so I want to find a way to mount this filesystem keeping that snapshots with it. Any recipe of how one does this? Do I need to zfs send/recv to myself to another name, delete the old, and rename? Is there any other way? There are large filesystems and at least one is greater than my available space. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How many ZFS pools is it sensible to use on a single server?
On Tue, Apr 8, 2008 at 9:55 AM, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 04/08/2008 11:22:53 AM: In our environment, the politically and administratively simplest approach to managing our storage is to give each separate group at least one ZFS pool of their own (into which they will put their various filesystems). This could lead to a proliferation of ZFS pools on our fileservers (my current guess is at least 50 pools and perhaps up to several hundred), which leaves us wondering how well ZFS handles this many pools. So: is ZFS happy with, say, 200 pools on a single server? Are there any issues (slow startup, say, or peculiar IO performance) that we'll run into? Has anyone done this in production? If there are issues, is there any sense of what the recommended largest number of pools per server is? Chris, Well, I have done testing with filesystems and not as much with pools -- I believe the core design premise for zfs is that administrators would use few pools and many filesystems. I would think that Sun would recommend that you make a large pool (or a few) and divvy out filesystem with reservations to the groups (to which they can add sub filesystems). As far as ZFS filesystems are concerned my testing has shown that the mount time and io overhead for multiple filesystems seems to be pretty linear -- timing 10 mounts translates pretty well to 100 and 1000. After you hit some level (depending on processor and memory) the mount time, io and write/read batching spikes up pretty heavily. This is one of the reasons I take a strong stance against the recommendation that people use reservations and filesystems as user/group quotas (ignoring that the functionality is not by any means in parity.) Not to beat a dead horse too much, the lack of quotas and the mount limits either of the clients or the time per filesystem mentioned above allows us to heavily utilize ZFS for second tier, where quotas can be at a logical group level, and not first tier use which still demands per user quotas. Its unmet requirement. As to your original question, with enough LUN carving you can artificially create many pools. However, ease of management and focusing on both performance and reliability suggest one put as many drives in a redundant config in as few a pools as possible, split up your disk use among top level ZFS filesystems to each group, and then let them divy up ZFS filesystems with further embedded ZFS file systems. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] odd slog behavior on B70
I was playing with a Gigabyte i-RAM card and found out it works great to improve overall performance when there are a lot of writes of small files over NFS to such a ZFS pool. However, I noted a frequent situation in periods of long writes over NFS of small files. Here's a snippet of iostat during that period. sd15/sd16 are two iscsi targets, and sd17 is the iRAM card (2GB) extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 During this time no operations can occur. I've attached the iRAM disk via a 3124 card. I've never seen a svc_t time of 0, and full wait and busy disk. Any clue what this might mean? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] odd slog behavior on B70
On Nov 26, 2007 7:00 PM, Richard Elling [EMAIL PROTECTED] wrote: I would expect such iostat output from a device which can handle only a single queued I/O to the device (eg. IDE driver) and an I/O is stuck. There are 3 more I/Os in the wait queue waiting for the active I/O to complete. The %w and %b are measured as the percent of time during which an I/O was in queue. The svc_t is 0 because the I/O is not finished. By default, most of the drivers will retry I/Os which don't seem to finish, but the retry interval is often on the order of 60 seconds. If a retry succeeds, then no message is logged to syslog, so you might not see any messages. But just to be sure, what does fmdump (and fmdump -e) say about the system? Are messages logged in /var/adm/messages? nothing with fmdump or /var/adm/messages. Your answer explains why its 60 seconds or so. What's sad is that this is a ramdisk so to speak, albeit connected via SATA-I to the sil3124. Any way to isolate this further? Anyway to limit i/o timeouts to a drive? this is just two sticks of ram.. ms would be fine :) -- richard Joe Little wrote: I was playing with a Gigabyte i-RAM card and found out it works great to improve overall performance when there are a lot of writes of small files over NFS to such a ZFS pool. However, I noted a frequent situation in periods of long writes over NFS of small files. Here's a snippet of iostat during that period. sd15/sd16 are two iscsi targets, and sd17 is the iRAM card (2GB) extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 sd17 0.00.00.00.0 3.0 1.00.0 100 100 During this time no operations can occur. I've attached the iRAM disk via a 3124 card. I've never seen a svc_t time of 0, and full wait and busy disk. Any clue what this might mean? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] odd slog behavior on B70
On Nov 26, 2007 7:57 PM, Richard Elling [EMAIL PROTECTED] wrote: Joe Little wrote: On Nov 26, 2007 7:00 PM, Richard Elling [EMAIL PROTECTED] wrote: I would expect such iostat output from a device which can handle only a single queued I/O to the device (eg. IDE driver) and an I/O is stuck. There are 3 more I/Os in the wait queue waiting for the active I/O to complete. The %w and %b are measured as the percent of time during which an I/O was in queue. The svc_t is 0 because the I/O is not finished. By default, most of the drivers will retry I/Os which don't seem to finish, but the retry interval is often on the order of 60 seconds. If a retry succeeds, then no message is logged to syslog, so you might not see any messages. But just to be sure, what does fmdump (and fmdump -e) say about the system? Are messages logged in /var/adm/messages? nothing with fmdump or /var/adm/messages. Your answer explains why its 60 seconds or so. What's sad is that this is a ramdisk so to speak, albeit connected via SATA-I to the sil3124. Any way to isolate this further? Anyway to limit i/o timeouts to a drive? this is just two sticks of ram.. ms would be fine :) I suspect a bug in the driver or firmware. It might be difficult to identify if it is in the firmware. A pretty good white paper on storage stack timeout tuning is available at BigAdmin: http://www.sun.com/bigadmin/features/hub_articles/tuning_sfs.jsp But it won't directly apply to your case because you aren't using the ssd driver. I'd wager the cmdk driver is being used for your case and I'm not familiar with its internals. prtconf -D will show which driver(s) are in use. -- richard The previous message listed the sil bug. Its the sil driver, and thus sd* (as seen from the iostat) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz DEGRADED state
On Nov 20, 2007 6:34 AM, MC [EMAIL PROTECTED] wrote: So there is no current way to specify the creation of a 3 disk raid-z array with a known missing disk? Can someone answer that? Or does the zpool command NOT accommodate the creation of a degraded raidz array? can't started degraded, but you can make it so.. If one can make a sparse file, then you'd be set. Just create the file, make a zpool out of the two disks and the file, and then drop the file from the pool _BEFORE_ copying over the data. I believe then you can add the third disk as a replacement. The gotcha (and why the sparse may be needed) is that it will only use per disk the size of the smallest disk. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 18, 2007 1:44 PM, Richard Elling [EMAIL PROTECTED] wrote: one more thing... Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 When you see actv = 35 and svc_t ~20, then it is possible that you can improve performance by reducing the zfs_vdev_max_pending queue depth. See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 This will be particularly true for JBODs. Doing a little math, there is ~ 4.5 MBytes queued in the drive waiting to be written. 4.5 MBytes isn't much for a typical RAID array, but for a disk, it is often a sizeable chunk of its available cache. A 9 GByte disk, being rather old, has a pretty wimpy microprocessor, so you are basically beating the poor thing senseless. Reducing the queue depth will allow the disk to perform more efficiently. I'll be trying an 18G 10K drive tomorrow. Again the test was simply to see if by having a slog, I'd enable NFS to allow for concurrent reads and writes. Especially in the iscsi case, but even in jbod, I find _any_ heavy writing to completely postpone reads to NFS clients. This makes ZFS and NFS impractical under i/o duress. My just was to simply see how things work. It appears from Neil that it won't, and the synchronicity RFE per ZFS filesystem is what is needed, or at least zil_disable for NFS to be practically used currently. As for the max_pending, I did try to lower that w/o any success (for values of 10 and 20) in a JBOD. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 10:41 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe Little wrote: On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. No there's no way currently to give reads preference over writes. All transactions get equal priority to enter a transaction group. Three txgs can be outstanding as we use a 3 phase commit model: open; quiescing; and syncing. anyway to improve the balance? Is would appear that zil_disable is still a requirement to get NFS to behave in an practical real world way with ZFS still. Even with zil_disable, we end up with periods of pausing on the heaviest of writes, and then I think its mostly just ZFS having too much outstanding i/o to commit. If zil_disable is enabled, is the slog disk ignored? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 117.00.0 14970.1 0.0 35.0 299.2 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.10.0 15111.9 0.0 35.0 296.4 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.0
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. Neil. Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 9:17 PM, Joe Little [EMAIL PROTECTED] wrote: On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. Roch wrote this before (thus my interest in the log or NVRAM like solution): There are 2 independant things at play here. a) NFS sync semantics conspire againts single thread performance with any backend filesystem. However NVRAM normally offers some releaf of the issue. b) ZFS sync semantics along with the Storage Software + imprecise protocol in between, conspire againts ZFS performance of some workloads on NVRAM backed storage. NFS being one of the affected workloads. The conjunction of the 2 causes worst than expected NFS perfomance over ZFS backend running __on NVRAM back storage__. If you are not considering NVRAM storage, then I know of no ZFS/NFS specific problems. Issue b) is being delt with, by both Solaris and Storage Vendors (we need a refined protocol); Issue a) is not related to ZFS and rather fundamental NFS issue. Maybe future NFS protocol will help. Net net; if one finds a way to 'disable cache flushing' on the storage side, then one reaches the state we'll be, out of the box, when b) is implemented by Solaris _and_ Storage vendor. At that point, ZFS becomes a fine NFS server not only on JBOD as it is today , both also on NVRAM backed storage. It's complex enough, I thougt it was worth repeating. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. Neil. Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 ... ___ zfs-discuss
Re: [zfs-discuss] first public offering of NexentaStor
Not for NexentaStor as yet to my knowledge. I'd like to caution that the target of the initial product release is digital archiving/tiering/etc and is not necessarily primary NAS usage, though it can be used as such for those so inclined. However, interested parties should contact them as they flesh out those details. BETA programs are available too (that's where I'm at now) On 11/6/07, roland [EMAIL PROTECTED] wrote: is there any pricing information available ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] first public offering of NexentaStor
On 11/2/07, MC [EMAIL PROTECTED] wrote: I consider myself an early adopter of ZFS and pushed it hard on this list and in real life with regards to iSCSI integration, zfs performance issues with latency there of, and how best to use it with NFS. Well, I finally get to talk more about the ZFS-based product I've been beta testing for quite some time. I thought this was the most appropriate place to make it known that NexentaStor is now out, and you can read more of my take at my personal post, http://jmlittle.blogspot.com/2007/11/coming-out-party- for-commodity-storage.html I thought it would be in the normal opensolaris blog listing, but since its not showing up there, this single list seems most appropriate to get interested parties and feedback. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss Hmm so is that where all the Nexenta guys have been all this time!?!? :) I look forward to trying out what has been produced. This type of solution is a pleasing one for the consumer. Is there a list of the contributers and what they do? The landscape of Nexenta has changed and I wonder about the details. PS: the website looks kind of busy to the eyes :) PPS: I think the new Nexenta team is the perfect candidate for submitting to the community how they think the OpenSolaris branding and compatibility should work. Would you like a Built with OpenSolaris logo to use? How far would you (or should you) go to maintain compatibility and be certified as OpenSolaris Compatible? I can only speak up to my particular usage and understanding. Its OpenSolaris-based in the sense it is based on the ON/NWS consolidations (aka, NexentaOS or the NCP releases). Its still very much Debian/Ubuntu like in that it has that packaging, that installer, etc. Time will tell how compatible that is deemed to be. People doing real work on real projects should chime on on those issues because there is far too much yapping from people like me who do nothing :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backport of vfs_zfsacl.c to samba 3.0.26a, [and NexentaStor]
On 11/2/07, Rob Logan [EMAIL PROTECTED] wrote: I'm confused by this and NexentaStor... wouldn't it be better to use b77? with: Heads Up: File system framework changes (supplement to CIFS' head's up) Heads Up: Flag Day (Addendum) (CIFS Service) Heads Up: Flag Day (CIFS Service) caller_context_t in all VOPs - PSARC/2007/218 VFS Feature Registration and ACL on Create - PSARC/2007/227 ZFS Case-insensitive support - PSARC/2007/244 Extensible Attribute Interfaces - PSARC/2007/315 ls(1) new command line options '-/' and '-%': CIFS system attributes support - PSARC/2007/394 Modified Access Checks for CIFS - PSARC/2007/403 Add system attribute support to chmod(1) - PSARC/2007/410 CIFS system attributes support for cp(1), pack(1), unpack(1), compress(1) and uncompress(1) - PSARC/2007/432 Rescind SETTABLE Attribute - PSARC/2007/444 CIFS system attributes support for cpio(1), pax(1), tar(1) - PSARC/2007/459 Update utilities to match CIFS system attributes changes. - PSARC/2007/546 ZFS sharesmb property - PSARC/2007/560 VFS Feature Registration and ACL on Create - PSARC/2007/227 Extensible Attribute Interfaces - PSARC/2007/315 Extensible Attribute Interfaces - PSARC/2007/315 Extensible Attribute Interfaces - PSARC/2007/315 Extensible Attribute Interfaces - PSARC/2007/315 CIFS Service - PSARC/2006/715 It doesn't yet have anything to do with NexentaStor per se. I know that CIFS service support in the BETA is preliminary, and the timing of the availability makes a CIFS service tied to ZFS and its share commands much more attractive. Depending on its maturity, I hope Nexenta folk will have it included in their final release if not somewhere on their roadmap. http://www.opensolaris.org/os/community/on/flag-days/all/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] first public offering of NexentaStor
I consider myself an early adopter of ZFS and pushed it hard on this list and in real life with regards to iSCSI integration, zfs performance issues with latency there of, and how best to use it with NFS. Well, I finally get to talk more about the ZFS-based product I've been beta testing for quite some time. I thought this was the most appropriate place to make it known that NexentaStor is now out, and you can read more of my take at my personal post, http://jmlittle.blogspot.com/2007/11/coming-out-party-for-commodity-storage.html I thought it would be in the normal opensolaris blog listing, but since its not showing up there, this single list seems most appropriate to get interested parties and feedback. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announcing NexentaCP(b65) with ZFS/Boot integrated installer
On 6/7/07, Al Hopper [EMAIL PROTECTED] wrote: On Wed, 6 Jun 2007, Erast Benson wrote: Announcing new direction of Open Source NexentaOS development: NexentaCP (Nexenta Core Platform). NexentaCP is Dapper/LTS-based core Operating System Platform distributed as a single-CD ISO, integrates Installer/ON/NWS/Debian and provides basis for Network-type installations via main or third-party APTs (NEW). First unstable b65-based ISO with ZFS/Boot-capable installer available as usual at: http://www.gnusolaris.org/unstable-iso/ncp_beta1-test1-b65_i386.iso ... snip Now also available on www.genunix.org And mirrored at: http://mirror.stanford.edu/gnusolaris/isos/ncp_beta1-test1-b65_i386.iso Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: .zfs snapshot directory in all directories
On 2/27/07, Eric Haycraft [EMAIL PROTECTED] wrote: I am no scripting pro, but I would imagine it would be fairly simple to create a script and batch it to make symlinks in all subdirectories. I've done something similar using NFS aggregation products. The real problem is when you export, especially via CIFS (SMB) from a given directory. Let's take a given example of a division based file tree. A given area of the company, say marketing, has multiple sub folders: /pool/marketing, /pool/marketing/docs, /pool/marketing/projects, /pool/marketing/users Well, Marketing wants Windows access, so you allow shares at any point, including at /pool/marketing/users. Well, symlinks don't help, and a snapshot mechanism needs to be there at the users subdirectory level. Some would argue to promote /pool/marketing/users into a ZFS filesystem. Well, the other problem arises, in that at least with NFS, you need to share per filesystem and clients must multiple mount the filesystem (/pool/marketing, /pool/marketing/users, /pool/marketing/docs, etc). Mounting /pool/marketing alone will show you empty directories for users, projects, etc if further mounting doesn't exist. Yeah.. automounts, nfsv4, blah blah :) A lot of setup when all you need is pervasive .snapshot trees similar to NetApp. I just hope that don't have a bloody patent on something as simple as that to solve this. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs corruption -- odd inum?
On 2/11/07, Jeff Bonwick [EMAIL PROTECTED] wrote: The object number is in hex. 21e282 hex is 2220674 decimal -- give that a whirl. This is all better now thanks to some recent work by Eric Kustarz: 6410433 'zpool status -v' would be more useful with filenames This was integrated into Nevada build 57. Jeff On Sat, Feb 10, 2007 at 05:18:05PM -0800, Joe Little wrote: So, I attempting to find the inode from the result of a zpool status -v: errors: The following persistent errors have been detected: DATASET OBJECT RANGE cc 21e382 lvl=0 blkid=0 Well, 21e282 appears to not be a valid number for find . -inum blah Any suggestions? Ok.. but using the hex as suggested gave me an odder error result that I can't parse.. zdb -vvv tier2 0x21e382 version=3 name='tier2' state=0 txg=353444 pool_guid=3320175367383032945 vdev_tree type='root' id=0 guid=3320175367383032945 children[0] type='disk' id=0 guid=1858965616559880189 path='/dev/dsk/c3t4d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 metaslab_array=16 metaslab_shift=33 ashift=9 asize=1500336095232 children[1] type='disk' id=1 guid=2406851811694064278 path='/dev/dsk/c3t5d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 metaslab_array=13 metaslab_shift=33 ashift=9 asize=1500336095232 children[2] type='disk' id=2 guid=4840324923103758504 path='/dev/dsk/c3t6d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 metaslab_array=4408 metaslab_shift=33 ashift=9 asize=1500336095232 children[3] type='disk' id=3 guid=18356839793156279878 path='/dev/dsk/c3t7d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 metaslab_array=4407 metaslab_shift=33 ashift=9 asize=1500336095232 Uberblock magic = 00bab10c version = 3 txg = 2834960 guid_sum = 12336413438187464178 timestamp = 1171223485 UTC = Sun Feb 11 11:51:25 2007 rootbp = [L0 DMU objset] 400L/200P DVA[0]=2:3aa12a3600:200 DVA[1]=3:378957f000:200 DVA[2]=0:7d2312f200:200 fletcher4 lzjb LE contiguous birth=2834960 fill=3672 cksum=f65361601:5b3233d8018:117d616a33b47:24feff94a90701 Dataset mos [META], ID 0, cr_txg 4, 294M, 3672 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]=2:3aa12a3600:200 DVA[1]=3:378957f000:200 DVA[2]=0:7d2312f200:200 fletcher4 lzjb LE contiguous birth=2834960 fill=3672 cksum=f65361601:5b3233d8018:117d616a33b47:24feff94a90701 Object lvl iblk dblk lsize asize type zdb: dmu_bonus_hold(2220930) failed, errno 2 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs corruption -- odd inum?
So, I attempting to find the inode from the result of a zpool status -v: errors: The following persistent errors have been detected: DATASET OBJECT RANGE cc 21e382 lvl=0 blkid=0 Well, 21e282 appears to not be a valid number for find . -inum blah Any suggestions? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] 118855-36 ZFS
On 2/5/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Casper, Monday, February 5, 2007, 2:32:49 PM, you wrote: Hello zfs-discuss, I've patched U2 system to 118855-36. Several zfs related bugs id should be covered between -19 and -36 like HotSpare support. However despite -36 is installed 'zpool upgrade' still claims only v1 and v2 support. Alse there's no zfs promote, etc. /kernel/drv/zfs is dated May 18 with 482448 in size which looks too old. Also 118855-36 has many zfs related bugs listed however in a section file I do not see zfs,zpool commands or zfs kernel modules. Looks like they are not delivered. CDSC Have you also installed the companion patch 124205-04? It contains all CDSC the ZFS bits. I've just figured it out. However why those bug ids related in ZFS are listed in -36 while actually those fixes are delivered in 124205-05 (the same bug ids)? Ah.. it looks like this patch is non-public (need a service plan). So the free as in beer version ZFS U3 bits likely won't make it until U4 into the general release. Also why 'smpatch analyze' doesn't show 124205? (I can force it to download the patch if I specify it). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: What SATA controllers are people using for ZFS?
On 2/1/07, Al Hopper [EMAIL PROTECTED] wrote: On Thu, 1 Feb 2007, Tom Buskey wrote: [i] I got an Addonics eSata card. Sata 3.0. PCI *or* PCI-X. Works right off the bat w/ 10u3. No firmware update needed. It was $130. But I don't pull out my hair and I can use it if I upgrade my server for pci-x [/i] And I'm finding the throughput isn't there. 2MB/s in ZFS RAIDZ and worse with UFS. *sigh* I think that there are big issues with the 3124 driver. I saw unexplained pauses that lasted from 30 to 80+ Seconds during a tar from a single SATA disk drive that I was migrating data from (using a Syba SD-SATA2-2E2I card). I fully expected the kernel to crash while observing this transfer (it did'nt). It happened periodically - each time a certain amount of data had been transferred (just by observation - not measurement). And this was a UFS filesystem and the drive is a Sun original drive from an Ultra 20 box. I need to do some followup experiments as Mike Riley (Sun) has kindly offered to take my results to the people working on this driver. So, anyone know an inexpensive 4 port SATA card for PCI that'll work with 10u3 and I don't need to reflash the BIOS on? (I bricked a Syba...) Honestly, you're much better off with the $125 8-port SuperMicro board that I have been unable to break to date. Details: SuperMicro AOC-SAT2-MV8 8-port - uses the Rev C0 (Hercules-2) chip: http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm Kudos to the Sun developers working the Marvell driver! :) In the meantime I hope to find time to test a SAS2041E-R (initially the PCI Express version of this card). We switched away from those same Marvell cards because of unexplained disconnects/reconnects that ZFS/Solaris would not survive from. Stability for us came from embracing the Sil3124-2's (Tekram). We had two marvell based systems, and the most stable are the now discontinued SATA-I adaptec 16 port cards, and Sil3124s. I think its redundant, but the state of SATA support here is still the most glaring weakness. Isolating this all to a SCSI-to-SATA external chassis is the surest route to bliss. Keep posting to zfs-discuss! :) Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Thumper Origins Q
On 1/24/07, Jonathan Edwards [EMAIL PROTECTED] wrote: On Jan 24, 2007, at 09:25, Peter Eriksson wrote: too much of our future roadmap, suffice it to say that one should expect much, much more from Sun in this vein: innovative software and innovative hardware working together to deliver world-beating systems with undeniable economics. Yes please. Now give me a fairly cheap (but still quality) FC- attached JBOD utilizing SATA/SAS disks and I'll be really happy! :-) Could you outline why FC attached instead of network attached (iSCSI say) makes more sense to you? It might help to illustrate the demand for an FC target I'm hearing instead of just a network target .. I'm not generally for FC-attached storage, but we've documented here many times how the round trip latency with iSCSI hasn't been the perfect match with ZFS and NFS (think NAS). You need either IB or FC right now to make that workable. Some day though.. either with nvram-backed NFS or cheap 10Gig-E... .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What SATA controllers are people using for ZFS?
and specific models, and the driver used? Looks like there may be stability issues with the marvell, which appear to go unanswered.. On 12/21/06, Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Naveen, I believe the newer LSI cards work pretty well with Solaris. Best Regards, Jason On 12/20/06, Naveen Nalam [EMAIL PROTECTED] wrote: Hi, This may not be the right place to post, but hoping someone here is running a reliably working system with 12 drives using ZFS that can tell me what hardware they are using. I have on order with my server vendor a pair of 12-drive servers that I want to use with ZFS for our company file stores. We're trying to use Supermicro PDSME motherboards, and each has two Supermicro MV8 sata cards. Solaris 10U3 he's found doesn't work on these systems. And I just read a post today (and an older post) on this group about how the Marvell based cards lock up. I can't afford lockups since this is very critical and expensive data that is being stored. My goal is a single cpu board that works with Solaris, and somehow get 12-drives plus 2 system boot drives plugged into it. I don't see any suitable sata cards on the Sun HCL. Are there any 4-port PCIe cards that people know reliably work? The Adaptec 1430SA looks nice, but no idea if it works. I could potentially get two 4-port PCIe cards, a 2 port PCI sata card (for boot), and 4-port motherboard - for 14 drives total. And cough up the extra cash for a supported dual-cpu motherboard (though i'm only using one cpu). any advice greatly appreciated.. Thanks! Naveen This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What SATA controllers are people using for ZFS?
On 12/21/06, Al Hopper [EMAIL PROTECTED] wrote: On Thu, 21 Dec 2006, Joe Little wrote: and specific models, and the driver used? Looks like there may be stability issues with the marvell, which appear to go unanswered.. I've tested a box running two Marvell based 8-port controllers (which has been running great on Update 2) on the solaris Update 3 beta without issues. The specific card is the newer version of the SuperMicro board: http://www.supermicro.com/products/accessories/addon/AoC-SAT2 but have yet to test them under the released Update 3 code. I'll post a followup after the box is upgraded or re-installed. [I'm waiting for the next 48-hour day so that I can do the upgrade without affecting the user community!!] AFAIR the reported Marvell issues were with ON B54 - not Update 3. Or do I have this wrong? Yes, this is all OpenSolaris based, so the areca seems to be for Solaris 10 proper and the marvell may have issues at least at B54. In any case, if you discover a bug with the Sun proprietary Marvell driver and Update 3 and you have a support contract, you can log a service request and get it fixed. Since the Marvell chipset is used in Thumper, I think its a pretty safe bet that the Marvell driver will continue to work very nicely (Thanks Lori). And yes, I would feel better if this driver was open sourced but that is Suns' decision to make. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] B54 and marvell cards
We just put together a new system for ZFS use at a company, and twice in one week we've had the system wedge. You can log on, but the zpools are hosed, and a reboot never occurs if requested since it can't unmount the zfs volumes. So, only a power cycle works. In both cases, we get this: Dec 20 10:59:36 kona marvell88sx: [ID 331397 kern.warning] WARNING: marvell88sx0: device on port 2 still busy Dec 20 10:59:36 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:36 kona port 2: device reset Dec 20 10:59:37 kona marvell88sx: [ID 331397 kern.warning] WARNING: marvell88sx0: device on port 2 still busy Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: device reset Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: link lost Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: link established Dec 20 10:59:37 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 2: Dec 20 10:59:37 kona marvell88sx: [ID 517869 kern.info] device disconnected Dec 20 10:59:37 kona marvell88sx: [ID 517869 kern.info] device connected The first time was on port 1 (Sunday) and now this has occurred on port 2. Is there a known unrecoverable condition with the marvell card. We adopted this card because the adaptec 16 port card was discontinued. Everyday there seems to be less in the way of workable SATA cards for Solaris (sigh). Here's the output on startup, which always occurs: Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 0: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 1: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 2: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 3: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 4: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 5: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered
[zfs-discuss] Re: B54 and marvell cards
On 12/20/06, Joe Little [EMAIL PROTECTED] wrote: We just put together a new system for ZFS use at a company, and twice in one week we've had the system wedge. You can log on, but the zpools are hosed, and a reboot never occurs if requested since it can't unmount the zfs volumes. So, only a power cycle works. In both cases, we get this: Note to group.. Is the tekram 834A (SATA-II card w/ sil3124-1 and sil3124-2) supported yet? Seems like marvell is not the way to go.. Dec 20 10:59:36 kona marvell88sx: [ID 331397 kern.warning] WARNING: marvell88sx0: device on port 2 still busy Dec 20 10:59:36 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:36 kona port 2: device reset Dec 20 10:59:37 kona marvell88sx: [ID 331397 kern.warning] WARNING: marvell88sx0: device on port 2 still busy Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: device reset Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: link lost Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: link established Dec 20 10:59:37 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 2: Dec 20 10:59:37 kona marvell88sx: [ID 517869 kern.info] device disconnected Dec 20 10:59:37 kona marvell88sx: [ID 517869 kern.info] device connected The first time was on port 1 (Sunday) and now this has occurred on port 2. Is there a known unrecoverable condition with the marvell card. We adopted this card because the adaptec 16 port card was discontinued. Everyday there seems to be less in the way of workable SATA cards for Solaris (sigh). Here's the output on startup, which always occurs: Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 0: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 1: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 2: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 3: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 4: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 5: Dec 17 11:23
[zfs-discuss] Re: B54 and marvell cards
Some further joy: http://bugs.opensolaris.org/view_bug.do?bug_id=6504404 On 12/20/06, Joe Little [EMAIL PROTECTED] wrote: On 12/20/06, Joe Little [EMAIL PROTECTED] wrote: We just put together a new system for ZFS use at a company, and twice in one week we've had the system wedge. You can log on, but the zpools are hosed, and a reboot never occurs if requested since it can't unmount the zfs volumes. So, only a power cycle works. In both cases, we get this: Note to group.. Is the tekram 834A (SATA-II card w/ sil3124-1 and sil3124-2) supported yet? Seems like marvell is not the way to go.. Dec 20 10:59:36 kona marvell88sx: [ID 331397 kern.warning] WARNING: marvell88sx0: device on port 2 still busy Dec 20 10:59:36 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:36 kona port 2: device reset Dec 20 10:59:37 kona marvell88sx: [ID 331397 kern.warning] WARNING: marvell88sx0: device on port 2 still busy Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: device reset Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: link lost Dec 20 10:59:37 kona sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Dec 20 10:59:37 kona port 2: link established Dec 20 10:59:37 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 2: Dec 20 10:59:37 kona marvell88sx: [ID 517869 kern.info] device disconnected Dec 20 10:59:37 kona marvell88sx: [ID 517869 kern.info] device connected The first time was on port 1 (Sunday) and now this has occurred on port 2. Is there a known unrecoverable condition with the marvell card. We adopted this card because the adaptec 16 port card was discontinued. Everyday there seems to be less in the way of workable SATA cards for Solaris (sigh). Here's the output on startup, which always occurs: Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 0: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 1: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 2: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 3: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] 10-bit to 8-bit decode error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Disparity error Dec 17 11:23:15 kona marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 4: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] SError interrupt Dec 17 11:23:15 kona marvell88sx: [ID 131198 kern.info] SErrors: Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] Recovered communication error Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info] PHY ready change Dec 17 11:23:15 kona marvell88sx: [ID 517869 kern.info
Re: [zfs-discuss] poor NFS/ZFS performance
On 11/22/06, Chad Leigh -- Shire.Net LLC [EMAIL PROTECTED] wrote: On Nov 22, 2006, at 4:11 PM, Al Hopper wrote: No problem there! ZFS rocks. NFS/ZFS is a bad combination. Has anyone tried sharing a ZFS fs using samba or afs or something else besides nfs? Do we have the same issues? I've done some CIFS tests in the past, and off the top of my head, it was about 3-5x faster than NFS. Chad --- Chad Leigh -- Shire.Net LLC Your Web App and Email hosting provider chad at shire.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best version of Solaris 10 fro ZFS ?
The latest OpenSolaris release? Perhaps Nexenta in the end is the way to best deliver/maintain that. On 10/27/06, David Blacklock [EMAIL PROTECTED] wrote: What is the current recommended version of Solaris 10 for ZFS ? -thanks, -Dave ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] marvel cards.. as recommended
On 9/12/06, James C. McPherson [EMAIL PROTECTED] wrote: Joe Little wrote: So, people here recommended the Marvell cards, and one even provided a link to acquire them for SATA jbod support. Well, this is what the latest bits (B47) say: Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Any takers on how to get around this one? You could start by providing the output from prtpicl -v and prtconf -v as well as /usr/X11/bin/scanpci -v -V 1 so we know which device you're actually having a problem with. Is the pci vendor+deviceid for that card listed in your /etc/driver_aliases file against the marvell88sx driver? James I don't know if you really want all those large files, but /etc/driver_aliases lists: marvell88sx pci11ab,6081.9 [EMAIL PROTECTED]:~# lspci | grep Marv 03:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 07) 05:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 07) [EMAIL PROTECTED]:~# lspci -n | grep 11ab 03:01.0 0100: 11ab:6081 (rev 07) 05:01.0 0100: 11ab:6081 (rev 07) And it sees the module: 198 f571 9f10 62 1 marvell88sx (marvell88sx HBA Driver v1.8) Is this a support revision of the card? Is there something stupid like enabling the jumpers or some such that's required? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: marvel cards.. as recommended
Yeah. I got the message from a few others, and we are hoping to return/buy the newer one. I've sort of surprised by the limited set of SATA RAID or JBOD cards that one can actually use. Even the one's linked to on this list sometimes aren't supported :). I need to get up and running like yesterday, so we are just ordering the cards post haste. On 9/13/06, Anton B. Rang [EMAIL PROTECTED] wrote: A quick peek at the Linux source shows a small workaround in place for the 07 revision...maybe if you file a bug against Solaris to support this revision it might be possible to get it added, at least if that's the only issue. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] marvel cards.. as recommended
So, people here recommended the Marvell cards, and one even provided a link to acquire them for SATA jbod support. Well, this is what the latest bits (B47) say: Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Any takers on how to get around this one? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] unaccounted for daily growth in ZFS disk space usage
We finally flipped the switch on one of our ZFS-based servers, with approximately 1TB of 2.8TB (3 stripes of 950MB or so, each of which is a RAID5 volume on the adaptec card). We have snapshots every 4 hours for the first few days. If you add up the snapshot references it appears somewhat high versus daily use (mostly mail boxes, spam, etc changing), but say an aggregate of no more than 400+MB a day. However, zfs list shows our daily pool as a whole, and per day we are growing by .01TB, or more specifically 80GB a day. That's a far cry different than the 400MB we can account for. Is it possible that metadata/ditto blocks, or the like is trully growing that rapidly. By our calculations, we will triple our disk space (sitting still) in 6 months and use up the remaining 1.7TB. Of course, this is only with 2-3 days of churn, but its an alarming rate where before on the NetApp we didn't see anything close to this rate. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] unaccounted for daily growth in ZFS disk space usage
On 8/24/06, Matthew Ahrens [EMAIL PROTECTED] wrote: On Thu, Aug 24, 2006 at 07:07:45AM -0700, Joe Little wrote: We finally flipped the switch on one of our ZFS-based servers, with approximately 1TB of 2.8TB (3 stripes of 950MB or so, each of which is a RAID5 volume on the adaptec card). We have snapshots every 4 hours for the first few days. If you add up the snapshot references it appears somewhat high versus daily use (mostly mail boxes, spam, etc changing), but say an aggregate of no more than 400+MB a day. However, zfs list shows our daily pool as a whole, and per day we are growing by .01TB, or more specifically 80GB a day. That's a far cry different than the 400MB we can account for. Is it possible that metadata/ditto blocks, or the like is trully growing that rapidly. By our calculations, we will triple our disk space (sitting still) in 6 months and use up the remaining 1.7TB. Of course, this is only with 2-3 days of churn, but its an alarming rate where before on the NetApp we didn't see anything close to this rate. How are you calculating this 400MB/day figure? Keep in mind that space used by each snapshot is the amount of space unique to that snapshot. Adding up the space used by all your snapshots is *not* the amount of space that they are all taking up cumulatively. For leaf filesystems (those with no descendents), you can calculate the space used by all snapshots as (fs's used - fs's referenced). How many filesystems do you have? Can you send me the output of 'zfs list' and 'zfs get -r all pool'? How much space did you expect to be using, and what data is that based on? Are you sure you aren't writing 80GB/day to your pool? --matt well, by deleting my 4-hourlies I reclaimed most of the data. To answer some of the questions, its about 15 filesystems (decendents included). I'm aware of the space used by snapshots overlapping. I was looking at the total space (zpool iostat reports) and seeing the diff per day. The 400MB/day was be inspection and by looking at our nominal growth on a netapp. It would appear that if one days many snapshots, there is an initial quick growth in disk usage, but once those snapshot meet their retention level (say 12), the growth would appear to match our typical 400MB/day. Time will prove this one way or other. By simply getting rid of hourly snapshots and collapsing to dailies for two days worth, I reverted to only ~1-2GB total growth, which is much more in line with expectations. For various reasons, I can't post the zfs list type results as yet. I'll need to get the ok for that first.. Sorry.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] multi-layer ZFS filesystems and exporting: my stupid question for the day
On 8/16/06, Frank Cusack [EMAIL PROTECTED] wrote: On August 16, 2006 10:25:18 AM -0700 Joe Little [EMAIL PROTECTED] wrote: Is there a way to allow simple export commands the traverse multiple ZFS filesystems for exporting? I'd hate to have to have hundreds of mounts required for every point in a given tree (we have users, projects, src, etc) Set the sharenfs property on the filesystems and use the automounter on the client. Damn. We are hoping to move away from automounters and maintenance of such (we use NeoPath, for example, for virtual aggregation of the paths). In the NAS world, sometimes automounts are not available. So, if this is true, you'll likely need to count me as one of those people who says we can't make a filesystem per user, and give us user quotas now please :) -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: Re: [zfs-discuss] multi-layer ZFS filesystems and exporting: my stupid question for the day
On 8/16/06, Frank Cusack [EMAIL PROTECTED] wrote: On August 16, 2006 10:34:31 AM -0700 Joe Little [EMAIL PROTECTED] wrote: On 8/16/06, Frank Cusack [EMAIL PROTECTED] wrote: On August 16, 2006 10:25:18 AM -0700 Joe Little [EMAIL PROTECTED] wrote: Is there a way to allow simple export commands the traverse multiple ZFS filesystems for exporting? I'd hate to have to have hundreds of mounts required for every point in a given tree (we have users, projects, src, etc) Set the sharenfs property on the filesystems and use the automounter on the client. Damn. We are hoping to move away from automounters and maintenance of such (we use NeoPath, for example, for virtual aggregation of the paths). I don't know what NeoPath is but automounts are trivial or at least easy to maintain even for very large sites. You are using path wildcards and DNS aliases, yes? used to.. running away quickly. For different *nix flavors, pick one of autofs, automountd, etc.. Nohide support, no nohide support, netgroups, LDAP, it all gets ugly pretty quickly. We want to move to statically definied mount trees similar to AFS managed centrally for all using NFS as the common protocol (aka the NeoPath) In the NAS world, sometimes automounts are not available. So, if this is true, you'll likely need to count me as one of those people who says we can't make a filesystem per user, and give us user quotas now please :) I don't understand. If an automount is not available, how is that different than the nfs server itself not being available. Or do you mean some clients do not have an automounter. Some clients don't have automounters available. Also some servers: Example is gateways/proxys (think SMB proxy w/o automounter support). Other clients in heavy use require lots of changes to get what you propose, such as OSX's automounter (storing the results in NetInfo) -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] ZFS vs. Apple XRaid
I've submitted these to Roch and co before on the NFS list and off list. My favorite case was writing 6250 8k files (randomly generated) over NFS from a solaris or linux client. We originally were getting 20K/sec when I was using RAIDZ, but between switching to RAID-5 backed iscsi luns in a zpool stripe and B40/41, we saw our performance approach a more reasonable 300-400K/sec average. I get closer to 1-3MB/sec with UFS as the backend vs ZFS. Of course, if its locally attached storage (not iSCSI) performance starts to be parallel to that of UFS or better. There is some built in latency and some major penalties for streaming writes of various sizes with the NFS implementation and its fsync happiness (3 fsyncs per write from an NFS client). Its all very true that its stable/safe, but its also very slow in various use cases! On 8/1/06, eric kustarz [EMAIL PROTECTED] wrote: Joe Little wrote: On 7/31/06, Dale Ghent [EMAIL PROTECTED] wrote: On Jul 31, 2006, at 8:07 PM, eric kustarz wrote: The 2.6.x Linux client is much nicer... one thing fixed was the client doing too many commits (which translates to fsyncs on the server). I would still recommend the Solaris client but i'm sure that's no surprise. But if you'r'e stuck on Linux, upgrade to the latest stable 2.6.x and i'd be curious if it was better. I'd love to be on kernel 2.6 but due to the philosophical stance towards OpenAFS of some people on the lkml list[1], moving to 2.6 is a tough call for us to do. But that's another story for another list. The fact is that I'm stuck on 2.4 for the time being and I'm having problems with a Solaris/ZFS NFS server that I'm (and Jan) are not having with Solaris/UFS and (in my case) Linux/XFS NFS server. [1] https://lists.openafs.org/pipermail/openafs-devel/2006-July/ 014041.html /dale First, OpenAFS 1.4 works just fine with 2.6 based kernels. We've already standardized on that over 2.4 kernels (deprecated) at Stanford. Second, I had similar fsync fatality when it came to NFS clients (linux or solaris mind you) and non-local backed clients using ZFS on a Solaris 10U2 (or B40+) server. My case was iscsi and it was chalked up to low latency on iSCSI, but I still to this day find NFS write performance on small or multititudes of files at a time with ZFS as a back end to be rather iffy. Its perfectly fast for NFS reads and and its always speedly local to the box, but the NFS/ZFS integration seems problematic. I can always test w/ UFS and get great performance. Its the roundtrips with many fsyncs to the backend storage that ZFS requires for commits that get ya. Do you have a reproducable test case for this? If so, i would be interested... I wonder if you're hitting: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510 which Neil is finishing up as we type. The problem basically is that fsyncs can get slowed down by non-related I/O, so if you had a process/NFS client that was doing lots of I/O and another doing fsyncs, the fsyncs would get slowed down by the other process/client. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] ZFS vs. Apple XRaid
On 7/31/06, Dale Ghent [EMAIL PROTECTED] wrote: On Jul 31, 2006, at 8:07 PM, eric kustarz wrote: The 2.6.x Linux client is much nicer... one thing fixed was the client doing too many commits (which translates to fsyncs on the server). I would still recommend the Solaris client but i'm sure that's no surprise. But if you'r'e stuck on Linux, upgrade to the latest stable 2.6.x and i'd be curious if it was better. I'd love to be on kernel 2.6 but due to the philosophical stance towards OpenAFS of some people on the lkml list[1], moving to 2.6 is a tough call for us to do. But that's another story for another list. The fact is that I'm stuck on 2.4 for the time being and I'm having problems with a Solaris/ZFS NFS server that I'm (and Jan) are not having with Solaris/UFS and (in my case) Linux/XFS NFS server. [1] https://lists.openafs.org/pipermail/openafs-devel/2006-July/ 014041.html /dale First, OpenAFS 1.4 works just fine with 2.6 based kernels. We've already standardized on that over 2.4 kernels (deprecated) at Stanford. Second, I had similar fsync fatality when it came to NFS clients (linux or solaris mind you) and non-local backed clients using ZFS on a Solaris 10U2 (or B40+) server. My case was iscsi and it was chalked up to low latency on iSCSI, but I still to this day find NFS write performance on small or multititudes of files at a time with ZFS as a back end to be rather iffy. Its perfectly fast for NFS reads and and its always speedly local to the box, but the NFS/ZFS integration seems problematic. I can always test w/ UFS and get great performance. Its the roundtrips with many fsyncs to the backend storage that ZFS requires for commits that get ya. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The ZFS Read / Write roundabout
I've always seen this curve in my tests (local disk or iscsi) and just think its zfs as designed. I haven't seen much parallelism when I have multiple i/o jobs going, the filesystem seems to go mostly into one or the other mode. Perhaps per vdev (in iscsi I'm only exposing one or two), there is only one performance characterist at a time, write or read. On 6/30/06, Nathan Kroenert [EMAIL PROTECTED] wrote: Hey all - Was playing a little with zfs today and noticed that when I was untarring a 2.5gb archive both from and onto the same spindle in my laptop, I noticed that the bytes red and written over time was seesawing between approximately 23MB/s and 0MB/s. It seemed like we read and read and read till we were all full up, then wrote until we were empty, and so the cycle went. Now: as it happens, 31MB/s is about as fast as it gets on this disk at that part of the platter (using dd and large block size on the rdev). (iirc, it actually started out closer to 30MB, so the slower speed might be a red herring...) So, it seems to be below what I would hope to get out of the platter, but it's not too bad. Whether I: read at 23, write at 0 then read at 0, write at 23 or read at 15 and write at 15 it work out the same(ish)... The question is: Is this deliberate? (I'm guessing it's the txg flushing that's causing this behaviour) iostat output is at the end of this email... Is this a deliberate attempt to reduce the number of seeks and IO's to the disk, (and especially competing read/writes on PATA)? I guess in the back of my mind is: Is this the fastest / best way we can approach this? Also - When dding the raw slice that zfs is using, I noticed that my IO rate also seesawed up and down between 31MB/s and 28MB/s, over a 5 second interval... I was not expecting that... Thoughts? Thanks! :) Nathan. Here is the iostat example - extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk00.0 201.50.0 23908.7 33.0 2.0 173.5 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk00.0 200.00.0 24822.5 33.0 2.0 174.9 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk00.0 184.00.0 22413.1 33.0 2.0 190.2 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 42.0 247.9 5246.9 8753.2 20.1 1.6 74.9 66 95 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 159.06.0 20290.84.0 13.4 1.9 92.7 90 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 186.00.0 23809.80.0 31.2 2.0 178.5 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 172.0 30.0 22017.2 3016.2 31.5 2.0 166.0 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk00.0 176.00.0 21109.0 33.0 2.0 198.8 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk00.0 189.00.0 23422.8 33.0 2.0 185.1 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk00.0 182.00.0 23288.6 33.0 2.0 192.3 100 100 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 33.0 364.0 3904.0 7765.6 19.8 1.6 53.9 70 92 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 146.06.0 18563.94.0 18.2 1.4 129.1 69 74 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 extended device statistics device r/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 131.00.0 16768.90.0 18.0 1.8 150.8 67 90 nfs2 0.00.00.00.0 0.0 0.00.0 0 0 -- ___ zfs-discuss mailing list
Re: Re: [zfs-discuss] Re: ZFS and Storage
On 6/27/06, Erik Trimble [EMAIL PROTECTED] wrote: Darren J Moffat wrote: Peter Rival wrote: storage arrays with the same arguments over and over without providing an answer to the customer problem doesn't do anyone any good. So. I'll restate the question. I have a 10TB database that's spread over 20 storage arrays that I'd like to migrate to ZFS. How should I configure the storage array? Let's at least get that conversation moving... I'll answer your question with more questions: What do you do just now, ufs, ufs+svm, vxfs+vxvm, ufs+vxvm, other ? What of that doesn't work for you ? What functionality of ZFS is it that you want to leverage ? It seems that the big thing we all want (relative to the discussion of moving HW RAID to ZFS) from ZFS is the block checksumming (i.e. how to reliabily detect that a given block is bad, and have ZFS compensate). Now, how do we get things when using HW arrays, and not just treat them like JBODs (which is impractical for large SAN and similar arrays that are already configured). Since the best way to get this is to use a Mirror or RAIDZ vdev, I'm assuming that the proper way to get benefits from both ZFS and HW RAID is the following: (1) ZFS mirror of HW stripes, i.e. zpool create tank mirror hwStripe1 hwStripe2 (2) ZFS RAIDZ of HW mirrors, i.e. zpool create tank raidz hwMirror1, hwMirror2 (3) ZFS RAIDZ of HW stripes, i.e. zpool create tank raidz hwStripe1, hwStripe2 mirrors of mirrors and raidz of raid5 is also possible, but I'm pretty sure they're considerably less useful than the 3 above. Personally, I can't think of a good reason to use ZFS with HW RAID5; case (3) above seems to me to provide better performance with roughly the same amount of redundancy (not quite true, but close). I'd vote for (1) if you need high performance, at the cost of disk space, (2) for maximum redundancy, and (3) as maximum space with reasonable performance. I'm making a couple of assumptions here: (a) you have the spare cycles on your hosts to allow for using ZFS RAIDZ, which is a non-trivial cost (though not that big, folks). (b) your HW RAID controller uses NVRAM (or battery-backed cache), which you'd like to be able to use to speed up writes (c) you HW RAID's NVRAM speeds up ALL writes, regardless of the configuration of arrays in the HW (d) having your HW controller present individual disks to the machines is a royal pain (way too many, the HW does other nice things with arrays, etc) The case for HW RAID 5 with ZFS is easy: when you use iscsi. You get major performance degradation over iscsi when trying to coordinate writes and reads serially over iscsi using RAIDZ. The sweet spot in the iscsi world is let your targets do RAID5 or whatnot (RAID10, RAID50, RAID6), and combine those into ZFS pools, mirrored or not. There are other benefits to ZFS, including snapshots, easily managed storage pools, and with iscsi, ease of switching head nodes with a simple export/import. Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
To clarify what has just been stated. With zil disabled I got 4MB/sec. With zil enabled I get 1.25MB/sec On 6/23/06, Tao Chen [EMAIL PROTECTED] wrote: On 6/23/06, Roch [EMAIL PROTECTED] wrote: On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff I think the performance is in line with expectation for, small file,single threaded, open/write/close NFS workload (nfs must commit on close). Therefore I expect : (avg file size) / (I/O latency). Joe does this formula approach the 1.25 MB/s ? Joe sent me another set of DTrace output (biorpt.sh.rec.gz), running 105 seconds with zil_disable=1. I generate a graph using Grace ( rec.gif ). The interesting part for me: 1) How I/O response time (at bdev level) changes in a pattern. 2) Both iSCSI (sd2) and local (sd1) storage follow the same pattern and have almost identicle latency on average. 3) The latency is very high, either on average or at peaks. Although a low throughput is expected given large amount of small files, I don't expect such high latency, and of course 1.25MB/s is too low, even after turn on zil_disable, I see 4MB/s in this data set. I/O size at bdev level are actually pretty decent: mostly (75%) 128KB. Here's a summary: # biorpt -i biorpt.sh.rec Generating report from biorpt.sh.rec ... === Top 5 I/O types === DEVICET BLKs COUNT - sd1 W 256 3122 sd2 W 256 3118 sd1 W 2 164 sd2 W 2 151 sd2 W 3 123 === Top 5 worst I/O response time === DEVICET BLKs OFFSETTIMESTAMP TIME.ms - -- --- --- sd1 W 256 529562656 104.322170 3316.90 sd1 W 256 529563424 104.322185 3281.97 sd2 W 256 521152480 104.262081 3262.49 sd2 W 256 521152736 104.262102 3258.56 sd1 W 256 529562912 104.262091 3249.85 === Top 5 Devices with largest number of I/Os === DEVICE READ AVG.ms MBWRITE AVG.ms MB IOs SEEK --- --- -- -- --- -- -- --- sd17 2.70 0 4169 440.62409 4176 0% sd26 0.25 0 4131 444.79407 4137 0% cmdk0 5 21.50 0 138 0.82 0 143 11% === Top 5 Devices with largest amount of data transfer === DEVICE READ AVG.ms MBWRITE AVG.ms MB Tol.MB MB/s --- --- -- -- --- -- -- --- sd17 2.70 0 4169 440.62409 4094 sd26 0.25 0 4131 444.79407 4074 cmdk0 5 21.50 0 138 0.82 000 Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/23/06, Roch [EMAIL PROTECTED] wrote: Joe Little writes: On 6/22/06, Bill Moore [EMAIL PROTECTED] wrote: Hey Joe. We're working on some ZFS changes in this area, and if you could run an experiment for us, that would be great. Just do this: echo 'zil_disable/W1' | mdb -kw We're working on some fixes to the ZIL so it won't be a bottleneck when fsyncs come around. The above command will let us know what kind of improvement is on the table. After our fixes you could get from 30-80% of that improvement, but this would be a good data point. This change makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a txg every 5 seconds. So at most, your disk will be 5 seconds out of date compared to what it should be. It's a pretty small window, but it all depends on your appetite for such windows. :) After running the above command, you'll need to unmount/mount the filesystem in order for the change to take effect. If you don't have time, no big deal. --Bill On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff I think the performance is in line with expectation for, small file,single threaded, open/write/close NFS workload (nfs must commit on close). Therefore I expect : (avg file size) / (I/O latency). Joe does this formula approach the 1.25 MB/s ? To this day, I still don't know how to calculate the i/o latency. Average file size is always expected to be close to kernel page size for NASes -- 4-8k. Always tune for that. Nope, gig-e link (single e1000g, or aggregate, doesn't matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Well, following Bill's advice and the previous note on disabling zil, I ran my test on a B38 opteron initiator and if you do a time on the copy from the client, 6250 8k files transfer at 6MB/sec now. If you watch the entire commit on the backend using zpool iostat 1 I see that it takes a few more seconds, and the actual rate there is 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Joe, you know this but for the benefit of others, I have to highlight that running any NFS server this way, may cause silent data corruption from client's point of view. Whenever a server keeps data in RAM this way and does not commit it to stable storage upon request from clients, that opens a time window for corruption. So a client writes to a page, then reads the same page, and if the server suffered a crash in between, the data may not match. So this is performance at the expense of data integrity. -r Yes.. ZFS in its normal mode has better data integrity. However, this may be a more ideal tradeoff if you have specific read/write patterns. In my case, I'm going to use ZFS initially for my tier2 storage, with nightly write periods (needs to be short duration rsync from tier1) and mostly read periods throughout the rest of the day. I'd love to use ZFS as a tier1 service as well, but then you'd have to perform as a NetApp does. Same tricks, same NVRAM or initial write to local stable storage before writing to backend storage. 6MB/sec is closer to expected behavior for first tier at the expense of reliability. I don't know what the answer is for Sun to make ZFS 1st Tier quality with their NFS implementation and its sync happiness. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on 32bit x86
What if your 32bit system is just a NAS -- ZFS and NFS, nothing else? I think it would still be ideal to allow tweaking of things at runtime to make 32-bit systems more ideal. On 6/21/06, Mark Maybee [EMAIL PROTECTED] wrote: Yup, your probably running up against the limitations of 32-bit kernel addressability. We are currently very conservative in this environment, and so tend to end up with a small cache as a result. It may be possible to tweak things to get larger cache sizes, but you run the risk of starving out other processes trying to get memory. -Mark Robert Milkowski wrote: Hello zfs-discuss, Simple test 'ptime find /zfs/filesystem /dev/null' with 2GB RAM. After second, third, etc. time still it reads a lot from disks while find is running (atime is off). on x64 (Opteron) it doesn't. I guess it's due to 512MB heap limit in kernel for its cache. ::memstat shows 469MB for kernel and 1524MB on freelist. Is there anything could be done? I guess not but perhaps ps. of course there're a lot of files like ~150K. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] ZFS on 32bit x86
On 6/22/06, Darren J Moffat [EMAIL PROTECTED] wrote: Rich Teer wrote: On Thu, 22 Jun 2006, Joe Little wrote: Please don't top post. What if your 32bit system is just a NAS -- ZFS and NFS, nothing else? I think it would still be ideal to allow tweaking of things at runtime to make 32-bit systems more ideal. I respectfully disagree. Even on x86, 64-bits are common, and the price difference between 64-bit and 32-bit capable systems is small. So apart from keeping old stuff working, I can think of little or no justifcation to not go with 64-bit systems these days, even for a small S10 plus ZFS NAS appliance. That way you leave behind all the pain 32-bits gives you. Are VIA processor chips 64bit capable yet ? -- Darren J Moffat Well, current Xeon-LVs are 32 bit only, but besides the point, I'm in education, where our storage boxes are purchased using grant money that must be utilized for x number of years. The answer from Rich Teer indicates that we should dump old infrastructure and buy new, or if you are in our industry (I represent Stanford University Electrical Engineering), take your money/infrastructure elsewhere as only new customers need apply :( A lot of organizations have a lot of 32 bit infrastructure with multiple RAID cards, drives, etc that they'd love to migrate over to ZFS. I'm using it now for creating large pools of 2nd tier storage. And yes, that will mostly be pre-existing hardware. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff Nope, gig-e link (single e1000g, or aggregate, doesn't matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on 32bit x86
I guess the only hope is to find pin-compatible Xeons that are 64bit to replace what is a large chassis with 24 slots of disks that has specific motherboard form-factor, etc. We have 6 of these things from a government grant that must be used for the stated purpose. So, yes, we can buy product, but we simply can't get rid of the old equipment designed for this purpose. Again, government auditors for the research will say pick the solution for the hardware has donated/purchased with grant funds -- Welcome to the world of research. If Sun wants to _give_ us lots of hardware to make ZFS shine, great. But as is usually the case, I got to make do with what I have. For the vast majority of the storage, I'm running them as iscsi targets with a single sun ultra20 as the frontend, but as from other messages in the list, the iscsi route along with NFS has its own pitfalls, 32bit or 64bit :) On 6/22/06, Erik Trimble [EMAIL PROTECTED] wrote: AMD Geodes are 32-bit only. I haven't heard any mention that they will _ever_ be 64-bit. But, honestly, this and the Via chip aren't really ever going to be targets for Solaris. That is, they simply aren't (any substantial) part of the audience we're trying to reach with Solaris x86. Also, relatively few 32-bit x86 systems can take 4GB. While many of the late-model P4 (and all Xeons since the P3 Xeon) chips have the capability, most of them were married to chipsets which can't take more than 4GB. On the AMD side, I'm pretty sure only the Athlon MP-series was enabled for PAE, and only a tiny amount of them were sold. So, basically, the problem boils down to those with Xeons, a few single-socket P4s, and some of this-year's Pentium Ds. Granted, this makes up most of the x86 server market. So, yes, it _would_ be nice to be able to dump a tuning parameter into /etc/system to fix the cache starvation (and other related 4GB RAM) problems. However, I have to say that working with PAE is messy, and, honestly, 64-bit enabled 1U/3U servers are dirt cheap now. So, while I empathize with the market that has severe purchasing constraints, I think it's entirely reasonable to be up front about needing a 64-bit processor for ZFS, _if_ we've explored expanding the 32-bit environment, and discovered it was too expensive (in resources required) to fix. Dell (arrggh! Not THEM!) sells PowerEdge servers with plenty of PCI slots and RAM, and 64-bit CPUs for around $1000 now. Hell, WE sell dual-core x2100s for under $2k. I'm sure one can pick up a whitebox single-core Opteron for around $1k. That's not unreasonable to ask to get the latest technology. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/22/06, Bill Moore [EMAIL PROTECTED] wrote: Hey Joe. We're working on some ZFS changes in this area, and if you could run an experiment for us, that would be great. Just do this: echo 'zil_disable/W1' | mdb -kw We're working on some fixes to the ZIL so it won't be a bottleneck when fsyncs come around. The above command will let us know what kind of improvement is on the table. After our fixes you could get from 30-80% of that improvement, but this would be a good data point. This change makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a txg every 5 seconds. So at most, your disk will be 5 seconds out of date compared to what it should be. It's a pretty small window, but it all depends on your appetite for such windows. :) After running the above command, you'll need to unmount/mount the filesystem in order for the change to take effect. If you don't have time, no big deal. --Bill On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff Nope, gig-e link (single e1000g, or aggregate, doesn't matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Well, following Bill's advice and the previous note on disabling zil, I ran my test on a B38 opteron initiator and if you do a time on the copy from the client, 6250 8k files transfer at 6MB/sec now. If you watch the entire commit on the backend using zpool iostat 1 I see that it takes a few more seconds, and the actual rate there is 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs going out to lunch
I've been writing via tar to a pool some stuff from backup, around 500GB. Its taken quite a while as the tar is being read from NFS. My ZFS partition in this case is a RAIDZ 3-disk job using 3 400GB SATA drives (sil3124 card) Ever once in a while, a df stalls and during that time my io's go flat, as in : capacity operationsbandwidth pool used avail read write read write -- - - - - - - pool 571G 545G 0 34 1017 2.25M pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 0 0 0 0 pool 571G 545G 48176 82.8K 5.31M pool 571G 545G 48313 283K 26.0M pool 571G 545G299130 1.05M 4.05M pool 571G 545G163160 932K 4.70M pool 571G 545G320 0 1.02M 0 Is this an ARC issue or some sort of flush happening. This is B40. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance metric/cookbook/whitepaper
Please add to the list the differences on locally or remotely attach vdevs: FC, SCSI/SATA, or iSCSI. This is the part that is troubling me most, as there are wildly different performance characteristics when you use NFS with any of these backends with the various configs of ZFS. Another thing is when where cache should be or not be used on backend RAID devices (RAID vs JBOD point made already). The wild difference is between small and large file writes, and how the backend can go from 10's of MB/sec to 10's of KB/sec. Really. On 6/1/06, Erik Trimble [EMAIL PROTECTED] wrote: Maybe the best thing here is to have us (i.e. the people on this list) come up with a set of standard and expected use cases, and have the ZFS team tell us what the relative performance/tradeoffs are. I mean, rather than us just asking a bunch of specific cases, a good whitepaper Best Practices / Cookbook for ZFS would be nice. For instance: compare UFS/Solaris Volume Manager against ZFS in: [random|sequential][small|large][read|write] on UFS/SVM: Raid-1, Raid-5, Raid 0+1 ZFS: RaidZ, Mirrors Relative Performance of HWRaid vs JBOD e.g. 3510FC w/ RAID using ZFS vs 3510FC JBOD using ZFS I know a bunch of this has been discussed before (and I've read most of it :-), but collecting it in one place and filling out the actual analysis would be Really Nice. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] cluster features
Well, here's my previous summary off list to different solaris folk (regarding NFS serving via ZFS and iSCSI): I want to use ZFS as a NAS with no bounds on the backing hardware (not restricted to one boxes capacity). Thus, there are two options: FC SAN or iSCSI. In my case, I have multi-building considerations and 10GB ethernet layer-2 interconnects that make iscsi ideal. Our standard users use NAS for collections of many small files to many large files (source code repositories, simulations, cad tools, VM images, rendering meta-forms, and final results). Ideally to allow for ongoing growth and drive replacement across multiple iscsi targets, RAIDZ was selected over static hardware raid solutions. This setup is very similar to a gfiler (iscsi based) or otherwise a standard NetApp Filer product, and it would have appeared that Sun is targeting this solution. I need this setup for both Tier1 primary NAS storage, as well as disk-to-disk Tier2 backup. In my extensive testing (not so much benchmarking, and definitely without the time/focus to learn dtrace and the like), we have found out that ZFS can be used for a tier2 system and not for tier1 due to pathologically poor performance via NFS against a ZFS filesystem based on RAIDZ over non-local storage. We have extremely poor but more acceptable performance using a non-RAIDZ configuration. Only in the case of expensive FC-SAN network implementation would it appear that ZFS is workable. If this is the only workable solution, then ZFS has lost its benefits over NetApp as we approach the same costs but do not have the same current maturity. Is it a lost cause? Honestly, I need to be convinced that this is workable, and so far optional solutions have been shot down. Evidence? The final synethetic test used was to generate a directory of 6250 random 8k files. On an NFS client (solaris, linux, or even loop-back on the server itself), run cp -r SRCDIR DESTDIR where DESTDIR is on the NFS server. Averages from memory: FSiSCSI backendRate XFS 1.5TB single Lun ~1-1.1MB/sec ZFS 1.5TB single Lun ~250-400KB/sec ZFS 1.5TB RAIDZ (8 disks) ~25KB/sec In the case of mixed sized files with predominantly small files above and below 8K, I see the XFS solution jump to an average of 2.5-3MB/sec. The ZFS store over a single lun stay within 200-420KB/sec, and the RAIDZ range from 16-40KB/sec. Likely caching and some dynamic behaviours cause ZFS to get worse with mixed sizing, whereas XFS or such increases performance. Finally, by switching to SMB and not using NFS, I can maintain over 3MB/sec rates. Large files over NFS get more reasonable performance (14MB-28MB/sec) on any given ZFS backend, and I get 30+MB/sec locally with spikes close to 100MB/sec when writing locally. I only can maximize performance on my ZFS backend if I use a blocksize (tests using dd) of 256K or greater. 128K seems to provide less overall datarates, and I believe this is the default when I use cp, rsync, or other commands locally. In summary, I can make my ZFS-based initiator an NFS client or otherwise use rsyncd to ameliorate the pathological NFS server performance of the ZFS combination. I can then service files fine. This solution allows us to move forward as a Tier2 only solution. If _any_ thing can be done to address NFS and its interactions with ZFS, and bring it close to 1MB/sec performance (these are gig-e interconnects afterall, think about it) then it will only be 1/10th the performance of a NetApp in this worse case scenario and perform similar to the NetApp if not better in other cases. The NetApp can do around 10MB/sec in the senario I'm depicting. Currently, we have around 1/20th to 1/30th the performance level when not using RAIDZ, and 1/200th using RAIDZ. I just can't quite understand how we can go from a cp -p TESTDIR DESTDIR of 50MB of small files locally in an instant and the OS returning to the prompt. Zpool iostat showing the writes committed over the next 3-6 seconds, and this is OK for on-disk consistency. But then for some reason its required that the NFS client can't commit in a similar fashion, with Solaris saying yes, we got it, here's confirmation.. next just as it does locally. The data definitely gets there at the same speed as my tests with remote iscsi pools and as an NFS client shows. My naive sense is that this should be addressable at some level without inducing corruption. I have a feeling that its somehow being overly conservative in this stance. On 5/30/06, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Joe, Wednesday, May 31, 2006, 12:44:22 AM, you wrote: JL Well, I would caution at this point against the iscsi backend if you JL are planning on using NFS. We took a long winded conversation online JL and have yet to return to this list, but the gist of it is that the JL latency of iscsi along with the tendency for NFS to fsync 3 times per JL write causes performance to drop
Re: [zfs-discuss] cluster features
Well, I would caution at this point against the iscsi backend if you are planning on using NFS. We took a long winded conversation online and have yet to return to this list, but the gist of it is that the latency of iscsi along with the tendency for NFS to fsync 3 times per write causes performance to drop dramatically, and it gets much worse for a RAIDZ config. If you want to go this route, FC is a current suggested requirement. On 5/30/06, Eric Schrock [EMAIL PROTECTED] wrote: On Tue, May 30, 2006 at 03:55:09AM -0700, Ernst Rohlicek jun. wrote: Hello list, I've read about your fascinating new fs implementation, ZFS. I've seen alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say: I'm quite impressed! I'd set up a few of my boxes to OpenSolaris for storage (using Linux and lvm right now - offers pooling, but no built-in fault-tolerance) if ZFS had one feature: Use of more than one machine - currently, as I understand it, if disks fail, no problem, but if the server machine fails, ... I read in your FAQ that cluster features are on the way and wanted to ask what's the status here :-) BTW I recently read about a filesystem, which has a pretty good cluster architecture, called Google File System. The article on the English Wikipedia has a good overview, a link to the detailed papers and a ZDNet interview about it. I just wanted to point that out to you, maybe some of its design / architecture is useful in ZFS's cluster mode. For cross-machine tolerance, it should be possible (once the iSCSI target is integrated) to create ZFS-backed iSCSI targets and then use RAID-Z from a single host across machines. This is not a true clustered filesystem, as it has a single point of access, but it does get you beyond the 'single node = dataloss' mode of failure. As for the true clustered filesystem, we're still gathering requirements. We have some ideas in the pipeline, and it's definitely a direction in which we are headed, but there's not much to say at this point. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
well, here's my first pass result: [EMAIL PROTECTED] loges1]# time tar xf /root/linux-2.2.26.tar real114m6.662s user0m0.049s sys 0m1.354s On 5/11/06, Roch Bourbonnais - Performance Engineering [EMAIL PROTECTED] wrote: Joe Little writes: How did you get the average time for async writes? My client (lacking ptime, its linux) comes in at 50 minutes, not 50 seconds. I'm running again right now for a more accurate number. I'm untarring from a local file on the directory to the NFS share. I used dtrace to measure times (I used the sleep time so it gives a ballpark figure). I untared with the tar file on the NFS share. Just retimed after moving the tar file to /tmp. # ptime tar xf /tmp/linux-2.2.22.tar ptime tar xf /tmp/linux-2.2.22.tar real 49.630 user1.033 sys11.405 -r On 5/11/06, Roch Bourbonnais - Performance Engineering [EMAIL PROTECTED] wrote: # ptime tar xf linux-2.2.22.tar ptime tar xf linux-2.2.22.tar real 50.292 user1.019 sys11.417 # ptime tar xf linux-2.2.22.tar ptime tar xf linux-2.2.22.tar real 56.833 user1.056 sys11.581 # avg time waiting for async writes is around 3ms. How much are you getting for the tar xf ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
I was asked to also snoop the iscsi end of things, trying to find something different between the two. iscsi being relatively opaque, it was easiest to find differences in the patterns. In the local copy to RAIDZ example, the iscsi link would show packets of 1514 in length in series of 5-10, with interjecting packets of 60 or 102, generally 2-4 in number. In the NFS client hitting the RAIDZ/iscsi combo, the iscsi length would have 3-5 on average 1514 length packets with 5-7 packets of 60 or 102 in between. Basically, the averages swapped, and its likely because of a lot more meta data and/or write confirmations going on in the NFS case. At this point in time, I have two very important questions: 1) Is there any options available or planned to make NFS/ZFS work more in concert to avoid this overhead, which with many small iscsi packets (in the iscsi case) kills performance? 2) Is iscsi-backed storage, especially StorageTek acquired products, in the planning matrix for supported ZFS (NAS) solutions? Also, why hasn't this combination been tested to date since this appears to be an achilles heal. Again, UFS does not have this problem, nor other file systems on other OSes (namely, XFS, JFS, etc which I've tested before) On 5/8/06, Nicolas Williams [EMAIL PROTECTED] wrote: On Fri, May 05, 2006 at 11:55:17PM -0500, Spencer Shepler wrote: On Fri, Joe Little wrote: Thanks. I'm playing with it now, trying to get the most succinct test. This is one thing that bothers me: Regardless of the backend, it appears that a delete of a large tree (say the linux kernel) over NFS takes forever, but its immediate when doing so locally. Is delete over NFS really take such a different code path? Yes. As mentioned in my other email, the NFS protocol requires that operations like REMOVE, RMDIR, CREATE have the filesystem metadata written to stable storage/disk before sending a response to the client. That is not required of local access and therefore the disparity between the two. So then multi-threading rm/rmdir on the client-side would help, no? Are there/should there be async versions of creat(2)/mkdir(2)/ rmdir(2)/link(2)/unlink(2)/...? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Thanks for the tip. In the local case, I could send to the iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time of 50seconds (17 seconds better than UFS). However, I didn't even both finishing the NFS client test, since it was taking a few seconds between multiple 27K files. So, it didn't help NFS at all. I'm wondering if there is something on the NFS end that needs changing, no? Also, how would one easily script the mdb command below to make permanent? On 5/5/06, Eric Schrock [EMAIL PROTECTED] wrote: My gut feeling is that somehow the DKIOCFLUSHWRITECACHE ioctls (which translate to the SCSI flush write cace requests) are throwing iSCSI for a loop. We've exposed a number of bugs in our drivers because ZFS is the first filesystem to actually care to issue this request. To turn this off, you can try: # mdb -kw ::walk spa | ::print spa_t spa_root_vdev | ::vdev -r ADDR STATE AUX DESCRIPTION 82dc16c0 HEALTHY -root 82dc0640 HEALTHY - /dev/dsk/c0d0s0 82dc0640::print -a vdev_t vdev_nowritecache 82dc0af8 vdev_nowritecache = 0 (B_FALSE) 82dc0af8/W1 0x82dc0af8: 0 = 0x1 See if that makes a difference. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Thanks. I'm playing with it now, trying to get the most succinct test. This is one thing that bothers me: Regardless of the backend, it appears that a delete of a large tree (say the linux kernel) over NFS takes forever, but its immediate when doing so locally. Is delete over NFS really take such a different code path? On 5/5/06, Lisa Week [EMAIL PROTECTED] wrote: These may help: http://opensolaris.org/os/community/dtrace/scripts/ Check out iosnoop.d http://www.solarisinternals.com/si/dtrace/index.php Check out iotrace.d - Lisa Joe Little wrote On 05/05/06 18:59,: Are there known i/o or iscsi dtrace scripts available? On 5/5/06, Spencer Shepler [EMAIL PROTECTED] wrote: On Fri, Joe Little wrote: On 5/5/06, Eric Schrock [EMAIL PROTECTED] wrote: On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote: Thanks for the tip. In the local case, I could send to the iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time of 50seconds (17 seconds better than UFS). However, I didn't even both finishing the NFS client test, since it was taking a few seconds between multiple 27K files. So, it didn't help NFS at all. I'm wondering if there is something on the NFS end that needs changing, no? Keep in mind that turning off this flag may corrupt on-disk state in the event of power loss, etc. What was the delta in the local case? 17 seconds better than UFS, but percentage wise how much faster than the original? I believe it was only about 5-10% faster. I don't have the time results off hand, just some dtrace latency reports. NFS has the property that it does an enormous amount of synchronous activity, which can tickle interesting pathologies. But it's strange that it didn't help NFS that much. Should I also mount via async.. would this be honored on the Solaris end? The other option mentioned with similar caveats was nocto. I just tried with both, and the observed transfer rate was about 1.4k/s. It was painful deleting the 3G directory via NFS, with about 100k/s deletion rate on these 1000 files. Of course, When I went locally the delete was instantaneous. I wouldn't change any of the options at the client. The issue is at the server side and none of the other combinations that you originally pointed out have this problem, right? Mount options at the client will just muddy the waters. We need to understand if/what the NFS/ZFS/iscsi interaction is and why it is so much worse. As Eric mentioned, there may be some interesting pathologies at play here and we need to understand what they are so they can be addressed. My suggestion is additional dtrace data collection but I don't have a specific suggestion as to how/what to track next. Because of the significant additional latency, I would be looking for big increases in the number of I/Os being generated to the iscsi backend as compared to the local attached case. I would also look for some type of serialization of I/Os that is occurring with iscsi vs. the local attach. Spencer ___ nfs-discuss mailing list [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor directory traversal or small file performance?
I just responsed to the NFS list, and it definitely looks like a bad interaction between NFS-ZFS-iSCSI, where as the first two (local disk for ZFS) or the last two (no ZFS) are very fast. Are there posted zfs dtrace scripts for observability of i/o? On 5/4/06, Neil Perrin [EMAIL PROTECTED] wrote: Actually the nfs slowness could be caused by the bug below, but it doesn't explain the find . times on a local zfs. Neil Perrin wrote On 05/04/06 21:01,: Was this a 32 bit intel system by chance? If so this is quite likely caused by: 6413731 pathologically slower fsync on 32 bit systems This was fixed in snv_39. Joe Little wrote On 05/04/06 15:47,: I've been writing to the Solaris NFS list since I was getting some bad performance copying via NFS (noticeably there) a large set of small files. We have various source trees, including a tree with many linux versions that I was copying to my ZFS NAS-to-be. On large files, it flies pretty well, and zpool iostat 1 shows interesting patterns of writes in the low k's up to 102MB/sec and down again as buffered segments apparently are synced. However, in the numerous small file case, we see consistently only transfers in the low k's per second. First, to give some background, we are utilizing iscsi, with the backend made up a directly exposed SATA disks via the target. I've put them in a 8 disk raidz: pool: poola0 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM poola0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 Again, I can get some great numbers on large files (doing a dd with a large blocksize screams!), but as a test, I took a problematic tree of around 1 million files, and walked it with a find/ls: bash-3.00# time find . \! -name .* | wc -l 987423 real53m52.285s user0m2.624s sys 0m27.980s That was local to the system, and not even NFS. The original files, located on a EXT3 RAID50, accessed via a linux client (NFS v3): [EMAIL PROTECTED] old-servers]# time find . \! -name .* | wc -l 987423 real1m4.255s user0m0.914s sys 0m6.976s Woe.. Something just isn't right here. Are there explicit ways I can find out what's wrong with my setup? This is from a dtrace/zdb/mdb neophyte. All I have been tracking with are zpool iostats. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss