Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!
Scott Meilicke wrote: Obviously iSCSI and NFS are quite different at the storage level, and I actually like NFS for the flexibility over iSCSI (quotas, reservations, etc.) Another key difference between them is that with iSCSI, the VMFS filesystem (built on the zvol presented as a block device) never frees up unused disk space. Once ESX has written to a block on that zvol, it will always be taking up space in your zpool, even if you delete the .vmdk file that contains it. The zvol has no idea that the block is not used any more. With NFS, ZFS is aware that the file is deleted, and can deallocate those blocks. This would be less of an issue if we had deduplication on the zpool (have ESX write blocks of all-0 and those would be deduped down to a single block) or if there was some way (like the SSD TRIM command) for the VMFS filesystem to tell the block device that a block is no longer used. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring ZFS host memory use
Carson Gaspar wrote: Not true. The script is simply not intelligent enough. There are really 3 broad kinds of RAM usage: A) Unused B) Unfreeable by the kernel (normal process memory) C) Freeable by the kernel (buffer cache, ARC, etc.) Monitoring usually should focus on keeping (A+C) above some threshold. On Solaris, this means parsing some rather obscure kstats, sadly (not that Linux's /proc/meminfo is much better). B) is freeable but requires moving pages to spinning rust. There's a subset of B (Call it B1) that is the active processes' working sets which are basically useless to swap out, since they'll be swapped right back in again. Two other important types of RAM usage in many modern situations: D) Unpageable (pinned) memory E) Memory that is presented to the OS but that is thin-provisioned by a hypervisor or other vitualization layer. (use of this memory may mean that the hypervisor moves pages to spinning rust) For virtualized systems, you should limit the size of A+B1+C so that it does not get into memory E. There's no point in having data in the ARC if the hypervisor has to go to disk to get it. Considering that the size of E is dependant on the memory demands on the host server, (which the guest has no insight into) this is a Very Hard problem. Often this is arranged by having the hypervisor break the virtualization containment via a memory management driver (vmware tools provides a memory control, for example) which steals pages of virtual-chip memory to avoid the hypervisor swapping. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()
Joerg Schilling wrote: James Andrewartha jam...@daa.com.au wrote: Recently there's been discussion [1] in the Linux community about how filesystems should deal with rename(2), particularly in the case of a crash. ext4 was found to truncate files after a crash, that had been written with open(foo.tmp), write(), close() and then rename(foo.tmp, foo). This is because ext4 uses delayed allocation and may not write the contents to disk immediately, but commits metadata changes quite frequently. So when rename(foo.tmp,foo) is committed to disk, it has a length of zero which is later updated when the data is written to disk. This means after a crash, foo is zero-length, and both the new and the old data has been lost, which is undesirable. This doesn't happen when using ext3's default settings because ext3 writes data to disk before metadata (which has performance problems, see Firefox 3 and fsync[2]) Ted T'so's (the main author of ext3 and ext4) response is that applications which perform open(),write(),close(),rename() in the expectation that they will either get the old data or the new data, but not no data at all, are broken, and instead should call open(),write(),fsync(),close(),rename(). The only granted way to have the file new in a stable state on the disk is to call: f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666); write(f, dat, size); fsync(f); close(f); AFAIUI, the ZFS transaction group maintains write ordering, at least as far as write()s to the file would be in the ZIL ahead of the rename() metadata updates. So I think the atomicity is maintained without requiring the application to call fsync() before closing the file. If the TXG is applied and the rename() is included, then the file writes have been too, so foo would have the new contents. If the TXG containing the rename() isn't complete and on the ZIL device at crash time, foo would have the old contents. Posix doesn't require the OS to sync() the file contents on close for local files like it does for NFS access? How odd. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nexsan SATABeast and ZFS
Lars-Gunnar Persson wrote: I would like to go back to my question for a second: I checked with my Nexsan supplier and they confirmed that access to every single disk in SATABeast is not possible. The smallest entities I can create on the SATABeast are RAID 0 or 1 arrays. With RAID 1 I'll loose too much disk space and I believe that leaves me with RAID 0 as the only reasonable option. But with this unsecure RAID format I'll need higher redundancy in the ZFS configuration. I think I'll go with the following configuration: On the Nexsan SATABeast: * 14 disks configured in 7 RAID arrays with RAID level 0 (each disk is 1 TB which gives me a total of 14 TB raw disk space). * Each RAID 0 array configured as one volume. So what the front end will see is 7 disks, 2TB each disk. On the Sun Fire X4100 M2 with Solaris 10: * Add all 7 volumes to one zpool configured in on raidz2 (gives me approx. 8,8 TB available disk space) You'll get 5 LUNs worth of space in this config, or 10TB of usable space. Any comments or suggestions? Given the hardware constraints (no single-disk volumes allowed) this is a good configuration for most purposes. The advantages/disadvantages are: . 10TB of usable disk space, out of 14TB purchased. . At least three hard disk failures are required to lose the ZFS pool. . Random non-cached read performance will be about 300 IO/sec. . Sequential reads and writes of the whole ZFS blocksize will be fast (up to 2000 IO/sec). . One hard drive failure will cause the used blocks of the 2TB LUN (raid0 pair) to be resilvered, even though the other half of the pair is not damaged. The other half of the pair is more likely to fail during the ZFS resilvering operation because of increased load. You'll want to pay special attention to the cache settings on the Nexsan. You earlier showed that the write cache is enabled, but IIRC the array doesn't have a nonvolatile (battery-backed) cache. If that's the case, MAKE SURE it's hooked up to a UPS that can support it for the 30 second cache flush timeout on the array. And make sure you don't power it down hard. I think you want to uncheck the ignore FUA setting, so that FUA requests are respected. My guess is that this will cause the array to properly handle the cache_flush requests that ZFS uses to ensure data consistancy. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nexsan SATABeast and ZFS
Bob Friesenhahn wrote: Your idea to stripe two disks per LUN should work. Make sure to use raidz2 rather than plain raidz for the extra reliability. This solution is optimized for high data throughput from one user. Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of redundancy (either mirror or raidz[2]) would be an efficient use of space. There would be no additional space overhead caused by running that way. Note, however, that if you do this, ZFS must resilver the larger LUN in the event of a single disk failure on the backend. This means a longer time to rebuild, and a lot of extra work on the other (non-failed) half of the RAID0 stripe. An alternative is to create individual RAID 0 LUNs which actually only contain a single disk. This is certainly preferable, since the unit of failure at the hardware level corresponds to the unit of resilvering at the ZFS level. And at least on my Nexsan SATAboy(2f) this configuration is possible. Then implement the pool as two raidz2s with six LUNs each, and two hot spares. That would be my own preference. Due to ZFS's load share this should provide better performance (perhaps 2X) for multi-user loads. Some testing may be required to make sure that your hardware is happy with this. I disagree with this suggestion. With this config, you only get 8 disks worth of storage, out of the 14, which is a ~42% overhead. In order to lose data in this scenario, 3 disks would have to fail out of a single 6-disk group before zfs is able to resilver any of them to the hot spares. That seems (to me) a lot more redundancy than is needed. As far as workload, any time you use RAIDZ[2], ZFS must read the entire stripe (across all of the disks) in order to verify the checksum for that data block. This means that a 128k read (the default zfs blocksize) requires a 32kb read from each of 6 disks, which may include a relatively slow seek to the relevant part of the spinning rust. So for random I/O, even though the data is striped across all the disks, you will see only a single disks's worth of throughput. For sequential I/O, you'll see the full RAID set's worth of throughput. If you are expecting a non-sequential workload, you would be better off taking the 50% storage overhead to do ZFS mirroring. Avoid RAID5 if you can because it is not as reliable with today's large disks and the resulting huge LUN size can take a long time to resilver if the RAID5 should fail (or be considered to have failed). Here's a place that ZFS shines: it doesn't resilver the whole disk, just the data blocks. So it doesn't have to read the full array to rebuild a failed disk, so it's less likely to cause a subsequent failure during parity rebuild. My $.02. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs streams data corruption
Miles Nordin wrote: that SQLite2 should be equally as tolerant of snapshot backups as it is of cord-yanking. The special backup features of databases including ``performing a checkpoint'' or whatever, are for systems incapable of snapshots, which is most of them. Snapshots are not writeable, so this ``in the middle of a write'' stuff just does not happen. This is correct. The general term for these sorts of point-in-time backups is crash consistant. If the database can be recovered easily (and/or automatically) from pulling the plug (or a kill -9), then a snapshot is an instant backup of that database. In-flight transactions (ones that have not been committed) at the database level are rolled back. Applications using the database will be confused by this in a recovery scenario, since the transaction was reported as committed are gone when the database comes back. But that's the case any time a database moves backward in time. Of course Toby rightly pointed out this claim does not apply if you take a host snapshot of a virtual disk, inside which a database is running on the VM guest---that implicates several pieces of untrustworthy stacked software. But for snapshotting SQLite2 to clone the currently-running machine I think the claim does apply, no? Snapshots of a virtual disk are also crash-consistant. If the VM has not committed its transactionally-committed data and is still holding it volatile memory, that VM is not maintaining its ACID requirements, and that's a bug in either the database or in the OS running on the VM. The snapshot represents the disk state as if the VM were instantly gone. If the VM or the database can't recover from pulling the virtual plug, the snapshot can't help that. That said, it is a good idea to quiesce the software stack as much as possible to make the recovery from the crash-consistant image as painless as possible. For example, if you take a snapshot of a VM running on an EXT2 filesystem (or unlogged UFS for that matter) the recovery will require an fsck of that filesystem to ensure that the filesystem structure is consistant. Perforing a lockfs on the filesystem while the snapshot is taken could mitigate that, but that's still out of the scope of the ZFS snapshot. --Joe --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
Mario Goebbels wrote: One thing I'd like to see is an _easy_ option to fall back onto older uberblocks when the zpool went belly up for a silly reason. Something that doesn't involve esoteric parameters supplied to zdb. Between uberblock updates, there may be many write operations to a data file, each requiring a copy on write operation. Some of those operations may reuse blocks that were metadata blocks pointed to by the previous uberblock. In which case the old uberblock points to a metadata tree full of garbage. Jeff, you must have some idea on how to overcome this in your bugfix, would you care to share? --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Ross wrote: The problem is they might publish these numbers, but we really have no way of controlling what number manufacturers will choose to use in the future. If for some reason future 500GB drives all turn out to be slightly smaller than the current ones you're going to be stuck. Reserving 1-2% of space in exchange for greater flexibility in replacing drives sounds like a good idea to me. As others have said, RAID controllers have been doing this for long enough that even the very basic models do it now, and I don't understand why such simple features like this would be left out of ZFS. It would certainly be terrible go back to the days where 5% of the filesystem space is inaccessible to users, and force the sysadmin to manually change that percentage to 0 to get full use of the disk. Oh wait, UFS still does that, and it's a configurable parameter at mkfs time (and can be tuned on the fly) For a ZFS pool, (until block pointer rewrite capability) this would have to be a pool-create-time parameter. Perhaps a --usable-size=N[%] option which would either cut down the size of the EFI slices or fake the disk geometry so the EFI label ends early. Or it would be a small matter of programming to build a perl wrapper for zpool create that would accomplish the same thing. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Miles Nordin wrote: mj == Moore, Joe joe.mo...@siemens.com writes: mj For a ZFS pool, (until block pointer rewrite capability) this mj would have to be a pool-create-time parameter. naw. You can just make ZFS do it all the time, like the other storage vendors do. no parameters. Other storage vendors have specific compatibility requirements for the disks you are allowed to install in their chassis. On the other hand, OpenSolaris is intended to work on commodity hardware. And there is no way to change this after the pool has been created, since after that time, the disk size can't be changed. So whatever policy is used by default, it is very important to get it right. (snip) Most people will not even notice the feature exists except by getting errors less often. AIUI this is how it works with other RAID layers, the cheap and expensive alike among ``hardware'' RAID, and this common-practice is very ZFS-ish. except hardware RAID is proprietary so you cannot determine their exact policy, while in ZFS you would be able to RTFS and figure it out. Sysadmins should not be required to RTFS. Behaviors should be documented in other places too. But there is still no need for parameters. There isn't even a need to explain the feature to the user. There isn't a need to explain the feature to the user? That's one of the most irresponsible responses I've heard lately. A user is expecting their 500GB disk to be 5 bytes, not 4999500 bytes, unless that feature is explained. Parameters with reasonable defaults (and a reasonable way to change them) allow users who care about the parameter and understand the tradeoffs involved in changing from the default to make their system work better. If I didn't want to be able to tune my system for performance, I would be running Windows. OpenSolaris is about transparency, not just Open Source. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs subdirectories to data set conversion
Nicolas Williams wrote: It'd be awesome to have a native directory-dataset conversion feature in ZFS. And, relatedly, fast moves of files across datasets in the same volume. These two RFEs have been discussed to death in the list; see the archives. This would be a nice feature to have. The most compelling technical problem I've seen in the idea of reparenting a directory to be a top-level dataset is that when a zfs filesystem is used, open files on that filesystem have a particular devid. In order to split off the directory onto a new zfs filesystem, you'd have to atomically change the devid inside all the processes that have open files under that directory. Finding those open files is practically impossible. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Ross Smith wrote: My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok And for the state where it's not returning data, you can again split that in two: - returns wrong data - doesn't return data The state in discussion in this thread is the I/O requested by ZFS hasn't finished after 60, 120, 180, 3600, etc. seconds The pool is waiting (for device timeouts) to distinguish between the first two states. More accurate state descriptions are: - The I/O has returned data - The I/O hasn't yet returned data and the user (admin) is justifiably impatient. For the first state, the data is either correct (verified by the ZFS checksums, or ESUCCESS on write) or incorrect and retried. The first of these is already covered by ZFS with its checksums (with FMA doing the extra work to fault drives), so it's just the second that needs immediate attention, and for the life of me I can't think of any situation that a simple timeout wouldn't catch. Personally I'd love to see two parameters, allowing this behavior to be turned on if desired, and allowing timeouts to be configured: zfs-auto-device-timeout zfs-auto-device-timeout-fail-delay I'd prefer these be set at the (default) pool level: zpool-device-timeout zpool-device-timeout-fail-delay with specific per-VDEV overrides possible: vdev-device-timeout and vdev-device-fail-delay This would allow but not require slower VDEVs to be tuned specifically for that case without hindering the default pool behavior on the local fast disks. Specifically, consider where I'm using mirrored VDEVs with one half over iSCSI, and want to have the iSCSI retry logic to still apply. Writes that failed while the iSCSI link is down would have to be resilvered, but at least reads would switch to the local devices faster. Set them to the default magic 0 value to have the system use the current behavior, of relying on the device drivers to report failures. Set to a number (in ms probably) and the pool would consider an I/O that takes longer than that as returns invalid data When the FMA work discussed below, these could be augmented by the pools best heuristic guess as to what the proper timeouts should be, which could be saved in (kstat?) vdev-device-autotimeout. If you set the timeout to the magic -1 value, the pool would use vdev-device-autotimeout. All that would be required is for the I/O that caused the disk to take a long time to be given a deadline (now + (vdev-device-timeout ?: (zpool-device-timeout?: forever)))* and consider the I/O complete with whatever data has returned after that deadline: if that's a bunch of 0's in a read, which would have a bad checksum; or a partially-completed write that would have to be committed somewhere else. Unfortunately, I'm not enough of a programmer to implement this. --Joe * with the -1 magic, it would be a little more complicated calculation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
C. Bergström wrote: Will Murnane wrote: On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote: Still don't understand why even the one on http://www.opensolaris.com/, ZFS - A Smashing Hit, doesn't show the app running in the moment the HD is smashed... weird... Sorry this is OT, but is it just me or does is only seem proper to have Gallagher do this? ;) Absolutely not. Under no circumstances should you attempt to create a striped ZFS pool on a watermelon, nor on any other type of epigynous berry. If you try, you will certainly rind up with a mess, if not a core dump. And let me tell you, that's the pits. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris, thumper and hd
Tommaso Boccali wrote: Ciao, I have a thumper with Opensolaris (snv_91), and 48 disks. I would like to try a new brand of HD, by replacing a spare disk with a new one and build on it a zfs pool. Unfortunately the official utility to map a disk to the physical position inside the thumper (hd, in /opt/SUNWhd) is not present in OpenSolaris. Any idea on how - get it It should have shipped with the system. But you can also download it http://www.sun.com/servers/x64/x4500/downloads.jsp Get it, install it, be happy :-) Or if you haven't tweaked the discovery order, there's a map at http://docs.sun.com/source/819-4359-14/figures/CH2-power-bios-9.gif --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)
Brian Hechinger On Mon, Oct 06, 2008 at 10:47:04AM -0400, Moore, Joe wrote: I wonder if an AVS-replicated storage device on the backends would be appropriate? write - ZFS-mirrored slog - ramdisk -AVS- physical disk \ +-iscsi- ramdisk -AVS- physical disk You'd get the continuous replication of the ramdisk to physical drive (and perhaps automagic recovery on reboot) but not pay the syncronous write to remote physical disk penalty It looks like the answer is no. [EMAIL PROTECTED] sudo sndradm -e localhost /dev/rramdisk/avstest1 /dev/zvol/rdsk/SYS0/bitmap1 \wintermute /dev/zvol/dsk/SYS0/avstest2 /dev/zvol/rdsk/SYS0/bitmap2 ip async Enable Remote Mirror? (Y/N) [N]: y sndradm: Error: both localhost and wintermute are local I've not worked with AVS other than looking at the basic concepts, but to me this looks like a dont-shoot-yourself-in-the-foot critical warning rather than an actual functionality restriction. Is there a -force option to override this normally quite reasonable sanity check? --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)
Nicolas Williams wrote There have been threads about adding a feature to support slow mirror devices that don't stay synced synchronously. At least IIRC. That would help. But then, if the pool is busy writing then your slow ZIL mirrors would generally be out of sync, thus being of no help in the even of a power failure given fast slog devices that don't survive power failure. I wonder if an AVS-replicated storage device on the backends would be appropriate? write - ZFS-mirrored slog - ramdisk -AVS- physical disk \ +-iscsi- ramdisk -AVS- physical disk You'd get the continuous replication of the ramdisk to physical drive (and perhaps automagic recovery on reboot) but not pay the syncronous write to remote physical disk penalty Also, using remote devices for a ZIL may defeat the purpose of fast ZILs, even if the actual devices are fast, because what really matters here is latency, and the farther the device, the higher the latency. A .5-ms RTT on an ethernet link to the iSCSI disk may be faster than a 9-ms latency on physical media. There was a time when it was better to place workstations' swap files on the far side of a 100Mbps ethernet link rather than using the local spinning rust. Ah, the good old days... --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Toby Thain Wrote: ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? The OP in question is not running his network clients on Solaris or OpenSolaris or FreeBSD or MacOSX, but rather a collection of Linux workstations. Unless there's been a recent port of ZFS to Linux, that makes a big What. Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ian Collins wrote: I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? Assuming that the application servers can coexist in the only 16GB available on a thumper, and the only 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Darren J Moffat wrote: Moore, Joe wrote: Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process. NFS can provided on the wire protection if you enable Kerberos support (there are usually 3 options for Kerberos: krb5 (or sometimes called krb5a) which is Auth only, krb5i which is Auth plus integrity provided by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data. I have personally seen krb5i NFS mounts catch problems when there was a router causing failures that the TCP checksum don't catch. No doubt, additional layers of data protection are available. I don't know the state of RPCSEC on Linux, so I can't comment on this, certainly your experience brings valuable insight into this discussion. It is also recommended (when iSCSI is an appropriate transport) to run over IPSEC in ESP mode to also ensure data-packet-content consistancy. Certainly NFS over IPSEC/ESP would be more resistant to on-the-wire corruption. Either of these would give better data reliability than pure NFS, just like ZFS on the backend gives better data reliability than for example, UFS or EXT3. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] iscsi target problems on snv_97
I believe the problem you're seeing might be related to deadlock condition (CR 6745310), if you run pstack on the iscsi target daemon you might find a bunch of zombie threads. The fix is putback to snv-99, give snv-99 a try. Yes, a pstack of the core I've generated from iscsitgtd does have a number of zombie threads. I'm afraid I can't make heads nor tails of the bug report at http://bugs.opensolaris.org/view_bug.do?bug_id=6658836 nor its duplicate-of 6745310, nor any of the related bugs (all are unavailable except for 6676298, and the stack trace reported in that bug doesn't look anything like mine. As far as I can tell snv-98 is the latest build, from Sep 10 according to http://dlc.sun.com/osol/on/downloads/. So snv-99 should be out next week, correct? Anything I can do in the mean time? Do I need to BFU to the latest nightly build? Or would just taking the iscsitgtd from that build suffice? --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] iscsi target problems on snv_97
I've recently upgraded my x4500 to Nevada build 97, and am having problems with the iscsi target. Background: this box is used to serve NFS underlying a VMware ESX environment (zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) for a Windows host and to act as zoneroots for Solaris 10 hosts. For optimal random-read performance, I've configured a single zfs pool of mirrored VDEVs of all 44 disks (+2 boot disks, +2 spares = 48) Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool would stop occasionally. Since the upgrade, that hasn't happened, and the NFS clients are quite happy. The iSCSI initiators are not. The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on Windows 2003 SP2 x64 Enterprise Edition. When the system reboots, it is not able to connect to its iscsi targets. No devices are found until I restart the iscsitgt process on the x4500, at which point the initiator will reconnect and find everything. I notice that on the x4500, it maintains an active TCP connection (according to netstat -an | grep 3260) to the Windows box through the reboot and for a long time afterwards. The initiator starts a second connection, but it seems that the target doesn't let go of the old one. Or something. At this point, every time I reboot the Windows system I have to `pkill iscsitgtd` The Solaris system is running S10 Update 4. Every once in a while (twice today, and not correlated with the pkill's above) the system reports that all of the iscsi disks are unavailable. Nothing I've tried short of a reboot of the whole host brings them back. All of the zones on the system remount their zoneroots read-only (and give I/O errors when read or zlogin'd to) There are a set of TCP connections from the zonehost to the x4500 that remain even through disabling the iscsi_initiator service. There's no process holding them as far as pfiles can tell. Does this sound familiar to anyone? Any suggestions on what I can do to troubleshoot further? I have a kernel dump from the zonehost and a snoop capture of the wire for the Windows host (but it's big). I'll be opening a bug too. Thanks, --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540
Bob Friesenhahn I expect that Sun is realizing that it is already undercutting much of the rest of its product line. These minor updates would allow the X4540 to compete against much more expensive StorageTek SAN hardware. Assuming, of course that the requirements for the more expensive SAN hardware don't include, for example, surviving a controller or motherboard failure (or gracefully a RAM chip failure) without requiring an extensive downtime for replacement, or other extended downtime because there's only 1 set of chips that can talk to those disks. Real SAN storage is dual-ported to dual controller nodes so that you can replace a motherboard without taking down access to the disk. Or install a new OS version without waiting for the system to POST. How can other products remain profitable when competing against such a star performer? Features. RAS. Simplicity. Corporate Inertia (having storage admins who don't know OpenSolaris). Executive outings with StorageTek-logo'd golfballs. The last 2 aren't something I'd build a business case around, but they're a reality. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] proposal partial/relative paths for zfs(1)
Carson Gaspar wrote: Darren J Moffat wrote: $ pwd /cube/builds/darrenm/bugs $ zfs create -c 6724478 Why -c ? -c for current directory -p partial is already taken to mean create all non existing parents and -r relative is already used consistently as recurse in other zfs(1) commands (as well as lots of other places). Why not zfs create $PWD/6724478. Works today, traditional UNIX behaviour, no coding required. Unles you're in some bizarroland shell (like csh?)... Because the zfs dataset mountpoint may not be the same as the zfs pool name. This makes things a bit complicated for the initial request. Personally, I haven't played with datasets where the mountpoint is different. If you have a zpool tank mounted on /tank and /tank/homedirs with mountpoint=/export/home, do you create the next dataset /tank/homedirs/carson, or /export/home/carson ? And does the mountpoint get inherited in the obvious (vs. the simple vs. not at all) way? I don't know. Also $PWD has a leading / in this example. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Bob Friesenhahn wrote: Something else came to mind which is a negative regarding deduplication. When zfs writes new sequential files, it should try to allocate blocks in a way which minimizes fragmentation (disk seeks). It should, but because of its copy-on-write nature, fragmentation is a significant part of the ZFS data lifecycle. There was a discussion of this on this list at the beginning of the year... http://mail.opensolaris.org/pipermail/zfs-discuss/2007-November/044077.h tml Disk seeks are the bane of existing storage systems since they come out of the available IOPS budget, which is only a couple hundred ops/second per drive. The deduplication algorithm will surely result in increasing effective fragmentation (decreasing sequential performance) since duplicated blocks will result in a seek to the master copy of the block followed by a seek to the next block. Disk seeks will remain an issue until rotating media goes away, which (in spite of popular opinion) is likely quite a while from now. On ZFS, sequential files are rarely sequential anyway. The SPA tries to keep blocks nearby, but when dealing with snapshotted sequential files being rewritten, there is no way to keep everything in order. But if you read through the thread referenced above, you'll see that there's no clear data about just how that impacts performance (I still owe Mr. Elling a filebench run on one of my spare servers) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help! ZFS pool is UNAVAILABLE
I AM NOT A ZFS DEVELOPER. These suggestions should work, but there may be other people who have better ideas. Aaron Berland wrote: Basically, I have a 3 drive raidz array on internal Seagate drives. running build 64nv. I purchased 3 add'l USB drives with the intention of mirroring and then migrating the data to the new USB drives. (snip) Below is my current zpool status. Note the USB drives are showing up as the same device. They are plugged into 3 different port and they used to show up as different controllers?? This whole thing was supposed to duplicate my data and have more redundancy, but now it looks like I could be loosing it all?! I have some data backed up on other devices, but not all. NAMESTATE READ WRITE CKSUM zbk UNAVAIL 0 0 0 insufficient replicas raidz1ONLINE 0 0 0 c2d0p2 ONLINE 0 0 0 c1d0ONLINE 0 0 0 c1d1ONLINE 0 0 0 raidz1UNAVAIL 0 0 0 insufficient replicas c5t0d0 ONLINE 0 0 0 c5t0d0 FAULTED 0 0 0 corrupted data c5t0d0 FAULTED 0 0 0 corrupted data Ok, from here, we can see that you have a single pool, with two striped components: a raidz set from c1 and c2 disks, and the (presumably new) raidz set from c5 -- I'm guessing this is where the USB disks show up. Unfortunately, it is not possible to remove a component from a zfs pool. On the bright side, it might be possible to trick it, at least for long enough to get the data back. First, we'll want to get the system booted. You'll connect the USB devices, but DON't try to do anything with your pool (especially don't put more data on it) You should then be able to get a consistant pool up and running -- the devices will be scanned and detected and automatically reenabled. You might have to do a zpool import to search all of the /dev/dsk/ devices. From there, pull out one of the USB drives and do a zpool scrub to resilver the failed RAID group. So now, wipe off the removed USB disks (format it with ufs or something... it just needs to lose the ZFS identifiers. And while we're at it, ufs is probably a good choice anyway, given the next step(s)) One of the disks will show FAULTED at this point, I'll call it c5t2d0. Now, mount up that extra disk, and run mkfile 500g /mnt/theUSBdisk/disk1.img (This will create a sparse file) Then do a zfs replace c5t2d0 /mnt/theUSBdisk/disk1.img Then you can also replace the other 2 USB disks with other img files too... as long as the total data written to these stripes doesn't exceed the actual size of the disk, you'll be OK. At this point, back up your data (zfs send | bzip2 -9 /mnt/theUSBdisk/backup.dat). --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZIL and snapshots
I'm using an x4500 as a large data store for our VMware environment. I have mirrored the first 2 disks, and created a ZFS pool of the other 46: 22 pairs of mirrors, and 2 spares (optimizing for random I/O performance rather than space). Datasets are shared to the VMware ESX servers via NFS. We noticed that VMware mounts its NFS datastore with the SYNC option, so every NFS write gets flagged with FILE_SYNC. In testing, syncronous writes are significantly slower than async, presumably because of the strict ordering required for correctness (cache flushing and ZIL). Can anyone tell me if a ZFS snapshot taken when zil_disable=1 will be crash-consistant with respect to the data written by VMware? Are the snapshot metadata updates serialized with pending non-metadata writes? If an asyncronous write is issued before the snapshot is initiated, is it guarenteed to be in the snapshot data, or can it be reordered to after the snapshot? Does a snapshot flush pending writes to disk? To increase performance, the users are willing to lose an hour or two of work (these are development/QA environments): In the event that the x4500 crashes and loses the 16GB of cached (zil_disable=1) writes, we roll back to the last hourly snapshot, and everyone's back to the way they were. However, I want to make sure that we will be able to boot a crash-consistant VM from that rolled-back virtual disk. Thanks for any knowledge you might have, --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL and snapshots
Have you thought of solid state cache for the ZIL? There's a 16GB battery backed PCI card out there, I don't know how much it costs, but the blog where I saw it mentioned a 20x improvement in performance for small random writes. Thought about it, looked in the Sun Store, couldn't find one, and cut the PO. Haven't gone back to get a new approval. I did put a couple of the MTron 32GB SSD drives on the christmas wishlist (aka 2008 budget) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
BillTodd wrote: In order to be reasonably representative of a real-world situation, I'd suggest the following additions: Your suggestions (make the benchmark big enough so seek times are really noticed) are good. I'm hoping that over the holidays, I'll get to play with an extra server... If I'm lucky, I'll have 2x36GB drives (in a 1-2GB memory server) that I can dedicate to their own mirrored zfs pool. I figure a 30GB test file should make the seek times interesting. There's also a needed 5) Run the same microbenchmark against a UFS filesystem to compare the step2/step4 ratio with what a non-COW filesystem offers. In theory, the UFS ratio should be 1:1, that is, sequential read performance should not be affected by the intervening random writes. (In the case of my test server, I'll make it an SVM mirror of the same 2 drives) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Louwtjie Burger wrote: Richard Elling wrote: - COW probably makes that conflict worse This needs to be proven with a reproducible, real-world workload before it makes sense to try to solve it. After all, if we cannot measure where we are, how can we prove that we've improved? I agree, let's first find a reproducible example where updates negatively impacts large table scans ... one that is rather simple (if there is one) to reproduce and then work from there. I'd say it would be possible to define a reproducible workload that demonstrates this using the Filebench tool... I haven't worked with it much (maybe over the holidays I'll be able to do this), but I think a workload like: 1) create a large file (bigger than main memory) on an empty ZFS pool. 2) time a sequential scan of the file 3) random write i/o over say, 50% of the file (either with or without matching blocksize) 4) time a sequential scan of the file The difference between times 2 and 4 are the penalty that COW block reordering (which may introduce seemingly-random seeks between sequential blocks) imposes on the system. It would be interesting to watch seeksize.d's output during this run too. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HAMMER
Peter Tribble wrote: I'm not worried about the compression effect. Where I see problems is backing up million/tens of millions of files in a single dataset. Backing up each file is essentially a random read (and this isn't helped by raidz which gives you a single disks worth of random read I/O per vdev). I would love to see better ways of backing up huge numbers of files. It's worth correcting this point... the RAIDZ behavior you mention only occurs if the read size is not aligned to the dataset's block size. The checksum verifier must read the entire stripe to validate the data, but it does that in parallel across the stripe's vdevs. The whole block is then available for delivery to the application. Although, backing up millions/tens of millions of files in a single backup dataset is a bad idea anyway. The metadata searches will kill you, no matter what backend filesystem is supporting it. zfs send is the faster way of backing up huge numbers of files. But you pay the price in restore time. (But that's the normal tradeoff) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] future ZFS Boot and ZFS copies
Jesus Cea wrote: Darren J Moffat wrote: Why would you do that when it would reduce your protection and ZFS boot can boot from a mirror anyway. I guess ditto blocks would be protection enough, since the data would be duplicated between both disks. Of course, backups are your friend. I asked almost the exact same question when I first heard about ditto blocks. (See http://mail.opensolaris.org/pipermail/zfs-discuss/2007-May/040596.html and followups) There are 2 key differences between ditto blocks and mirrors: 1) The ZFS pool is considered unprotected. That means a device failure will result in a kernel panic. 2) Ditto block separation is not enforced. The allocator tries to keep the second copy far from the first one, but it is possible that both copies of your /etc/passwd file are on the same VDEV. This means that a device failure could result in real loss of data. It would be really nice if there was some sort of enforced-ditto-separation (fail w/ device full if unable to satisfy) but that doesn't exist currently. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] space allocation vs. thin provisioning
Mike Gerdts wrote: I'm curious as to how ZFS manages space (free and used) and how its usage interacts with thin provisioning provided by HDS arrays. Is there any effort to minimize the number of provisioned disk blocks that get writes so as to not negate any space benefits that thin provisioning may give? I was trying to compose an email asking almost the exact same question, but in the context of array-based replication. They're similar in the sense that you're asking about using already-written space, rather than to go off into virgin sectors of the disks (in my case, in the hope that the previous write is still waiting to be replicated and thus can be replaced by the current data) Background more detailed questions: In Jeff Bonwick's blog[1], he talks about free space management and metaslabs. Of particular interest is the statement: ZFS divides the space on each virtual device into a few hundred regions called metaslabs. 1. http://blogs.sun.com/bonwick/entry/space_maps I wish I'd have seen this blog while I was composing my question... it answers some of my questions about how things work (plus Jeff's zfs_block_allocation entry actually moots most of my comments since they've already been implemented) (snip) As data is deleted, do the freed blocks get reused before never used blocks? I didn't see any code where this would happen. I would really love to see a zpool setting where I can specify the reuse algorithm. (For example: zpool set block_reuse_policy=mru or =dense or =broad or =low) MRU (most recently used) in the hopes that the storage replication hasn't yet committed the previous write to the other side of the WAN DENSE (reuse any previously-written space) in the thin-provisioning case BROAD (venture off into new space when possible) for media that has a rewrite cycle limitations (flash drives) to spread the writes over as much of the media as possible LOW (prioritize low-block# space) would provide optimal rotational latency for random i/o in the fututre and might be a special case of the above. The corresponding HIGH would improve sequential i/o. (Implementation is left as an exercise to the reader ;) Is there any collaboration between the storage vendors and ZFS developers to allow the file system to tell the storage array this range of blocks is unused so that the array can reclaim the space? I could see this as useful when doing re-writes of data (e.g. crypto rekey) to concentrate data that had become scattered into contiguous space. Deallocating storage space is something that nobody seems to be good at: ever tried to shrink a filesystem? Or a ZFS pool? Or a SAN RAID group? --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Force ditto block on different vdev?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank Cusack Sent: Friday, August 10, 2007 7:26 AM To: Tuomas Leikola Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Force ditto block on different vdev? On August 10, 2007 2:20:30 PM +0300 Tuomas Leikola [EMAIL PROTECTED] wrote: We call that a mirror :-) Mirror and raidz suffer from the classic blockdevice abstraction problem in that they need disks of equal size. Not that I'm aware of. Mirror and raid-z will simply use the smallest size of your available disks. Exactly. The rest is not usable. Well I don't understand how you suggest to use it if you want redundancy. Since copies=N is a per-filesystem setting, you fail writes to /tank/important_documents (copies=2) when you run out of ditto blocks on another VDEV, but still allow /tank/torrentcache (copies=1) to use the other space. With disks of 100 and 50 GB mirrored, /tank/torrentcache would be more redundant than necessary, and you run out of capacity too soon. Wishlist: It would be nice to put the whole redundancy definitions into the zfs filesystem layer (rather than the pool layer): Imagine being able to set copies=5+2 for a filesystem... (requires a 7-VDEV pool, and stripes via RAIDz2, otherwise the zfs create/set fails) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and powerpath
Brian Wilson wrote: On Jul 16, 2007, at 6:06 PM, Torrey McMahon wrote: Darren Dunham wrote: My previous experience with powerpath was that it rode below the Solaris device layer. So you couldn't cause trespass by using the wrong device. It would just go to powerpath which would choose the link to use on its own. Is this not true or has it changed over time? I haven't looked at power path for some time but it used to be the opposite. The powerpath node sat on top of the actual device paths. One of the selling points of mpxio is that it doesn't have that problem. (At least for devices it supports.) Most of the multipath software had that same limitation I agree, it's not true. I don't know how long it hasn't been true, but the last year and a half I've been implementing PowerPath on Solaris 8, 9, 10, the way to make it work is to point whatever disk tool you're using to the emcpower device. The other paths are there because leadville finds them and creates them (if you're using leadville), but PowerPath isn't doing anything to make them redundant, it's giving you the emcpower device and the emcp, etc. drivers to front end them and give you a multipathed device (the emcpower device). It DOES choose which one to use, for all I/O going through the emcpower device. In a situation where you lose paths and I/O is moving, you'll see scsi errors down one path, then the next, then the next, as PowerPath gets fed the scsi error and tries the next device path. If you use those actual device paths, you're not actually getting a device that PowerPath is multipathing for you (i.e. it does not dig in beneath the scsi driver) I'm afraid I have to disagree with you: I'm using the /dev/dsk/c2t$WWNdXs2 devices quite happily with powerpath handling failover for my clariion. # powermt version EMC powermt for PowerPath (c) Version 4.4.0 (build 274) # powermt display dev=58 Pseudo name=emcpower58a CLARiiON ID=APM00051704678 [uscicsap1] Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1: /oracle/Q02/saparch] state=alive; policy=BasicFailover; priority=0; queued-IOs=0 Owner: default=SP A, current=SP A == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O PathsInterf. ModeState Q-IOs Errors == 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 SP A1 active alive 0 0 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 SP B1 active alive 0 0 # fsck /dev/dsk/c2t5006016130202E48d58s0 ** /dev/dsk/c2t5006016130202E48d58s0 ** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n 144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0% fragmentation) # fsck /dev/dsk/c2t5006016930202E48d58s0 ** /dev/dsk/c2t5006016930202E48d58s0 ** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX? n 144 files, 189504 used, 33832172 free (420 frags, 4228969 blocks, 0.0% fragmentation) ### So at this point, I can look down either path and get to my data. Now I kill 1 of the 2 paths via SAN zoning. cfgadm -c configure c2, and powermt check reports that the path to SP A is now dead. I'm still able to fsck the dead path: # cfgadm -c configure c2 # powermt check Warning: CLARiiON device path c2t5006016130202E48d58s0 is currently dead. Do you want to remove it (y/n/a/q)? n # powermt display dev=58 Pseudo name=emcpower58a CLARiiON ID=APM00051704678 [uscicsap1] Logical device ID=6006016067E51400565259A15331DB11 [saperqdb1: /oracle/Q02/saparch] state=alive; policy=BasicFailover; priority=0; queued-IOs=0 Owner: default=SP A, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O PathsInterf. ModeState Q-IOs Errors == 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016130202E48d58s0 SP A1 active dead 0 1 3073 [EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] c2t5006016930202E48d58s0 SP B1 active alive 0 0 # fsck /dev/dsk/c2t5006016130202E48d58s0 ** /dev/dsk/c2t5006016130202E48d58s0 ** Last Mounted on /zones/saperqdb1/root/oracle/Q02/saparch **
[zfs-discuss] ZFS mirroring vs. ditto blocks
Has anyone done a comparison of the reliability and performance of a mirrored zpool vs. a non-redundant zpool using ditto blocks? What about a gut-instinct about which will give better performance? Or do I have to wait until my Thumper arrives to find out for myself? Also, in selecting where a ditto block is written, (other than far away) does the system take into account the disk's path, so for example, would it write both copies down a single controller? --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss