Re: [zfs-discuss] SSD ZIL/L2ARC partitioning
On 11/14/12 03:24, Sašo Kiselkov wrote: On 11/14/2012 11:14 AM, Michel Jansens wrote: Hi, I've ordered a new server with: - 4x600GB Toshiba 10K SAS2 Disks - 2x100GB OCZ DENEVA 2R SYNC eMLC SATA (no expander so I hope no SAS/SATA problems). Specs: http://www.oczenterprise.com/ssd-products/deneva-2-r-sata-6g-2.5-emlc.html I want to use the 2 OCZ SSDs as mirrored intent log devices, but as the intent log needs quite a small amount of the disks (10GB?), I was wondering if I can use the rest of the disks as L2ARC? I have a few questions about this: -Is 10GB enough for a log device? A log device, essentially, only needs to hold a single transaction's-worth of small sync writes, Actually it needs to hold 3 transaction groups worth. There are 3 phases to ZFS's transaction group model: open, quiescing and syncing. Nowadays the sync phase is targetted at 5s so the log device needs to be able to hold up to 15s of synchronous data. so unless you write more than that, you'll be fine. In fact, DDRdrive's X1 is only 4GB and works just fine. Agreed, 10GB should be fine for your system. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
On 10/04/12 05:30, Schweiss, Chip wrote: Thanks for all the input. It seems information on the performance of the ZIL is sparse and scattered. I've spent significant time researching this the past day. I'll summarize what I've found. Please correct me if I'm wrong. The ZIL can have any number of SSDs attached either mirror or individually. ZFS will stripe across these in a raid0 or raid10 fashion depending on how you configure. The ZIL code chains blocks together and these are allocated round robin among slogs or if they don't exist then the main pool devices. To determine the true maximum streaming performance of the ZIL setting sync=disabled will only use the in RAM ZIL. This gives up power protection to synchronous writes. There is no RAM ZIL. If sync=disabled then all writes are asynchronous and are written as part of the periodic ZFS transaction group (txg) commit that occurs every 5 seconds. Many SSDs do not help protect against power failure because they have their own ram cache for writes. This effectively makes the SSD useless for this purpose and potentially introduces a false sense of security. (These SSDs are fine for L2ARC) The ZIL code issues a write cache flush to all devices it has written before returning from the system call. I've heard, that not all devices obey the flush but we consider them as broken hardware. I don't have a list to avoid. Mirroring SSDs is only helpful if one SSD fails at the time of a power failure. This leave several unanswered questions. How good is ZFS at detecting that an SSD is no longer a reliable write target? The chance of silent data corruption is well documented about spinning disks. What chance of data corruption does this introduce with up to 10 seconds of data written on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is returning what we write to it? If the ZIL code gets a block write failure it will force the txg to commit before returning. It will depend on the drivers and IO subsystem as to how hard it tries to write the block. Zpool versions 19 and higher should be able to survive a ZIL failure only loosing the uncommitted data. However, I haven't seen good enough information that I would necessarily trust this yet. This has been available for quite a while and I haven't heard of any bugs in this area. Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs. I'm not sure if that is current, but I can't find any reports of better performance. I would suspect that DDR drive or Zeus RAM as ZIL would push past this. 1GB/s seems very high, but I don't have any numbers to share. Anyone care to post their performance numbers on current hardware with E5 processors, and ram based ZIL solutions? Thanks to everyone who has responded and contacted me directly on this issue. -Chip On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel andrew.gabr...@cucumber.demon.co.uk wrote: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Schweiss, Chip How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are
Re: [zfs-discuss] Making ZIL faster
On 10/04/12 15:59, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin The ZIL code chains blocks together and these are allocated round robin among slogs or if they don't exist then the main pool devices. So, if somebody is doing sync writes as fast as possible, would they gain more bandwidth by adding multiple slog devices? In general - yes, but it really depends. Multiple synchronous writes of any size across multiple file systems will fan out across the log devices. That is because there is a separate independent log chain for each file system. Also large synchronous writes (eg 1MB) within a specific file system will be spread out. The ZIL code will try to allocate a block to hold all the records it needs to commit up to the largest block size - which currently for you should be 128KB. Anything larger will allocate a new block - on a different device if there are multiple devices. However, lots of small synchronous writes to the same file system might not use more than one 128K block and benefit from multiple slog devices. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On 08/03/12 19:39, Bob Friesenhahn wrote: On Fri, 3 Aug 2012, Karl Rossing wrote: I'm looking at http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html wondering what I should get. Are people getting intel 330's for l2arc and 520's for slog? For the slog, you should look for a SLC technology SSD which saves unwritten data on power failure. In Intel-speak, this is called Enhanced Power Loss Data Protection. I am not running across any Intel SSDs which claim to match these requirements. - That shouldn't be necessary. ZFS flushes the write cache for any device written before returning from the synchronous request to ensure data stability. Extreme write IOPS claims in consumer SSDs are normally based on large write caches which can lose even more data if there is a power failure. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Log disk with all ssd pool?
On 10/28/11 00:04, Mark Wolek wrote: Still kicking around this idea and didnt see it addressed in any of the threads before the forum closed. If one made an all ssd pool, would a log/cache drive just slow you down? Would zil slow you down? Thinking rotate MLC drives with sandforce controllers every few years to avoid losing a drive to sorry no more writes aloud scenarios. Thanks Mark Interesting question. I don't think there's a straightforward answer. Oracle uses write optimised log devices and read optimised cache devices in it's appliances. However, assuming all the SSDs are the same then I suspect neither a log nor a cache device would help: Log If there is a log then it is solely used, and can be written to in parallel with periodic TXG commit writes to the other pool devices. If that log were part of the pool then the ZIL code will spread the load among all pool devices, but will compete with TXG commit writes. My gut feeling is that this would be the higher performing option though. I think, a long time ago, I experimented with designating one disk out of the pool as a log and saw degradation on synchronous performance. That seems to be the equivalent to your SSD question. Cache Similarly for cache devices the read would compete at TXG commit writes, but otherwise performance ought to be higher. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Log disk with all ssd pool?
On 10/28/11 00:54, Neil Perrin wrote: On 10/28/11 00:04, Mark Wolek wrote: Still kicking around this idea and didnt see it addressed in any of the threads before the forum closed. If one made an all ssd pool, would a log/cache drive just slow you down? Would zil slow you down? Thinking rotate MLC drives with sandforce controllers every few years to avoid losing a drive to sorry no more writes aloud scenarios. Thanks Mark Interesting question. I don't think there's a straightforward answer. Oracle uses write optimised log devices and read optimised cache devices in it's appliances. However, assuming all the SSDs are the same then I suspect neither a log nor a cache device would help: Log If there is a log then it is solely used, and can be written to in parallel with periodic TXG commit writes to the other pool devices. If that log were part of the pool then the ZIL code will spread the load among all pool devices, but will compete with TXG commit writes. My gut feeling is that this would be the higher performing option though. I think, a long time ago, I experimented with designating one disk out of the pool as a log and saw degradation on synchronous performance. That seems to be the equivalent to your SSD question. Cache Similarly for cache devices the read would compete at TXG commit writes, but otherwise performance ought to be higher. Neil. Did some quick tests with disks to check if my memory was correct. 'sb' is a simple problem to spawn a number of threads to fill a file of a certain size with specified sized non zero writes. Bandwidth is also important. 1. Simple 2 disk system. 32KB synchronous writes filling 1GB with 20 threads zpool create whirl 2 disks; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 20 Elapsed time 95s 10.8MB/s zpool create whirl disk log disk ; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 20 Elapsed time 151s 6.8MB/s 2. Higher end 6 disk system. 32KB synchronous writes filling 1GB with 100 threads zpool create whirl 6 disks; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 100 Elapsed time 33s 31MB/s zpool create whirl 5 disks log 1disk; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 100 Elapsed time 147s 7.0MB/s and for interest: zpool create whirl 5 disk log SSD; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 100 Elapsed time 8s 129MB/s 3. Higher end smaller writes 2K synchronous writes filling 128MB with 100 threads zpool create whirl 6 disks: zfs set recordsize=1k whirl st1 -n /whirl/f -f 134217728 -b 2048 -t 100 Elapsed time 16s 8.2MB/s zpool create whirl 5 disks log 1 disk zfs set recordsize=1k whirl ds8 -n /whirl/f -f 134217728 -b 2048 -t 100 Elapsed time 24s 5.5MB/s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Log disk with all ssd pool?
On 10/28/11 11:21, Mark Wolek wrote: Having the log disk slowed it down a lot in your tests (when it wasnt a SSD), 30MB/s vs 7. Is this is also a 100% write / 100% sequential workload? Forcing sync? 100% synchronous write. Writes are random but ZFS will write them sequentially on disk. Its gotten to the point where I can buy a 120G SSD for less or the same price as a 146G SAS diskSure the MLC drives have limited lifetime, but at $150 (and dropping) just replace them every few years to be safe, work out a rotation/rebuild cycle, its tempting I suppose if we do end up buying all SSDs it becomes really easy to test if we should use a log or not! Would highly recommend some form of zpool redundancy (mirroring or raidz). From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Neil Perrin Sent: Friday, October 28, 2011 11:38 AM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Log disk with all ssd pool? On 10/28/11 00:54, Neil Perrin wrote: On 10/28/11 00:04, Mark Wolek wrote: Still kicking around this idea and didnt see it addressed in any of the threads before the forum closed. If one made an all ssd pool, would a log/cache drive just slow you down? Would zil slow you down? Thinking rotate MLC drives with sandforce controllers every few years to avoid losing a drive to sorry no more writes aloud scenarios. Thanks Mark Interesting question. I don't think there's a straightforward answer. Oracle uses write optimised log devices and read optimised cache devices in it's appliances. However, assuming all the SSDs are the same then I suspect neither a log nor a cache device would help: Log If there is a log then it is solely used, and can be written to in parallel with periodic TXG commit writes to the other pool devices. If that log were part of the pool then the ZIL code will spread the load among all pool devices, but will compete with TXG commit writes. My gut feeling is that this would be the higher performing option though. I think, a long time ago, I experimented with designating one disk out of the pool as a log and saw degradation on synchronous performance. That seems to be the equivalent to your SSD question. Cache Similarly for cache devices the read would compete at TXG commit writes, but otherwise performance ought to be higher. Neil. Did some quick tests with disks to check if my memory was correct. 'sb' is a simple problem to spawn a number of threads to fill a file of a certain size with specified sized non zero writes. Bandwidth is also important. 1. Simple 2 disk system. 32KB synchronous writes filling 1GB with 20 threads zpool create whirl 2 disks; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 20 Elapsed time 95s 10.8MB/s zpool create whirl disk log disk ; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 20 Elapsed time 151s 6.8MB/s 2. Higher end 6 disk system. 32KB synchronous writes filling 1GB with 100 threads zpool create whirl 6 disks; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 100 Elapsed time 33s 31MB/s zpool create whirl 5 disks log 1disk; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 100 Elapsed time 147s 7.0MB/s and for interest: zpool create whirl 5 disk log SSD; zfs set recordsize=32k whirl st1 -n /whirl/f -f 1073741824 -b 32768 -t 100 Elapsed time 8s 129MB/s 3. Higher end smaller writes 2K synchronous writes filling 128MB with 100 threads zpool create whirl 6 disks: zfs set recordsize=1k whirl st1 -n /whirl/f -f 134217728 -b 2048 -t 100 Elapsed time 16s 8.2MB/s zpool create whirl 5 disks log 1 disk zfs set recordsize=1k whirl ds8 -n /whirl/f -f 134217728 -b 2048 -t 100 Elapsed time 24s 5.5MB/s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC
On 9/19/11 11:45 AM, Jesus Cea wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I have a new answer: interaction between dataset encryption and L2ARC and ZIL. 1. I am pretty sure (but not completely sure) that data stored in the ZIL is encrypted, if the destination dataset uses encryption. Can anybody confirm?. If the data set (file system/zvol) is encrypted then the user data is also encrypted. The ZIL meta data used to parse blocks and records is kept in the clear (in order to claim the blocks) but the user data is encrypted. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC
On 08/30/11 08:31, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jesus Cea 10. What happens if my 1GB of ZIL is too optimistic?. Will ZFS use the disks or it will stop writers until flushing ZIL to the HDs?. Good question. I don't know. - It will use the pool disks. Thanks Edward for answering the rest. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL
In general the blogs conclusion is correct . When file systems get full there is fragmentation (happens to all file systems) and for ZFS the pool uses gang blocks of smaller blocks when there are insufficient large blocks. However, the ZIL never allocates or uses gang blocks. It directly allocates blocks (outside of the zio pipeline) using zio_alloc_zil() - metaslab_alloc(). Gang blocks are only used by the main pool when the pool transaction group (txg) commit occurs. Solutions to the problem include: - add a separate intent log - add more top level devices (hopefully replicated) - delete unused files/snapshots etc with in the poll... Neil. On 08/01/11 08:29, Josh Simon wrote: Hello, One of my coworkers was sent the following explanation from Oracle as to why one of backup systems was conducting a scrub so slow. I figured I would share it with the group. http://wildness.espix.org/index.php?post/2011/06/09/ZFS-Fragmentation-issue-examining-the-ZIL PS: Thought it was kind of odd that Oracle would direct us to a blog, but the post is very thorough. Thanks, Josh Simon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On 06/16/11 20:26, Daniel Carosone wrote: On Thu, Jun 16, 2011 at 09:15:44PM -0400, Edward Ned Harvey wrote: My personal preference, assuming 4 disks, since the OS is mostly reads and only a little bit of writes, is to create a 4-way mirrored 100G partition for the OS, and the remaining 900G of each disk (or whatever) becomes either a stripe of mirrors or raidz, as appropriate in your case, for the storagepool. Is it still the case, as it once was, that allocating anything other than whole disks as vdevs forces NCQ / write cache off on the drive (either or both, forget which, guess write cache)? It was once the case that using a slice as a vdev forced the write cache off, but I just tried it and found it wasn't disabled - at least with the current source. In fact it looks like we no longer change the setting. You may want to experiment yourself on your ZFS version (see below for how the check). If so, can this be forced back on somehow to regain performance when known to be safe? Yes: format -e- select disk - cache - write - display/enable/disable I think the original assumption was that zfs-in-a-partition likely implied the disk was shared with ufs, rather than another async-safe pool. - Correct. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
On 05/02/11 14:02, Nico Williams wrote: Also, sparseness need not be apparent to applications. Until recent improvements to lseek(2) to expose hole/non-hole offsets, the only way to know about sparseness was to notice that a file's reported size is more than the file's reported filesystem blocks times the block size. Sparse files in Unix go back at least to the early 80s. If a filesystem protocol, such as CIFS (I've no idea if it supports sparse files), were to not support sparse files, all that would mean is that the server must report a number of blocks that matches a file's size (assuming the protocol in question even supports any notion of reporting a file's size in blocks). There's really two ways in which a filesystem protocol could support sparse files: a) by reporting file size in bytes and blocks, b) by reporting lists of file offsets demarcating holes from non-holes. (b) is a very new idea; Lustre may be the only filesystem that I know that supports this (see the Linux FIEMAP APIs)., though work is in progress to add this to NFSv4. I enhanced the lseek interface a while back now to return information about sparse files, by adding 2 new interfaces: SEEK_HOLE and SEEK_DATA. See man -s2 lseek Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 04/30/11 01:41, Sean Sprague wrote: : xvm-4200m2-02 ; I can do the echo | mdb -k. But what is that : xvm-4200 command? My guess is that is a very odd shell prompt ;-) - Indeed ':' means what follows a comment (at least to /bin/ksh) 'xvm-4200m2-02' is the comment - actually the system name (not very inventive) ';' ends the comment. I use this because I can cut and paste entire lines back to the shell. Sorry for the confusion: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 4/28/11 12:45 PM, Edward Ned Harvey wrote: From: Erik Trimble [mailto:erik.trim...@oracle.com] OK, I just re-looked at a couple of things, and here's what I /think/ is the correct numbers. I just checked, and the current size of this structure is 0x178, or 376 bytes. Each ARC entry, which points to either an L2ARC item (of any kind, cached data, metadata, or a DDT line) or actual data/metadata/etc., is defined in the struct arc_buf_hdr : http://src.opensolaris.org/source/xref/onnv/onnv- gate/usr/src/uts/common/fs/zfs/arc.c#431 It's current size is 0xb0, or 176 bytes. These are fixed-size structures. heheheh... See what I mean about all the conflicting sources of information? Is it 376 and 176? Or is it 270 and 200? Erik says it's fixed-size. Richard says The DDT entries vary in size. So far, what Erik says is at least based on reading the source code, with a disclaimer of possibly misunderstanding the source code. What Richard says is just a statement of supposed absolute fact without any backing. In any event, thank you both for your input. Can anyone answer these authoritatively? (Neil?) I'll send you a pizza. ;-) - I wouldn't consider myself an authority on the dedup code. The size of these structures will vary according to the release you're running. You can always find out the size for a particular system using ::sizeof within mdb. For example, as super user : : xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k sizeof (ddt_entry_t) = 0x178 : xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k sizeof (arc_buf_hdr_t) = 0x100 : xvm-4200m2-02 ; This shows yet another size. Also there are more changes planned within the arc. Sorry, I can't talk about those changes and nor when you'll see them. However, that's not the whole story. It looks like the arc_buf_hdr_t use their own kmem cache so there should be little wastage, but the ddt_entry_t are allocated from the generic kmem caches and so will probably have some roundup and unused space. Caches for small buffers are aligned to 64 bytes. See kmem_alloc_sizes[] and comment: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#920 Pizza: Mushroom and anchovy - er, just kidding. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 04/25/11 11:55, Erik Trimble wrote: On 4/25/2011 8:20 AM, Edward Ned Harvey wrote: And one more comment: Based on what's below, it seems that the DDT gets stored on the cache device and also in RAM. Is that correct? What if you didn't have a cache device? Shouldn't it *always* be in ram? And doesn't the cache device get wiped every time you reboot? It seems to me like putting the DDT on the cache device would be harmful... Is that really how it is? Nope. The DDT is stored only in one place: cache device if present, /or/ RAM otherwise (technically, ARC, but that's in RAM). If a cache device is present, the DDT is stored there, BUT RAM also must store a basic lookup table for the DDT (yea, I know, a lookup table for a lookup table). No, that's not true. The DDT is just like any other ZFS metadata and can be split over the ARC, cache device (L2ARC) and the main pool devices. An infrequently referenced DDT block will get evicted from the ARC to the L2ARC then evicted from the L2ARC. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove zil device
On 03/31/11 12:28, Roy Sigurd Karlsbakk wrote: http://pastebin.com/nD2r2qmh Here is zpool status and zpool version The only thing I wonder about here, is why you have two striped log devices. I didn't even know that was supported. Yes it's supported. ZFS will round robin writes to the log devices. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] BOOT, ZIL, L2ARC one one SSD?
On 12/25/10 19:32, Bill Werner wrote: Understood Edward, and if this was a production data center, I wouldn't be doing it this way. This is for my home lab, so spending hundreds of dollars on SSD devices isn't practical. Can several datasets share a single ZIL and a single L2ARC, or much must each dataset have their own? The ZIL and L2ARC devices are per pool and thus shared shared amongst all datasets. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On 12/01/10 22:14, Miles Nordin wrote: Also did anyone ever clarify whether the slog has an ashift? or is it forced-512? or derived from whatever vdev will eventually contain the separately-logged data? I would expect generalized immediate Caring about that since no slogs except ACARD and DDRDrive will have 512-byte sectors. The minimum slog write is #define ZIL_MIN_BLKSZ 4096 and all writes are also rounded to multiples of ZIL_MIN_BLKSZ. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does dedup work over iSCSI?
On 10/22/10 15:34, Peter Taps wrote: Folks, Let's say I have a volume being shared over iSCSI. The dedup has been turned on. Let's say I copy the same file twice under different names at the initiator end. Let's say each file ends up taking 5 blocks. For dedupe to work, each block for a file must match the corresponding block from the other file. Essentially, each pair of block being compared must have the same start location into the actual data. No, ZFS doesn't care about the file offset, just that the checksum of the blocks matches. For a shared filesystem, ZFS may internally ensure that the block starts match. However, over iSCSI, the initiator does not even know about the whole block mechanism that zfs has. It is just sending raw bytes to the target. This makes me wonder if dedup actually works over iSCSI. Can someone please enlighten me on what I am missing? Thank you in advance for your help. Regards, Peter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does dedup work over iSCSI?
On 10/22/10 17:28, Peter Taps wrote: Hi Neil, if the file offset does not match, the chances that the checksum would match, especially sha256, is almost 0. May be I am missing something. Let's say I have a file that contains 11 letters - ABCDEFGHIJK. Let's say the block size is 5. For the first file, the block contents are ABCDE, FGHIJ, and K. For the second file, let's say the blocks are ABCD, EFGHI, and JK. The chance that any checksum would match is very less. The chance that any checksum+verify would match is even less. Regards, Peter The block size and contents has to match for ZFS dedup. See http://blogs.sun.com/bonwick/entry/zfs_dedup Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is dedupditto property on zpool?
On 09/24/10 11:26, Peter Taps wrote: Folks, One of the zpool properties that is reported is dedupditto. However, there is no documentation available, either in man pages or anywhere else on the Internet. What exactly is this property? Thank you in advance for your help. Regards, Peter I found it documented in man zpool: dedupditto=number Sets a threshold for number of copies. If the reference count for a deduplicated block goes above this thres- hold, another ditto copy of the block is stored automat- ically. The default value is 0. It seems a bit counter-intuitive to start with. The purpose of dedup is to remove copies of blocks. However, if there are say 50 references to the same block and that block gets checksum errors then all 50 references are bad. So this is another form of redundancy, by telling zfs to store an additional copy say a specific number of references. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS COW and simultaneous read write of files
On 09/22/10 11:22, Moazam Raja wrote: Hi all, I have a ZFS question related to COW and scope. If user A is reading a file while user B is writing to the same file, when do the changes introduced by user B become visible to everyone? Is there a block level scope, or file level, or something else? Thanks! Assuming the user is using read and write against zfs files. ZFS has reader/writer range locking within files. If thread A is trying to read the same section that thread B is writing it will block until the data is written. Note, written in this case means written into the zfs cache and not to the disks. If thread A requires that changes to the file be stable (on disk) before reading it can use the little known O_RSYNC flag. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is l2cache setting?
On 09/22/10 11:23, Peter Taps wrote: Folks, While going through zpool source code, I see a configuration option called l2cache. What is this option for? It doesn't seem to be documented. Thank you in advance for your help. Regards, Peter man zpool under Cache Devices section ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is l2cache setting?
On 09/22/10 13:40, Peter Taps wrote: Neil, Thank you for your help. However, I don't see anything about l2cache under Cache devices man pages. To be clear, there are two different vdev types defined in zfs source code - cache and l2cache. I am familiar with cache devices. I am curious about l2cache devices. Regards, Peter They are one and the same. It's a bit confusing, but 'cache' was the external name given to l2cache (level 2 cache) vdevs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?
On 09/17/10 23:31, Ian Collins wrote: On 09/18/10 04:46 PM, Neil Perrin wrote: On 09/17/10 18:32, Edward Ned Harvey wrote: From: Neil Perrin [mailto:neil.per...@oracle.com] you lose information. Not your whole pool. You lose up to 30 sec of writes The default is now 5 seconds (zfs_txg_timeout). When did that become default? It was changed more recently than I remember in snv_143 as part of a of set of bug fixes: 6494473, 6743992, 6936821, 6956464. They were integrated on 6/8/10. Should I *ever* say 30 sec anymore? Well for versions before snv_143 then 30 seconds is correct. I was just giving a heads up that it has changed. In the context of this thread, was the change integrated in update 9? - No. It looks like it's destined for Update 10. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?
On 09/17/10 18:32, Edward Ned Harvey wrote: From: Neil Perrin [mailto:neil.per...@oracle.com] you lose information. Not your whole pool. You lose up to 30 sec of writes The default is now 5 seconds (zfs_txg_timeout). When did that become default? It was changed more recently than I remember in snv_143 as part of a of set of bug fixes: 6494473, 6743992, 6936821, 6956464. They were integrated on 6/8/10. Should I *ever* say 30 sec anymore? Well for versions before snv_143 then 30 seconds is correct. I was just giving a heads up that it has changed. In my world, the oldest machine is 10u6. (Except one machine named dinosaur that is sol8) I believe George responded on that thread that we do handle log mirrors correctly. That is, if one side fails to checksum a block we do indeed check the other side. I should have been more cautious with my concern. I think I said I don't know if we handle it correctly, and George confirmed we do. Sorry for the false alarm. Great. ;-) Thank you. So the recommendation is still to mirror log devices, because the recommendation will naturally be ultra-conservative. ;-) The risk is far smaller now than it was before. So make up your own mind. If you are willing to risk 5sec or 30sec of data in the situation of (a) undetected failed log device *and* (b) ungraceful system crash, then you are willing to run with unmirrored log devices. In no situation does the filesystem become inconsistent or corrupt. In the worst case, you have a filesystem which is consistent with a valid filesystem state, a few seconds before the system crash. (Assuming you have a zpool recent enough to support log device removal.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
Arne, NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. Neil. On 09/09/10 09:00, Arne Jansen wrote: Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
I should also have mentioned that if the pool has a separate log device then this shouldn't happen.Assuming the slog is big enough then it it should have enough blocks to not be forced into using main pool device blocks. Neil. On 09/09/10 10:36, Neil Perrin wrote: Arne, NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. Neil. On 09/09/10 09:00, Arne Jansen wrote: Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
On 08/25/10 20:33, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then that is reported, but if the checksum does not match then we assume it's the end of the intent log chain. Using this design means we use the minimum number of writes. So corruption of an intent log is not going to generate any errors. I didn't know that. Very interesting. This raises another question ... It's commonly stated, that even with log device removal supported, the most common failure mode for an SSD is to blindly write without reporting any errors, and only detect that the device is failed upon read. So ... If an SSD is in this failure mode, you won't detect it? At bootup, the checksum will simply mismatch, and we'll chug along forward, having lost the data ... (nothing can prevent that) ... but we don't know that we've lost data? - Indeed, we wouldn't know we lost data. Worse yet ... In preparation for the above SSD failure mode, it's commonly recommended to still mirror your log device, even if you have log device removal. If you have a mirror, and the data on each half of the mirror doesn't match each other (one device failed, and the other device is good) ... Do you read the data from *both* sides of the mirror, in order to discover the corrupted log device, and correctly move forward without data loss? Hmm, I need to check, but if we get a checksum mismatch then I don't think we try other mirror(s). This is automatic for the 'main pool', but of course the ZIL code is different by necessity. This problem can of course be fixed. (It will be a week and a bit before I can report back on this, as I'm on vacation). Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then that is reported, but if the checksum does not match then we assume it's the end of the intent log chain. Using this design means we the minimum number of writes to add write an intent log record is just one write. So corruption of an intent log is not going to generate any errors. Neil. On 08/23/10 10:41, StorageConcepts wrote: Hello, we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. The headline above all our tests is do we still need to mirror ZIL with all current fixes in ZFS (zfs can recover zil failure, as long as you don't export the pool, with latest upstream you can also import a poool with a missing zil)? This question is especially interesting with RAM based devices, because they don't wear out, have a very low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :) During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru's here :) The test in question is called offline ZIL corruption. The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: - Prepare 2 OS installations (ProdudctOS and CorruptOS) - Boot ProductOS and create a pool and add the ZIL - ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton) - ProductOS: Power off the server and record the laast committet transaction - Boot CorruptOS - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL ~ 300 MB from start of disk, overwriting the first two disk labels) - Boot ProductOS - Verify that the data corruption is detected by checking the file with the transaction number against the one recorded We ran the test and it seems with modern snv_134 the pool comes up after the corruption with all beeing ok, while ~1 Transactions (this is some seconds of writes with DDRX1) are missing and nobody knows about this. We ran a scrub and scrub does not even detect this. ZFS automatically repairs the labels on the ZIL, however no error is reported about the missing data. While it is clear to us that if we do not have a mirrored zil, the data we have overwritten in the zil is lost, we are really wondering why ZFS does not REPORT about this corruption, silently ignoring it. Is this is a bug or .. aehm ... a feature :) ? Regards, Robert ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
On 08/23/10 13:12, Markus Keil wrote: Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn't match? - Yes, but you wouldn't want to replay the following entries in case the log records in the missing log block were important (eg create file). Mirroring the slogs is recommended to minimise concerns about slogs corruption.  Regards, Markus Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44 geschrieben: This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then that is reported, but if the checksum does not match then we assume it's the end of the intent log chain. Using this design means we the minimum number of writes to add write an intent log record is just one write. So corruption of an intent log is not going to generate any errors. Neil. On 08/23/10 10:41, StorageConcepts wrote: Hello, we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. The headline above all our tests is do we still need to mirror ZIL with all current fixes in ZFS (zfs can recover zil failure, as long as you don't export the pool, with latest upstream you can also import a poool with a missing zil)? This question is especially interesting with RAM based devices, because they don't wear out, have a very low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :) During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru's here :) The test in question is called offline ZIL corruption. The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: - Prepare 2 OS installations (ProdudctOS and CorruptOS) - Boot ProductOS and create a pool and add the ZIL - ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton) - ProductOS: Power off the server and record the laast committet transaction - Boot CorruptOS - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL ~ 300 MB from start of disk, overwriting the first two disk labels) - Boot ProductOS - Verify that the data corruption is detected by checking the file with the transaction number against the one recorded We ran the test and it seems with modern snv_134 the pool comes up after the corruption with all beeing ok, while ~1 Transactions (this is some seconds of writes with DDRX1) are missing and nobody knows about this. We ran a scrub and scrub does not even detect this. ZFS automatically repairs the labels on the ZIL, however no error is reported about the missing data. While it is clear to us that if we do not have a mirrored zil, the data we have overwritten in the zil is lost, we are really wondering why ZFS does not REPORT about this corruption, silently ignoring it. Is this is a bug or .. aehm ... a feature :) ? Regards, Robert   -- StorageConcepts Europe GmbH   Storage: Beratung. Realisierung. Support   Markus Keil      k...@storageconcepts.de             http://www.storageconcepts.de Wiener StraÃYe 114-116 Telefon:  +49 (351) 8 76 92-21 01219 Dresden     Telefax:  +49 (351) 8 76 92-99 Handelregister Dresden, HRB 28281 Geschäftsführer: Robert Heinzmann, Gerd Jelinek -- Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind vertraulich und ausschlieÃYlich für den Gebrauch durch den Empfänger bestimmt, soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt. Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten sein. Soweit eine Weitergabe oder Verteilung nicht ausschlieÃYlich zu internen Zwecken des Empfängers geschieht, wird jede Weitergabe, Verteilung oder sonstige Kopierung untersagt. Sollten Sie nicht der beabsichtigte Empfänger der Sendung sein, informieren Sie den Absender bitte unverzüglich. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Debunking the dedup memory myth
On 07/09/10 19:40, Erik Trimble wrote: On 7/9/2010 5:18 PM, Brandon High wrote: On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey solar...@nedharvey.com mailto:solar...@nedharvey.com wrote: The default ZFS block size is 128K. If you have a filesystem with 128G used, that means you are consuming 1,048,576 blocks, each of which must be checksummed. ZFS uses adler32 and sha256, which means 4bytes and 32bytes ... 36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by enabling dedup. I suspect my numbers are off, because 36Mbytes seems impossibly small. But I hope some sort of similar (and more correct) logic will apply. ;-) I think that DDT entries are a little bigger than what you're using. The size seems to range between 150 and 250 bytes depending on how it's calculated, call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique data would require 600M - 1000M for the DDT. The numbers are fuzzy of course, and assum only 128k blocks. Lots of small files will increase the memory cost of dedupe, and using it on a zvol that has the default block size (8k) would require 16 times the memory. -B Go back and read several threads last month about ZFS/L2ARC memory usage for dedup. In particular, I've been quite specific about how to calculate estimated DDT size. Richard has also been quite good at giving size estimates (as well as explaining how to see current block size usage in a filesystem). The structure in question is this one: ddt_entry http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108 I'd have to fire up an IDE to track down all the sizes of the ddt_entry structure's members, but I feel comfortable using Richard's 270 bytes-per-entry estimate. It must have grown a bit because on 64 bit x86 a ddt_entry is currently 0x178 = 376 bytes : # mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs sata sd ip hook neti sockfs arp usba fctl random cpc fcip nfs lofs ufs logindmux ptm sppp ipc ] ::sizeof struct ddt_entry sizeof (struct ddt_entry) = 0x178 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
On 07/02/10 00:57, Erik Trimble wrote: On 7/1/2010 10:17 PM, Neil Perrin wrote: On 07/01/10 22:33, Erik Trimble wrote: On 7/1/2010 9:23 PM, Geoff Nordli wrote: Hi Erik. Are you saying the DDT will automatically look to be stored in an L2ARC device if one exists in the pool, instead of using ARC? Or is there some sort of memory pressure point where the DDT gets moved from ARC to L2ARC? Thanks, Geoff Good question, and I don't know. My educated guess is the latter (initially stored in ARC, then moved to L2ARC as size increases). Anyone? The L2ARC just holds blocks that have been evicted from the ARC due to memory pressure. The DDT is no different than any other object (e.g. file). So when looking for a block ZFS checks first in the ARC then the L2ARC and if neither succeeds reads from the main pool. - Anyone. That's what I assumed. One further thought, though. Is the DDT is treated as a single entity - so it's *all* either in the ARC or in the L2ARC? Or does it move one entry at a time into the L2ARC as it fills the ARC? It's not treated as a single entity but at a block at a time. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
On 07/02/10 11:14, Erik Trimble wrote: On 7/2/2010 6:30 AM, Neil Perrin wrote: On 07/02/10 00:57, Erik Trimble wrote: That's what I assumed. One further thought, though. Is the DDT is treated as a single entity - so it's *all* either in the ARC or in the L2ARC? Or does it move one entry at a time into the L2ARC as it fills the ARC? It's not treated as a single entity but at a block at a time. Neil. Where 1 block = ? I'm assuming that more than on DDT entry will fit in a block (since DDT entries are ~270 bytes) - but, how big does the block get? Depending on the total size of the DDT? Or does it use fixed-sized blocks (I'd assume the smallest block possible, in this case)? - Yes, a pool block will contain many DDT entries. They are stored as a ZAP entries. I assume but I'm not sure if zap blocks grow to the maximum SPA block size (currently 128KB). Which reminds me: the current DDT is stored on disk - correct? - so that when I boot up, ZFS loads a complete DDT into the ARC when the pool is mounted? Or is it all constructed on the fly? - It's read as needed on the fly. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
On 07/01/10 22:33, Erik Trimble wrote: On 7/1/2010 9:23 PM, Geoff Nordli wrote: Hi Erik. Are you saying the DDT will automatically look to be stored in an L2ARC device if one exists in the pool, instead of using ARC? Or is there some sort of memory pressure point where the DDT gets moved from ARC to L2ARC? Thanks, Geoff Good question, and I don't know. My educated guess is the latter (initially stored in ARC, then moved to L2ARC as size increases). Anyone? The L2ARC just holds blocks that have been evicted from the ARC due to memory pressure. The DDT is no different than any other object (e.g. file). So when looking for a block ZFS checks first in the ARC then the L2ARC and if neither succeeds reads from the main pool. - Anyone. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
On 06/14/10 12:29, Bob Friesenhahn wrote: On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: It is good to keep in mind that only small writes go to the dedicated slog. Large writes to to main store. A succession of that many small writes (to fill RAM/2) is highly unlikely. Also, that the zil is not read back unless the system is improperly shut down. I thought all sync writes, meaning everything NFS and iSCSI, went into the slog - IIRC the docs says so. Check a month or two back in the archives for a post by Matt Ahrens. It seems that larger writes (32k?) are written directly to main store. This is probably a change from the original zfs design. Bob If there's a slog then the data, regardless of size, gets written to the slog. If there's no slog and if the data size is greater than zfs_immediate_write_sz/zvol_immediate_write_sz (both default to 32K) then the data is written as a block into the pool and the block pointer written into the log record. This is the WR_INDIRECT write type. So Matt and Roy are both correct. But wait, there's more complexity!: If logbias=throughput is set we always use WR_INDIRECT. If we just wrote more than 1MB for a single zil commit and there's more than 2MB waiting then we start using the main pool. Clear as mud? This is likely to change again... Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
On 06/14/10 19:35, Erik Trimble wrote: On 6/14/2010 12:10 PM, Neil Perrin wrote: On 06/14/10 12:29, Bob Friesenhahn wrote: On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: It is good to keep in mind that only small writes go to the dedicated slog. Large writes to to main store. A succession of that many small writes (to fill RAM/2) is highly unlikely. Also, that the zil is not read back unless the system is improperly shut down. I thought all sync writes, meaning everything NFS and iSCSI, went into the slog - IIRC the docs says so. Check a month or two back in the archives for a post by Matt Ahrens. It seems that larger writes (32k?) are written directly to main store. This is probably a change from the original zfs design. Bob If there's a slog then the data, regardless of size, gets written to the slog. If there's no slog and if the data size is greater than zfs_immediate_write_sz/zvol_immediate_write_sz (both default to 32K) then the data is written as a block into the pool and the block pointer written into the log record. This is the WR_INDIRECT write type. So Matt and Roy are both correct. But wait, there's more complexity!: If logbias=throughput is set we always use WR_INDIRECT. If we just wrote more than 1MB for a single zil commit and there's more than 2MB waiting then we start using the main pool. Clear as mud? This is likely to change again... Neil. How do I monitor the amount of live (i.e. non-committed) data in the slog? I'd like to spend some time with my setup, seeing exactly how much I tend to use. I think monitoring the capacity when running zpool iostat -v pool 1 should be fairly accurate. A simple d script can be written to determine how often the ZIL (code) fails to get a slog block and has to resort to the allocation in the main pool. One recent change reduced the amount of data written and possibly the slog block fragmentation. This is zpool version 23: Slim ZIL. So be sure to experiment with that. I'd suspect that very few use cases call for more than a couple (2-4) GB of slog... I agree this is typically true. Of course it depends on your workload. The amount slog data will reflect the uncommitted synchronous txg data, and the size of each txg will depend on memory size. This area is also undergoing tuning. I'm trying to get hard numbers as I'm working on building a DRAM/battery/flash slog device in one of my friend's electronics prototyping shops. It would be really nice if I could solve 99% of the need with 1 or 2 2GB SODIMMs and the chips from a cheap 4GB USB thumb drive... Sounds like fun. Good luck. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool import not working
On 06/12/10 17:13, zfsnoob4 wrote: Thanks. As I discovered from that post, VB does not have cache flush enabled by default. Ignoreflush must be explicitly turned off. VBoxManage setextradata VMNAME VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush 0 where VMNAME is the name of your virtual machine. Although I tried that it it returned with no output (indicating it worked) but it still won't detect a pool that has been destroyed. Is there any way to detect if flushes are working from inside the OS? Maybe a command that tells you if cacheflush is enabled? Thanks. You also need the -D flag. I could successfully import. This was running the latest bits: : trasimene ; mkdir /pf : trasimene ; mkfile 100m /pf/a /pf/b /pf/c : trasimene ; zpool create whirl /pf/a /pf/b log /pf/c : trasimene ; zpool destroy whirl : trasimene ; zpool import -D -d /pf pool: whirl id: 1406684148029707587 state: ONLINE (DESTROYED) action: The pool can be imported using its name or numeric identifier. config: whirl ONLINE /pf/a ONLINE /pf/b ONLINE logs /pf/c ONLINE : trasimene ; zpool import -D -d /pf whirl : trasimene ; zpool status whirl pool: whirl state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM whirl ONLINE 0 0 0 /pf/a ONLINE 0 0 0 /pf/b ONLINE 0 0 0 logs /pf/c ONLINE 0 0 0 errors: No known data errors : trasimene ; It would, of course, have been easier if you'd been using real devices but I understand you want to experiment first... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool import not working
On 06/11/10 22:07, zfsnoob4 wrote: Hey, I'm running some test right now before setting up my server. I'm running Nexenta Core 3.02 (RC2, based on opensolaris build 134 I believe) in Virtualbox. To do the test, I'm creating three empty files and then making a raidz mirror: mkfile -n 1g /foo mkfile -n 1g /foo1 mkfile -n 1g /foo2 Then I make a zpool: zpool create testpool raidz /foo /foo1 /foo2 Now I destroy the pool and attempt to restore it: zpool destroy testpool But when I try to list available imports, the list is empty: zpool import -D return nothing. zpool import testpool also return nothing. Even if I try to export the pool (so before destroying it): zpool export testpool I see it disappear from the zpool list, but I can't import it (commands return nothing). Is this due to the fact that I'm using test files instead of real drives? - Yes. zpool import will by default look in /dev/dsk. You need to specify the directory (using -d dir) if your pool devices are located elsewhere. See man zpool. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
On 05/26/10 07:10, sensille wrote: Recently, I've been reading through the ZIL/slog discussion and have the impression that a lot of folks here are (like me) interested in getting a viable solution for a cheap, fast and reliable ZIL device. I think I can provide such a solution for about $200, but it involves a lot of development work. The basic idea: the main problem when using a HDD as a ZIL device are the cache flushes in combination with the linear write pattern of the ZIL. This leads to a whole rotation of the platter after each write, because after the first write returns, the head is already past the sector that will be written next. My idea goes as follows: don't write linearly. Track the rotation and write to the position the head will hit next. This might be done by a re-mapping layer or integrated into ZFS. This works only because ZIL device are basically write-only. Reads from this device will be horribly slow. I have done some testing and am quite enthusiastic. If I take a decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise the synchronous write performance from 166 writes/s to about 2000 writes/s (!). 2000 IOPS is more than sufficient for our production environment. Currently I'm implementing a re-mapping driver for this. The reason I'm writing to this list is that I'd like to find support from the zfs team, find sparring partners to discuss implementation details and algorithms and, most important, find testers! If there is interest it would be great to build an official project around it. I'd be willing to contribute most of the code, but any help will be more than welcome. So, anyone interested? :) -- Arne Jansen Yes, I agree this seems very appealing. I have investigated and observed similar results. Just allocating larger intent log blocks but only writing to say the first half of them has seen the same effect. Despite the impressive results, we have not pursued this further mainly because of it's maintainability. There is quite a variance between drives so, as mentioned, feedback profiling of the device is needed in the working system. The layering of the Solaris IO subsystem doesn't provide the feedback necessary and the ZIL code is layered on the SPA/DMU. Still it should be possible. Good luck! Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?
On 04/10/10 09:28, Edward Ned Harvey wrote: Neil or somebody? Actual ZFS developers? Taking feedback here? ;-) While I was putting my poor little server through cruel and unusual punishment as described in my post a moment ago, I noticed something unexpected: I expected that while I'm stressing my log device by infinite sync writes, my primary storage devices would also be busy(ish). Not really busy, but not totally idle either. Since the primary storage is a stripe of spindle mirrors, obviously it can handle much more sustainable throughput than the individual log device, but the log device can respond with smaller latency. What I noticed was this: For several seconds, **only** the log device is busy. Then it stops, and for maybe 0.5 secs **only** the primary storage disks are busy. Repeat, recycle. These are the txgs getting pushed out. I expected to see the log device busy nonstop. And the spindle disks blinking lightly. As long as the spindle disks are idle, why wait for a larger TXG to be built? Why not flush out smaller TXG's as long as the disks are idle? Sometimes it's more efficient to batch up requests. Less blocks are written. As you mentioned you weren't stressing the system heavily. ZFS will perform differently when under pressure. It will shorten the time between txgs if the data arrives quicker. But worse yet ... During the 1-second (or 0.5 second) that the spindle disks are busy, why stop the log device? (Presumably also stopping my application that's doing all the writing.) Yes, this has been observed by many people. There are two sides to this problem related to the CPU and IO used while pushing a txg: 6806882 need a less brutal I/O scheduler 6881015 ZFS write activity prevents other threads from running in a timely manner The CPU side (6881015) was fixed relatively recently in snv_129. This means, if I'm doing zillions of **tiny** sync writes, I will get the best performance with the dedicated log device present. But if I'm doing large sync writes, I would actually get better performance without the log device at all. Or else ... add just as many log devices as I have primary storage devices. Which seems kind of crazy. Yes you're right, there are times when it's better to bypass the slog and use the pool disks which can deliver better bandwidth. The algorithm for where and what the ZIL writes has got quite complex: - There was another change recently to bypass the slog if 1MB had been sent to it and 2MB were waiting to be sent. - There's a new property logbias which when set to throughput directs the ZIL to send all of it's writes to the main pool devices thus freeing the slog for more latency sensitive work (ideal for database data files). - If synchronous writes are large (32K) and block aligned then the blocks are written directly to the pool and a small record written to the log. Later when the txg commits then the blocks are just linked into the txg. However, this processing is not done if there are any slogs because I found it didn't perform as well. Probably ought to be re-evaluated. - There are further tweaks being suggested and which might make it to a ZIL near you soon. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sync Write - ZIL log performance - Feedback for ZFS developers?
On 04/10/10 14:55, Daniel Carosone wrote: On Sat, Apr 10, 2010 at 11:50:05AM -0500, Bob Friesenhahn wrote: Huge synchronous bulk writes are pretty rare since usually the bottleneck is elsewhere, such as the ethernet. Also, large writes can go straight to the pool, and the zil only logs the intent to commit those blocks (ie, link them into the zfs data structure). I don't recall what the threshold for this is, but I think it's one of those Evil Tunables. This is zfs_immediate_write_sz which is 32K. However this only happens currently if you don't have any slogs. If logbias is set to throughput then all writes go straight to the pool regardless of zfs_immediate_write_sz. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/07/10 09:19, Bob Friesenhahn wrote: On Wed, 7 Apr 2010, Robert Milkowski wrote: it is only read at boot if there are uncomitted data on it - during normal reboots zfs won't read data from slog. How does zfs know if there is uncomitted data on the slog device without reading it? The minimal read would be quite small, but it seems that a read is still required. Bob If there's ever been synchronous activity then there an empty tail block (stubby) that will be read even after a clean shutdown. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/07/10 10:18, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims success. If the log device fails to read (oops!), then a mirror would be quite useful. An excellent point. BTW, does the system *ever* read from the log device during normal operation? Such as perhaps during a scrub? It really would be nice to detect failure of log devices in advance, that are claiming to write correctly, but which are really unreadable. A scrub will read the log blocks but only for unplayed logs. Because of the transient nature of the log and becuase it operates outside of the transaction group model it's hard to read the in-flight log blocks to validate them. There have previously been suggestions to read slogs periodically. I don't know if there's a CR raised for this though. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Removing SSDs from pool
On 04/05/10 11:43, Andreas Höschler wrote: Hi Khyron, No, he did *not* say that a mirrored SLOG has no benefit, redundancy-wise. He said that YOU do *not* have a mirrored SLOG. You have 2 SLOG devices which are striped. And if this machine is running Solaris 10, then you cannot remove a log device because those updates have not made their way into Solaris 10 yet. You need pool version = 19 to remove log devices, and S10 does not currently have patches to ZFS to get to a pool version = 19. If your SLOG above were mirrored, you'd have mirror under logs. And you probably would have log not logs - notice the s at the end meaning plural, meaning multiple independent log devices, not a mirrored pair of logs which would effectively look like 1 device. Thanks for the clarification! This is very annoying. My intend was to create a log mirror. I used zpool add tank log c1t6d0 c1t7d0 and this was obviously false. Would zpool add tank mirror log c1t6d0 c1t7d0 zpool add tank log mirror c1t6d0 c1t7d0 You can also do it on the create: zpool create tank pool devs log mirror c1t6d0 c1t7d0 have done what I intended to do? If so it seems I have to tear down the tank pool and recreate it from scratc!?. Can I simply use zpool destroy -f tank to do so? Shouldn't need the -f Thanks, Andreas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/02/10 08:24, Edward Ned Harvey wrote: The purpose of the ZIL is to act like a fast log for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. Bob and Casper and some others clearly know a lot here. But I'm hearing conflicting information, and don't know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim I can answer this question, I wrote that code, or at least have read it? I'm one of the ZFS developers. I wrote most of the zil code. Still I don't have all the answers. There's a lot of knowledgeable people on this alias. I usually monitor this alias and sometimes chime in when there's some misinformation being spread, but sometimes the volume is so high. Since I started this reply there's been 20 new posts on this thread alone! Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? - The intent log (separate device(s) or not) is only used by fsync, O_DSYNC, O_SYNC, O_RSYNC. NFS commits are seen to ZFS as fsyncs. Note sync(1m) and sync(2s) do not use the intent log. They force transaction group (txg) commits on all pools. So zfs goes beyond the the requirement for sync() which only requires it schedules but does not necessarily complete the writing before returning. The zfs interpretation is rather expensive but seemed broken so we fixed it. Is it ever used to accelerate async writes? The zil is not used to accelerate async writes. Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? - Kind of. The uberblock contains the root of the txg. At boot time, or zpool import time, what is taken to be the current filesystem? The latest uberblock? Something else? A txg is for the whole pool which can contain many filesystems. The latest txg defines the current state of the pool and each individual fs. My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. Correct (except replace sync() with O_DSYNC, etc). This also assumes hardware that for example handles correctly the flushing of it's caches. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. The ZIL doesn't make such guarantees. It's the DMU that handles transactions and their grouping into txgs. It ensures that writes are committed in order by it's transactional nature. The function of the zil is to merely ensure that synchronous operations are stable and replayed after a crash/power fail onto the latest txg. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. No, disabling the ZIL does not disable the DMU. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 03/30/10 20:00, Bob Friesenhahn wrote: On Tue, 30 Mar 2010, Edward Ned Harvey wrote: But the speedup of disabling the ZIL altogether is appealing (and would probably be acceptable in this environment). Just to make sure you know ... if you disable the ZIL altogether, and you have a power interruption, failed cpu, or kernel halt, then you're likely to have a corrupt unusable zpool, or at least data corruption. If that is indeed acceptable to you, go nuts. ;-) I believe that the above is wrong information as long as the devices involved do flush their caches when requested to. Zfs still writes data in order (at the TXG level) and advances to the next transaction group when the devices written to affirm that they have flushed their cache. Without the ZIL, data claimed to be synchronously written since the previous transaction group may be entirely lost. If the devices don't flush their caches appropriately, the ZIL is irrelevant to pool corruption. Bob Yes Bob is correct - that is exactly how it works. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance benchmarks in various configurations
If I understand correctly, ZFS now adays will only flush data to non volatile storage (such as a RAID controller NVRAM), and not all the way out to disks. (To solve performance problems with some storage systems, and I believe that it also is the right thing to do under normal circumstances.) Doesn't this mean that if you enable write back, and you have a single, non-mirrored raid-controller, and your raid controller dies on you so that you loose the contents of the nvram, you have a potentially corrupt file system? ZFS requires,that all writes be flushed to non-volatile storage. This is needed for both transaction group (txg) commits to ensure pool integrity and for the ZIL to satisfy the synchronous requirement of fsync/O_DSYNC etc. If the caches weren't flushed then it would indeed be quicker but the pool would be susceptible to corruption. Sadly some hardware doesn't honour cache flushes and this can cause corruption. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
On 02/09/10 08:18, Kjetil Torgrim Homme wrote: Richard Elling richard.ell...@gmail.com writes: On Feb 8, 2010, at 9:10 PM, Damon Atkins wrote: I would have thought that if I write 1k then ZFS txg times out in 30secs, then the 1k will be written to disk in a 1k record block, and then if I write 4k then 30secs latter txg happen another 4k record size block will be written, and then if I write 130k a 128k and 2k record block will be written. Making the file have record sizes of 1k+4k+128k+2k Close. Once the max record size is achieved, it is not reduced. So the allocation is: 1KB + 4KB + 128KB + 128KB I think the above is easily misunderstood. I assume the OP means append, not rewrites, and in that case (with recordsize=128k): * after the first write, the file will consist of a single 1 KiB record. * after the first append, the file will consist of a single 5 KiB record. Good so far. * after the second append, one 128 KiB record and one 7 KiB record. A long time ago we used to write short tail blocks, but not any more. So after the 2nd append we actually have 2 128KB blocks. in each of these operations, the *whole* file will be rewritten to a new location, but after a third append, only the tail record will be rewritten. So after the third append we'd actually have 3 128KB blocks. The first doesn't need to be re-written. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL to disk
On 01/15/10 12:59, Jeffry Molanus wrote: Sometimes people get confused about the ZIL and separate logs. For sizing purposes, the ZIL is a write-only workload. Data which is written to the ZIL is later asynchronously written to the pool when the txg is committed. Right; the tgx needs time to transfer the ZIL. I think you misunderstand the function of the ZIL. It's not a journal, and doesn't get transferred to the pool as of a txg. It's only ever written except after a crash it's read to do replay. See: http://blogs.sun.com/perrin/entry/the_lumberjack The ZFS write performance for this configuration should consistently be greater than 80 IOPS. We've seen measurements in the 600 write IOPS range. Why? Because ZFS writes tend to be contiguous. Also, with the SATA disk write cache enabled, bursts of writes are handled quite nicely. -- richard Is there a method to determine this value before pool configuration ? Some sort of rule of thumb? It would be sad when you configure the pool and have to reconfigure later one because you discover the pool can't handle the tgx commits from SSD to disk fast enough. In other words; with Y as expected load you would require a minimal of X mirror devs or X raid-z vdevs in order to have a pool with enough bandwith/IO to flush the ZIL without stalling the system. Jeffry ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Hi Adam, So was FW aware of this or in contact with these guys? Also are you requesting/ordering any of these cards to evaluate? The device seems kind of small at 4GB, and uses a double wide PCI Express slot. Neil. On 01/13/10 12:27, Adam Leventhal wrote: Hey Chris, The DDRdrive X1 OpenSolaris device driver is now complete, please join us in our first-ever ZFS Intent Log (ZIL) beta test program. A select number of X1s are available for loan, preferred candidates would have a validation background and/or a true passion for torturing new hardware/driver :-) We are singularly focused on the ZIL device market, so a test environment bound by synchronous writes is required. The beta program will provide extensive technical support and a unique opportunity to have direct interaction with the product designers. Congratulations! This is great news for ZFS. I'll be very interested to see the results members of the community can get with your device as part of their pool. COMSTAR iSCSI performance should be dramatically improved in particular. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on ssd
On 12/11/09 14:56, Bill Sommerfeld wrote: On Fri, 2009-12-11 at 13:49 -0500, Miles Nordin wrote: sh == Seth Heeren s...@zfs-fuse.net writes: sh If you don't want/need log or cache, disable these? You might sh want to run your ZIL (slog) on ramdisk. seems quite silly. why would you do that instead of just disabling the ZIL? I guess it would give you a way to disable it pool-wide instead of system-wide. A per-filesystem ZIL knob would be awesome. for what it's worth, there's already a per-filesystem ZIL knob: the logbias property. It can be set either to latency or throughput. That's a bit different. logbias controls whether the intent log block blocks go to main pool or the log devices (if they exist). I think Miles was requesting a per fs knob to disable writing any log blocks. A proposal for this exists that suggests a new sync property: everything synchrnous; everything not synchronous (ie zil disabled on fs); and the current behaviour (the default). The RFE is: 6280630 zil synchronicity My problem with implementing this is that people might actually use it! Well actually my concern is more that it will be misused. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Planed ZFS-Features - Is there a List or something else
On 12/09/09 13:52, Glenn Lagasse wrote: * R.G. Keen (k...@geofex.com) wrote: I didn't see remove a simple device anywhere in there. Is it: too hard to even contemplate doing, or too silly a thing to do to even consider letting that happen or too stupid a question to even consider or too easy and straightforward to do the procedure I see recommended (export the whole pool, destroy the pool, remove the device, remake the pool, then reimport the pool) to even bother with? You missed: Too hard to do correctly with current resource levels and other higher priority work. As always, volunteers I'm sure are welcome. :-) This gives the impression that development is not actively working on it. This is not true. As has been said often it is a difficult problem and has been actively worked on for a few months now. I don't think we are prepared to give a date as to when it will be delivered though. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] Transaction consistency of ZFS
On 12/06/09 10:11, Anurag Agarwal wrote: Hi, My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the writes in zfs are logged in the ZIL. Each write gets recorded in memory in case it needs to be forced out later (eg fsync()), but is not written to the on-disk log until then or until the transaction group commits which contains the write in which case the in-memory transaction is discarded. And if that indeed is the case, then yes, ZFS does guarantee the sequential consistency, even when there are power outage or server crash. You might loose some writes if ZIL has not committed to disk. But that would not change the sequential consistency guarantee. There is no need to do a fsync or open the file with O_SYNC. It should work as it is. I have not done any experiments to verify this, so please take my observation with pinch of salt. Any ZFS developers to verify or refute this. Regards, Anurag. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Transaction consistency of ZFS
I'll try to find out whether ZFS binding the same file always to the same opening transaction group. Not sure what you mean by this. Transactions (eg writes) will go into the current open transaction group (txg). Subsequent writes may enter the same or a future txg. Txgs are obviously committed in order. So writes are not committed out of order. The txg commit is all or nothing, so on a crash you get to see all the transactions in that txg or none. I think this answers your original question/concern. If so, I guess my assumption here would be true. Seems like there is only one opening transaction group at anytime. Can anybody give me a definitive answer here? ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing. Transactions enter in Open. Quiescing is where a new Open stage has started and waits for transactions that have yet to commit to finish. Syncing is where all the completed transactions are pushed to the pool in an atomic manner with the last write being the root of the new tree of blocks (uberblock). All the guarantees assume good hardware. As part of the new uberblock update we flush the write caches of the pool devices. If this is broken all bets are off. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on ssd
On 12/05/09 01:36, anu...@kqinfotech.com wrote: Hi, What you say is probably right with respect to L2ARC, but logging (ZIL or database log) is required for consistency purpose. No, the ZIL is not required for consistency. The pool is fully consistent without the ZIL. See http://blogs.sun.com/perrin/entry/the_lumberjack for more details. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Zil on HDD ?
On 12/03/09 09:21, mbr wrote: Hello, Bob Friesenhahn wrote: On Thu, 3 Dec 2009, mbr wrote: What about the data that were on the ZILlog SSD at the time of failure, is a copy of the data still in the machines memory from where it can be used to put the transaction to the stable storage pool? The intent log SSD is used as 'write only' unless the system reboots, in which case it is used to support recovery. The system memory is used as the write path in the normal case. Once the data is written to the intent log, then the data is declared to be written as far as higher level applications are concerned. thank you Bob for the clarification. So I don't need a mirrored ZILlog for security reasons, all the information is still in memory and will be used from there by default if only the ZILlog SSD fails. Mirrored log devices are advised to improve reliablity. As previously mentioned, if during writing a log device fails or is temporarily full then we use the main pool devices to chain the log blocks. If we get read errors when trying to replay the intent log (after a crash/power fail) then the admin is given the option to ignore the log and continue or somehow fix the device (eg re-attach) and then retry. Multiple log devices would provide extra reliability here. We do not look in memory for the log records if we can't get the records from the log blocks. If the intent log SSD fails and the system spontaneously reboots, then data may be lost. I can live with the data loss as long as the machine comes up with the faulty ZILlog SSD but otherwise without probs and with a clean zpool. The log records are not required for consistency of the pool (it's not a journal). Has the following error no consequences? Bug ID 6538021 Synopsis Need a way to force pool startup when zil cannot be replayed State 3-Accepted (Yes, that is a problem) Link http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6538021 Er that bug should probably be closed as a duplicate. We now have this functionality. Michael. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is write(2) made durable atomically?
Under the hood in ZFS, writes are committed using either shadow paging or logging, as I understand it. So I believe that I mean to ask whether a write(2), pushed to ZPL, and pushed on down the stack, can be split into multiple transactions? Or, instead, is it guaranteed to be committed in a single transaction, and so committed atomically? A write made through the ZPL (zfs_write()) will be broken into transactions that contain at most 128KB user data. So a large write could well be split across transaction groups, and thus committed separately. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and NFS
On 11/18/09 12:21, Joe Cicardo wrote: Hi, My customer says: Application has NFS directories with millions of files in a directory, and this can't changed. We are having issues with the EMC appliance and RPC timeouts on the NFS lookup. I am looking doing is moving one of the major NFS exports to as Sun 25k using VCS to cluster a ZFS RAIDZ that is then NFS exported. For performance I am looking at disabling ZIL, since these files have almost identical names. I think there's some confusion about the function of the ZIL because having files with identical names is irrelevant to the ZIL. Perhaps the customer is thinking of the DNLC, which is a cache of name lookups. The ZIL does handle changes to these NFS files though, as the NFS protocol requires they be on stable storage after most NFS operations. We don't recommend recommend disabling the ZIL as this can lead to integrity of user data issues. This is not the same as zpool corruption. One way to speed the ZIL up is to use a SSD as a separate log device. You can check how much activity is going through the ZIL by running zilstat: http://www.richardelling.com/Home/scripts-and-programs-1/zilstat Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does ZFS work with SAN-attached devices?
Also, ZFS does things like putting the ZIL data (when not on a dedicated device) at the outer edge of disks, that being faster. No, ZFS does not do that. It will chain the intent log from blocks allocated from the same metaslabs that the pool is allocating from. This actually works out well because there isn't a large seek back to the beginning of the device. When the pool gets near full then there will be a noticeable slowness - but then all file systems performance suffer when searching for space. When the log is on a separate device it uses the same allocation scheme but those blocks will tend to be allocated at the outer edge of the disk. They only exist for a short time before getting freed, so the same blocks gets re-used. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 09/25/09 16:19, Bob Friesenhahn wrote: On Fri, 25 Sep 2009, Ross Walker wrote: Problem is most SSD manufactures list sustained throughput with large IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD that can handle the throughput. Who said that the slog SSD is written to in 128K chunks? That seems wrong to me. Previously we were advised that the slog is basically a log of uncommitted system calls so the size of the data chunks written to the slog should be similar to the data sizes in the system calls. Log blocks are variable in size dependent on what needs to be committed. The minimum size is 4KB and the max 128KB. Log records are aggregated and written together as much as possible. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to verify if the ZIL is disabled
On 09/23/09 10:59, Scott Meilicke wrote: How can I verify if the ZIL has been disabled or not? I am trying to see how much benefit I might get by using an SSD as a ZIL. I disabled the ZIL via the ZFS Evil Tuning Guide: echo zil_disable/W0t1 | mdb -kw - this only temporarily disables the zil until the reboot. In fact it has no effect unless file systems are remounted as the variable is only looked at on mount. and then rebooted. However, I do not see any benefits for my NFS workload. To set zil_disable from boot put the following in /etc/system and reboot: set zfs:zil_disable=1 Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lots of zil_clean threads
Nils, A zil_clean() is started for each dataset after every txg. this includes snapshots (which is perhaps a bit inefficient). Still, zil_clean() is fairly lightweight if there's nothing to do (grab a non contended lock; find nothing on a list; drop the lock exit). Neil. On 09/21/09 08:08, Nils Goroll wrote: Hi All, out of curiosity: Can anyone come up with a good idea about why my snv_111 laptop computer should run more than 1000 zil_clean threads? ff0009a9dc60 fbc2c0300 tq:zil_clean ff0009aa3c60 fbc2c0300 tq:zil_clean ff0009aa9c60 fbc2c0300 tq:zil_clean ff0009aafc60 fbc2c0300 tq:zil_clean ff0009ab5c60 fbc2c0300 tq:zil_clean ff0009abbc60 fbc2c0300 tq:zil_clean ff0009ac1c60 fbc2c0300 tq:zil_clean ::threadlist!grep zil_clean| wc -l 1037 Thanks, Nils P.S.: Please don't spend too much time on this, for me, this question is really academic - but I'd be grateful for any good answers. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lots of zil_clean threads
Thinking more about this I'm confused about what you are seeing. The function dsl_pool_zil_clean() will serialise separate calls to zil_clean() within a pool. I don't expect you have 1037 pools on your laptop! So I don't know what's going on. What is the typical call stack for those zil_clean() threads? Neil. On 09/21/09 08:53, Neil Perrin wrote: Nils, A zil_clean() is started for each dataset after every txg. this includes snapshots (which is perhaps a bit inefficient). Still, zil_clean() is fairly lightweight if there's nothing to do (grab a non contended lock; find nothing on a list; drop the lock exit). Neil. On 09/21/09 08:08, Nils Goroll wrote: Hi All, out of curiosity: Can anyone come up with a good idea about why my snv_111 laptop computer should run more than 1000 zil_clean threads? ff0009a9dc60 fbc2c0300 tq:zil_clean ff0009aa3c60 fbc2c0300 tq:zil_clean ff0009aa9c60 fbc2c0300 tq:zil_clean ff0009aafc60 fbc2c0300 tq:zil_clean ff0009ab5c60 fbc2c0300 tq:zil_clean ff0009abbc60 fbc2c0300 tq:zil_clean ff0009ac1c60 fbc2c0300 tq:zil_clean ::threadlist!grep zil_clean| wc -l 1037 Thanks, Nils P.S.: Please don't spend too much time on this, for me, this question is really academic - but I'd be grateful for any good answers. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pulsing write performance
On 09/04/09 09:54, Scott Meilicke wrote: Roch Bourbonnais Wrote: 100% random writes produce around 200 IOPS with a 4-6 second pause around every 10 seconds. This indicates that the bandwidth you're able to transfer through the protocol is about 50% greater than the bandwidth the pool can offer to ZFS. Since, this is is not sustainable, you see here ZFS trying to balance the 2 numbers. When I have tested using 50% reads, 60% random using iometer over NFS, I can see the data going straight to disk due to the sync nature of NFS. But I also see writes coming to a stand still every 10 seconds or so, which I have attributed to the ZIL dumping to disk. Therefore I conclude that it is the process of dumping the ZIL to disk that (mostly?) blocks writes during the dumping. The ZIL does does not work like that. It is not a journal. Under a typical write load write transactions are batched and written out in a group transaction (txg). This txg sync occurs every 30s under light load but more frequently or continuously under heavy load. When writing synchronous data (eg NFS) the transactions get written immediately to the intent log and are made stable. When the txg later commits the intent log blocks containing those committed transactions can be freed. So as you can see there is no periodic dumping of the ZIL to disk. What you are probably observing is the periodic txg commit. Hope that helps: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ssd for zil on a dell 2950
On 08/20/09 06:41, Greg Mason wrote: Something our users do quite a bit of is untarring archives with a lot of small files. Also, many small, quick writes are also one of the many workloads our users have. Real-world test: our old Linux-based NFS server allowed us to unpack a particular tar file (the source for boost 1.37) in around 2-4 minutes, depending on load. This machine wasn't special at all, but it had fancy SGI disk on the back end, and was using the Linux-specific async NFS option. I'm glad you mentioned this option. It turns all synchronous requests from the client into async allowing the server to immediately return without making the data stable. This is the equivalent of setting zil_disable. Async used to be the default behaviour. It must have been a shock to Linux users when suddenly NFS slowed down when synchronous became the default! I wonder what the perf numbers were without the async option. We turned up our X4540s, and this same tar unpack took over 17 minutes! We disabled the ZIL for testing, and we dropped this to under 1 minute. With the X25-E as a slog, we were able to run this test in 2-4 minutes, same as the old storage. That's pretty impressive. So with a X25-E slog ZFS is as fast synchronously as your previously hardware was asynchronously - but with no risk of data corruption. Of course the hardware is different so it's not really apples to apples. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On 08/07/09 10:54, Scott Meilicke wrote: ZFS absolutely observes synchronous write requests (e.g. by NFS or a database). The synchronous write requests do not benefit from the long write aggregation delay so the result may not be written as ideally as ordinary write requests. Recently zfs has added support for using a SSD as a synchronous write log, and this allows zfs to turn synchronous writes into more ordinary writes which can be written more intelligently while returning to the user with minimal latency. Bob, since the ZIL is used always, whether a separate device or not, won't writes to a system without a separate ZIL also be written as intelligently as with a separate ZIL? - Yes. ZFS uses the same code path (intelligence?) to write out the data from NFS - regardless of whether there's a separate log (slog) or not. Thanks, Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
I understand that the ZILs are allocated out of the general pool. There is one intent log chain per dataset (file system or zvol). The head of each log the log is kept in the main pool. Without slog(s) we allocate (and chain) blocks from the main pool. If separate intent log(s) exist then blocks are allocated and chained there. If we fail to allocate from the slog(s) then we revert to allocation from the main pool. Is there a ZIL for the ZILs, or does this make no sense? There is no ZIL for the ZILs. Note the ZIL is not a journal (like ext3 or ufs logging). It simply contains records of system calls (including data) that need to be replayed if the system crashes and those records have not been committed in a transaction group. Hope that helps: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool iostat and iostat discrepancy
On 06/20/09 11:14, tester wrote: Hi, Does anyone know the difference between zpool iostat and iostat? dd if=/dev/zero of=/test/test1/trash count=1 bs=1024k;sync pool only shows 236K IO and 13 write ops. whereas iostat shows a correctly meg of activity. The zfs numbers are per second as well. So 236K * 5 = 1180K zpool iostat -v test 1 would make this clearer. The iostat output below also shows 237K (88+37+112) being written per second. I'm not sure why any reads occurred though. When I did a quick experiment there were no reads. Enabling compression gives much better numbers when writing zeros! Neil. zpool iostat -v test 5 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test1.14M 100G 0 13 0 236K c8t60060E800475F50075F50525d0 182K 25.0G 0 4 0 36.8K c8t60060E800475F50075F50526d0 428K 25.0G 0 4 0 87.7K c8t60060E800475F50075F50540d0 558K 50.0G 0 4 0 111K -- - - - - - - iostat -xnz [devices] 5 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 2.46.06.8 88.2 0.0 0.00.01.0 0 0 c8t60060E800475F50075F50540d0 2.45.46.8 37.0 0.0 0.00.00.9 0 0 c8t60060E800475F50075F50526d0 2.45.06.8 112.0 0.0 0.00.00.9 0 0 c8t60060E800475F50075F50525d0 dtrace also concurs with iostat device bytes IOPS == /devices/scsi_vhci/s...@g60060e800475f50075f50525:a 224416 35 /devices/scsi_vhci/s...@g60060e800475f50075f50526:a 486560 37 /devices/scsi_vhci/s...@g60060e800475f50075f50540:a 608416 33 Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Degraded log device in zpool status output
Will, This is bug: 6710376 log device can show incorrect status when other parts of pool are degraded This is just an error in the reporting. There was nothing actually wrong with the log device. It is picking up the degraded status from the rest of the pool. The bug was fixed only yesterday and checked into snv_114. Neil. On 04/18/09 23:52, Will Murnane wrote: I have a pool, huge, composed of one six-disk raidz2 vdev and a log device. I failed to plug in one disk when I took the machine down to plug in the log device, and booted all the way before I realized this, so the raidz2 vdev was rightly listed as degraded. Then I brought the machine down, plugged the disk in, and brought it back up. I ran zpool scrub huge to make sure that the missing disk was completely synced. After a few minutes, zpool status huge showed this: $ zpool status huge pool: huge state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress for 0h8m, 1.19% done, 11h15m to go config: NAMESTATE READ WRITE CKSUM hugeDEGRADED 0 0 0 raidz2DEGRADED 0 0 0 c4t4d0 DEGRADED 0 015 too many errors c4t1d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 logsDEGRADED 0 0 0 c7d1 ONLINE 0 0 0 errors: No known data errors I understand that not all of the blocks may have been synced onto c4t4d0 (the missing disk), so some checksum errors are normal there. But the log disk reports no errors, and its sole component reports none either, yet the log device is marked as degraded. To see what would happen, I executed this: $ pfexec zpool clear huge c4t4d0 $ zpool status huge pool: huge state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress for 0h12m, 1.87% done, 10h32m to go config: NAMESTATE READ WRITE CKSUM hugeONLINE 0 0 0 raidz2ONLINE 0 0 0 c4t4d0 ONLINE 0 0 2 c4t1d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 logsONLINE 0 0 0 c7d1 ONLINE 0 0 0 errors: No known data errors So clearing the errors from one device has an effect on the status of another device? Is this expected behavior, or is something wrong with my log device? I'm running snv_111. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
On 04/10/09 20:15, Toby Thain wrote: On 10-Apr-09, at 5:05 PM, Mark J Musante wrote: On Fri, 10 Apr 2009, Patrick Skerrett wrote: degradation) when these write bursts come in, and if I could buffer them even for 60 seconds, it would make everything much smoother. ZFS already batches up writes into a transaction group, which currently happens every 30 seconds. Isn't that 5 seconds? It used to be, and it may still be for what you are running. However, Mark is right, it is now 30 seconds. In fact 30s is the maximum. The actual time will depend on load. If the pool is heavily used then the txg's fire more frequently. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
Patrick, The ZIL is only used for synchronous requests like O_DSYNC/O_SYNC and fsync(). Your iozone command must be doing some synchronous writes. All the other tests (dd, cat, cp, ...) do everything asynchronously. That is they do not require the data to be on stable storage on return from the write. So asynchronous writes get cached in memory (the ARC) and written out periodically (every 30 seconds or less) when the transaction group commits. The ZIL would be heavily used if your system were a NFS server. Databases also do synchronous writes. Neil. On 04/09/09 15:13, Patrick Skerrett wrote: Hi folks, I would appreciate it if someone can help me understand some weird results I'm seeing with trying to do performance testing with an SSD offloaded ZIL. I'm attempting to improve my infrastructure's burstable write capacity (ZFS based WebDav servers), and naturally I'm looking at implementing SSD based ZIL devices. I have a test machine with the crummiest hard drive I can find installed in it, Quantum Fireball ATA-100 4500RPM 128K cache, and an Intel X25-E 32gig SSD drive. I'm trying to do A-B comparisons and am coming up with some very odd results: The first test involves doing IOZone write testing on the fireball standalone, the SSD standalone, and the fireball with the SSD as a log device. My test command is: time iozone -i 0 -a -y 64 -q 1024 -g 32M Then I check the time it takes to complete this operation in each scenario: Fireball alone - 2m15s (told you it was crappy) SSD alone - 0m3s Fireball + SSD zil - 0m28s This looks great! Watching 'zpool iostat-v' during this test further proves that the ZIL device is doing the brunt of the heavy lifting during this test. If I can get these kind of write results in my prod environment, I would be one happy camper. However, ANY other test I can think of to run on this test machine shows absolutely no performance improvement of the Fireball+SSD Zil over the Fireball by itself. Watching zpool iostat -v shows no activity on the ZIL at all whatsoever. Other tests I've tried to run: A scripted batch job of 10,000 - dd if=/dev/urandom of=/fireball/file_$i.dat bs=1k count=1000 A scripted batch job of 10,000 - cat /sourcedrive/$file /fireball/$file A scripted batch job of 10,000 - cp /sourcedrive/$file /fireball/$file And a scripted batch job moving 10,000 files onto the fireball using Apache Webdav mounted on the fireball (similar to my prod environment): curl -T /sourcedrive/$file http://127.0.0.1/fireball/ So what is IOZone doing differently than any other write operation I can think of??? Thanks, Pat S. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.
I'd like to correct a few misconceptions about the ZIL here. On 03/06/09 06:01, Jim Dunham wrote: ZFS the filesystem is always on disk consistent, and ZFS does maintain filesystem consistency through coordination between the ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Pool and file system consistency is more a function of the DMU SPA. Unfortunately for SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, therefore the data is in memory, not written to disk, ZFS data is actually cached in the ARC. The ZIL code keeps in-memory records of system call transactions in case a fsync() occurs. so SNDR does not know this data exists. ZIL flushes to disk can be seconds behind the actual application writes completing, It's the DMU/SPA that handles the transaction group commits (not the ZIL). Currently these occur 30 seconds or more frequently on a loaded system. and if SNDR is running asynchronously, these replicated writes to the SNDR secondary can be additional seconds behind the actual application writes. Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 'supported' way to get ZFS to empty the ZIL to disk on demand. The sync(2) system call is implemented differently in ZFS. For UFS it initiates a flush of cached data to disk, but does not wait for completion. This satisfies the POSIX requirement but never seemed right. For ZFS we wait for all transactions to complete and commit to stable storage (including flushing any disk write caches) before returning. So any asynchronous data in the ARC is written. Alternatively, a lockfs will flush just a file system to stable storage but in this case just the intent log is written. (Then later when the txg commits those intent log records are discarded). For some basic info on the ZIL see: http://blogs.sun.com/perrin/entry/the_lumberjack Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.
On 03/06/09 08:10, Jim Dunham wrote: Andrew, Jim Dunham wrote: ZFS the filesystem is always on disk consistent, and ZFS does maintain filesystem consistency through coordination between the ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, therefore the data is in memory, not written to disk, so SNDR does not know this data exists. ZIL flushes to disk can be seconds behind the actual application writes completing, and if SNDR is running asynchronously, these replicated writes to the SNDR secondary can be additional seconds behind the actual application writes. Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no 'supported' way to get ZFS to empty the ZIL to disk on demand. I'm wondering if you really meant ZIL here, or ARC? It is my understanding that the ZFS intent log (ZIL) satisfies POSIX requirements for synchronous transactions, True. thus filesystem consistency. No. The filesystems in the pool are always consistent with or without the ZIL. The ZIL is not the same as a journal (or the log in UFS). The ZFS adaptive replacement cache (ARC) is where uncommitted filesystem data is being cached. So although unwritten filesystem data allocated from the DMU, retained in the ARC, it is the ZIL which influences filesystem metadata and data consistency on disk. No. It just ensures the synchronous requests (O_DSYNC, fsync() etc) are on stable storage in case a crash/power fail occurs before the dirty ARC is written when the txg commits. In either case, creating a snapshot should get both flushed to disk, I think? No. A ZFS snapshot is a control path, verse data path operation and (to the best of my understanding, and testing) has no influence over POSIX filesystem consistency. See the discussion here: http://www.opensolaris.org/jive/click.jspa?searchID=1695691messageID=124809 Invoking a ZFS snapshot will assure the ZFS snapshot is consistent on the replicated disk, but not all actively opened files. A simple test I performed to verify this, was to append to a ZFS file (no synchronous filesystem options being set) a series of blocks with a block order pattern contained within. At some random point in this process, I took a ZFS snapshot, immediately dropped SNDR into logging mode. When importing the ZFS storage pool on the SNDR remote host, I could see the ZFS snapshot just taken, but neither the snapshot version of the file, or the file itself contained all of the data previously written to it. That seems like a bug in ZFS to me. A snapshot ought to contain all data that has been written (whether synchronous or asynchronous) prior to the snapshot. I then retested, but opened the file with O_DSYNC, and when following the same test steps above, both the snapshot version of the file, and the file itself contained all of the data previously written to it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.
On 03/06/09 14:51, Miles Nordin wrote: np == Neil Perrin neil.per...@sun.com writes: np Alternatively, a lockfs will flush just a file system to np stable storage but in this case just the intent log is np written. (Then later when the txg commits those intent log np records are discarded). In your blog it sounded like there's an in-RAM ZIL through which _everything_ passes, and parts of this in-RAM ZIL are written to the on-disk ZIL as needed. Thats correct. so maybe I was using the word ZIL wrongly in my last post. I understood what you meant. are you saying, lockfs will divert writes that would normally go straight to the pool, to pass through the on-disk ZIL instead? - Not instead but as well. The ZIL (code) will write immediately to the stable intent logs, then later the data cached in the ARC will be written as part of the pool transaction group (txg). As soon as that happens the intent log blocks can be re-used. assuming any separate slog isn't destroyed while the power's off, lockfs and sync should get you the same end result after an unclean shutdown, right? Right. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
Having a separate intent log on good hardware will not prevent corruption on a pool with bad hardware. By good I mean hardware that correctly flush their write caches when requested. Note, a pool is always consistent (again when using good hardware). The function of the intent log is not to provide consistency (like a journal), but to speed up synchronous requests like fsync and O_DSYNC. Neil. On 02/13/09 06:29, Jiawei Zhao wrote: While mobility could be lost, usb storage still has the advantage of being cheap and easy to install comparing to install internal disks on pc, so if I just want to use it to provide zfs storage space for home file server, can a small intent log located on internal sata disk prevent the pool corruption caused by a power cut? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris and zfs versions
Mark, I believe creating a older version pool is supported: zpool create -o version=vers whirl c0t0d0 I'm not sure what version of ZFS in Solaris 10 you are running. Try running zpool upgrade and replacing vers above with that version number. Neil. : trasimene ; zpool create -o version=11 whirl c0t0d0 : trasimene ; zpool get version whirl NAME PROPERTY VALUESOURCE whirl version 11 local : trasimene ; zpool upgrade This system is currently running ZFS pool version 14. The following pools are out of date, and can be upgraded. After being upgraded, these pools will no longer be accessible by older software versions. VER POOL --- 11 whirl Use 'zpool upgrade -v' for a list of available versions and their associated features. : trasimene ; On 02/12/09 11:42, Mark Winder wrote: We’ve been experimenting with zfs on OpenSolaris 2008.11. We created a pool in OpenSolaris and filled it with data. Then we wanted to move it to a production Solaris 10 machine (generic_137138_09) so I “zpool exported” in OpenSolaris, moved the storage, and “zpool imported” in Solaris 10. We got: Cannot import ‘deadpool’: pool is formatted using a newer ZFS version We would like to be able to move pools back and forth between the OS’s. Is there a way we can upgrade Solaris 10 to the same supported zfs version (or create downgraded pools in OpenSolaris)? Thanks! Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Max size of log device?
On 02/08/09 11:50, Vincent Fox wrote: So I have read in the ZFS Wiki: # The minimum size of a log device is the same as the minimum size of device in pool, which is 64 Mbytes. The amount of in-play data that might be stored on a log device is relatively small. Log blocks are freed when the log transaction (system call) is committed. # The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 Gbytes of physical memory, consider a maximum log device size of 8 Gbytes. What is the downside of over-large log device? - Wasted disk space. Let's say I have a 3310 with 10 older 72-gig 10K RPM drives and RAIDZ2 them. Then I throw an entire 72-gig 15K RPM drive in as slog. What is behind this maximum size recommendation? - Just guidance on what might be used in the most stressed environment. Personally I've never seen anything like the maximum used but it's theoretically possible. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS core contributor nominations
Looks reasonable +1 Neil. On 02/02/09 08:55, Mark Shellenbaum wrote: The time has come to review the current Contributor and Core contributor grants for ZFS. Since all of the ZFS core contributors grants are set to expire on 02-24-2009 we need to renew the members that are still contributing at core contributor levels. We should also add some new members to both Contributor and Core contributor levels. First the current list of Core contributors: Bill Moore (billm) Cindy Swearingen (cindys) Lori M. Alt (lalt) Mark Shellenbaum (marks) Mark Maybee (maybee) Matthew A. Ahrens (ahrens) Neil V. Perrin (perrin) Jeff Bonwick (bonwick) Eric Schrock (eschrock) Noel Dellofano (ndellofa) Eric Kustarz (goo)* Georgina A. Chua (chua)* Tabriz Holtz (tabriz)* Krister Johansen (johansen)* All of these should be renewed at Core contributor level, except for those with a *. Those with a * are no longer involved with ZFS and we should let their grants expire. I am nominating the following to be new Core Contributors of ZFS: Jonathan W. Adams (jwadams) Chris Kirby Lin Ling Eric C. Taylor (taylor) Mark Musante Rich Morris George Wilson Tim Haley Brendan Gregg Adam Leventhal Pawel Jakub Dawidek Ricardo Correia For Contributor I am nominating the following: Darren Moffat Richard Elling I am voting +1 for all of these (including myself) Feel free to nominate others for Contributor or Core Contributor. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache and cache flush
On 01/29/09 21:32, Greg Mason wrote: This problem only manifests itself when dealing with many small files over NFS. There is no throughput problem with the network. I've run tests with the write cache disabled on all disks, and the cache flush disabled. I'm using two Intel SSDs for ZIL devices. This setup is faster than using the two Intel SSDs with write caches enabled on all disks, and with the cache flush enabled. My test would run around 3.5 to 4 minutes, now it is completing in abound 2.5 minutes. I still think this is a bit slow, but I still have quite a bit of testing to perform. I'll keep the list updated with my findings. I've already established both via this list and through other research that ZFS has performance issues over NFS when dealing with many small files. This seems to maybe be an issue with NFS itself, where NVRAM-backed storage is needed for decent performance with small files. Typically such an NVRAM cache is supplied by a hardware raid controller in a disk shelf. I find it very hard to explain to a user why an upgrade is a step down in performance. For the users these Thors are going to serve, such a drastic performance hit is a deal breaker... Perhaps I missed something, but what was your previous setup? I.e. what did you upgrade from? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lackluster ZFS performance trials using various ZIL and L2ARC configurations...
I don't believe that iozone does any synchronous calls (fsync/O_DSYNC/O_SYNC), so the ZIL and separate logs (slogs) would be unused. I'd recommend performance testing by configuring filebench to do synchronous writes: http://opensolaris.org/os/community/performance/filebench/ Neil. On 01/15/09 00:36, Gray Carper wrote: Hey, all! Using iozone (with the sequential read, sequential write, random read, and random write categories), on a Sun X4240 system running OpenSolaris b104 (NexentaStor 1.1.2, actually), we recently ran a number of relative performance tests using a few ZIL and L2ARC configurations (meant to try and uncover which configuration would be the best choice). I'd like to share the highlights with you all (without bogging you down with raw data) to see if anything strikes you. Our first (baseline) test used a ZFS pool which had a self-contained ZIL and L2ARC (i.e. not moved to other devices, the default configuration). Note that this system had both SSDs and SAS drive attached to the controller, but only the SAS drives were in use. In the second test, we rebuilt the ZFS pool with the ZIL on a 32GB SSD and the L2ARC on four 146GB SAS drives. Random reads were significantly worse than the baseline, but all other categories were slightly better. In the third test, we rebuilt the ZFS pool with the ZIL on a 32GB SSD and the L2ARC on four 80GB SSDs. Sequential reads were better than the baseline, but all other categories were worse. In the fourth test, we rebuilt the ZFS pool with no separate ZIL, but with the L2ARC on four 146GB SAS drives. Random reads were significantly worse than the baseline and all other categories were about the same as the baseline. As you can imagine, we were disappointed. None of those configurations resulted in any significant improvements, and all of the configurations resulted in at least one category being worse. This was very much not what we expected. For the sake of sanity checking, we decided to run the baseline case again (ZFS pool which had a self-contained ZIL and L2ARC), but this time remove the SSDs completely from the box. Amazingly, the simple presence of the SSDs seemed to be a negative influence - the new SSD-free test showed improvement in every single category when compared to the original baseline test. So, this has lead us to the conclusion that we shouldn't be mixing SSDs with SAS drives on the same controller (at least, not the controller we have in this box). Has anyone else seen problems like this before that might validate that conclusion? If so, we think we should probably build an SSD JBOD, hook it up to the box, and re-run the tests. This leads us to another question: Does anyone have any recommendations for SSD-performant controllers that have great OpenSolaris driver support? Thanks! -Gray -- Gray Carper MSIS Technical Services University of Michigan Medical School gcar...@umich.edu mailto:gcar...@umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs panic
I'm sorry about the problems. We try to be responsive to fixing bugs and implementing new features that people are requesting for ZFS. It's not always possible to get it right. In this instance I don't think the bug was reproducible, and perhaps that's why it hasn't received the attention it deserves. As far as I know yours is the second reported instance. It may be that the problem has been fixed and that's why we haven't seen it in-house. However, that's just speculation, and some serious investigation is needed. Neil. On 01/13/09 06:39, Krzys wrote: To be honest I am quite surprised as this bug you referring to was submited early in 2008 and last updated over the summer. Quite surprised that Sun did not come up with a fix for it so far. ZFS is certainly gaining some popularity at my workplace, and we were thinking of using it instead of veritas, but I am not sure what to do with it now.. what if we have systems that we quite depend on and we have similar issue? How could we solve it? is calling sun support going to help me in such case? This particular system is my playground and I do not care about it to that extend but if I had other system that has much greater importance and I get such situation its quite scary... :( On Mon, 12 Jan 2009, Neil Perrin wrote: This is a known bug: 6678070 Panic from vdev_mirror_map_alloc() http://bugs.opensolaris.org/view_bug.do?bug_id=6678070 Neil. On 01/12/09 21:12, Krzys wrote: any idea what could cause my system to panic? I get my system rebooted daily at various times. very strange, but its pointing to zfs. I have U6 with all latest patches. Jan 12 05:47:12 chrysek unix: [ID 836849 kern.notice] Jan 12 05:47:12 chrysek ^Mpanic[cpu1]/thread=30002c8d4e0: Jan 12 05:47:12 chrysek unix: [ID 799565 kern.notice] BAD TRAP: type=28 rp=2a10285c790 addr=7b76a0a8 mmu_fsr=0 Jan 12 05:47:12 chrysek unix: [ID 10 kern.notice] Jan 12 05:47:12 chrysek unix: [ID 839527 kern.notice] zfs: ... ... ... 374706 pages dumped, compression ratio 3.50, Jan 12 05:48:51 chrysek genunix: [ID 851671 kern.notice] dump succeeded Jan 12 05:49:40 chrysek genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_13-02 64-bit ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Several questions concerning ZFS
On 01/12/09 20:45, Simon wrote: Hi Experts, IHAC who using Solaris 10 + ZFS,two questions they're concerned: - ZIL(zfs intent log) is enabled by default for ZFS,there are varied storage purchased by customer(such as EMC CX/DMX series,HDS AMS/USP series and etc),customer wonder whether there is any impact to these storages if enable ZIL,if have,what is the negative factors? As far as I know there hasn't been any performance reports comparing various devices specifically for ZFS or the ZIL. Richard Elling may know? - Under what circumstances,should I disable zil ? - It's not recommended to ever disable it! It was originally added as a switch to allow the new ZIL to be disabled if it proved unstable. It should have been removed shortly afterwards. If the system loses power or crashes then some recent synchronous changes that were claimed to be safely on disk might not be. If you know this then I suppose you could take advantage of the speed and redo the recent changes. For instance, it has been suggested that Solaris binaries be built with the ZIL disabled. If the system crashes then the build would be started again from scratch. Panics are sufficiently rare and the build time can be cut significantly (eg 30%). However, these are somewhat contrived circumstances and I wouldn't recommend ever configuring a customers system with the ZIL disabled. A safer option is to turn off disk write cache flushing (set zfs:zfs_nocacheflush=1). This should only be done if it's known *all* zpool devices are non-volatile. This has almost the same performance effect as disabling the ZIL, as it's the actual writing of the bits to the rotating rust that takes most of the time. This also helps speed up other writes - ie committing transaction groups. - If the device used by zpool is come from above listed external storage (EMC or HDS),what size is suggested for storage LUN ? As current rules,customer uses 100G for EMC CX,and 55G/110G for EMC DMX, and 52G for HDS USP V,100G for HDS AMS as LUN size,the filesystem over LUNs is UFS. Sorry - I don't know. Any reply are much appreciated,thanks in advance. Best Rgds, Simon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs panic
This is a known bug: 6678070 Panic from vdev_mirror_map_alloc() http://bugs.opensolaris.org/view_bug.do?bug_id=6678070 Neil. On 01/12/09 21:12, Krzys wrote: any idea what could cause my system to panic? I get my system rebooted daily at various times. very strange, but its pointing to zfs. I have U6 with all latest patches. Jan 12 05:47:12 chrysek unix: [ID 836849 kern.notice] Jan 12 05:47:12 chrysek ^Mpanic[cpu1]/thread=30002c8d4e0: Jan 12 05:47:12 chrysek unix: [ID 799565 kern.notice] BAD TRAP: type=28 rp=2a10285c790 addr=7b76a0a8 mmu_fsr=0 Jan 12 05:47:12 chrysek unix: [ID 10 kern.notice] Jan 12 05:47:12 chrysek unix: [ID 839527 kern.notice] zfs: Jan 12 05:47:12 chrysek unix: [ID 983713 kern.notice] integer divide zero trap: Jan 12 05:47:12 chrysek unix: [ID 381800 kern.notice] addr=0x7b76a0a8 Jan 12 05:47:12 chrysek unix: [ID 101969 kern.notice] pid=18941, pc=0x7b76a0a8, sp=0x2a10285c031, tstate=0x4480001606, context=0x1 Jan 12 05:47:12 chrysek unix: [ID 743441 kern.notice] g1-g7: 7b76a07c, 1, 0, 0, 241b2a, 16, 30002c8d4e0 Jan 12 05:47:12 chrysek unix: [ID 10 kern.notice] Jan 12 05:47:12 chrysek genunix: [ID 723222 kern.notice] 02a10285c4b0 unix:die+9c (28, 2a10285c790, 7b76a0a8, 0, 2a10285c570, 1) Jan 12 05:47:12 chrysek genunix: [ID 179002 kern.notice] %l0-3: 000a 0028 000a 0801 Jan 12 05:47:12 chrysek %l4-7: 02a10285cd18 02a10285cd3c 0006 0109a000 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c590 unix:trap+644 (2a10285c790, 1, 0, 0, 180c000, 30002c8d4e0) Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice] %l0-3: 06002c5b9130 0028 0600118fa088 Jan 12 05:47:13 chrysek %l4-7: 00db 004480001606 00010200 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c6e0 unix:ktl0+48 (0, 70021d50, 349981, 180c000, 10394e8, 2a10285c8e8) Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice] %l0-3: 0007 1400 004480001606 0101bedc Jan 12 05:47:13 chrysek %l4-7: 0600110bd630 0600110be400 02a10285c790 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c830 zfs:spa_get_random+c (0, 0, d15c4746ef9ddd65, 0, , 8) Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice] %l0-3: 01ff 7b772a00 000e Jan 12 05:47:13 chrysek %l4-7: 00020801 ee00 060031b23680 Jan 12 05:47:13 chrysek genunix: [ID 723222 kern.notice] 02a10285c8f0 zfs:vdev_mirror_map_alloc+b8 (60012ec20e0, 30006a9a3c8, 1, 30006a9a370, 0, ff) Jan 12 05:47:13 chrysek genunix: [ID 179002 kern.notice] %l0-3: Jan 12 05:47:13 chrysek %l4-7: 0600112cc080 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285c9a0 zfs:vdev_mirror_io_start+4 (30006a9a370, 0, 0, 30006a9a3c8, 0, 7b772bc4) Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice] %l0-3: 0001 7b7a4688 Jan 12 05:47:14 chrysek %l4-7: 7b7a4400 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285ca80 zfs:zio_execute+74 (30006a9a370, 7b783f70, 78, f, 1, 70496c00) Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice] %l0-3: 030083edb728 00c44002 00038000 70496d88 Jan 12 05:47:14 chrysek %l4-7: 00efc006 0801 8000 Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285cb30 zfs:arc_read+724 (1, 600112cc080, 30075baba00, 200, 0, 300680b9288) Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice] %l0-3: 0001 70496060 0006 0801 Jan 12 05:47:14 chrysek %l4-7: 02a10285cd18 030083edb728 02a10285cd3c Jan 12 05:47:14 chrysek genunix: [ID 723222 kern.notice] 02a10285cc40 zfs:dbuf_prefetch+13c (60035ce1050, 70496c00, 30075baba00, 0, 0, 3007578b0a0) Jan 12 05:47:14 chrysek genunix: [ID 179002 kern.notice] %l0-3: 000a 0028 000a 0801 Jan 12 05:47:14 chrysek %l4-7: 02a10285cd18 02a10285cd3c 0006 Jan 12 05:47:15 chrysek genunix: [ID 723222 kern.notice] 02a10285cd50 zfs:dmu_zfetch_fetch+2c (60035ce1050, 8b67, 100, 100, cd, 8c34) Jan 12 05:47:15 chrysek genunix: [ID 179002 kern.notice] %l0-3: 7049d098 4000 7049d000 7049d188 Jan 12 05:47:15 chrysek %l4-7: 06d8 00db 7049d178
Re: [zfs-discuss] Problems at 90% zpool capacity 2008.05
On 01/06/09 21:25, Nicholas Lee wrote: Since zfs is so smart is other areas is there a particular reason why a high water mark is not calculated and the available space not reset to this? I'd far rather have a zpool of 1000GB that said it only had 900GB but did not have corruption as it ran out of space. Nicholas Is there any evidence of corruption at high capacity or just a lack of performance? All file systems will slow down when near capacity, as they struggle to find space and then have to spread writes over the disk. Our priorities are integrity first, followed somewhere by performance. I vaguely remember a time when UFS had limits to prevent ordinary users from consuming past a certain limit, allowing only the super-user to use it. Not that I'm advocating that approach for ZFS. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs_nocacheflush, nvram, and root pools
On 12/02/08 03:47, River Tarnell wrote: hi, i have a system connected to an external DAS (SCSI) array, using ZFS. the array has an nvram write cache, but it honours SCSI cache flush commands by flushing the nvram to disk. the array has no way to disable this behaviour. a well-known behaviour of ZFS is that it often issues cache flush commands to storage in order to ensure data integrity; while this is important with normal disks, it's useless for nvram write caches, and it effectively disables the cache. so far, i've worked around this by setting zfs_nocacheflush, as described at [1], which works fine. but now i want to upgrade this system to Solaris 10 Update 6, and use a ZFS root pool on its internal SCSI disks (previously, the root was UFS). the problem is that zfS_nocacheflush applies to all pools, which will include the root pool. my understanding of ZFS is that when run on a root pool, which uses slices (instead of whole disks), ZFS won't enable the write cache itself. i also didn't enable the write cache manually. so, it _should_ be safe to use zfs_nocacheflush, because there is no caching on the root pool. am i right, or could i encounter problems here? Yes you are right and this should work. You may want to check that the write cache is disabled on the root pool disks using 'format -e' + cache + write_cache + display. (the system is an NFS server, which means lots of synchronous writes (and therefore ZFS cache flushes), so i *really* want the performance benefit from using the nvram write cache.) Indeed, performance would be bad without it. - river. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs znode changes getting lost
I suspect ZFS is unaware that anything has changed in the z_phys so it never gets written out. You probably need to create a dmu transaction and call dmu_buf_will_dirty(zp-z_dbuf, tx); Neil. On 11/26/08 03:36, shelly wrote: In place of padding in zfs znode i added a new field. stored an integer value and am able to see saved information. but after reboot it is not there. If i was able to access before reboot so it must be in memory. I think i need to save it to disk. how does one force zfs znode to disk. right now i dont do anything special for it. Just made an ioctl, accessed znode and made changes. example in zfs_ioctl case add_new: zp = VTOZ(vp); zp-z_phys-new_field = 2; return(0); ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL performance on traditional HDDs
On 11/20/08 12:52, Danilo Poccia wrote: Hi, I was wondering is there is a performance gain for an OLTP-like workload in putting the ZFS Intent Log (ZIL) on traditional HDDs. It's probably always best to benchmark it yourself, but my experience has shown that it's better to only have a separate log when the log devices are faster. Without a separate log (slog) the log is allocated dynamically from the pool and at a location where current allocations are happening for transaction groups. So there is little head movement needed to write the log. There may be a problem when the pool is very full and fragmented as log block allocation will be all over the place and seek latency will be high. However, this is a problem for the whole pool. Also, the log is spread across the pool devices so the more devices in the pool the faster the intent log can be written when the load is heavy. Hope that helps: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] s10u6--will using disk slices for zfs logs improve nfs performance?
I wouldn't expect any improvement using a separate disk slice for the Intent Log unless that disk was much faster and was otherwise largely idle. If it was heavily used then I'd expect quite the performance degradation as the disk head bounces around between slices. Separate intent logs are really recommended for fast devices (SSDs or NVRAM). When you're comparing against UFS is the write cache disabled (use format -e)? Otherwise UFS is unsafe. To get a apples to apples perf comparison, you can compare either: Safe mode - ZFS with default settings (zil_disable=0 zfs_nocacheflush=0) against UFS with write cache disabled. Ie the safe mode. Unsafe mode - unless device is volatile. --- ZFS with zil_disable=0 zfs_nocacheflush=1 against UFS with write cache enabled. From my reading of one your comparisons, ZFS takes 10s vs 15s for UFS (unsafe mode) Neil. On 11/13/08 16:23, Doug wrote: I've got an X4500/thumper that is mainly used as an NFS server. It has been discussed in the past that NFS performance with ZFS can be slow (when running tar to expand an archive with lots of files, for example.) My understanding is the reason that zfs/nfs is slow in this case is because it is doing the correct/safe thing of waiting for the files to be written to disk. I can (and have) improved nfs/zfs performance by about 15x by adding set zfs:zil_disable=1 or zfs:zfs_nocacheflush=1 to /etc/system but this is unsafe (though a common workaround?) But, I have never understood why zfs/nfs is so much slower than ufs/nfs in the case of expanding a tar archive. Is ufs/nfs not properly committing the data to disk? Anyway, with the just released Solaris 10 10/08, zpool has been upgraded to version 10 which includes option of using a separate storage device for the ZIL. It had been my impression that you would need to use an flash disk/SSD to store the ZIL to improve performance, but Richard Elling mentioned in a earlier post that you could use a regular disk slice for this also (see http://www.opensolaris.org/jive/thread.jspa?threadID=80213tstart=15) On an X4500 server, I had a zpool of 8 disks arranged in RAID 10. I installed a flash archive of s10u6 on the server then ran zpool upgrade. Next, I used zpool add log to add a 50GB slice on the boot disk for the zfs intent log. But, I didn't see any improvement in NFS performance in running gtar zxf Python-2.5.2.tgz (Python language source code) It took 0.6sec to run on the local system (no NFS) and 2min20sec over NFS. If I disable the ZIL, the command runs in about 10sec on the NFS client. (It runs in about 15 seconds over NFS to a UFS slice on the NFS server.) The separate intent log didn't seem to do anything in this case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DNLC and ARC
Leal, ZFS uses the DNLC. It still provides the fastest lookup of directory, name to vnode. The DNLC is kind of LRU. An async process will use a rotor to move through the hash chains and select the LRU entry but will select first negative cache entries and vnodes only referenced by the DNLC. Underlying this ZFS uses the ZAP and Fat ZAP to store the mappings. ZFS does not use the 2nd level DNLC which allows caching of directories. This is only used by UFS to avoid a linear search of large directories. Neil. On 10/30/08 04:50, Marcelo Leal wrote: Hello, In ZFS the DNLC concept is gone, or is in ARC too? I mean, all the cache in ZFS is ARC right? I was thinking if we can tune the DNLC in ZFS like in UFS.. if we have too *many* files and directories, i guess we can have a better performance having all the metadata cached, and that is even more important in NFS operations. DNLC is LRU right? And ARC should be totally dynamic, but as in another thread here, i think reading a *big* file can mess with the whole thing. Can we hold an area in memory for DNLC cache, or that is not the ARC way? thanks, Leal. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DNLC and ARC
On 10/30/08 11:00, Marcelo Leal wrote: Hello Neil, Leal, ZFS uses the DNLC. It still provides the fastest lookup of directory, name to vnode. Ok, so the whole concept remains true? We can tune the DNLC and expect the same behaviour on ZFS? Yes. The DNLC is kind of LRU. An async process will use a rotor to move through the hash chains and select the LRU entry but will select first negative cache entries and vnodes only referenced by the DNLC. Underlying this ZFS uses the ZAP and Fat ZAP to store the mappings. Here i did not understand very well. You are saying that ZFS uses DNLC just for one level? Yes, the DNLC also supports entire directory caches, however ZFS doesn't use this as it's better organised on disk not to be linear. Normally name lookups check the normal/original (1st level) DNLC then if that fails the entire directory name cache (2nd level) is checked. ZFS does not use the 2nd level DNLC which allows caching of directories. This is only used by UFS to avoid a linear search of large directories. What is the ZFS way here? One of the points of my question is exactly that... in an environment with many directories with *many* files, i think ZFS would has the *same* problems too. So, having directories cache on DNLC could be a good solution. Can you explain how ZFS handles the performance in directories with hundreds of files? There is a lot of docs around UFS/DNLC, but for now i think the only doc about ZFS/ARC and DNLC is the source code. ;-) Neil. Thanks a lot! I was thinking in tune DNLC to have as many metadata (directories and files) as i can, to minimize lookups/stats and etc (in NFS there is a lot of getattr ops). So we could have *all* the metadata cached, and use what remains in memory to cache data. Maybe that kind of tuning would be usefull for just a few workloads, but could be a *huge* enhancement for that workloads. Leal -- posix rules -- [http://www.posix.brte.com.br/blog] On 10/30/08 04:50, Marcelo Leal wrote: Hello, In ZFS the DNLC concept is gone, or is in ARC too? I mean, all the cache in ZFS is ARC right? I was thinking if we can tune the DNLC in ZFS like in UFS.. if we have too *many* files and directories, i guess we can have a better performance having all the metadata cached, and that is even more important in NFS operations. DNLC is LRU right? And ARC should be totally dynamic, but as in another thread here, i think reading a *big* file can mess with the whole thing. Can we hold an area in memory for DNLC cache, or that is not the ARC way? thanks, Leal. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove slog device from zpool
Ethan, It is still not possible to remove a slog from a pool. This is bug: 6574286 removing a slog doesn't work The error message: cannot remove c4t15d0p0: only inactive hot spares or cache devices can be removed is correct and this is the same as documented in the zpool man page: zpool remove pool device ... Removes the specified device from the pool. This command currently only supports removing hot spares and cache devices. It's actually relatively easy to implement removal of slogs. We simply flush the outstanding transactions and start using the main pool for the Intent Logs. Thus the vacated device can be removed. However, we wanted to make sure it fit into the framework for the removal of any device. This a much harder problem which we have made progress, but it's not there yet... Neil. On 10/26/08 11:41, Ethan Erchinger wrote: Sorry for the first incomplete send, stupid Ctrl-Enter. :-) Hello, I've looked quickly through the archives and haven't found mention of this issue. I'm running SXCE (snv_99), which uses zfs version 13. I had an existing zpool: -- [EMAIL PROTECTED] ~]$ zpool status -v data pool: data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t1d0p0 ONLINE 0 0 0 c4t9d0p0 ONLINE 0 0 0 ... cache c4t15d0p0ONLINE 0 0 0 errors: No known data errors -- The cache device (c4t15d0p0) is an Intel SSD. To test zil, I removed the cache device, and added it as a log device: -- [EMAIL PROTECTED] ~]$ pfexec zpool remove data c4t15d0p0 [EMAIL PROTECTED] ~]$ pfexec zpool add data log c4t15d0p0 [EMAIL PROTECTED] ~]$ zpool status -v data pool: data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t1d0p0 ONLINE 0 0 0 c4t9d0p0 ONLINE 0 0 0 ... logs ONLINE 0 0 0 c4t15d0p0ONLINE 0 0 0 errors: No known data errors -- The device is working fine. I then said, that was fun, time to remove and add as cache device. But that doesn't seem possible: -- [EMAIL PROTECTED] ~]$ pfexec zpool remove data c4t15d0p0 cannot remove c4t15d0p0: only inactive hot spares or cache devices can be removed -- I've also tried using detach, offline, each failing in other more obvious ways. The manpage does say that those devices should be removable/replaceable. At this point the only way to reclaim my SSD device is to destroy the zpool. Just in-case you are wondering about versions: -- [EMAIL PROTECTED] ~]$ zpool upgrade data This system is currently running ZFS pool version 13. Pool 'data' is already formatted using the current version. [EMAIL PROTECTED] ~]$ uname -a SunOS opensolaris 5.11 snv_99 i86pc i386 i86pc -- Any ideas? Thanks, Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis
On 10/22/08 10:26, Constantin Gonzalez wrote: Hi, On a busy NFS server, performance tends to be very modest for large amounts of small files due to the well known effects of ZFS and ZIL honoring the NFS COMMIT operation[1]. For the mature sysadmin who knows what (s)he does, there are three possibilities: 1. Live with it. Hard, if you see 10x less performance than could be and your users complain a lot. 2. Use a flash disk for a ZIL, a slog. Can add considerable extra cost, especially if you're using an X4500/X4540 and can't swap out fast SAS drives for cheap SATA drives to free the budget for flash ZIL drives.[2] 3. Disable ZIL[1]. This is of course evil, but one customer pointed out to me that if a tar xvf were writing locally to a ZFS file system, the writes wouldn't be synchronous either, so there's no point in forcing NFS users to having a better availability experience at the expense of performance. So, if the sysadmin draws the informed and conscious conclusion that (s)he doesn't want to honor NFS COMMIT operations, what are options less disruptive than disabling ZIL completely? - I checked the NFS tunables from: http://dlc.sun.com/osol/docs/content/SOLTUNEPARAMREF/chapter3-1.html But could not find a tunable that would disable COMMIT honoring. Is there already an RFE asking for a share option that disable's the translation of COMMIT to synchronous writes? - None that I know of... - The ZIL exists on a per filesystem basis in ZFS. Is there an RFE already that asks for the ability to disable the ZIL on a per filesystem basis? Yes: 6280630 zil synchronicity Though personally I've been unhappy with the exposure that zil_disable has got. It was originally meant for debug purposes only. So providing an official way to make synchronous behaviour asynchronous is to me dangerous. Once Admins start to disable the ZIL for whole pools because the extra performance is too tempting, wouldn't it be the lesser evil to let them disable it on a per filesystem basis? Comments? Cheers, Constantin [1]: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine [2]: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis
But the slog is the ZIL. formaly a *separate* intent log. No the slog is not the ZIL! Here's the definition of the terms as we've been trying to use them: ZIL: The body of code the supports synchronous requests, which writes out to the Intent Logs Intent Log: A stable storage log. There is one per file system zvol. slog: An Intent Log on a separate stable device - preferably high speed. We don't really have name for an Intent Log when it's embedded in the main pool. I have in the past used the term clog for chained log. Originally before slogs existed, it was just the Intent Log. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis
On 10/22/08 13:56, Marcelo Leal wrote: But the slog is the ZIL. formaly a *separate* intent log. No the slog is not the ZIL! Ok, when you did write this: I've been slogging for a while on support for separate intent logs (slogs) for ZFS. Without slogs, the ZIL is allocated dynamically from the main pool. You were talking about The body of code in the statement: the ZIL is allocated ? So i have misunderstood you... Leal. I guess I need to fix that! Anyway the slog is not the ZIL it's one of the two currently possible Intent Log types. Sorry for the confusion: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs cp hangs when the mirrors are removed ..
Karthik, The pool failmode property as implemented governs the behaviour when all the devices needed are unavailable. The default behaviour is to wait (block) until the IO can continue - perhaps by re-enabling the device(s). The behaviour you expected can be achieved by zpool set failmode=continue pool, as shown in the link you indicated below. Neil. On 10/15/08 22:38, Karthik Krishnamoorthy wrote: Hello All, Summary: cp command for mirrored zfs hung when all the disks in the mirrored pool were unavailable. Detailed description: ~ The cp command (copy a 1GB file from nfs to zfs) hung when all the disks in the mirrored pool (both c1t0d9 and c2t0d9) were removed physically. NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 mirrorONLINE 0 0 0 c1t0d9 ONLINE 0 0 0 c2t0d9 ONLINE 0 0 0 We think if all the disks in the pool are unavailable, cp command should fail with error (not cause hang). Our request: Please investigate the root cause of this issue. How to reproduce: ~ 1. create a zfs mirrored pool 2. execute cp command from somewhere to the zfs mirrored pool. 3. remove the both of disks physically during cp command working = hang happen (cp command never return and we can't kill cp command) One engineer pointed me to this page http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ and indicated that if all the mirrors are removed zfs enters a hang like state to prevent the kernel from going into a panic mode and this type of feature would be an RFE. My questions are Are there any documentation of the mirror configuration of zfs that explains what happens when the underlying drivers detect problems in one of the mirror devices? It seems that the traditional views of mirror or raid-2 would expect that the mirror would be able to proceed without interruption and that does not seem to be this case in ZFS. What is the purpose of the mirror, in zfs? Is it more like an instant backup? If so, what can the user do to recover, when there is an IO error on one of the devices? Appreciate any pointers and help, Thanks and regards, Karthik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL NVRAM partitioning?
On 09/05/08 14:42, Narayan Venkat wrote: I understand that if you want to use ZIL, then the requirement is one or more ZILs per pool. A little clarification of ZFS terms may help here. The term ZIL is somewhat overloaded. I think what you mean here is a separate log device (slog), because intent logs are always present in ZFS. Without a slog, the logs are present in the main pool. There is one log per file system and it allocates blocks in the main pool to form a chain. When a slog is defined, then it can be made up of multiple devices (in which case the writes are striped across the devices) or it can be in the form on a N way mirror - to provide redundancy. With an SSD you can partition the disk to allow usage of a single disk for multiple ZILs Can we do the same thing with an PCIe-based NVRAM card (like http://www.vmetro.com/category4304.html)? I don't think there's a Solaris supported driver for that device. However, any Solaris device, whether a partition or not, will work with ZFS provided it's at least 64MB. It's performance is another matter. Thanks Narayan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool import not working - I broke my pool...
Ross, Thanks, I have updated the bug with this info. Neil. Ross Smith wrote: Hmm... got a bit more information for you to add to that bug I think. Zpool import also doesn't work if you have mirrored log devices and either one of them is offline. I created two ramdisks with: # ramdiskadm -a rc-pool-zil-1 256m # ramdiskadm -a rc-pool-zil-2 256m And added them to the pool with: # zpool add rc-pool log mirror /dev/ramdisk/rc-pool-zil-1 /dev/ramdisk/rc-pool-zil-2 I can reboot fine, the pool imports ok without the ZIL and I have a script that recreates the ramdisks and adds them back to the pool: #!/sbin/sh state=$1 case $state in 'start') echo 'Starting Ramdisks' /usr/sbin/ramdiskadm -a rc-pool-zil-1 256m /usr/sbin/ramdiskadm -a rc-pool-zil-2 256m echo 'Attaching to ZFS ZIL' /usr/sbin/zpool replace test /dev/ramdisk/rc-pool-zil-1 /usr/sbin/zpool replace test /dev/ramdisk/rc-pool-zil-2 ;; 'stop') ;; esac However, if I export the pool, and delete one ramdisk to check that the mirroring works fine, the import fails: # zpool export rc-pool # ramdiskadm -d rc-pool-zil-1 # zpool import rc-pool cannot import 'rc-pool': one or more devices is currently unavailable Ross Date: Mon, 4 Aug 2008 10:42:43 -0600 From: [EMAIL PROTECTED] Subject: Re: [zfs-discuss] Zpool import not working - I broke my pool... To: [EMAIL PROTECTED]; [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.org Richard Elling wrote: Ross wrote: I'm trying to import a pool I just exported but I can't, even -f doesn't help. Every time I try I'm getting an error: cannot import 'rc-pool': one or more devices is currently unavailable Now I suspect the reason it's not happy is that the pool used to have a ZIL :) Correct. What you want is CR 6707530, log device failure needs some work http://bugs.opensolaris.org/view_bug.do?bug_id=6707530 which Neil has been working on, scheduled for b96. Actually no. That CR mentioned the problem and talks about splitting out the bug, as it's really a separate problem. I've just done that and here's the new CR which probably won't be visible immediately to you: 6733267 Allow a pool to be imported with a missing slog Here's the Description: --- This CR is being broken out from 6707530 log device failure needs some work When Separate Intent logs (slogs) were designed they were given equal status in the pool device tree. This was because they can contain committed changes to the pool. So if one is missing it is assumed to be important to the integrity of the application(s) that wanted the data committed synchronously, and thus a pool cannot be imported with a missing slog. However, we do allow a pool to be missing a slog on boot up if it's in the /etc/zfs/zpool.cache file. So this sends a mixed message. We should allow a pool to be imported without a slog if -f is used and to not import without -f but perhaps with a better error message. It's the guidsum check that actually rejects imports with missing devices. We could have a separate guidsum for the main pool devices (non slog/cache). --- Find out how to make Messenger your very own TV! Try it Now! http://clk.atdmt.com/UKM/go/101719648/direct/01/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs crash CR6727355 marked incomplete
Michael Hale wrote: A bug report I've submitted for a zfs-related kernel crash has been marked incomplete and I've been asked to provide more information. This CR has been marked as incomplete by User 1-5Q-2508 for the reason Need More Info. Please update the CR providing the information requested in the Evaluation and/or Comments field. However, when I pull up 6727355 in the bugs.opensolaris.org, it doesn't allow me to make any edits, nor do I see an evaluation or comments field - am I doing something wrong? 1. The Comments field asks that the core dump be made readable by our zfs group, and the CR was made incomplete until the person who saved the core does this. 2. You do not see this because the Comments is not readable outside of Sun as it is used to contain customer information. 3. Finally there is no Evaluation yet. Bottom line is that you can ignore the Need more info - it wasn't directed at you. Sorry about the confusion. I guess the kinks in the system aren't ironed out yet. Usually if we need more info we will email you directly. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool import not working - I broke my pool...
Richard Elling wrote: Ross wrote: I'm trying to import a pool I just exported but I can't, even -f doesn't help. Every time I try I'm getting an error: cannot import 'rc-pool': one or more devices is currently unavailable Now I suspect the reason it's not happy is that the pool used to have a ZIL :) Correct. What you want is CR 6707530, log device failure needs some work http://bugs.opensolaris.org/view_bug.do?bug_id=6707530 which Neil has been working on, scheduled for b96. Actually no. That CR mentioned the problem and talks about splitting out the bug, as it's really a separate problem. I've just done that and here's the new CR which probably won't be visible immediately to you: 6733267 Allow a pool to be imported with a missing slog Here's the Description: --- This CR is being broken out from 6707530 log device failure needs some work When Separate Intent logs (slogs) were designed they were given equal status in the pool device tree. This was because they can contain committed changes to the pool. So if one is missing it is assumed to be important to the integrity of the application(s) that wanted the data committed synchronously, and thus a pool cannot be imported with a missing slog. However, we do allow a pool to be missing a slog on boot up if it's in the /etc/zfs/zpool.cache file. So this sends a mixed message. We should allow a pool to be imported without a slog if -f is used and to not import without -f but perhaps with a better error message. It's the guidsum check that actually rejects imports with missing devices. We could have a separate guidsum for the main pool devices (non slog/cache). --- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss