Re: [zfs-discuss] overhead of snapshot operations
you can find the ZFS on-disk spec at: http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf I don't know of any way to produce snapshots at periodic intervals other than shell scripts (or a cron job), but the creation and deletion of snapshots at command level is fairly instantaneous. If you have a need for continous snapshots (check points) you may want to check out the NILFS system (linux open source) available from NTT japan at: http://www.nilfs.org/en/ NILFS does continous check points (on all write events), is log based and allows configuration of the window size (time based) within which to keep active checkpoints ... after this amount of time, old checkpoints are discarded and their space is reclaimed regards, Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
Hi Bob ... as richard has mentioned, allocation to vdevs is done in a fixed sized chunk (richard specs 1MB, but I remember a 512KB number from the original spec, but this is not very important), and the allocation algorithm is basically doing load balancing. for your non-raid pool, this chunk size will stay fixed regardless of the block size you choose when creating the file system or the IO unit size your applications(s) use. (The stripe size can dynamically change in a raidz pool, but not in your non-raid pool.) Measuring bandwidth for you application load is tricky with ZFS, since there are many hidden IO operations (besides the ones that your application is requesting) that ZFS must perform. If you collect iostats on bytes transferred to hard drives and compare those numbers to the amount of data your application(s) transferred you can find potentially large differences. The differences in these scenarios are largely driven by the IO size your application(s) use. For example, when I run the following tests here are my observations: -using dual xeon server with qlogic FC 2G interface -using a pool with 5 10Krpm FC 146 GB drives -sequentially writing 4 15GB previously wriiten files in one file system in the pool (this file system is using 128KB block size), and a separate thread writing each file concurrently for a total of 60GB written block size writtenactual writtendisk IO observed BW MB/S%CPU 4KB 60GB227.3GB 34.2 20.4 32KB 60GB216.5GB 36.1 13.9 128KB60Gb 63.6GB 69.6 31.0 You can see that a small application IO size causes much meta-data based IO (more than 3 times the actual application IO requirements), while the 128KB application writes induce only marginally more disk IO than the application actually uses. the BW numbers here are for just the application data, but when you consider all the IO from the disks over the test times, the physical BW is obviously greater in all cases. All my drives were uniformly busy in these tests, but the small application IO sizes forced much more total IO against the drives. In your case the application IO rate would be even further degraded due to the mirror configuration. The extra load of reading and writing meta-data (including ditto-blocks) and mirror devices conspire to reduce the application IO rate, even though the disk device IO rates may be quite good. File system block size reduction only exacerbates the problem by requiring more meta-data to support the same quantity of application data, and for sequential IO this is a loser. In any case, for a non-raid pool, the allocation chunk size per drive (the stripe size) is not influenced by file system block size. When application IO sizes get small, the overhead in ZFS goes up dramatically. regards, Bill The application is spending almost all the time blocked on I/O. I see that the number of device writes per second seems pretty high. The application is doing I/O in 128K blocks. How many IOPS does a modern 300GB 15K RPM SAS drive typically deliver? Of course the IOPS capacity depends on if the access is random or sequential. At the application level, the access is completely sequential but ZFS is likely doing some extra seeks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On my own system, when a new file is written, the write block size does not make a significant difference to the write speed Yes, I've observed the same result ... when a new file is being written sequentially, the file data and newly constructed meta-data can be built in cache and written in large sequential chunks periodically, without the need to read in existing meta-data and/or data. It seems that data and meta-data that are newly constucted in cache for sequential operations will persist in cache effectively, and the application IO size is a much less sensitive parameter. Monitoring disks with iostat in these cases shows the disk IO to be only marginally greater than the application IO. This is why I specified that the write tests described in my previous post were to existing files. The overhead of doing small sequential writes to an existing object is so much greater than writing to a new object, that it begs for some reasonable explanation. The only one that i've been able assemble in various experimentation, is that data/meta-data for existing objects is not retained effetively in cache if ZFS detects that such an object is being sequntially written. This forces the constant re-reading of the data/meta-data associated with such an object, causing a huge increase in device IO traffic that does not seem to accompany the writing of a brand new object. The size of RAM seems to make little difference in this case. As small sequential writes accumulate in the 5 second cache, the chain of meta-data leading to the newly constructed data block may see only one pointer (of the 128 in the final set) changing to point to this newly constructed data block, but all the meta-data from the uber block to the target must be rewritten on the 5 second flush. Of course this is not much diffrent from what's happening in the newly created object scenario, so it must be the behavior that follows this flush that's different. It seems to me that after this flush, some, or all of the data/meta-data that will be affected next is re-read even though much of what's needed for subsequent operations should already be in cache. My experience with large RAM systems and with the use of SSDs as ZFS cache devices has convinced me that data/meta-data associated with sequential write operations to existing objects (and ZFS seems very good at detecting this association) does not get retained in cache very effectively. You can see this very clearly if you look at the IO to a cache device (ZFS allows you to easily attach a device to a pool as a cache device which acts as a sort of L2 type cache for RAM). When I do random IO operations to existing objects I see a large amount of IO to my cache device as RAM fills and ZFS pushes cached information (that would otherwise be evicted) to the SSD cache device. If I repeat the random IO test over the same total file space I see improved performance as I get occassional hits from the RAM cache and the SSD cache. As this extended cache heirarchy warms up with each test run, my results continue to improve. If I run sequential write operations to exiting objects however, I see very little activity to my SSD cache, and virtually no change in performance when I immediately run the same test again. It seems that ZFS is still in need of some fine-tuning for small sequential write operations to exiting objects. regards, Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SSD cache device hangs ZFS
I'm using a FC flash drive as a cache device to one of my pools: zpool add pool-name cache device-name and I'm running random IO tests to assess performance on a snv-78 x86 system I have a set of threads each doing random reads to about 25% of its own, previously written, large file ... a test run will read in about 20GB on a server with 2GB of RAM using zpool iostat,I can see that the SSD device is being used aggressively, and each time I run my random read test I find better performance than the previous execution ... I also see my SSD drive filling up more and more between runs this behavior is what I expect, and the performance improvements I see are quite good (4X improvement over 5 runs), but I'm getting hung from time to time after several successful runs of my test application, some run of my test will be running fine, but at some point before it finishes, I see that all IO to the pool has stopped, and, while I still can use the system for other things, most operations that involve the pool will also hang (e.g. a wcon a pool based file will hang) any of these hung processes seem to sleep in the kernel at an uninterruptible level, and will not die on a kill -9 attempt any attempt to shutdown will hang, and the only way I can recover is to use the reboot -qnd command (I think that the -d option in the key since it keeps the system from trying to sync before reboot) when I reboot, everything is fine again and I can continue testing until I run into this problem again ... does anyone have any thoughts on this issue ? ... thanks, Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD cache device hangs ZFS
Thanks Marion and richard, but I've run these tests with much larger data sets and have never had this kind of problem when no cache device was involved In fact, if I remove the SSD cache device from my pool and run the tests, they seem to run with no issues (except for some reduced performance as I would expect) the same SSD disk works perfectly as a separate ZIL device, providing improved IO with synchronous writes on large test runs of 100GBs ... Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] removing a separate zil device
Thanks to Kyle, richard and Eric In dealing with this problem, I realize now that I could have saved myself a lot of grief if I had simply used the replace command and substituted some other drive for my flash drive before I removed it I think that this point is critical for anyone who finds themselves experimenting with separate ZILs ... since a pool will continue to function with no obvious problems after a separate ZIL is removed, it's easy to think that, while the benefit of a separate ZIL is gone, the pool drives have picked up the ZIL function and all is well with the world the sad reality comes when a reboot or export of the pool occurs and there is then no way to re-import the pool without re-inserting the missing ZIL device, and if the missing ZIL device is no longer available, the pool is inaccessible ... it's too late now to do a replace, because the pool must be imported to do anything ... all the data in your pool is perfect, but it's perfectly out of reach ... Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
I have a question that is related to this topic: Why is there only a (tunable) 5 second threshold and not also an additional threshold for the buffer size (e.g. 50MB)? Sometimes I see my system writing huge amounts of data to a zfs, but the disks staying idle for 5 seconds, although the memory consumption is already quite big and it really would make sense (from my uneducated point of view as an observer) to start writing all the data to disks. I think this leads to the pumping effect that has been previously mentioned in one of the forums here. Can anybody comment on this? TIA, Thomas because ZFS always writes to a new location on the disk, premature writing can often result in redundant work ... a single host write to a ZFS object results in the need to rewrite all of the changed data and meta-data leading to that object if a subsequent follow-up write to the same object occurs quickly, this entire path, once again, has to be recreated, even though only a small portion of it is actually different from the previous version if both versions were written to disk, the result would be to physically write potentially large amounts of nearly duplicate information over and over again, resulting in logically vacant bandwidth consolidating these writes in host cache eliminates some redundant disk writing, resulting in more productive bandwidth ... providing some ability to tune the consolidation time window and/or the accumulated cache size may seem like a reasonable thing to do, but I think that it's typically a moving target, and depending on an adaptive, built-in algorithm to dynamically set these marks (as ZFS claims it does) seems like a better choice ...Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
But is seems that when we're talking about full block writes (such as sequential file writes) ZFS could do a bit better. And as long as there is bandwidth left to the disk and the controllers, it is difficult to argue that the work is redundant. If it's free in that sense, it doesn't matter whether it is redundant. But if it turns out NOT o have been redundant you save a lot. I think this is why an adaptive algorithm makes sense ... in situations where frequent, progressive small writes are engaged by an application, the amount of redundant disk access can be significant, and longer consolidation times may make sense ... larger writes (= the FS block size) would benefit less from longer consolidation times, and shorter thresholds could provide more usable bandwidth to get a sense of the issue here, I've done some write testing to previously written files in a ZFS file system, and the choice of write element size shows some big swings in actual vs data-driven bandwidth when I launch a set of threads each of which writes 4KB buffers sequentially to its own file, I observe that for 60GB of application writes, the disks see 230+GB of IO (reads and writes): data-driven BW =~41MB/Sec (my 60GB in ~1500 Sec) actual BW =~157 MB/Sec (the 230+GB in ~1500 Sec) if I do the same writes with 128KB buffers (block size of my pool), the same 60GBs of writes only generate 95GB of disk IO (reads and writes) data-driven BW =~85MB/Sec (my 60GB in ~700 Sec) actual BW =~134.6MB/Sec (the 95+GB in ~700 Sec) in the first case, longer consolidation times would have lead to less total IO and better data-driven BW, while in the second case shorter consolidation times would have worked better as far as redundant writes possibly occupying free bandwidth (and thus costing nothing), I think you also have to consider the related costs of additional block scavenging, and less available free space at any specific instant, possibly limiting the sequentiality of the next write ... of course there's also the additional device stress in any case, I agree with you that ZFS could do a better job in this area, but it's not as simple as just looking for large or small IOs ... sequential vs random access patterns also play a big role (as you point out) I expect (hope) the adaptive algorithms will mature over time, eventually providing better behavior over a broader set of operating conditions ... Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] removing a separate zil device
This is a re-post of this issue ... I didn't get any replies to the previous post of 12/27 ... I'm hoping someone is back from holiday who may have some insight into this problem ... Bill when I remove a separate zil disk from a pool, the pool continues to function, logging synchronous writes to the disks in the pool. Status shows that the log disk has been removed, and everything seems to work fine until I export the pool. After the pool has been exported (long after the log disk was removed and gigabytes of synchronous writes were performed successfully), I am no longer able to import the pool. I get an error stating that a pool device cannot be found, and importing the pool cannot succeed until the missing device (the separate zil log disk) is replaced in the system. There is a bug filed by Neil Perrin: 6574286 removing a slog doesn't work regading the problem of not being able to remove a separate zil device from a pool, but no detail on the ramifications of just taking the device out of the JBOD. Taking it out does not impact the immediate function of the pool, but the inability to re-import it after this event is a significant issue. Has anyone found a workaround for this problem ? I have data in a pool that I cannot import because the separate zil is no longer available to me. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
file system journals may support a variety of availability models, ranging from simple support for fast recovery (return to consistency) with possible data loss, to those that attempt to support synchronous write semantics with no data loss on failure, along with fast recovery the simpler models use a persistent caching scheme for file system meta-data that can be used to limit the possible sources of file system corruption, avoiding a complete fsck run after a failure ... the journal specifies the only possible sources of corruption, allowing a quick check-and-recover mechanism ... here the journal is always written with meta-data changes (at least), before the actual updated meta-data in question is over-written to its old location on disk ... after a failure, the journal indicates what meta-data must be checked for consistency more elaborate models may cache both data and meta-data, to support limited data loss, synchronous writes and fast recovery ... newer file systems often let you choose among these features since ZFS never updates any data or meta-data in place (anything written into a pool is always written to a new (unused) location, it does not have the same consistency issues that traditional file systems have to deal with ... a ZFS pool is always in a consistent state, moving an old state to a new state only after the new state has been completely committed to persistent store ... the final update to a new state depends on a single atomic write that either succeeds (moving the system to a consistent new state) or fails, leaving the system in its current consistent state ... there can be no interim inconsistent state a ZFS pool builds its new state information in host memory for some period of time (about 5 seconds), as host IOs are generated by various applications ... at the end of this period these buffers are written to fresh locations on persistent store as described above, meaning that application writes are treated asynchronously by default, and in the face a failure, some amount of information that has been accumulating in host memory can be lost if an application requires synchronous writes and a guarantee of no data loss, then ZFS must somehow get the written information to persistent store before it returns the application write call ... this is where the intent log comes in ... the system call information (including the data) involved in a synchronous write operation are written to the intent log on persistent store before the application write call returns ... but the information is also written into the host memory buffer scheduled for its 5 sec updates (just as if it was an asynchronous write) ... at then end of the 5 sec update time the new host buffers are written to disk, and, once committed, the intent log information written to the ZIL is not longer needed and can be jettisoned (so the ZIL never needs to be very large) if the system fails, the accumulated but not flushed host buffer information will be lost, but the ZIL records will already be on disk for any synchronous writes and can be replayed when the host comes back up, or the pool is imported by some other living host ... the pool, of course, always comes up in a consistent state, but any ZIL records can be incorporated into a new consistent state before the pool is fully imported for use the ZIL is always there in host memory, even when no synchronous writes are being done, since the POSIX fsync() call could be made on an open write channel at any time, requiring all to-date writes on that channel to be committed to persistent store before it returns to the application ... it's cheaper to write the ZIL at this point than to force the entire 5 sec buffer out prematurely synchronous writes can clearly have a significant negative performance impact in ZFS (or any other system) by forcing writes to disk before having a chance to do more efficient, aggregated writes (the 5 second type), but the ZIL solution in ZFS provides a good trade-off with a lot of room to choose among various levels of performance and potential data loss ... this is especially true with the recent addition of separate ZIL device specification ... a small, fast (nvram type) device can be designated for ZIL use, leaving slower spindle disks for the rest of the pool hope this helps ... Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] separate zil removal
when I remove a separate zil disk from a pool, the pool continues to function, logging synchronous writes to the disks in the pool. Status shows that the log disk has been removed, and everything seems to work fine until I export the pool. After the pool has been exported (long after the log disk was removed and gigabytes of synchronous writes were performed successfully), I am no longer able to import the pool. I get an error stating that a pool device cannot be found, and importing the pool cannot succeed until the missing device (the separate zil log disk) is replaced in the system. There is a bug filed by Neil Perrin: 6574286 removing a slog doesn't work regading the problem of not being able to remove a separate zil device from a pool, but no detail on the ramifications of just taking the device out of the JBOD. Taking it out does not impact the immediate function of the pool, but the inability to re-import it after this event is a significant issue. Has anyone found a workaround for this problem ? I have data in a pool that I cannot import because the separate zil is no longer available to me. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snv-76 panics on installation
I have an Intel based server running dual P3 Xeons (Intel A46044-609, 1.26GHz) with a BIOS from American Megatrends Inc (AMIBIOS, SCB2 production BIOS rev 2.0, BIOS build 0039) with 2GB of RAM when I attempt to install snv-76 the system panics during the initial boot from CD I've been using this system for extensive testing with ZFS and have had no problems installing snv-68, 69 or 70, but I'm having this problem with snv-76 any information regarding this problem or a potential workaround would be appreciated Thx ... bill moloney This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] nv-69 install panics dell precision 670
I have nv-63 installed on a Dell Precision 670 (dual Intel p4s) using zfs with no problems when I attempt to start to install nv-69 from CD #1, just after the Copyright notice and Use is subject to license terms prints to the screen (when device discovery usually begins), my system panics and begins to reboot the solaris panic message splashes across the screen too fast to read before immediately resetting and rebooting the machine I've been able to install nv-69 successfully on other Intel (server) platforms any ideas or suggestions or help would be appreciated This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nv-69 install panics dell precision 670
using hyperterm, I captured the panic message as: SunOS Release 5.11 Version snv_69 32-bit Copyright 1983-2007 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. panic[cpu0]/thread=fec1ede0: Can't handle mwait size 0 fec37e70 unix:mach_alloc_mwait+72 (fec2006c) fec37e8c unix:mach_init+b0 (c0ce80, fe800010, f) fec37eb8 unix:psm_install+95 (fe84166e, 3, fec37e) fec37ec8 unix:startup_end+93 (fec37ee4, fe91731e,) fec37ed0 unix:startup+3a (fe800010, fec33c98,) fec37ee4 genunix:main+1e () skipping system dump - no dump device configured rebooting... this behavior loops endlessly This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nv-69 install panics dell precision 670
Thanks all for the details on this bug, looks like nv-70 should work for me when the drop is available I've been using an older P3 based server to test the new separate ZIL device feature that became available in nv-68, using a FC flash drive as a log device outside the zpool itself I wanted to do some additional testing using the faster Dell 670 system but could not get nv-68 or 69 to install ... now I know why ... Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZVOLs and O_DSYNC, fsync() behavior
I've spent some time searching, and I apologize if I've missed this somewhere, but in testing ZVOL write performance I cannot see any noticeable difference between opening a ZVOL with or without O_DSYNC. Does the O_DSYNC flag have any actual influence on ZVOL writes ? For ZVOLS that I have opened without the O_DSYNC flag, I find that if I follow each write (these are 4KB writes done to a previously written area of a ZVOL) with an fsync() call on the channel, I see a significant performance drop (as expected). But I do not see this behavior when I open the ZVOL with the O_DSYNC flag (and do not do the fsync() operations) as I thought I should. While the O_DSYNC flag is accepted without error when opening a ZVOL it apparently does not support synchronous writes to that ZVOL ... is this correct or am I missing something ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re[2]: Re: How does ZFS write data to disks?
this is not a problem we're trying to solve, but part of a characterization study of the zfs implementation ... we're currently using the default 8KB blocksize for our zvol deployment, and we're performing tests using write block sizes as small as 4KB and as large as 1MB as previously described (including an 8KB write aligned to logical zvol block zero, for a perfect match to the zvol blocksize) ... in all cases we see at least twice the IO to the disks than we generate from our test program (and it's much worse for smaller write block sizes) ... we're not exactly caught in read-modify-write hell (except when we write the 4KB blocks that are smaller than the zvol blocksize), it's more like modify-write hell since the original meta-data that maps the 2GB region we're writing is probably just read once and kept in cache for the duration of the test ... the large amount of back-end IO is almost entirely write operations, but these write operations include the re-writing of meta-data that has to change to reflect the re-location of newly written data (remember, no in-place writes ever occur for data or meta-data) ... using the default zvol block size of 8KB, zfs requires, in just block-pointer meta-data, about 1.5% of the total 2GB write region (this is a large percentage vs other file systems like ufs, for example, because zfs uses a 128 byte block pointer vs a ufs 8 byte block pointer) ... as new data is written over the old data, the leaves of the meta-data tree are necessarily changed to point to the new locations on disk of the new data, but any new leaf block-pointer requires that a new block of leaf pointers be allocated and written, which requires that the next indirect level up from these leaves point to this new set of leaf pointers, so it must be rewritten itself, and so on up the tree (and remember, meta-data is subject to being written in up to 3 copies - default is 2 - anytime any of it is written to disk) ... the indirect pointer blocks closer to the root of the tree may only see a single pointer change over the course of a 5 second consolidation (based on the size of the zvol, the size of the block allocation unit in the zvol and the amount of data actually written to the zvol in 5 seconds), but a complete new indirect block must be created and written to disk (all the way back to the uberblock) on each transaction group write ... this means that some of these meta-data blocks are written to disk over and over again with only small changes from their previous composition ... consolidating for more than 5 seconds would help to mitigate this situation, but longer consolidation periods put more data at risk of being lost in case of a power failure ... this is not particularly a problem, just a manifestation of the need to never write in-place, a rather large block pointer size and the possible writing of multiple copies of meta-data (of course this block pointer carries check sums, and the addresses of up to 3 duplicate blocks, providing the excellent data and meta-data protection zfs is so well known for) ... the original thread that this reply addressed was the characteristic 5 second delay in writes, which I tried to explain in the context of copy-on-write consolidation, but it's clear that even this delay cannot prevent the modification and re-writing of the same basic meta-data many times with small modifications This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: How does ZFS write data to disks?
writes to ZFS objects have significant data and meta-data implications, based on the zfs copy-on write implementation ... as data is written into a file object, for example, this update must eventually be written to a new location on physical disk, and all of the meta-data (from the uberblock down to this object) must be updated and re-written to a new location as well ... while in cache, the changes to these objects can be consolidated, but once written out to disk, any further changes would make this recent write obsolete and require it all to be written once again to yet another new location on the disk ... batching transactions for 5 seconds (the trigger discussed in zfs documentation) ... is essential to limiting the amount of redundant re-writing that takes place to physical disk ... keeping a disk busy 100% of the time by writing mostly the same data over and over makes far less sense than collecting a group of changes in cache and writing them efficiently every trigger period of time ... even with this optimization, our experience with small, sequential writes (4KB or less) to zvols that have been previously written (to ensure the mapping of real space on the physical disk) for example, show bandwidth values that are less than 10% of comparable larger (128KB or larger) writes ... you can see this behavior dramatically if you compare the amount of host initiated write data (front-end data) to the actual amount of IO performed to the physical disks (both reads and writes) to handle the host's front-end request ... for example, doing sequential 1MB writes to a (previously written) zvol (simple catenation of 5 FC drives in a JBOD) and writing 2GB of data induced more than 4GB of IO to the drives (with smaller write sizes this ratio gets progressively worse) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: The ZFS MOS and how DNODES are stored
Thanks for the input Darren, but I'm still confused about DNODE atomicity ... it's difficult to imagine that a change that is made anyplace in the zpool would require copy operations all the way back up to the uberblock (e.g. if some single file in one of many file systems in a zpool was suddenly changed, making a new copy of all of the interceeding objects in the tree back to the uberblock would seem to be an untenable amount of work even though it may all be carried out in memory and not involve any IO, although if the zpool itself was under snapshot control this would have to happen) ... the DNODE implementation appears to include its own checksum field (self-checksumming), and controlling DNODEs (those that lead to decendent collections of DNODEs) are always of the known type DMU_OT_DNODE and so their block pointers do not have to checksum the DNODEs they point to (unlike all other block pointers that do cehcksum the data they point to) ... this would allow for inplace updates of a DNODE, without the need to continue further up the tree ... since all objects are controlled by a DNODE, updates to an object's data can stop at its DNODE if that DNODE is not under some snapshot or clone control ... if this is not the case, than 'any' modification in the zpool would require copying up to the uberblock This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] The ZFS MOS and how DNODES are stored
ZFS documentation lists snapshot limits on any single file system in a pool at 2**48 snaps, and that seems to logically imply that a snap on a file system does not require an update to the pool’s currently active uberblock. That is to say, that if we take a snapshot of a file system in a pool, and then make any changes to that file system, the copy on write behavior induced by the changes will stop at some synchronization point below the uberblock (presumably at or below the DNODE that is the DSL directory for that file system). In-place updates to a DNODE that has been allocated in a single sector sized ZFS block can be considered atomic, since the sector write will either succeed or fail totally, leaving either the old version or the new version, but not a combination of the two. This seems sensible to me, but the description of object sets beginning on page 26 of the ZFS On-Disk Specification, states that the DNODE type DMU_OT_DNODE (the type of the DNODE that’s included in the 1KB objset_phys_t structure) will have a data load of an array of DNODES allocated in 128KB blocks, and the picture (Illustration 12 in the spec) shows these blocks as containing 1024 DNODES. Since DNODES are 512 bytes, it would not be possible to fit the 1024 DNODES depicted in the illustration and if DNODES did live in such an array then they could not be atomically updated in-place. If the blocks in question were actually filled with an array of block pointers pointing to single sector sized blocks that each held a DNODE then this would account for the 1024 entries per 128KB block shown, since block pointers are 128 bytes (not the 512 bytes of a DNODE), but in this case wouldn’t such 128KB blocks be considered to be indirect block pointers, forcing the dn_nlevels field shown in the object set DNODE at the top left of Illustration 12 to be 2, instead of the 1 that’s there ? I’m further confused by the illustration’s use of dotted lines to project the contents of a structure field (as seen in the projection of the metadnode field of the objset_phys_t structure found at the top of the picture) and arrows to represent pointers (as seen in the projection of the block pointer array of the DMU-OT-DNODE type dnode, also at the top of the picture), but the blocks pointed to by these block pointers seem to actually contain instances of DNODES (as seen from the projection of one of these instances in the lower part of the picture). Should this projection be replaced by a pointer to the lower DNODE ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS limits on zpool snapshots
The ZFS On-Disk specification and other ZFS documentation describe the labeling scheme used for the vdevs that comprise a ZFS pool. A label entry contains, among other things, an array of uberblocks, one of which will point to the active object set of the pool it is a part of at a given instant (according to documentation, the active uberblock for a given pool could be located in the uberblock array of any vdev participating in the pool at a given instant, and is subject to relocation from vdev to vdev as the uberblock for the pool is recreated in an update). Recreation of the active uberblock would occur, for example, if we took a snapshot of the pool and changes were then made anywhere in the pool. Since a new uberblock is required in this snapshot scenario, and since it appears that the uberblocks are treated as a kind of circular list across vdevs, it seems to me that the number of available snapshots we could have of a pool at any given instant would be strictly limited to the number of available uberblocks in the vdevs of the pool (128 uberblocks per vdev, if I have that straight). Is this truly the case or am I missing something here ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss