Re: [zfs-discuss] ZFS snapshot GUI
How does the ability to set a snapshot schedule for a particular *file* or *folder* interact with the fact that ZFS snapshots are on a per-filesystem basis? This seems a poor fit. If I choose to snapshot my "Important Documents" folder every 5 minutes, that's implicitly creating snapshots of my "Giant Video Downloads" folder every 5 minutes too, if they're both in the same file system. It seems unwise not to expose this to the user. One possibility would be for the "enable snapshots" menu item to implicitly apply to the root of the file system in which the selected item is. So in the example shown, right-clicking on "Documents" would bring up a dialog labeled something like "Automatic snapshots for /home/cb114949". == I don't think it's a good idea to replace "Enable Automatic Snapshots" by "Restore from Snapshot" because there's no obvious way to "Disable Automatic Snapshots" (or change their properties). (It appears one could probably do that from the properties dialog, but that's certainly not obvious to a user who has turned this on using the menu and now wants to make a change -- if you can turn it on in the menu, you should be able to turn it off in the menu too.) == If "Roll back" affects the whole file system, it definitely should NOT be an option when right-clicking on a file or folder within the file system! This is a recipe for disaster. I would not present this as an option at all -- it's already in the "Restore Files" dialog. Also, "All files will be restored" is not a good description for rollback. That really means "All changes since the selected snapshot will be lost." I can readily imagine a user thinking, "I deleted three files, so if I choose to restore all files, I'll get those three back [without losing the other work I've done]." == Just a few random comments. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
... just rearrange your blocks sensibly - > and to at least some degree you could do that while > they're still cache-resident Lots of discussion has passed under the bridge since that observation above, but it may have contained the core of a virtually free solution: let your table become fragmented, but each time that a sequential scan is performed on it determine whether the region that you're currently scanning is *sufficiently* fragmented that you should retain the sequential blocks that you've just had to access anyway in cache until you've built up around 1 MB of them and then (in a background thread) flush the result contiguously back to a new location in a single bulk 'update' that changes only their location rather than their contents. 1. You don't incur any extra reads, since you were reading sequentially anyway and already have the relevant blocks in cache. Yes, if you had reorganized earlier in the background the current scan would have gone faster, but if scans occur sufficiently frequently for their performance to be a significant issue then the *previous* scan will probably not have left things *all* that fragmented. This is why you choose a fragmentation threshold to trigger reorg rather than just do it whenever there's any fragmentation at all, since the latter would probably not be cost-effective in some circumstances; conversely, if you only perform sequential scans once in a blue moon, every one may be completely fragmented but it probably wouldn't have been worth defragmenting constantly in the background to avoid this, and the occasional reorg triggered by the rare scan won't constitute enough additional overhead to justify heroic efforts to avoid it. Such a 'threshold' is a crude but possi bly adequate metric; a better but more complex one would perhaps nudge up the threshold value every time a sequential scan took place without an intervening update, such that rarely-updated but frequently-scanned files would eventually approach full contiguity, and an even finer-grained metric would maintain such information about each individual *region* in a file, but absent evidence that the single, crude, unchanging threshold (probably set to defragment moderately aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 1 MB region) is inadequate these sound a bit like over-kill. 2. You don't defragment data that's never sequentially scanned, avoiding unnecessary system activity and snapshot space consumption. 3. You still incur additional snapshot overhead for data that you do decide to defragment for each block that hadn't already been modified since the most recent snapshot, but performing the local reorg as a batch operation means that only a single copy of all affected ancestor blocks will wind up in the snapshot due to the reorg (rather than potentially multiple copies in multiple snapshots if snapshots were frequent and movement was performed one block at a time). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz DEGRADED state
On Nov 20, 2007 6:34 AM, MC <[EMAIL PROTECTED]> wrote: > > So there is no current way to specify the creation of > > a 3 disk raid-z > > array with a known missing disk? > > Can someone answer that? Or does the zpool command NOT accommodate the > creation of a degraded raidz array? > can't started degraded, but you can make it so.. If one can make a sparse file, then you'd be set. Just create the file, make a zpool out of the two disks and the file, and then drop the file from the pool _BEFORE_ copying over the data. I believe then you can add the third disk as a replacement. The gotcha (and why the sparse may be needed) is that it will only use per disk the size of the smallest disk. > > This message posted from opensolaris.org > ___ > > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover
I am "rolling my own" replication using zfs send|recv through the cluster agent framework and a custom HA shared local storage set of scripts(similar to http://www.posix.brte.com.br/blog/?p=75 but without avs). I am not using zfs off of shared storage in the supported way. So this is a bit of a lonely area. =) As these are two different zfs volumes on different zpools of differing underlying vdev topology, it appears they are not sharing the same fsid and so are assumedly presenting different file handles from each other. I have the cluster parts out of the way(mostly =)), I now need to solve the nfs side of things so that at the point of failing over. I have isolated zfs out of the equation, I receive the same stale file handle errors if I try and share an arbitrary UFS folder to the client through the cluster interface. Yeah I am a hack. Asa On Nov 20, 2007, at 7:27 PM, Richard Elling wrote: > asa wrote: >> Well then this is probably the wrong list to be hounding >> >> I am looking for something like >> http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/ >> Where when fileserver A dies, fileserver B can come up, grab the >> same IP address via some mechanism(in this case I am using sun >> cluster) and keep on trucking without the lovely stale file handle >> errors I am encountering. >> > > If you are getting stale file handles, then the Solaris cluster is > misconfigured. > Please double check the NFS installation guide for Solaris Cluster and > verify that the paths are correct. > -- richard > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] which would be faster
> On the other hand, the pool of 3 disks is obviously > going to be much slower than the pool of 5 while today that's true, "someday" io will be balanced by the latency of vdevs rather than the number... plus two vdevs are always going to be faster than one vdev, even if one is slower than the other. so do 4+1 and 2+1 in the same pool rather than separate pools. this will let zfs balance the load (always) between the two vdevs rather than you trying the balance the load between pools. Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover
asa wrote: > Well then this is probably the wrong list to be hounding > > I am looking for something like > http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/ > Where when fileserver A dies, fileserver B can come up, grab the same > IP address via some mechanism(in this case I am using sun cluster) and > keep on trucking without the lovely stale file handle errors I am > encountering. > If you are getting stale file handles, then the Solaris cluster is misconfigured. Please double check the NFS installation guide for Solaris Cluster and verify that the paths are correct. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover
Well then this is probably the wrong list to be hounding I am looking for something like http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/ Where when fileserver A dies, fileserver B can come up, grab the same IP address via some mechanism(in this case I am using sun cluster) and keep on trucking without the lovely stale file handle errors I am encountering. My clients are Linux, Servers are sol 10u4. it seems that it is impossible to change the fsid on solaris, can you point me towards the appropriate NFS client behavior option lingo if you have a minute?(just the terminology would be great, there are a ton of confusing options in the land of nfs: Client recovery, failover, replicas etc) I am unable to use block base replication(AVS) underneath the ZFS layer because I would like to run with different zpool schemes on each server( fast primary server, slower, larger failover server only to be used during downtime on the main server.) Worst case scenario here seems to be that I would have to forcibly unmount and remount all my client mounts. Ill start bugging the nfs-discuss people. Thank you. Asa On Nov 12, 2007, at 1:21 PM, Darren J Moffat wrote: > asa wrote: >> I would like for all my NFS clients to hang during the failover, >> then pick up trucking on this new filesystem, perhaps obviously >> failing their writes back to the apps which are doing the >> writing. Naive? > > The OpenSolaris NFS client does this already - has done since IIRC > around Solaris 2.6. The knowledge is in the NFS client code. > > For NFSv4 this functionality is part of the standard. > > -- > Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS
>>> the 3124 looks perfect. The only problem is the only thing I found on >>> ebay >>> was for the 3132, which is PCIe, which doesn't help me. :) I'm not >>> finding >>> anything for 3124 other than the data on silicon image's site. Do you >>> know >>> of any cards I should be looking for that uses this chip? >> >> http://www.cooldrives.com/sata-cards.html >> >> There are a couple on there for about $80. Not quite where you want to >> get I am sure but it is an option. > > Yep - I see: http://www.cooldrives.com/saiiraco2esa.html for $60. I got a Sil3114 (4 internal ports) off ebay for $AU30 including postage. Didnt look at any PCIe stuff since I'm building up from old parts. > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] >Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 > http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ > Graduate from "sugar-coating school"? Sorry - I never attended! :) > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] which would be faster
On Tue, 20 Nov 2007, Tim Cook wrote: > So I have 8 drives total. > > 5x500GB seagate 7200.10 > 3x300GB seagate 7200.10 > > I'm trying to decide, would I be better off just creating two separate pools? > > pool1 = 5x500gb raidz > pool2= 3x300gb raidz ... reformatted ... > or would I be better off creating one large pool, with two raid > sets? I'm trying to figure out if it would be faster this way since > it should be striping across the two pools (from what I understand). > On the other hand, the pool of 3 disks is obviously going to be much > slower than the pool of 5. > > In a perfect world I'd just benchmark both ways, but due to some > constraints, that may not be possible. Any insight? > Hi Tim, Let me give you a 3rd option for your consideration. In general, there is no "one-pool-fits-all-workloads" solution. On a 10 disk system here, we ended up with a: 5 disk raidz1 pool 2 disk mirror pool 3 disk mirror pool Each have their strengths/weaknesses. The raidz set is ideal for large file sequential access type workloads - but the IOPS are limited to the IOPS of a single drive. The 3-way mirror is ideal for a workload with a high read to write ratio - which describes many real-world type workloads (e.g. software development) - since ZFS will load balance read ops amoung all members of the mirror set. So read IOPS is 3x the IOPS rating of a single disk. I would suggest/recommend you configure a 5 disk raidz1 pool (with the 500Gb disks) and a 2nd pool using a 3-way mirror. You can then match pool/filesystems to the best fit with your different workloads. Remember the incredibly useful blogs at: http://blogs.sun.com/relling/ (Thank you Richard) to determine the relative reliability/failure rates of different ZFS configs. PS: If we had to do it over, I'd probably go with a 6-disk raidz2, in place of the 5-disk raidz1 - due to the much higher relibility of that config. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz2 testing
Brian Lionberger wrote: > Is there a preferred method to test a raidz2. > I would like to see the the disks recover on there own after simulating > a disk failure. > I'm have a 4 disk configuration. It really depends on what failure mode you're interested in. The most common failure we see from disks in the field is an uncorrectable read. Pulling a disk will not simulate an uncorrectable read. For such tests, there are really two different parts of the system you are exercising: the fault detection and the recovery/reconfiguration. When we do RAS benchmarking, we ofen find that the recovery/reconfiguration code path is the interesting part and the fault detection less so. In other words, there will be little difference in the recovery/ reconfiguration between initiating a zpool replace from the command line vs fault injection. Unless you are really interested in the maze of fault detection code, you might want to stick with the command line interfaces to stimulate a reconfiguration. If you really do want to stimulate the fault detection code, then a simple online test which requires no hands-on changes, is to change the partition table to zero out the size of the partition or slice. This will have the effect of causing an I/O to receive an ENXIO error which should then kick off the recovery. prtvtoc will show you a partition map which can be sent to fmthard -s to populate the VTOC. Be careful here, this is a place where mistakes can be painful to overcome. Dtrace can be used to perform all sorts of nasty fault injection, but that may be more than you want to bite off at first. b77 adds a zpool failmode property which will allow you to set the mode to something other than panic -- options are: wait(default), continue, and panic. See zpool(1m) for more info. You will want to know the failmode if you are experimenting with fault injection. Finally, you will want to be aware of the FMA commands for viewing reports and diagnosis status. See fmadm(1m), fmdump(1m), and fmstat(1m) If you want to experiment with fault injection, you'll want to pay particular attention to the SERD engines and reset them between runs. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] which would be faster
So I have 8 drives total. 5x500GB seagate 7200.10 3x300GB seagate 7200.10 I'm trying to decide, would I be better off just creating two separate pools? pool1 = 5x500gb raidz pool2= 3x300gb raidz or would I be better off creating one large pool, with two raid sets? I'm trying to figure out if it would be faster this way since it should be striping across the two pools (from what I understand). On the other hand, the pool of 3 disks is obviously going to be much slower than the pool of 5. In a perfect world I'd just benchmark both ways, but due to some constraints, that may not be possible. Any insight? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool io to 6140 is really slow
Asif Iqbal wrote: > On Nov 19, 2007 11:47 PM, Richard Elling <[EMAIL PROTECTED]> wrote: > >> Asif Iqbal wrote: >> >>> I have the following layout >>> >>> A 490 with 8 1.8Ghz and 16G mem. 6 6140s with 2 FC controllers using >>> A1 anfd B1 controller port 4Gbps speed. >>> Each controller has 2G NVRAM >>> >>> On 6140s I setup raid0 lun per SAS disks with 16K segment size. >>> >>> On 490 I created a zpool with 8 4+1 raidz1s >>> >>> I am getting zpool IO of only 125MB/s with zfs:zfs_nocacheflush = 1 in >>> /etc/system >>> >>> Is there a way I can improve the performance. I like to get 1GB/sec IO. >>> >>> >> I don't believe a V490 is capable of driving 1 GByte/s of I/O. >> > > Well I am getting ~190MB/s right now. I sure not hitting any where close > to that ceiling > > >> The V490 has two schizos and the schizo is not a full speed >> bridge. For more information see Section 1.2 of: >> http://www.sun.com/processors/manuals/External_Schizo_PRM.pdf >> [err - see Section 1.3] You will notice from Table 1-1, the read bandwidth limit for a schizo PCI leaf is 204 MBytes/s. With two schizos, you can expect to max out at 816 MBytes/s or less, depending on resource contention. It makes no difference that a 4 Gbps FC card could read 400 MBytes/s, the best you can do for the card is 204 MBytes/s. 1 GBytes/s of read throughput will not be attainable with a V490. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS
On Tue, 20 Nov 2007, Jason P. Warr wrote: > > >> the 3124 looks perfect. The only problem is the only thing I found on ebay >> was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding >> anything for 3124 other than the data on silicon image's site. Do you know >> of any cards I should be looking for that uses this chip? > > http://www.cooldrives.com/sata-cards.html > > There are a couple on there for about $80. Not quite where you want to get I > am sure but it is an option. Yep - I see: http://www.cooldrives.com/saiiraco2esa.html for $60. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS
>the 3124 looks perfect. The only problem is the only thing I found on ebay >was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding >anything for 3124 other than the data on silicon image's site. Do you know >of any cards I should be looking for that uses this chip? http://www.cooldrives.com/sata-cards.html There are a couple on there for about $80. Not quite where you want to get I am sure but it is an option. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS
On Tue, Nov 20, 2007 at 02:01:34PM -0600, Al Hopper wrote: > > a) the SuperMicro AOC-SAT2-MV8 is an 8-port SATA card available for > around $110 IIRC. Yeah, I'd like to spend a lot less than that, especially as I only need 2 ports. :) > b) There is also a PCI-X version of the older LSI 4-port (internal) > PCI Express SAS3041E card which is still available for around $165 and > works well with ZFS (SATA or SAS drives). I actually just picked up a SAS3080X for my Ultra80 on ebay for $30. I guess I can always scour ebay for something similar. > c) Any card based on the SiliconImage 3124/3132 chips will work. But, > ensure you're running an OS with the latest version of the si3124 > drivers - or - you can swap out the older drivers using the files > from: the 3124 looks perfect. The only problem is the only thing I found on ebay was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding anything for 3124 other than the data on silicon image's site. Do you know of any cards I should be looking for that uses this chip? These will be OS disks, and I'm willing to run whichever version is best for this hardware and ZFS (I'm going to try the most recent SXCE once I have all the hardware together). Any recommendations as related to this card? -brian -- "Perl can be fast and elegant as much as J2EE can be fast and elegant. In the hands of a skilled artisan, it can and does happen; it's just that most of the shit out there is built by people who'd be better suited to making sure that my burger is cooked thoroughly." -- Jonathan Patschke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow
On Nov 20, 2007 10:40 AM, Andrew Wilson <[EMAIL PROTECTED]> wrote: > > What kind of workload are you running. If you are you doing these > measurements with some sort of "write as fast as possible" microbenchmark, Oracle database with blocksize 16K .. populating the database as fast I can > once the 4 GB of nvram is full, you will be limited by backend performance > (FC disks and their interconnect) rather than the host / controller bus. > > Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10 > seconds, you will fill it up, even with the backend writing out data as fast > as it can, in about 20 seconds. Once the nvram is full, you will only see > the backend (e.g. 2 Gbit) rate. > > The reason these controller buffers are useful with real applications is > that they smooth the bursts of writes that real applications tend to > generate, thus reducing the latency of those writes and improving > performance. They will then "catch up" during periods when few writes are > being issued. But a typical microbenchmark that pumps out a steady stream of > writes won't see this benefit. > > Drew Wilson > > > > Asif Iqbal wrote: > On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote: > > > On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > > On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: > > > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > > (Including storage-discuss) > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate > ST337FC (300GB - 1 RPM FC-AL) > > Those disks are 2Gb disks, so the tray will operate at 2Gb. > > > That is still 256MB/s . I am getting about 194MB/s > > 2Gb fibre channel is going to max out at a data transmission rate > > But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of > 300GB FC 10K rpm (2Gb/s) disks > > So I should get "a lot" more than ~ 200MB/s. Shouldn't I? > > > > > around 200MB/s rather than the 256MB/s that you'd expect. Fibre > channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data > in 10 bits on the wire. So while 256MB/s is being transmitted on the > connection itself, only 200MB/s of that is the data that you're > transmitting. > > Chad Mynhier > > > > > > > -- Asif Iqbal PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS
On Mon, 19 Nov 2007, Brian Hechinger wrote: > On Sun, Nov 18, 2007 at 02:18:21PM +0100, Peter Schuller wrote: >>> Right now I have noticed that LSI has recently began offering some >>> lower-budget stuff; specifically I am looking at the MegaRAID SAS >>> 8208ELP/XLP, which are very reasonably priced. > > I looked up the 8204XLP, which is really quite expensive compared to > the Supermicro MV based card. > > That being said, for a small 1U box that is only going to have two SATA > disks, the Supermicro card is way overkill/overpriced for my needs. > > Does anyone know if there are any PCI-X cards based on the MV88SX6041? > > I'm not having much luck finding any. A couple of options: a) the SuperMicro AOC-SAT2-MV8 is an 8-port SATA card available for around $110 IIRC. b) There is also a PCI-X version of the older LSI 4-port (internal) PCI Express SAS3041E card which is still available for around $165 and works well with ZFS (SATA or SAS drives). c) Any card based on the SiliconImage 3124/3132 chips will work. But, ensure you're running an OS with the latest version of the si3124 drivers - or - you can swap out the older drivers using the files from: http://www.opensolaris.org/jive/servlet/JiveServlet/download/80-32437-138083-3390/si3124.tar.gz Note: if these drives are your boot drives, you'll need to do this after booting from a CDROM/DVD disk, otherwise you can unload the driver and swap out the files. Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv-76 panics on installation
Bill Moloney wrote: > I have an Intel based server running dual P3 Xeons (Intel A46044-609, > 1.26GHz) with a BIOS from American Megatrends Inc (AMIBIOS, SCB2 > production BIOS rev 2.0, BIOS build 0039) with 2GB of RAM > > when I attempt to install snv-76 the system panics during the initial > boot from CD please post the panic stack (to the list, not to me alone), if possible, and as much other information as you have (ie. what step does the panic happen at, etc.) where did you get the media from (is it really a CD, or a DVD?)? Can you read/mount the CD when running an older build? if no, are there errors in the messages file? ... HTH Michael -- Michael Schuster Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
On Tue, 20 Nov 2007, Ross wrote: >>> doing these writes now sounds like a >>> lot of work. I'm guessing that needing two full-path >>> updates to achieve this means you're talking about a >>> much greater write penalty. >> >> Not all that much. Each full-path update is still >> only a single write request to the disk, since all >> the path blocks (again, possibly excepting the >> superblock) are batch-written together, thus mostly >> increasing only streaming bandwidth consumption. > ... reformatted ... > Ok, that took some thinking about. I'm pretty new to ZFS, so I've > only just gotten my head around how CoW works, and I'm not used to > thinking about files at this kind of level. I'd not considered that Here's a couple of resources that'll help you get up to speed with ZFS internals: a) From the London OpenSolaris User Group (LOSUG) session, presented by Jarod Nash, TSC Systems Engineer entitled: "ZFS: Under The Hood": ZFS-UTH_3_v1.1_LOSUG.pdf zfs_data_structures_for_single_file.pdf also referred to as "ZFS Internals Lite". and b) the ZFS on-disk Specification: ondiskformat0822.pdf > path blocks would be batch-written close together, but of course > that makes sense. > > What I'd been thinking was that ordinarily files would get > fragmented as they age, which would make these updates slower as > blocks would be scattered over the disk, so a full-path update would > take some time. I'd forgotten that the whole point of doing this is > to prevent fragmentation... > > So a nice side effect of this approach is that if you use it, it > makes itself more efficient :D > Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snv-76 panics on installation
I have an Intel based server running dual P3 Xeons (Intel A46044-609, 1.26GHz) with a BIOS from American Megatrends Inc (AMIBIOS, SCB2 production BIOS rev 2.0, BIOS build 0039) with 2GB of RAM when I attempt to install snv-76 the system panics during the initial boot from CD I've been using this system for extensive testing with ZFS and have had no problems installing snv-68, 69 or 70, but I'm having this problem with snv-76 any information regarding this problem or a potential workaround would be appreciated Thx ... bill moloney This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz2
comment on retries below... Paul Boven wrote: > Hi Eric, everyone, > > Eric Schrock wrote: >> There have been many improvements in proactively detecting failure, >> culminating in build 77 of Nevada. Earlier builds: >> >> - Were unable to distinguish device removal from devices misbehaving, >> depending on the driver and hardware. >> >> - Did not diagnose a series of I/O failures as disk failure. >> >> - Allowed several (painful) SCSI retries and continued to queue up I/O, >> even if the disk was fatally damaged. > >> Most classes of hardware would behave reasonably well on device removal, >> but certain classes caused cascading failures in ZFS, all which should >> be resolved in build 77 or later. > > I seem to be having exactly the problems you are describing (see my > postings with the subject 'zfs on a raid box'). So I would very much > like to give b77 a try. I'm currently running b76, as that's the latest > sxce that's available. Are the sources to anything beyond b76 already > available? Would I need to build it, or bfu? > > I'm seeing zfs not making use of available hot-spares when I pull a > disk, long and indeed painful SCSI retries and very poor write > performance on a degraded zpool - I hope to be able to test if b77 fares > any better with this. The SCSI retries are implemented at the driver level (usually sd) below ZFS. By default, the timeout (60s) and retry (3 or 5) counters are somewhat conservative and intended to apply to a wide variety of hardware, including slow CD-ROMs and ancient processors. Depending on your situation and business requirements, these may be tuned. There is a pretty good article on BigAdmin which describes tuning the FC side of the equation (ssd driver). http://www.sun.com/bigadmin/features/hub_articles/tuning_sfs.jsp Beware, making these tunables too small can lead to an unstable system. The article does a good job of explaining how interdependent the tunables are, so hopefully you can make wise choices. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz2
On Tue, Nov 20, 2007 at 11:02:55AM +0100, Paul Boven wrote: > > I seem to be having exactly the problems you are describing (see my > postings with the subject 'zfs on a raid box'). So I would very much > like to give b77 a try. I'm currently running b76, as that's the latest > sxce that's available. Are the sources to anything beyond b76 already > available? Would I need to build it, or bfu? The sources, yes (you can pull them from the ON mercurial mirror). It looks like the latest SX:CE is still on build 76, so it doesn't seem like you can get a binary distro, yet. > > I'm seeing zfs not making use of available hot-spares when I pull a > disk, long and indeed painful SCSI retries and very poor write > performance on a degraded zpool - I hope to be able to test if b77 fares > any better with this. What hardware/driver are you using? Build 76 should have the ability to recognize removed devices via DKIOCGETSTATE and immediately transition to the REMOVED state instead of going through the SCSI retry logic (3x 60 seconds). Build 77 added a 'probe' operation on I/O failure that will try to read/write some basic data to the disk and if that fails will immediately determine the disk as FAULTED without having to wait for retries to fail and FMA diagnosis to offline the device. - Eric -- Eric Schrock, FishWorkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz2 testing
Is there a preferred method to test a raidz2. I would like to see the the disks recover on there own after simulating a disk failure. I'm have a 4 disk configuration. Brian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
On Nov 20, 2007 5:33 PM, can you guess? <[EMAIL PROTECTED]> wrote: > > But the whole point of snapshots is that they don't > > take up extra space on the disk. If a file (and > > hence a block) is in every snapshot it doesn't mean > > you've got multiple copies of it. You only have one > > copy of that block, it's just referenced by many > > snapshots. > > I used the wording "copies of a parent" loosely to mean "previous > states of the parent that also contain pointers to the current state of > the child about to be updated in place". But children are never updated in place. When a new block is written to a leaf, new blocks are used for all the ancestors back to the superblock, and then the old ones are either freed or held on to by the snapshot. > And in every earlier version of the parent that was updated for some > *other* reason and still contains a pointer to the current child that > someone using that snapshot must be able to follow correctly. The snapshot doesn't get the 'current' child - it gets the one that was there when the snapshot was taken. > No: every version of the parent that points to the current version of > the child must be updated. Even with clones, the two 'parent' and the 'clone' are allowed to diverge - they contain different data. Perhaps I'm missing something. Excluding ditto blocks, when in ZFS would two parents point to the same child, and need to both be updated when the child is updated? Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why did resilvering restart?
On Tue, Nov 20, 2007 at 11:10:20AM -0600, [EMAIL PROTECTED] wrote: > > [EMAIL PROTECTED] wrote on 11/20/2007 10:11:50 AM: > > > On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote: > > > Resilver and scrub are broken and restart when a snapshot is created > > > -- the current workaround is to disable snaps while resilvering, > > > the ZFS team is working on the issue for a long term fix. > > > > But, no snapshot was taken. If so, zpool history would have shown > > this. So, in short, _no_ ZFS operations are going on during the > > resilvering. Yet, it is restarting. > > > > Does 2007-11-20.02:37:13 actually match the expected timestamp of > the original zpool replace command before the first zpool status > output listed below? No. We ran some 'zpool status' commands after the last 'zpool replace'. The 'zpool status' output in the initial email is from this morning. The only ZFS command we've been running is 'zfs list', 'zpool list tww', 'zpool status', or 'zpool status -v' after the last 'zpool replace'. Server is on GMT time. > Is it possible that another zpool replace is further up on your > pool history (ie it was rerun by an admin or automatically from some > service)? Yes, but a zpool replace for the same bad disk: 2007-11-20.00:57:40 zpool replace tww c0t600A0B8000299966059E4668CBD3d0 c0t600A0B800029996606584741C7C3d0 2007-11-20.02:35:22 zpool detach tww c0t600A0B800029996606584741C7C3d0 2007-11-20.02:37:13 zpool replace tww c0t600A0B8000299966059E4668CBD3d0 c0t600A0B8000299CCC06734741CD4Ed0 We accidentally removed c0t600A0B800029996606584741C7C3d0 from the array, hence the 'zpool detach'. The last 'zpool replace' has been running for 15h now. > -Wade > > > > > > > > [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM: > > > > > > > On b66: > > > > # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \ > > > > c0t600A0B8000299CCC06734741CD4Ed0 > > > > < some hours later> > > > > # zpool status tww > > > > pool: tww > > > >state: DEGRADED > > > > status: One or more devices is currently being resilvered. The > pool > > > will > > > > continue to function, possibly in a degraded state. > > > > action: Wait for the resilver to complete. > > > >scrub: resilver in progress, 62.90% done, 4h26m to go > > > > < some hours later> > > > > # zpool status tww > > > > pool: tww > > > >state: DEGRADED > > > > status: One or more devices is currently being resilvered. The > pool > > > will > > > > continue to function, possibly in a degraded state. > > > > action: Wait for the resilver to complete. > > > >scrub: resilver in progress, 3.85% done, 18h49m to go > > > > > > > > # zpool history tww | tail -1 > > > > 2007-11-20.02:37:13 zpool replace tww > > > c0t600A0B8000299966059E4668CBD3d0 > > > > c0t600A0B8000299CCC06734741CD4Ed0 > > > > > > > > So, why did resilvering restart when no zfs operations occurred? I > > > > just ran zpool status again and now I get: > > > > # zpool status tww > > > > pool: tww > > > >state: DEGRADED > > > > status: One or more devices is currently being resilvered. The > pool > > > will > > > > continue to function, possibly in a degraded state. > > > > action: Wait for the resilver to complete. > > > >scrub: resilver in progress, 0.00% done, 134h45m to go > > > > > > > > What's going on? > > > > > > > > -- > > > > albert chin ([EMAIL PROTECTED]) > > > > ___ > > > > zfs-discuss mailing list > > > > zfs-discuss@opensolaris.org > > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > ___ > > > zfs-discuss mailing list > > > zfs-discuss@opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > > > > > -- > > albert chin ([EMAIL PROTECTED]) > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
> But the whole point of snapshots is that they don't > take up extra space on the disk. If a file (and > hence a block) is in every snapshot it doesn't mean > you've got multiple copies of it. You only have one > copy of that block, it's just referenced by many > snapshots. I used the wording "copies of a parent" loosely to mean "previous states of the parent that also contain pointers to the current state of the child about to be updated in place". > > The thing is, the location of that block isn't saved > separately in every snapshot either - the location is > just stored in it's parent. And in every earlier version of the parent that was updated for some *other* reason and still contains a pointer to the current child that someone using that snapshot must be able to follow correctly. So moving a block is > just a case of updating one parent. No: every version of the parent that points to the current version of the child must be updated. ... > If you think about it, that has to work for the old > data since as I said before, ZFS already has this > functionality. If ZFS detects a bad block, it moves > it to a new location on disk. If it can already do > that without affecting any of the existing snapshots, > so there's no reason to think we couldn't use the > same code for a different purpose. Only if it works the way you think it works, rather than, say, by using a look-aside list of moved blocks (there shouldn't be that many of them), or by just leaving the bad block in the snapshot (if it's mirrored or parity-protected, it'll still be usable there unless a second failure occurs; if not, then it was lost anyway). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why did resilvering restart?
[EMAIL PROTECTED] wrote on 11/20/2007 10:11:50 AM: > On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote: > > Resilver and scrub are broken and restart when a snapshot is created > > -- the current workaround is to disable snaps while resilvering, > > the ZFS team is working on the issue for a long term fix. > > But, no snapshot was taken. If so, zpool history would have shown > this. So, in short, _no_ ZFS operations are going on during the > resilvering. Yet, it is restarting. > Does 2007-11-20.02:37:13 actually match the expected timestamp of the original zpool replace command before the first zpool status output listed below? Is it possible that another zpool replace is further up on your pool history (ie it was rerun by an admin or automatically from some service)? -Wade > > > > [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM: > > > > > On b66: > > > # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \ > > > c0t600A0B8000299CCC06734741CD4Ed0 > > > < some hours later> > > > # zpool status tww > > > pool: tww > > >state: DEGRADED > > > status: One or more devices is currently being resilvered. The pool > > will > > > continue to function, possibly in a degraded state. > > > action: Wait for the resilver to complete. > > >scrub: resilver in progress, 62.90% done, 4h26m to go > > > < some hours later> > > > # zpool status tww > > > pool: tww > > >state: DEGRADED > > > status: One or more devices is currently being resilvered. The pool > > will > > > continue to function, possibly in a degraded state. > > > action: Wait for the resilver to complete. > > >scrub: resilver in progress, 3.85% done, 18h49m to go > > > > > > # zpool history tww | tail -1 > > > 2007-11-20.02:37:13 zpool replace tww > > c0t600A0B8000299966059E4668CBD3d0 > > > c0t600A0B8000299CCC06734741CD4Ed0 > > > > > > So, why did resilvering restart when no zfs operations occurred? I > > > just ran zpool status again and now I get: > > > # zpool status tww > > > pool: tww > > >state: DEGRADED > > > status: One or more devices is currently being resilvered. The pool > > will > > > continue to function, possibly in a degraded state. > > > action: Wait for the resilver to complete. > > >scrub: resilver in progress, 0.00% done, 134h45m to go > > > > > > What's going on? > > > > > > -- > > > albert chin ([EMAIL PROTECTED]) > > > ___ > > > zfs-discuss mailing list > > > zfs-discuss@opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > -- > albert chin ([EMAIL PROTECTED]) > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unsubscribe
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Calum Benson Sent: Tuesday, November 20, 2007 11:12 AM To: Darren J Moffat Cc: Henry Zhang; zfs-discuss@opensolaris.org; Desktop discuss; Christian Kelly Subject: Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI On 20 Nov 2007, at 15:04, Darren J Moffat wrote: > Calum Benson wrote: >> On 20 Nov 2007, at 13:35, Christian Kelly wrote: >>> Take the example I gave before, where you have a pool called, >>> say, pool1. In the pool you have two ZFSes: pool1/export and >>> pool1/ export/home. So, suppose the user chooses /export in >>> nautilus and adds this to the backup list. Will the user be >>> aware, from browsing through nautilus, that /export/home may or >>> may not be backed up - depending on whether the -r (?) option is >>> used. >> I'd consider that to be a fairly strong requirement, but it's not >> something I particularly thought through for the mockups. >> One solution might be to change the nautilus background for >> folders that are being backed up, another might be an indicator >> in the status bar, another might be emblems on the folder icons >> themselves. > > I think changing the background is a non starter since users can > change the background already anyway. You're right that they "can", and while that probably does write it off, I wonder how many really do. (And we could possibly do something clever like a semi-opaque overlay anyway, we may not have to replace the background entirely.) All just brainstorming at this stage though, other ideas welcome :) > An emblem is good for the case where you are looking from "above" a > dataset that is tagged for backup. > > An indicator in the status bar is good for when you are "in" a > dataset that is tagged for backup. Yep, all true. Also need to bear in mind that nowadays, with the (fairly) new nautilus treeview, you can potentially see both "in" and "above" at the same time, so any solution would have to work elegantly with that view too. Cheeri, Calum. -- CALUM BENSON, Usability Engineer Sun Microsystems Ireland mailto:[EMAIL PROTECTED]GNOME Desktop Team http://blogs.sun.com/calum +353 1 819 9771 Any opinions are personal and not necessarily those of Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
But the whole point of snapshots is that they don't take up extra space on the disk. If a file (and hence a block) is in every snapshot it doesn't mean you've got multiple copies of it. You only have one copy of that block, it's just referenced by many snapshots. The thing is, the location of that block isn't saved separately in every snapshot either - the location is just stored in it's parent. So moving a block is just a case of updating one parent. So regardless of how many snapshots the parent is in, you only have to update one parent to point it at the new location for the *old* data. Then you save the new data to the old location and ensure the current tree points to that. If you think about it, that has to work for the old data since as I said before, ZFS already has this functionality. If ZFS detects a bad block, it moves it to a new location on disk. If it can already do that without affecting any of the existing snapshots, so there's no reason to think we couldn't use the same code for a different purpose. Ultimately, your old snapshots get fragmented, but the live data stays contiguous. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
On Nov 19, 2007 10:08 PM, Richard Elling <[EMAIL PROTECTED]> wrote: > James Cone wrote: > > Hello All, > > > > Here's a possibly-silly proposal from a non-expert. > > > > Summarising the problem: > >- there's a conflict between small ZFS record size, for good random > > update performance, and large ZFS record size for good sequential read > > performance > > > > Poor sequential read performance has not been quantified. I think this is a good point. A lot of solutions are being thrown around, and the problems are only theoretical at the moment. Conventional solutions may not even be appropriate for something like ZFS. The point that makes me skeptical is this: blocks do not need to be logically contiguous to be (nearly) physically contiguous. As long as you reallocate the blocks close to the originals, chances are that a scan of the file will end up being mostly physically contiguous reads anyway. ZFS's intelligent prefetching along with the disk's track cache should allow for good performance even in this case. ZFS may or may not already do this, I haven't checked. Obviously, you won't want to keep a years worth of snapshots, or run the pool near capacity. With a few minor tweaks though, it should work quite well. Talking about fundamental ZFS design flaws at this point seems unnecessary to say the least. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
Calum Benson wrote: > You're right that they "can", and while that probably does write it off, > I wonder how many really do. (And we could possibly do something clever > like a semi-opaque overlay anyway, we may not have to replace the > background entirely.) Almost everyone I've seen using the filemanager other than myself has done this :-) If you do a semi-opaque overlay thats going to require lots of colour selection stuff - plus what if the background is a complex image (why people do this I don't know but I've seen it done). >> An emblem is good for the case where you are looking from "above" a >> dataset that is tagged for backup. >> >> An indicator in the status bar is good for when you are "in" a dataset >> that is tagged for backup. > > Yep, all true. Also need to bear in mind that nowadays, with the > (fairly) new nautilus treeview, you can potentially see both "in" and > "above" at the same time, so any solution would have to work elegantly > with that view too. I would expect emblem in the tree and status bar indicator for the non tree part. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
On 20 Nov 2007, at 14:31, Christian Kelly wrote: > > Ah, I see. So, for phase 0, the 'Enable Automatic Snapshots' option > would only be available for/work for existing ZFSes. Then at some > later > stage, create them on the fly. Yes, that's the scenario for the mockups I posted, anyway... if the requirements are bogus, then of course we'll have to change them :) My original mockup did allow you to create a pool/filesystem on the fly if required, but it felt like the wrong place to be doing that-- if you could understand the dialog to do that, you would probably know how to do it better on the command line anyway. Longer term, I guess we might be wanting to ship some sort of ZFS management GUI that might be better suited to that sort of thing (maybe like the Nexenta app that Roman mentioned earlier, but I haven't really looked at that yet...) Cheeri, Calum. -- CALUM BENSON, Usability Engineer Sun Microsystems Ireland mailto:[EMAIL PROTECTED]GNOME Desktop Team http://blogs.sun.com/calum +353 1 819 9771 Any opinions are personal and not necessarily those of Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
On 20 Nov 2007, at 15:04, Darren J Moffat wrote: > Calum Benson wrote: >> On 20 Nov 2007, at 13:35, Christian Kelly wrote: >>> Take the example I gave before, where you have a pool called, >>> say, pool1. In the pool you have two ZFSes: pool1/export and >>> pool1/ export/home. So, suppose the user chooses /export in >>> nautilus and adds this to the backup list. Will the user be >>> aware, from browsing through nautilus, that /export/home may or >>> may not be backed up - depending on whether the -r (?) option is >>> used. >> I'd consider that to be a fairly strong requirement, but it's not >> something I particularly thought through for the mockups. >> One solution might be to change the nautilus background for >> folders that are being backed up, another might be an indicator >> in the status bar, another might be emblems on the folder icons >> themselves. > > I think changing the background is a non starter since users can > change the background already anyway. You're right that they "can", and while that probably does write it off, I wonder how many really do. (And we could possibly do something clever like a semi-opaque overlay anyway, we may not have to replace the background entirely.) All just brainstorming at this stage though, other ideas welcome :) > An emblem is good for the case where you are looking from "above" a > dataset that is tagged for backup. > > An indicator in the status bar is good for when you are "in" a > dataset that is tagged for backup. Yep, all true. Also need to bear in mind that nowadays, with the (fairly) new nautilus treeview, you can potentially see both "in" and "above" at the same time, so any solution would have to work elegantly with that view too. Cheeri, Calum. -- CALUM BENSON, Usability Engineer Sun Microsystems Ireland mailto:[EMAIL PROTECTED]GNOME Desktop Team http://blogs.sun.com/calum +353 1 819 9771 Any opinions are personal and not necessarily those of Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why did resilvering restart?
On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote: > Resilver and scrub are broken and restart when a snapshot is created > -- the current workaround is to disable snaps while resilvering, > the ZFS team is working on the issue for a long term fix. But, no snapshot was taken. If so, zpool history would have shown this. So, in short, _no_ ZFS operations are going on during the resilvering. Yet, it is restarting. > -Wade > > [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM: > > > On b66: > > # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \ > > c0t600A0B8000299CCC06734741CD4Ed0 > > < some hours later> > > # zpool status tww > > pool: tww > >state: DEGRADED > > status: One or more devices is currently being resilvered. The pool > will > > continue to function, possibly in a degraded state. > > action: Wait for the resilver to complete. > >scrub: resilver in progress, 62.90% done, 4h26m to go > > < some hours later> > > # zpool status tww > > pool: tww > >state: DEGRADED > > status: One or more devices is currently being resilvered. The pool > will > > continue to function, possibly in a degraded state. > > action: Wait for the resilver to complete. > >scrub: resilver in progress, 3.85% done, 18h49m to go > > > > # zpool history tww | tail -1 > > 2007-11-20.02:37:13 zpool replace tww > c0t600A0B8000299966059E4668CBD3d0 > > c0t600A0B8000299CCC06734741CD4Ed0 > > > > So, why did resilvering restart when no zfs operations occurred? I > > just ran zpool status again and now I get: > > # zpool status tww > > pool: tww > >state: DEGRADED > > status: One or more devices is currently being resilvered. The pool > will > > continue to function, possibly in a degraded state. > > action: Wait for the resilver to complete. > >scrub: resilver in progress, 0.00% done, 134h45m to go > > > > What's going on? > > > > -- > > albert chin ([EMAIL PROTECTED]) > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow
And, just to add one more point, since pretty much everything the host writes to the controller eventually has to make it out to the disk drives, the long term average write rate cannot exceed the rate that the backend disk subsystem can absorb the writes, regardless of the workload. (An exception is if the controller can combine some overlapping writes). Basically just like putting water into a reservoir at twice the rate it is being withdrawn, the reservoir will eventually overflow! At least in this case the controller can limit the input from the host and avoid an actual data overflow situation. Drew Andrew Wilson wrote: What kind of workload are you running. If you are you doing these measurements with some sort of "write as fast as possible" microbenchmark, once the 4 GB of nvram is full, you will be limited by backend performance (FC disks and their interconnect) rather than the host / controller bus. Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10 seconds, you will fill it up, even with the backend writing out data as fast as it can, in about 20 seconds. Once the nvram is full, you will only see the backend (e.g. 2 Gbit) rate. The reason these controller buffers are useful with real applications is that they smooth the bursts of writes that real applications tend to generate, thus reducing the latency of those writes and improving performance. They will then "catch up" during periods when few writes are being issued. But a typical microbenchmark that pumps out a steady stream of writes won't see this benefit. Drew Wilson Asif Iqbal wrote: >On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote: > > >>On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: >> >> >>>On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: >>> >>> On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote: >(Including storage-discuss) > >I have 6 6140s with 96 disks. Out of which 64 of them are Seagate >ST337FC (300GB - 1 RPM FC-AL) > > Those disks are 2Gb disks, so the tray will operate at 2Gb. >>>That is still 256MB/s . I am getting about 194MB/s >>> >>> >>2Gb fibre channel is going to max out at a data transmission rate >> >> > >But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of >300GB FC 10K rpm (2Gb/s) disks > >So I should get "a lot" more than ~ 200MB/s. Shouldn't I? > > > > >>around 200MB/s rather than the 256MB/s that you'd expect. Fibre >>channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data >>in 10 bits on the wire. So while 256MB/s is being transmitted on the >>connection itself, only 200MB/s of that is the data that you're >>transmitting. >> >>Chad Mynhier >> >> >> > > > > > ___ perf-discuss mailing list [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NFS performance considerations (Linux vs Solaris)
Hello all... I think all of you agree that "performance" is a great topic in NFS. So, when we talk about NFS and ZFS we imagine a great combination/solution. But one is not dependent on another, actually are two well distinct technologies. ZFS has a lot of features that all we know about, and "maybe", all of us want in a NFS share (maybe not). The point is: Two technologies with diferent priorities. So, what i think is important, is a "document" (here on NFS/ZFS discuss), that lists and explains the ZFS features that have a "real" performance impact. I know that there is the solarisinternals wiki about ZFS/NFS integration, but what i think is really important is a comparison between Linux and Solaris/ZFS on server side. That would be very useful to see for example, what "consistency" i have with Linux and (XFS, ext3, etc), with "that" performance. And "how" can i configure a similar NFS service on solaris/ZFS. Here we have some information about it: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine but there is no relation with Linux, what i think is important. What i do mean, is that the people that knows a lot about the NFS protocol, and about the filesystem features, should make such comparison (to facilitate the adoption and users' comparison). I think there are many users comparing oranges with apples. Another example (correct me if i am wrong), Until the kernel 2.4.20 (at least), the default export option for sync/async was "async" (in solaris i think always was "sync"). Another point was about the "commit" operation in vers2, that was not implemented, the server just reply with an "OK", but the data was not in stable storage yet (here the ZIL and the roch blog entry is excellent). That's it, i'm proposing the creation of a "matrix/table" with features and performance impact, as well as a comparison with other implementations/implications. Thanks very much for your time, and sorry for the long post. Leal. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why did resilvering restart?
Resilver and scrub are broken and restart when a snapshot is created -- the current workaround is to disable snaps while resilvering, the ZFS team is working on the issue for a long term fix. -Wade [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM: > On b66: > # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \ > c0t600A0B8000299CCC06734741CD4Ed0 > < some hours later> > # zpool status tww > pool: tww >state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. >scrub: resilver in progress, 62.90% done, 4h26m to go > < some hours later> > # zpool status tww > pool: tww >state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. >scrub: resilver in progress, 3.85% done, 18h49m to go > > # zpool history tww | tail -1 > 2007-11-20.02:37:13 zpool replace tww c0t600A0B8000299966059E4668CBD3d0 > c0t600A0B8000299CCC06734741CD4Ed0 > > So, why did resilvering restart when no zfs operations occurred? I > just ran zpool status again and now I get: > # zpool status tww > pool: tww >state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. >scrub: resilver in progress, 0.00% done, 134h45m to go > > What's going on? > > -- > albert chin ([EMAIL PROTECTED]) > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Why did resilvering restart?
On b66: # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \ c0t600A0B8000299CCC06734741CD4Ed0 < some hours later> # zpool status tww pool: tww state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 62.90% done, 4h26m to go < some hours later> # zpool status tww pool: tww state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 3.85% done, 18h49m to go # zpool history tww | tail -1 2007-11-20.02:37:13 zpool replace tww c0t600A0B8000299966059E4668CBD3d0 c0t600A0B8000299CCC06734741CD4Ed0 So, why did resilvering restart when no zfs operations occurred? I just ran zpool status again and now I get: # zpool status tww pool: tww state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 0.00% done, 134h45m to go What's going on? -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
Rats - I was right the first time: there's a messy problem with snapshots. The problem is that the parent of the child that you're about to update in place may *already* be in one or more snapshots because one or more of its *other* children was updated since each snapshot was created. If so, then each snapshot copy of the parent is pointing to the location of the existing copy of the child you now want to update in place, and unless you change the snapshot copy of the parent (as well as the current copy of the parent) the snapshot will point to the *new* copy of the child you are now about to update (with an incorrect checksum to boot). With enough snapshots, enough children, and bad enough luck, you might have to change the parent (and of course all its ancestors...) in every snapshot. In other words, Nathan's approach is pretty much infeasible in the presence of snapshots. Background defragmention works as long as you move the entire region (which often has a single common parent) to a new location, which if the source region isn't excessively fragmented may not be all that expensive; it's probably not something you'd want to try at normal priority *during* an update to make Nathan's approach work, though, especially since you'd then wind up moving the entire region on every such update rather than in one batch in the background. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow
What kind of workload are you running. If you are you doing these measurements with some sort of "write as fast as possible" microbenchmark, once the 4 GB of nvram is full, you will be limited by backend performance (FC disks and their interconnect) rather than the host / controller bus. Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10 seconds, you will fill it up, even with the backend writing out data as fast as it can, in about 20 seconds. Once the nvram is full, you will only see the backend (e.g. 2 Gbit) rate. The reason these controller buffers are useful with real applications is that they smooth the bursts of writes that real applications tend to generate, thus reducing the latency of those writes and improving performance. They will then "catch up" during periods when few writes are being issued. But a typical microbenchmark that pumps out a steady stream of writes won't see this benefit. Drew Wilson Asif Iqbal wrote: On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote: On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote: (Including storage-discuss) I have 6 6140s with 96 disks. Out of which 64 of them are Seagate ST337FC (300GB - 1 RPM FC-AL) Those disks are 2Gb disks, so the tray will operate at 2Gb. That is still 256MB/s . I am getting about 194MB/s 2Gb fibre channel is going to max out at a data transmission rate But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of 300GB FC 10K rpm (2Gb/s) disks So I should get "a lot" more than ~ 200MB/s. Shouldn't I? around 200MB/s rather than the 256MB/s that you'd expect. Fibre channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data in 10 bits on the wire. So while 256MB/s is being transmitted on the connection itself, only 200MB/s of that is the data that you're transmitting. Chad Mynhier ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow
On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: > On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote: > > On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > > On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: > > > > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > > > > (Including storage-discuss) > > > > > > > > > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate > > > > > ST337FC (300GB - 1 RPM FC-AL) > > > > > > > > Those disks are 2Gb disks, so the tray will operate at 2Gb. > > > > > > > > > > That is still 256MB/s . I am getting about 194MB/s > > > > 2Gb fibre channel is going to max out at a data transmission rate > > around 200MB/s rather than the 256MB/s that you'd expect. Fibre > > channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data > > in 10 bits on the wire. So while 256MB/s is being transmitted on the > > connection itself, only 200MB/s of that is the data that you're > > transmitting. > > But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of > 300GB FC 10K rpm (2Gb/s) disks > > So I should get "a lot" more than ~ 200MB/s. Shouldn't I? Here, I'm relying on what Louwtjie said above, that the tray itself is going to be limited to 2Gb/s because of the 2Gb/s FC disks. Chad Mynhier ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
Calum Benson wrote: > On 20 Nov 2007, at 13:35, Christian Kelly wrote: >> Take the example I gave before, where you have a pool called, say, >> pool1. In the pool you have two ZFSes: pool1/export and pool1/ >> export/home. So, suppose the user chooses /export in nautilus and >> adds this to the backup list. Will the user be aware, from browsing >> through nautilus, that /export/home may or may not be backed up - >> depending on whether the -r (?) option is used. > > I'd consider that to be a fairly strong requirement, but it's not > something I particularly thought through for the mockups. > > One solution might be to change the nautilus background for folders > that are being backed up, another might be an indicator in the status > bar, another might be emblems on the folder icons themselves. I think changing the background is a non starter since users can change the background already anyway. An emblem is good for the case where you are looking from "above" a dataset that is tagged for backup. An indicator in the status bar is good for when you are "in" a dataset that is tagged for backup. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow
On Nov 20, 2007 7:01 AM, Chad Mynhier <[EMAIL PROTECTED]> wrote: > On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: > > > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > > > (Including storage-discuss) > > > > > > > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate > > > > ST337FC (300GB - 1 RPM FC-AL) > > > > > > Those disks are 2Gb disks, so the tray will operate at 2Gb. > > > > > > > That is still 256MB/s . I am getting about 194MB/s > > 2Gb fibre channel is going to max out at a data transmission rate But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of 300GB FC 10K rpm (2Gb/s) disks So I should get "a lot" more than ~ 200MB/s. Shouldn't I? > around 200MB/s rather than the 256MB/s that you'd expect. Fibre > channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data > in 10 bits on the wire. So while 256MB/s is being transmitted on the > connection itself, only 200MB/s of that is the data that you're > transmitting. > > Chad Mynhier > -- Asif Iqbal PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] zpool io to 6140 is really slow
On Nov 20, 2007 1:48 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: > > > > That is still 256MB/s . I am getting about 194MB/s > > No, I don't think you can take 2Gbit / 8bits per byte and say 256MB is > what you should get... > Someone with far more FC knowledge can comment here. There must be > some overhead in transporting data (as with regular SCSI) ... in the > same way ULTRA 320MB SCSI never yields close to 320 MB/s ... even > though it might seem so. > > > Adding a second loop by adding another non active port I may have to > > rebuild the > > FS, no? > > No. Use MPXio to help you out here ... Solaris will see the same LUN's > on each of the 2,3 or 4 ports on the primary controller ... but with > multi-pathing switched on will only give you 1 vhci LUN to work with. > > What I would do is export the zpool(s). Hook up more links to the > primary and enable scsi_vhci. Reboot and look for the new cX vhci > devices. > > zpool import should rebuilt the pools from the multipath devices just fine. > > Interesting test though. > > > I am gettin 194MB/s. Hmm my 490 has 16G memory. I really I could benefit > > some > > from OS and controller RAM, atleast for Oracle IO > > Close to 200MB seems good from 1 x 2Gb. Should I not gain a lot (I am not getting any) of performance gain with 2 x 2GB RAM on my raid controllers NVRAM? > > Something else to try ... when creating hardware LUNs, one can assign > the LUN to either controller A or B (as preferred or owner). By doing > assignments one can use the secondary controller ... you are going to > then "stripe" over controllers .. as one way of looking at it. > > PS: Is this a direct connection? Switched fabric? > -- Asif Iqbal PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
> > doing these writes now sounds like a > > lot of work. I'm guessing that needing two full-path > > updates to achieve this means you're talking about a > > much greater write penalty. > > Not all that much. Each full-path update is still > only a single write request to the disk, since all > the path blocks (again, possibly excepting the > superblock) are batch-written together, thus mostly > increasing only streaming bandwidth consumption. Ok, that took some thinking about. I'm pretty new to ZFS, so I've only just gotten my head around how CoW works, and I'm not used to thinking about files at this kind of level. I'd not considered that path blocks would be batch-written close together, but of course that makes sense. What I'd been thinking was that ordinarily files would get fragmented as they age, which would make these updates slower as blocks would be scattered over the disk, so a full-path update would take some time. I'd forgotten that the whole point of doing this is to prevent fragmentation... So a nice side effect of this approach is that if you use it, it makes itself more efficient :D This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz DEGRADED state
> So there is no current way to specify the creation of > a 3 disk raid-z > array with a known missing disk? Can someone answer that? Or does the zpool command NOT accommodate the creation of a degraded raidz array? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
Calum Benson wrote: > Right, for Phase 0 the thinking was that you'd really have to manually > set up whatever pools and filesystems you required first. So in your > example, you (or, perhaps, the Indiana installer) would have had to > set up /export/home/chris/Documents as a ZFS filesystem in its own > right before you could start taking snapshots of it. > > Were we to stick with this general design, in later phases, creating a > new ZFS filesystem on the fly, and migrating the contents of the > existing folder into to it, would hopefully happen behind the scenes > when you selected that folder to be backed up. (That could presumably > be quite a long operation, though, for folders with large contents.) > Ah, I see. So, for phase 0, the 'Enable Automatic Snapshots' option would only be available for/work for existing ZFSes. Then at some later stage, create them on the fly. > I have no problem looking at it from that angle if it turns out that's > what people want-- much of the UI would be fairly similar. But at the > same time, I don't necessarily always expect OSX users' requirements > to be the same as Solaris users' requirements-- I'd especially like to > hear from people who are already using Tim's snapshot and backup > services, to find out how they use it and what their needs are. Yes, absolutely, OSX users' requirements probably vary wildly from those of a Solaris users'. I guess I fall into what we might call the 'lazy' category of user ;) I'm aware of Tim's tool, don't use it though. -Christian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
Louwtjie Burger wrote: > Richard Elling wrote: > > > > >- COW probably makes that conflict worse > > > > > > > > > > This needs to be proven with a reproducible, real-world > workload before it > > makes sense to try to solve it. After all, if we cannot > measure where > > we are, > > how can we prove that we've improved? > > I agree, let's first find a reproducible example where "updates" > negatively impacts large table scans ... one that is rather simple (if > there is one) to reproduce and then work from there. I'd say it would be possible to define a reproducible workload that demonstrates this using the Filebench tool... I haven't worked with it much (maybe over the holidays I'll be able to do this), but I think a workload like: 1) create a large file (bigger than main memory) on an empty ZFS pool. 2) time a sequential scan of the file 3) random write i/o over say, 50% of the file (either with or without matching blocksize) 4) time a sequential scan of the file The difference between times 2 and 4 are the "penalty" that COW block reordering (which may introduce seemingly-random seeks between "sequential" blocks) imposes on the system. It would be interesting to watch seeksize.d's output during this run too. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
On 20 Nov 2007, at 13:35, Christian Kelly wrote: > > Take the example I gave before, where you have a pool called, say, > pool1. In the pool you have two ZFSes: pool1/export and pool1/ > export/home. So, suppose the user chooses /export in nautilus and > adds this to the backup list. Will the user be aware, from browsing > through nautilus, that /export/home may or may not be backed up - > depending on whether the -r (?) option is used. I'd consider that to be a fairly strong requirement, but it's not something I particularly thought through for the mockups. One solution might be to change the nautilus background for folders that are being backed up, another might be an indicator in the status bar, another might be emblems on the folder icons themselves. Which approach works best would probably depend on whether we expect most of the folders people are browsing reguarly to be backed up, or not backed up-- in general, you'd want any sort of indicator to show the less common state. Cheeri, Calum. -- CALUM BENSON, Usability Engineer Sun Microsystems Ireland mailto:[EMAIL PROTECTED]GNOME Desktop Team http://blogs.sun.com/calum +353 1 819 9771 Any opinions are personal and not necessarily those of Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
... > With regards sharing the disk resources with other > programs, obviously it's down to the individual > admins how they would configure this, Only if they have an unconstrained budget. but I would > suggest that if you have a database with heavy enough > requirements to be suffering noticable read > performance issues due to fragmentation, then that > database really should have it's own dedicated drives > and shouldn't be competing with other programs. You're not looking at it from a whole-system viewpoint (which if you're accustomed to having your own dedicated storage devices is understandable). Even if your database performance is acceptable, if it's performing 50x as many disk seeks as it would otherwise need to when scanning a table that's affecting the performance of *other* applications. > > Also, I'm not saying defrag is bad (it may be the > better solution here), just that if you're looking at > performance in this kind of depth, you're probably > experienced enough to have created the database in a > contiguous chunk in the first place :-) As I noted, ZFS may not allow you to ensure that and in any event if the database grows that contiguity may need to be reestablished. You could grow the db in separate files, each of which was preallocated in full (though again ZFS may not allow you to ensure that each is created contiguously on disk), but while databases may include such facilities as a matter of course it would still (all other things being equal) be easier to manage everything if it could just extend a single existing file (or one file per table, if they needed to be kept separate) as it needed additional space. > > I do agree that doing these writes now sounds like a > lot of work. I'm guessing that needing two full-path > updates to achieve this means you're talking about a > much greater write penalty. Not all that much. Each full-path update is still only a single write request to the disk, since all the path blocks (again, possibly excepting the superblock) are batch-written together, thus mostly increasing only streaming bandwidth consumption. ... > It may be that ZFS is not a good fit for this kind of > use, and that if you're really concerned about this > kind of performance you should be looking at other > file systems. I suspect that while it may not be a great fit now with relatively minor changes it could be at least an acceptable one. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
On 20 Nov 2007, at 12:56, Christian Kelly wrote: > Hi Calum, > > heh, as it happens, I was tinkering with pygtk to see how difficult > this would be :) > > Supposing I have a ZFS on my machine called root/export/home which > is mounted on /export/home. Then I have my home dir as /export/home/ > chris. Say I only want to snapshot and backup /export/home/chris/ > Documents. I can't create a snapshot of /export/home/chris/ > Documents as it is a directory, I have to create a snapshot of the > parent ZFS, in this case /export/home/. So there isn't really the > granularity that the attached spec implies. Someone correct me if > I'm wrong, but I just tried it and it didn't work. Right, for Phase 0 the thinking was that you'd really have to manually set up whatever pools and filesystems you required first. So in your example, you (or, perhaps, the Indiana installer) would have had to set up /export/home/chris/Documents as a ZFS filesystem in its own right before you could start taking snapshots of it. Were we to stick with this general design, in later phases, creating a new ZFS filesystem on the fly, and migrating the contents of the existing folder into to it, would hopefully happen behind the scenes when you selected that folder to be backed up. (That could presumably be quite a long operation, though, for folders with large contents.) > I've had a bit of a look at 'Time Machine' and I'd be more in > favour of that style of backup. Just back up everything so I don't > have to think about it. My feeling is that picking individual > directories out just causes confusion. Think of it this way: how > much change is there on a daily basis on your desktop/laptop? Those > snapshots aren't going to grow very quickly. I have no problem looking at it from that angle if it turns out that's what people want-- much of the UI would be fairly similar. But at the same time, I don't necessarily always expect OSX users' requirements to be the same as Solaris users' requirements-- I'd especially like to hear from people who are already using Tim's snapshot and backup services, to find out how they use it and what their needs are. Cheeri, Calum. -- CALUM BENSON, Usability Engineer Sun Microsystems Ireland mailto:[EMAIL PROTECTED]GNOME Desktop Team http://blogs.sun.com/calum +353 1 819 9771 Any opinions are personal and not necessarily those of Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
On Tue, 2007-11-20 at 13:35 +, Christian Kelly wrote: > What I'm suggesting is that the configuration presents a list of pools > and their ZFSes and that you have a checkbox, backup/don't backup sort > of an option. That's basically the (hacked-up) zenity GUI I have at the moment on my blog, download & install the packages and you'll see - I think getting that in a proper tree-structure help ? Right now, there's a bug in my gui, such that with: [X] tank [ ] tank/timf [ ] tank/timf/Documents [ ] tank/timf/Music Selecting "tank" implicitly marks the other filesystems for backup because of the way zfs properties inherit. (load the above gui again having just selected tank, and you'll see the other filesystems being selected for you) Having said that, I like Calum's ideas - and am happy to defer the decision about the gui to someone a lot more qualified than I in this area :-) I think that when browsing directories in nautilus, it would be good to have some sort of "backup" or "snapshot" icon (ála the little padlock in secure web-browsing sessions) to let you know this directory is being either backed up, and/or included in snapshots. cheers, tim -- Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops http://blogs.sun.com/timf ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
Hmm... that's a pain if updating the parent also means updating the parent's checksum too. I guess the functionality is there for moving bad blocks, but since that's likely to be a rare occurence, it wasn't something that would need to be particularly efficient. With regards sharing the disk resources with other programs, obviously it's down to the individual admins how they would configure this, but I would suggest that if you have a database with heavy enough requirements to be suffering noticable read performance issues due to fragmentation, then that database really should have it's own dedicated drives and shouldn't be competing with other programs. I'm not saying defrag is bad (it may be the better solution here), just that if you're looking at performance in this kind of depth, you're probably experienced enough to have created the database in a contiguous chunk in the first place :-) I do agree that doing these writes now sounds like a lot of work. I'm guessing that needing two full-path updates to achieve this means you're talking about a much greater write penalty. And that means you can probably expect significant read penalty if you have any significant volume of writes at all, which would rather defeat the point. After all, if you have a low enough amount of writes to not suffer from this penalty, your database isn't going to be particularly fragmented. However, I'm now in over my depth. This needs somebody who knows the internal architecture of ZFS to decide whether it's feasible or desirable, and whether defrag is a good enough workaround. It may be that ZFS is not a good fit for this kind of use, and that if you're really concerned about this kind of performance you should be looking at other file systems. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
> Time Machine is storing all in the system by default, but you still can > select some ones that you don't like to store. And Time Machine don't > use ZFS. > Here we will use ZFS snapshot, and what it's working with is file > system. In Nevada, the default file system is not ZFS, it means some > directory is not ZFS, so seems you have to select some directory which > is ZFS, and it's impossible for you to store all, (some are not ZFS)... > What I'm suggesting is that the configuration presents a list of pools and their ZFSes and that you have a checkbox, backup/don't backup sort of an option. When you start having nested ZFSes it could get confusing as to what you are actually backing up if you start browsing down through the filesystem with the likes of nautilus. Take the example I gave before, where you have a pool called, say, pool1. In the pool you have two ZFSes: pool1/export and pool1/export/home. So, suppose the user chooses /export in nautilus and adds this to the backup list. Will the user be aware, from browsing through nautilus, that /export/home may or may not be backed up - depending on whether the -r (?) option is used. I guess what I'm saying is, how aware of the behavior of ZFS must the user be in order to use this backup system? -Christian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
... > My understanding of ZFS (in short: an upside down > tree) is that each block is referenced by it's > parent. So regardless of how many snapshots you take, > each block is only ever referenced by one other, and > I'm guessing that the pointer and checksum are both > stored there. > > If that's the case, to move a block it's just a case > of: > - read the data > - write to the new location > - update the pointer in the parent block Which changes the contents of the parent block (the change in the data checksum changed it as well), and thus requires that this parent also be rewritten (using COW), which changes the pointer to it (and of course its checksum as well) in *its* parent block, which thus also must be re-written... and finally a new copy of the superblock is written to reflect the new underlying tree structure - all this in a single batch-written 'transaction'. The old version of each of these blocks need only be *saved* if a snapshot exists and it hasn't previously been updated since that snapshot was created. But all the blocks need to be COWed even if no snapshot exists (in which case the old versions are simply discarded). ... > PS. > > >1. You'd still need an initial defragmentation pass > to ensure that the file was reasonably piece-wise > contiguous to begin with. > > No, not necessarily. If you were using a zpool > configured like this I'd hope you were planning on > creating the file as a contiguous block in the first > place :) I'm not certain that you could ensure this if other updates in the system were occurring concurrently. Furthermore, the file may be extended dynamically as new data is inserted, and you'd like to have some mechanism that could restore reasonable contiguity to the result (which can be difficult to accomplish in the foreground if, for example, free space doesn't happen to exist on the disk right after the existing portion of the file). ... > Any zpool with this option would probably be > dedicated to the database file and nothing else. In > fact, even with multiple databases I think I'd have a > single pool per database. It's nice if you can afford such dedicated resources, but it seems a bit cavalier to ignore users who just want decent performance from a database that has to share its resources with other activity. Your prompt response is probably what prevented me from editing my previous post after I re-read it and realized I had overlooked the fact that over-writing the old data complicates things. So I'll just post the revised portion here: 3. Now you must make the above transaction persistent, and then randomly over-write the old data block with the new data (since that data must be in place before you update the path to it below, and unfortunately since its location is not arbitrary you can't combine this update with either the transaction above or the transaction below). 4. You can't just slide in the new version of the block using the old version's existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version's new location that you just had to establish (if a snapshot exists, that's the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block (save for the fact that the block itself was already written above): all the *additional* overhead occurred in the previous steps. So instead of a single full-path update that fragments the file, you have two full-path updates, a random write, and possibly a random read initially to fetch the old data. And you still need an initial defrag pass to establish initial contiguity. Furthermore, these additional resources are consumed at normal rather than the reduced priority at which a background reorg can operate. On the plus side, though, the file would be kept contiguous all the time rather than just returned to contiguity whenever there was time to do so. ... > Taking it a stage further, I wonder if this would > work well with the prioritized write feature request > (caching writes to a solid state disk)? > http://www.genunix.org/wiki/index.php/OpenSolaris_Sto > age_Developer_Wish_List > > That could potentially mean there's very little > slowdown: > - Read the original block > - Save that to solid state disk > - Write the the new block in the original location > - Periodically stream writes from the solid state > disk to the main storage I'm not sure this would confer much benefit if things in fact need to be handled as I described above. In particular, if a snapshot exists you almost certainly must establish the old version in its new location in the snapshot rather than just capture it in the log; if no snapshot exists you could c
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
Christian Kelly 写道: > Hi Calum, > > heh, as it happens, I was tinkering with pygtk to see how difficult this > would be :) > > Supposing I have a ZFS on my machine called root/export/home which is > mounted on /export/home. Then I have my home dir as /export/home/chris. > Say I only want to snapshot and backup /export/home/chris/Documents. I > can't create a snapshot of /export/home/chris/Documents as it is a > directory, I have to create a snapshot of the parent ZFS, in this case > /export/home/. So there isn't really the granularity that the attached > spec implies. Someone correct me if I'm wrong, but I just tried it and > it didn't work. > > I've had a bit of a look at 'Time Machine' and I'd be more in favour of > that style of backup. Just back up everything so I don't have to think > about it. My feeling is that picking individual directories out just > causes confusion. Think of it this way: how much change is there on a > daily basis on your desktop/laptop? Those snapshots aren't going to grow > very quickly. Time Machine is storing all in the system by default, but you still can select some ones that you don't like to store. And Time Machine don't use ZFS. Here we will use ZFS snapshot, and what it's working with is file system. In Nevada, the default file system is not ZFS, it means some directory is not ZFS, so seems you have to select some directory which is ZFS, and it's impossible for you to store all, (some are not ZFS)... > > -Christian > > > > Calum Benson wrote: >> Hi all, >> >> We've been thinking a little about a more integrated desktop presence >> for Tim Foster's ZFS backup and snapshot services[1]. Here are some >> initial ideas about what a Phase 0 (snapshot only, not backup) user >> experience might look like... comments welcome. >> >> http://www.genunix.org/wiki/index.php/ZFS_Snapshot >> >> (I'm not subscribed to zfs-discuss, so please make sure either >> desktop-discuss or I remain cc'ed on any replies if you want me to >> see them...) >> >> Cheeri, >> Calum. >> >> [1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people >> >> > > ___ > desktop-discuss mailing list > [EMAIL PROTECTED] -- Henry Zhang JDS Software Development, OPG Sun China Engineering & Research Institute Sun Microsystems, Inc. 10/F Chuang Xin Plaza, Tsinghua Science Park Beijing 100084, P.R. China Tel: +86 10 62673866 Fax: +86 10 62780969 eMail: [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI
Hi Calum, heh, as it happens, I was tinkering with pygtk to see how difficult this would be :) Supposing I have a ZFS on my machine called root/export/home which is mounted on /export/home. Then I have my home dir as /export/home/chris. Say I only want to snapshot and backup /export/home/chris/Documents. I can't create a snapshot of /export/home/chris/Documents as it is a directory, I have to create a snapshot of the parent ZFS, in this case /export/home/. So there isn't really the granularity that the attached spec implies. Someone correct me if I'm wrong, but I just tried it and it didn't work. I've had a bit of a look at 'Time Machine' and I'd be more in favour of that style of backup. Just back up everything so I don't have to think about it. My feeling is that picking individual directories out just causes confusion. Think of it this way: how much change is there on a daily basis on your desktop/laptop? Those snapshots aren't going to grow very quickly. -Christian Calum Benson wrote: > Hi all, > > We've been thinking a little about a more integrated desktop presence > for Tim Foster's ZFS backup and snapshot services[1]. Here are some > initial ideas about what a Phase 0 (snapshot only, not backup) user > experience might look like... comments welcome. > > http://www.genunix.org/wiki/index.php/ZFS_Snapshot > > (I'm not subscribed to zfs-discuss, so please make sure either > desktop-discuss or I remain cc'ed on any replies if you want me to > see them...) > > Cheeri, > Calum. > > [1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
In that case, this may be a much tougher nut to crack than I thought. I'll be the first to admit that other than having seen a few presentations I don't have a clue about the details of how ZFS works under the hood, however... You mention that moving the old block means updating all it's ancestors. I had naively assumed moving a block would be relatively simple, and would also update all the ancestors. My understanding of ZFS (in short: an upside down tree) is that each block is referenced by it's parent. So regardless of how many snapshots you take, each block is only ever referenced by one other, and I'm guessing that the pointer and checksum are both stored there. If that's the case, to move a block it's just a case of: - read the data - write to the new location - update the pointer in the parent block Please let me know if I'm mis-understanding ZFS here. The major problem with this is that I don't know if there's any easy way to identify the parent block from the child, or an effcient way to do this move. However, thinking about it, there must be. ZFS intelligently moves data if it detects corruption, so there must already be tools in place to do exactly what we need here. In which case, this is still relatively simple and much of the code already exists. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
... > - Nathan appears to have suggested a good workaround. > Could ZFS be updated to have a 'contiguous' setting > where blocks are kept together. This sacrifices > write performance for read. I had originally thought that this would be incompatible with ZFS's snapshot mechanism, but with a minor tweak it may not be. ... > - Bill seems to understand the issue, and added some > useful background (although in an entertaining but > rather condascending way). There is a bit of nearby history that led to that. ... > One point that I haven't seen raised yet: I believe > most databases will have had years of tuning based > around the assumption that their data is saved > contigously on disk. They will be optimising their > disk access based on that and this is not something > we should ignore. Ah - nothing like real, experienced user input. I tend to agree with ZFS's general philosophy of attempting to minimize the number of knobs that need tuning, but this can lead to forgetting that higher-level software may have knobs of its own. My original assumption was that databases automatically attempted to leverage on-disk contiguity (which the more evolved ones certainly do when they're controlling the on-disk layout themselves and one might suspect try to do even when running on top of files by assuming that the file system is trying to preserve on-disk contiguity), but of course admins play a major role as well (e.g., in determining which indexes need not be created because sequential table scans can get the job done efficiently). ... > I definately don't think defragmentation is the > solution (although that is needed in ZFS for other > scenarios). If your database is under enough read > strain to need the fix suggested here, your disks > definately do not have the time needed to scan and > defrag the entire system. Well, it's only this kind of randomly-updated/sequentially-scanned data that needs much defragmention in the first place. Data that's written once and then only read at worst needs a single defragmentation pass (if the original writes got interrupted by a lot of other update activity), data that's not read sequentially (e.g., indirect blocks) needn't be defragmented at all, nor need data that's seldom read and/or not very fragmented in the first place. > > It would seem to me that Nathan's suggestion right at > the start of the thread is the way to go. It > guarantees read performance for the database, and > would seem to be relatively easy to implement at the > zpool level. Yes it adds considerable overhead to > writes, but that is a decision database > administrators can make given the expected load. > > If I'm understanding Nathan right, saving a block of > data would mean: > - Reading the original block (may be cached if we're > lucky) > - Saving that block to a new location > - Saving the new data to the original location 1. You'd still need an initial defragmentation pass to ensure that the file was reasonably piece-wise contiguous to begin with. 2. You can't move the old version of the block without updating all its ancestors (since the pointer to it changes). When you update this path to the old version, you need to suppress the normal COW behavior if a snapshot exists because it would otherwise maintain the old path pointing to the old data location that you're just about to over-write below. This presumably requires establishing the entire new path and deallocating the entire old path in a single transaction but this may just be equivalent to a normal data block 'update' (that just doesn't happen to change any data in the block) when no snapshot exists. I don't *think* that there should be any new issues raised with other updates that may be combined in the same 'transaction', even if they may affect some of the same ancestral blocks. 3. You can't just slide in the new version of the block using the old version's existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version's new location that you just had to establish (if a snapshot exists, that's the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block: all the *additional* overhead occurred in the previous steps. Given that doing the update twice, as described above, only adds to the bandwidth consumed (steps 2 and 3 should be able to be combined in a single transaction), the only additional disk seek would be that required to re-read the original data if it wasn't cached. So you may well be correct that this approach would likely consume fewer resources than background defragmentation would (though, as noted above, you'd still need an initial defrag p
[zfs-discuss] ZFS snapshot GUI
Hi all, We've been thinking a little about a more integrated desktop presence for Tim Foster's ZFS backup and snapshot services[1]. Here are some initial ideas about what a Phase 0 (snapshot only, not backup) user experience might look like... comments welcome. http://www.genunix.org/wiki/index.php/ZFS_Snapshot (I'm not subscribed to zfs-discuss, so please make sure either desktop-discuss or I remain cc'ed on any replies if you want me to see them...) Cheeri, Calum. [1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people -- CALUM BENSON, Usability Engineer Sun Microsystems Ireland mailto:[EMAIL PROTECTED]GNOME Desktop Team http://blogs.sun.com/calum +353 1 819 9771 Any opinions are personal and not necessarily those of Sun Microsystems ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow
On 11/20/07, Asif Iqbal <[EMAIL PROTECTED]> wrote: > On Nov 19, 2007 1:43 AM, Louwtjie Burger <[EMAIL PROTECTED]> wrote: > > On Nov 17, 2007 9:40 PM, Asif Iqbal <[EMAIL PROTECTED]> wrote: > > > (Including storage-discuss) > > > > > > I have 6 6140s with 96 disks. Out of which 64 of them are Seagate > > > ST337FC (300GB - 1 RPM FC-AL) > > > > Those disks are 2Gb disks, so the tray will operate at 2Gb. > > > > That is still 256MB/s . I am getting about 194MB/s 2Gb fibre channel is going to max out at a data transmission rate around 200MB/s rather than the 256MB/s that you'd expect. Fibre channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data in 10 bits on the wire. So while 256MB/s is being transmitted on the connection itself, only 200MB/s of that is the data that you're transmitting. Chad Mynhier ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + "fragments"
My initial thought was that this whole thread may be irrelevant - anybody wanting to run such a database is likely to use a specialised filesystem optimised for it. But then I realised that for a database admin the integrity checking and other benefits of ZFS would be very tempting, but only if ZFS can guarantee equivalent performance to other filesystems. So, let me see if I understand this right: - Louwtjie is concerned that ZFS will fragment databases, potentially leading to read performance issues for some databases. - Nathan appears to have suggested a good workaround. Could ZFS be updated to have a 'contiguous' setting where blocks are kept together. This sacrifices write performance for read. - Richard isn't convinced there's a problem as he's not seen any data supporting this. I can see his point, but I don't agree that this is a non starter. For certain situations it could be very useful, and balancing read and write performance is an integral part in the choice of storage configuration. - Bill seems to understand the issue, and added some useful background (although in an entertaining but rather condascending way). Richard then went into a little more detail. I think he's pointing out here that while contiguous data is fastest if you consider a single disk, is not necessarily the fastest approach when your data is spread across multiple disks. Instead he feels a 'diverse stochastic spread' is needed. I guess that means you want the data spread so all the disks can be used in parallel. I think I'm now seeing why Richard is asking for real data. I think he believes that ZFS may already be faster or equal to a standard contiguous filesystem in this scenario. Richard seems to be using a random or statistical approach to this: If data is saved randomly, you're likely to be using all disks when reading data. I do see the point, and yes, data would be useful, but I think I agree with Bill on this. For reading data, while random locations are likely to be fast in terms of using multiple disks, that data is also likely to be spread and so is almost certain to result in more disk seeks. Whereas if you have contiguous data you can guarantee that it will be striped across the maximum possible number of disks, with the minimum number of seeks. As a database admin I would take guaranteed performance over probable performance any day of the week. Especially if I can be sure that performance will be consistent and will not degrade as the database ages. One point that I haven't seen raised yet: I believe most databases will have had years of tuning based around the assumption that their data is saved contigously on disk. They will be optimising their disk access based on that and this is not something we should ignore. Yes, until we have data to demonstrate the problem it's just theoretical. However that may be hard to obtain and in the meantime I think the theory is sound, and the solution easy enough that it is worth tackling. I definately don't think defragmentation is the solution (although that is needed in ZFS for other scenarios). If your database is under enough read strain to need the fix suggested here, your disks definately do not have the time needed to scan and defrag the entire system. It would seem to me that Nathan's suggestion right at the start of the thread is the way to go. It guarantees read performance for the database, and would seem to be relatively easy to implement at the zpool level. Yes it adds considerable overhead to writes, but that is a decision database administrators can make given the expected load. If I'm understanding Nathan right, saving a block of data would mean: - Reading the original block (may be cached if we're lucky) - Saving that block to a new location - Saving the new data to the original location So you've got a 2-3x slowdown in write performance, but you guarantee read performance will at least match existing filesystems (with ZFS caching, it may exceed it). ZFS then works much better with all the existing optimisations done within the database software, and you still keep all the benefits of ZFS - full data integrity, snapshots, clones, etc... For many database admins, I think that would be an option they would like to have. Taking it a stage further, I wonder if this would work well with the prioritized write feature request (caching writes to a solid state disk)? http://www.genunix.org/wiki/index.php/OpenSolaris_Storage_Developer_Wish_List That could potentially mean there's very little slowdown: - Read the original block - Save that to solid state disk - Write the the new block in the original location - Periodically stream writes from the solid state disk to the main storage In theory there's no need for the drive head to move at all between the read and the write, so this should only be fractionally slower than traditional ZFS writes. Yes the data needs to be
Re: [zfs-discuss] zfs on a raid box
Hi MP, MP wrote: >> but my issue is that >> not only the 'time left', but also the progress >> indicator itself varies >> wildly, and keeps resetting itself to 0%, not giving >> any indication that > > Are you sure you are not being hit by this bug: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667 > > i.e. scrub or resilver get's reset to 0% on a snapshot creation or deletion. >Cheers. I'm very sure of that: I've never done a snapshot on these, and I am the only user on the machine (it's not in production yet). Regards, Paul Boven. -- Paul Boven <[EMAIL PROTECTED]> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It's a fringe science ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on a raid box
> but my issue is that > not only the 'time left', but also the progress > indicator itself varies > wildly, and keeps resetting itself to 0%, not giving > any indication that Are you sure you are not being hit by this bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667 i.e. scrub or resilver get's reset to 0% on a snapshot creation or deletion. Cheers. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz2
Hi Eric, everyone, Eric Schrock wrote: > There have been many improvements in proactively detecting failure, > culminating in build 77 of Nevada. Earlier builds: > > - Were unable to distinguish device removal from devices misbehaving, > depending on the driver and hardware. > > - Did not diagnose a series of I/O failures as disk failure. > > - Allowed several (painful) SCSI retries and continued to queue up I/O, > even if the disk was fatally damaged. > Most classes of hardware would behave reasonably well on device removal, > but certain classes caused cascading failures in ZFS, all which should > be resolved in build 77 or later. I seem to be having exactly the problems you are describing (see my postings with the subject 'zfs on a raid box'). So I would very much like to give b77 a try. I'm currently running b76, as that's the latest sxce that's available. Are the sources to anything beyond b76 already available? Would I need to build it, or bfu? I'm seeing zfs not making use of available hot-spares when I pull a disk, long and indeed painful SCSI retries and very poor write performance on a degraded zpool - I hope to be able to test if b77 fares any better with this. Regards, Paul Boven. -- Paul Boven <[EMAIL PROTECTED]> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It's a fringe science ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss