Re: [zfs-discuss] How Do I know if a ZFS snapshot is complete?
I took a snapshot of one of my oracle filesystems this week and when someone tried to add data to it it filled up. I tried to remove some data but the snapshot seemed to keep reclaiming it as i deleted it. I had taken the snapshot says earlier. Does this make sense? Snapshots are completed when your zfs snapshot terminates. The reason deletes do not free up space is that snapshots are just that - snapshots of the file system at the time, which means that it has to retain all data referenced by the file system at the point of the snapshot. In order to reclaim space used by snapshots, they have to be removed. -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
I realized I forgot to follow-up on this thread. Just to be clear, I have confirmed that I am seeing what to me is undesirable behavior even with the ARC being 1500 MB in size on an almost idle system ( 0.5 mb/sec read load, almost 0 write load). Observe these recursive searches through /usr/src/sys: % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2.74s user 1.19s system 20% cpu 19.143 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.45s user 0.51s system 99% cpu 2.986 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.41s user 0.62s system 53% cpu 5.667 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.37s user 0.68s system 50% cpu 6.025 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.45s user 0.61s system 45% cpu 6.694 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.45s user 0.59s system 53% cpu 5.651 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.32s user 0.72s system 46% cpu 6.503 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.41s user 0.66s system 44% cpu 6.843 total % time ack arcstats 2/dev/null 1/dev/null ack arcstats 2 /dev/null /dev/null 2.37s user 0.67s system 49% cpu 6.119 total The first was entirely cold. For some reason the second was close to CPU-bound, while the remainder were significantly disk-bound even if not to the extent of the initial run. I correlated with 'iostat -x 1' to confirm that I am in fact generating I/O (but no, I do not have dtrace output). Anyways, presumably the answer to my original question is no and the above isn't really very interesting other than to show that under some circumstances you can see behavior that is decidedly non-optimal for interactive desktop use of certain kinds. Whether this is ARC in general or something FreeBSD specific, I don't know. But it does, at this point, not have to do with ARC sizing since the ARC is sensibly large. (I realize I should investigate properly and report back, but I'm not likely to have time to dig into this now.) -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Tuning the ARC towards LRU
Hello, For desktop use, and presumably rapidly changing non-desktop uses, I find the ARC cache pretty annoying in its behavior. For example this morning I had to hit my launch-terminal key perhaps 50 times (roughly) before it would start completing without disk I/O. There are plenty of other examples as well, such as /var/db/pkg not being pulled aggressively into cache such that pkg_* operations (this is on FreeBSD) are slower than they should (I have to run pkg_info some number of times before *it* will complete without disk I/O too). I would be perfectly happy with pure LRU caching behavior or an approximation thereof, and would therefore like to essentially completely turn off all MFU-like weighting. I have not investigated in great depth so it's possible this represents an implementation problem rather than the actual intended policy of the ARC. If the former, can someone confirm/deny? If the latter, is there some way to tweak it? I have not found one (other than changing the code). Is there any particular reason why such knobs are not exposed? Am I missing something? -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
It sounds like you are complaining about how FreeBSD has implemented zfs in the system rather than about zfs in general. These problems don't occur under Solaris. Zfs and the kernel need to agree on how to allocate/free memory, and it seems that Solaris is more advanced than FreeBSD in this area. It is my understanding that FreeBSD offers special zfs tunables to adjust zfs memory usage. It may be FreeBSD specific, but note that I a not talking about the amount of memory dedicated to the ARC and how it balances with free memory on the system. I am talking about eviction policy. I could be wrong but I didn't think ZFS port made significant changes there. And note that part the of *point* of the ARC (at least according to the original paper, though it was a while since I read it), as opposed to a pure LRU, is to do some weighting on frequency of access, which is exactly consistent with what I'm observing (very quick eviction and/or lack of inserton of data, particularly in the face of unrelated long-term I/O having happened in the background). It would likely also be the desired behavior for longer-running homogenous disk access patterns where optimal use of cache over long period may be more important than immediately reacting to a changing access pattern. So it's not like there is no reason to believe this can be about ARC policy. Why would this *not* occurr on Solaris? It seems to me that it would imply the ARC was broken on Solaris, since it is not *supposed* to be a pure LRU by design. Again, there may very well be a FreeBSD specific issue here that is altering the behavior, and maybe the extremity of it that I am reporting is not supposed to be happening, but I believe the issue is more involved than what you're implying in your response. -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
The ARC is designed to use as much memory as is available up to a limit. If the kernel allocator needs memory and there is none available, then the allocator requests memory back from the zfs ARC. Note that some systems have multiple memory allocators. For example, there may be a memory allocator for the network stack, and/or for a filesystem. Yes, but again I am concerned with what the ARC chooses to cache and for how long, not how the ARC balances memory with other parts of the kernel. At least, none of my observations lead me to believe the latter is the problem here. might be pre-allocated. I assume that you have already read the FreeBSD ZFS tuning guide (http://wiki.freebsd.org/ZFSTuningGuide) and the ZFS filesystem section in the handbook (http://www.freebsd.org/doc/handbook/filesystems-zfs.html) and made sure that your system is tuned appropriately. Yes, I have been tweaking and fiddling and reading off and on since ZFS was originally added to CURRENT. This is not about tuning in that sense. The fact that the little data necessary to start an 'urxvt' instance does not get cached for at least 1-2 seconds on an otherwise mostly idle system is either the result of cache policy, an implementation bug (freebsd or otherwise), or a matter of an *extremely* small cache size. I have observed this behavior for a very long time across versions of both ZFS and FreeBSD, and with different forms of arc sizing tweaks. It's entirely possibly there are FreeBSD issues preventing the ARC to size itself appropriately. What I am saying though is that all indications are that data is not being selected for caching at all, or else is evicted extremely quickly, unless sufficient frequency has been accumulated to, presumably, make the ARC decide to cache the data. This is entirely what I would expect from a caching policy that tries to adapt to long-term access patterns and avoid pre-mature cache eviction by looking at frequency of access. I don't see what it is that is so outlandish about my query. These are fundamental ways in which caches of different types behave, and there is a legitimate reason to not use the same cache eviction policy under all possible workloads. The behavior I am seeing is consistent with a caching policy that tries too hard (for my particular use case) to avoid eviction in the face of short-term changes in access pattern. There have been a lot of eyeballs looking at how zfs does its caching, and a ton of benchmarks (mostly focusing on server thoughput) to verify the design. While there can certainly be zfs shortcomings (I have found several) these are few and far between. That's a very general statement. I am talking about specifics here. For example, you can have mountains of evidence that shows that a plain LRU is optimal (under some conditions). That doesn't change the fact that if I want to avoid a sequential scan of a huge data set to completely evict everything in the cache, I cannot use a plain LRU. In this case I'm looking for the reverse; i.e., increasing the importance of 'recenticity' because my workload is such that it would be more optimal than the behavior I am observing. Benchmarks are irrelevant except insofar as they show that my problem is not with the caching policy, since I am trying to address an empirically observed behavior. I *will* try to look at how the ARC sizes itself, as I'm unclear on several things in the way memory is being reported by FreeBSD, but as far as I can tell these are different issues. Sure, a bigger ARC might hide the behavior I happen to see; but I want the cache to behave in a way where I do not need gigabytes of extra ARC size to lure it into caching the data necessary for 'urxvt' without having to start it 50 times in a row to accumulate statistics. -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
In simple terms, the ARC is divided into a MRU and MFU side. target size (c) = target MRU size (p) + target MFU size (c-p) On Solaris, to get from the MRU to the MFU side, the block must be read at least once in 62.5 milliseconds. For pure read-once workloads, the data won't to the MFU side and the ARC will behave exactly like an (adaptable) MRU cache. Ok. That differs significantly from my understanding, though in retrospect I should have realized it given that arc stats contain only references to mru and mfu... I previously was under the impression that the ZFS ARC had an LRU:Ish side to complement the MFU side. MRU+MFU changes things. I will have to look into it in better detail to understand the consequences. Is there a paper that describes the ARC as it is implemented in ZFS (since it clearly diverges from the IBM ARC)? I *will* try to look at how the ARC sizes itself, as I'm unclear on several things in the way memory is being reported by FreeBSD, but as For what it's worth I confirmed that the ARC was too small and that there are clearly remaining issues with the interaction between the ARC the rest of the FreeBSD kernel. (I wasn't sure before but I confirmed I Was looking at the right number.) I'll try to monitor more carefully and see if I can figure out when the ARC shrinks and why it doesn't grow back. Informally my observations have always been that things behave great for a while after boot, but degenerate over time. In this case it was sitting at it's minium size, which was 214M. I realize this is far below what is recommended or even designed for, but is it clearly caching *something* and I clearly *could* make it cache urxvt+deps by re-running it several tens of times in rapid succession. I'm not convinced you have attributed the observation to the ARC behaviour. Do you have dtrace (or other) data to explain what process is causing the physical I/Os? In the urxvt case, I am basing my claim on informal observations. I.e., hit terminal launch key, wait for disks to rattle, get my terminal. Repeat. Only by repeating it very many times in very rapid succession am I able to coerce it to be cached such that I can immediately get my terminal. And what I mean by that is that it keeps necessitating disk I/O for a long time, even on rapid successive invocations. But once I have repeated it enough times it seems to finally enter the cache. (No dtrace unfortunately. I confess to not having learned dtrace yet, in spite of thinking it's massively cool.) However, I will of course accept that given the minimal ARC size at the time I am moving completely away from the designed-for use-case. And if that is responsible, it is of course my own fault. Given MRU+MFU I'll have to back off with my claims. Under the (incorrect) assumption of LRU+MFU I felt the behavior was unexpected, even with a small cache size. Given MRU+MFU and without knowing further details right now, I accept that the ARC may fundamentally need a bigger cache size in relation to the working set in order to be effective in the way I am using it here. I was basing my expectations on LRU-style behavior. Thanks! -- / Peter Schuller ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()
fsync() is, indeed, expensive. Lots of calls to fsync() that are not necessary for correct application operation EXCEPT as a workaround for lame filesystem re-ordering are a sure way to kill performance. IMO the fundamental problem is that the only way to achieve a write barrier is fsync() (disregarding direct I/O etc). Again I would just like an fbarrier() as I've mentioned on the list previously. It seems to me that if this were just adopted by some operating systems and applications could start using it, things would just sort itself out when file systems/block devices layers start actually implementing the optimization possible (instead of the native fbarrier() - fsync()). As was noted previously in the previous thread on this topic, ZFS effectively has an implicit fbarrier() in between each write. Imagine now if all the applications out there were automatically massively faster on ZFS... but this won't happen until operating systems start exposing the necessary interface. What does one need to do to get something happening here? Other than whine on mailing lists... -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpZpZZfcWwR3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()
Uh, I should probably clarify some things (I was too quick to hit send): IMO the fundamental problem is that the only way to achieve a write barrier is fsync() (disregarding direct I/O etc). Again I would just like an fbarrier() as I've mentioned on the list previously. It seems Of course if fbarrier() is analogous to fsync() this does not actually address the particular problem which is the main topic of this thread, since there the fbarrier() would presumably apply only to I/O within that file. This particular case would only be helped if the fbarrier() were global, or at least extending further than the particular file. Fundamentally, I think a userful observation is that the only time you ever care about persistence is when you make a contract with an external party outside of your blackbox of I/O. Typical examples are database commits and mail server queues. Anything within the blackbox is only concerned with consistency. In this particular case, the fsync()/fbarrier() operateon the black box of the file, with the directory being an external party. The rename() operation on the directory entry constitutes an operation which depends on the state of the individual file blackbox, thus constituting an external dependency and thus requireing persistence. The question is whether it is necessarily a good idea to make the blackbox be the entire file system. If it is, a lot of things would be much much easier. On the other hand, it also makes optimization more difficult in many cases. For example the latency of persisting 8kb of data could be very very significant if there is large amounts of bulk I/O happening in the same file system. So I definitely see the motivation behind having persistence guarantees be non-global. Perhaps it boils down to the files+directory model not necessarily being the best one in all cases. Perhaps one would like to define subtrees which have global fsync()/fbarrier() type semantics within each respective subtree. On the other hand, that sounds a lot like a ZFS file system, other than the fact that ZFS file system creation is not something which is exposed to the application programmer. How about having file-system global barrier/persistence semantics, but having a well-defined API for creating child file systems rooted at any point in a hierarchy? It would allow global semantics and what that entails, while allowing that bulk I/O happening in your 1 TB PostgreSQL database to be segregated, in terms of performance impact, from your kde settings file system. What does one need to do to get something happening here? Other than whine on mailing lists... And that came off much more rude than intended. Clearly it's not an implementation effort issues ince the naive fbarrioer() is basically calling fsync(). However I get the feeling there is little motivation in the operating system community for addressing these concerns, for whatever reason (IIRC it was only recently that some write barrier/write caching issues started being seriously discussed in the Linux kernel community for example). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpMWdfQtjIuW.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
However, I just want to state a warning, that ZFS is far from being that what it is promising, and so far from my sum of experience I can't recommend at all to use zfs on a professional system. Or, perhaps, you've given ZFS disks which are so broken that they are really unusable; it is USB, after all. I had a cheap-o USB enclosure that definitely did ignore such commands. On every txg commit I'd get a warning in dmesg (this was on FreeBSD) about the device not implementing the relevant SCSI command. This of course would affect filesystems other than ZFS aswell. What is worse, I was unable to completely disable write caching either because that, too, did not actually propagate to the underlying device when attempted. (I could not say for certain whether this was fundamental to the device or in combination with a FreeBSD issue.) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpLGAnqq8jAy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
YES! I recently discovered that VirtualBox apparently defaults to ignoring flushes, which would, if true, introduce a failure mode generally absent from real hardware (and eventually resulting in consistency problems quite unexpected to the user who carefully configured her journaled filesystem or transactional RDBMS!) I recommend everyone to be extremely hesitant to assume that any particular storage setup actually honors write barriers and cache flushes. This is a recommendation I would give even when you purchase non-cheap battery backed hardware RAID controllers (I won't mention any names or details to avoid bashing as I'm sure it's not specific to the particular vendor I had problems with most recently). You need the underlying device to do the right thing, the driver to do the right thing, the operating system in general to do the right thing (which includes the file system, block device layer if any etc - for example, if use md on Linux with RAID5/6 you're toast). So again I cannot stress enough - do not assume things behave in a non-broken fashion with respect to write barriers and flushes. I can't speak to expensive integrated hardware solutions; I HOPE, though at this point my level of paranoid does not allow me to assume, that if you buy boxed systems from companies like Sun/HP/etc you get decent stuff. But I can definitely say that paying non-trivial amounts of money for hardware is not a guarantee that you won't get completely broken behavior. speculation I think it boils down to the fact that 99% of customers that aren't doing integration of the individual components in overall packages, probably don't care/understand/bother with it, so as long as the benchmarks say it's fast, they sell. /speculation -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpHedvKWqrb4.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there was no shock, hit or any other event like that? I have no information about your particular situation, but you have to remember the ZFS uncovers problems that otherwise go unnoticed. Just personally on my private hardware (meaning a very limited set), I have seen silent corruption issues several times. The most recent one I discovered almost immediately because of ZFS. If it weren't for ZFS, I would have been highly likely to have transfered my entire system without noticing and suffer weird problems a couple of weeks later. While I don't know what is going on in your case, blaming the introduction of a piece of software/hardware/procedure on some problem without identifying a causal relationship, is a common mistake to make. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgp5cozV6UwEf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
on a UFS ore reiserfs such errors could be corrected. In general, UFS has zero capability to actually fix real corruption in any reliable way. What you normally do with fsck is repairing *expected* inconsistencies that the file system was *designed* to produce in the event of e.g. a sudden reboot or a crash. This is entirely different from repairing arbitrary corruption. If ZFS says that a file has a checksum error, that can very well be because there is a bug in ZFS. But it can also be the case that there *is* actual on-disk (or in-transit) corruption that ZFS has detected, and given I/O errors back to an application instead of producing bad data. Now it is probably entirely true that onces you *do* have broken hardware or there is some other reason for corruption beyond that which you can design for, ZFS is probably less mature than traditional file systems in terms of the availability of tools and procedures to salvage whatever might actually be salvagable. That is a valid critisicm. But you *have* to realize the distinction between repairing fully expected inconsistencies specifically expected as part of regular operation in the event of a crash/power outtage, from problems arrising from misbehaving hardware or bugs in software. ZFS cannot magically overcome such problems, nor can UFS/reiserfs/xfs/whatever else. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpOIkWS4ZEdB.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
ps This is a recommendation I would give even when you purchase ps non-cheap battery backed hardware RAID controllers (I won't ps mention any names or details to avoid bashing as I'm sure it's ps not specific to the particular vendor I had problems with most ps recently). This again? If you're sure the device is broken, then I think others would like to know it, even if all devices are broken. The problem is that I even had help from the vendor in question, and it was not for me personally but for a company, and I don't want to use information obtained that way to do any public bashing. But I have no particular indication that there is any problem with the vendor in general; it was a combination of choices made by Linux kernel developers and the behavior of the RAID controller. My interpretation was that no one was there looking at the big picture, and the end result was that if you followed the instructions specifically given by the vendor, you would have a setup whereby you would loose correctness whenever the BBU was overheated/broken/disabled. The alternative was to get completely piss-poor performance by not being able to take advantage of the battery backed nature of the cache at all (which defeats most of the purpose of having the controller, if you use it in any kind of transactional database environment or similar). but, fine. Anyway, how did you determine the device was broken? By performing timing tests as mentioned in the other post that you answered separately, and after detecting the problem confirming the status with respect to caching at the different levels as claimed by the administrative tool for the controller. While timing tests cannot conclusively prove correct behavior, it can definitely proove incorrect behavior in cases where your timings are simply theoretically impossible given the physical nature of the underlying drives. At least you can tell us that much without fear of retaliation (whether baseless or founded), and maybe others can use the same test to independently discover what you did which would be both fair and safe for you. The test was trivial; in my case a ~10 line Python script or something along those lines. Perhaps I should just go ahead and release something which non-programmers can easily run and draw conclusions from. This is the real problem as I see it---a bunch of FUD, without any actual resolution beyond ``it's working, I _think_, and in any case the random beatings have stopped so D'OH-NT TOUCH *ANY*THING! THAR BE DEMONZ IN THE BOWELS O DIS DISK SHELF!'' I'd love to go on a public rant, because I think the whole situation was a perfect example of a case where a single competent person who actually cares about correctness could have pinpointed this problem trivially. But instead you have different camps doing their own stuff and not considering the big picture. If anyone asks questions, they get no actual information, but a huge amount of blame heaped on the sysadmin. Your post is a great example of the typical way this problem is handled because it does both: deny information and blame the sysadmin. Though I'm really picking on you way too much here. Hopefully everyone's starting to agree, though, we do need a real way out of this mess! I'm not quite sure what you're referring to here. I'm not blaming any sysadmin. I was trying to point out *TO* sysadmins, to help them, that I recommend being paranoid about correctness. If you mean the original poster in the thread having issues, I am not blaming him *at all* in the post you responded to. It was strictly meant as a comment in response to the poster who noted that he discovered, to his surprise, the problems with VirtualBox. I wanted to make the point that while I completely understand his surprise, I have come to expect that these things are broken by default (regardless of whether you're using virtualbox or not, or vendor X or Y etc), and that care should be taken if you do want to have correctness when it comes to write barriers and/or honoring fsync(). However, that said, as I stated in another post I wouldn't be surprised if it turns out the USB device was ignoring sync commands. But I have no idea what the case was for the original poster, nor have I even followed the thread in detail enough to know if that would even be a possible explanation for his problems. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpjMmPNbbRnF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does your device honor write barriers?
mortals ;) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpGm6lQE1f12.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does your device honor write barriers?
is such that you cannot trust the hardware/software involved, I suppose there is no other way out other than testing and adjusting until it works. Here is another bit of FUD to worry about: the common advice for the lost SAN pools is, use multi-vdev pools. Well, that creepily matches just the scenario I described: if you need to make a write barrier that's valid across devices, the only way to do it is with the SYNCHRONIZE CACHE persistence command, because you need a reply from Device 1 before you can release writes behind the barrier to Device 2. You cannot perform that optimisation I described in the last paragraph of pushing the barrier paast the high-latency link down into the device, because your initiator is the only thing these two devices have in common. Keeping the two disks in sync would in effect force the initiator to interpret the SYNC command as in my second example. However if you have just one device, you could write the filesystem to use this hypothetical barrier command instead of the persistence command for higher performance, maybe significantly higher on high-latency SAN. I don't guess that's actually what's going on though, just an interesting creepy speculation. This would be another case where battery-backed (local to the machine) NVRAM fundamentally helps even in a situation where you are only concerned with the barrier, since there is no problem having a battery-backed controller sort out the disk-local problems itself by whatever combination of syncs/barriers, while giving instant barrier support (by effectively implementing synch-and-wait) to the operating system. (Referring now to individual drives being battery-backed, not using a hardware raid volume.) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org pgpkpvy9lmPS6.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Verify files' checksums
1) When I notice an error in a file that I've copied from a ZFS disk I want to know whether that error is also in the original file on my ZFS disk or if it's only in the copy. This was already addressed but let me do so slightly differently: One of the major points of ZFS checksumming is that, in the abscence of software bugs or hardware memory corruption issues when the file is read on the host, successfully reading a file is supposed to mean that you got the correct version of the file (either from physical disk or from cache, having previously been read from physical disk). A scrub is still required if you want to make sure the file is okay *ON DISK*, unless you can satisfy yourself that no relevant data is cached somewhere (or unless someone can inform me of a way to nuke a particular file and related resources from cache). Up to now I've been storing md5sums for all files, but keeping the files and their md5sums synchronized is a burden I could do without. FWIW I wanted to mention here that if you care a lot about this, I'd recommend something like par2[1] instead. It uses forward error correction[2], allowing you to not only detect corruption, but also correct it. You can choose your desired level of redundancy expressed as a percentage of the file size. [1] http://en.wikipedia.org/wiki/Parchive [2] http://en.wikipedia.org/wiki/Forward_error_correction -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpTBfuHa8mpF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over multiple iSCSI targets
Exporting them as one huge iSCSI volume is good if you're paranoid about data loss. You can use raid5 or 6 on the Linux servers, and then mirror those large volumes with ZFS. The downside is that it's much harder to add storage. I don't know if iSCSI volumes can be expanded, so you might have to break the mirror, create a larger iSCSI volume and resync all your data with that approach. Just be careful with respect to writer barriers. The software raid in Linux does not support them with raid5/raid6, so you loose the correctness aspect of ZFS you otherwise get even without hw raid controllers. (Speaking of this, can someone speak to the general state of affairs with iSCSI with respect to write barriers? I assume Solaris does it correctly; what about the bsd/linux stuff? Can one trust that the iSCSI targets correctly implement cache flushing/writer barriers?) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpi9RuPOuyGm.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box
If you are willing to go cheap you can get something that holds 8 drives for $70: buy a standard tower case with five internal 3.5 bays ($50), and one of these enclosures that fit in two 5.25 bays but give you three 3.5 bays ($20). I have one of these: http://www.gtek.se/index.php?mode=itemid=2454 That 2798 SEK price is about $450, and you can fit up to 30 3.5 drives in the exteme case. This is true even when using the Supermicro SATA hotswap bays that fit 5 drives in 3x5.25. You can fit 6 of these total, meaning 30 drives. Of course the cost of these bays are added (~ $60-$70 I believe for the 5-bay supermicro; the Lian Li stuff is cheaper, but not hotswap and such). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q : change disks to get bigger pool
I don't believe this is true. Online replacement is smart enough to pick this up. Where you hare to re-import is when you change the size of a LUN without doing a zpool replace. I see. I have never had the hardware necessary to actually try hotswaps, but I remember some people complaining here and on FreeBSD lists about not seeing the added space. In my case I always rebooted anyway so I could never tell the difference. I stand corrected. Thanks! -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q : change disks to get bigger pool
So will pool get bigger just by replasing all 4 disks one-by-one ? Yes, but a re-import (either by export/import or by reboot) is necessary before the new space will be usable. And if it will get larger how this should be done , fail disks one-by-one .. or ??? Use zpool replace. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS via Virtualized Solaris?
From what I read, one of the main things about ZFS is Don't trust the underlying hardware. If this is the case, could I run Solaris under VirtualBox or under some other emulated environment and still get the benefits of ZFS such as end to end data integrity? You could probably answer that question by changing the phrase to Don't trust the underlying virtual hardware! ZFS doesn't care if the storage is virtualised or not. But worth noting is that, as with for example hardware RAID, if you intend to take advantage of the self-healing properties of ZFS with multiple disks, you must expose the individual disks to your mirror/raidz/raidz2 individually through the virtualization environment and use them in your pool. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
I was just wandering that maybe there's a problem with just one disk... No, this is something I have observed on at least four different systems, with vastly varying hardware. Probably just the effects of the known problem. Thanks, -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
Sequential writing problem with process throttling - there's an open bug for it for quite a while. Try to lower txg_time to 1s - should help a little bit. Yeah, my post was mostly to emphasize that on commodity hardware raidz2 does not even come close to being a CPU bottleneck. It wasn't a poke at the streaming performance. Very interesting to hear there's a bug open for it though. Can you also post iostat -xnz 1 while you're doing dd? and zpool status This was FreeBSD, but I can provide iostat -x if you still want it for some reason. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
Use a faster processor or change to a mirrored configuration. raidz2 can become processor bound in the Reed-Soloman calculations for the 2nd parity set. You should be able to see this in mpstat, and to a coarser grain in vmstat. Hmm. Is the OP's hardware *that* slow? (I don't know enough about the Sun hardware models) I have a 5-disk raidz2 (cheap SATA) here on my workstation, which is an X2 3800+ (i.e., one of the earlier AMD dual-core offerings). Here's me dd:ing to a file on FreeBSD on ZFS running on that hardware: promraid 741G 387G 0380 0 47.2M promraid 741G 387G 0336 0 41.8M promraid 741G 387G 0424510 51.0M promraid 741G 387G 0441 0 54.5M promraid 741G 387G 0514 0 19.2M promraid 741G 387G 34192 4.12M 24.1M promraid 741G 387G 0341 0 42.7M promraid 741G 387G 0361 0 45.2M promraid 741G 387G 0350 0 43.9M promraid 741G 387G 0370 0 46.3M promraid 741G 387G 1423 134K 51.7M promraid 742G 386G 22329 2.39M 10.3M promraid 742G 386G 28214 3.49M 26.8M promraid 742G 386G 0347 0 43.5M promraid 742G 386G 0349 0 43.7M promraid 742G 386G 0354 0 44.3M promraid 742G 386G 0365 0 45.7M promraid 742G 386G 2460 7.49K 55.5M At this point the bottleneck looks architectural rather than CPU. None of the cores are saturated, and the CPU usage of the ZFS kernel threads is pretty low. I say architectural because writes to the underlying devices are not sustained; it drops to almost zero for certain periods (this is more visible in iostat -x than it is in the zpool statistics). What I think is happening is that ZFS is too late to evict data in the cache, thus blocking the writing process. Once a transaction group with a bunch of data gets committed the application unblocks, but presumably ZFS waits for a little while before resuming writes. Note that this is also being run on plain hardware; it's not even PCI Express. During throughput peaks, but not constantly, the bottleneck is probably the PCI bus. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Freebsd 7.0
NAMESTATE READ WRITE CKSUM fatty DEGRADED 0 0 3.71K raidz2DEGRADED 0 0 3.71K da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 300 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 253 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 spare DEGRADED 0 0 0 da9 OFFLINE 0 0 0 da11 ONLINE 0 0 0 da10ONLINE 0 0 0 spares da11 INUSE currently in use errors: 801 data errors, use '-v' for a list After I detach the spare da11 and bring da9 back online all the errors go away. Theory: Suppose da3 and da6 are either bad drives, have cabling issues, or are on a controller suffering corruption (different from the other drives). If you now were to replace da9 by da11, the resilver operation would be reading from these drives, thus triggering checksum issues. Once you bring da9 back in, it is either entirely up to date or very close to up to date, so the amount of I/O required to resilver it is very small and may not trigger problems. If this theory is correct, a scrub (zpool scrub fatty) should encounter checksum errors on da3 and da6. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Brain damage seems a bit of an alarmist label. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. The concurrent small-i/o problem is fundamental though. If you have an application where you care only about random concurrent reads for example, you would not want to use raidz/raidz2 currently. No amount of smartness in the application gets around this. It *is* a relevant shortcoming of raidz/raidz2 compared to raid5/raid6, even if in many cases it is not significant. If disk space is not an issue, striping across mirrors will be okay for random seeks. But if you also care about diskspace, it's a show stopper unless you can throw money at the problem. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Response to phantom dd-b post
proper UPS and a power outtage should never happen unless you suck. (This has actually been advocated to me. Seriously.) [2] Because of [1] and because of course you only run stable software that is well tested and will never be buggy. (This has been advocated. Seriously.) [3] Because of [1]. [4] Because of [1], [2] and [3]. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS
Your best bet is to call Tech Support and not Sales. I've found LSI tech support to be very responsive to individual customers. Thanks. I'll try them. I eventually noticed you could actually get the number to them under the LSI offices category of their find-your-contact web form system which otherwise looked like a re-seller inventory. I recommend the SuperMicro card - but that is PCI-X and I think you're looking for PCI-Express? PCI is okay and nice, PCI-Express is nicer. PCI-X I don't want since it is only semi-compatible with PCI. E.g. the Marvell I have now works in one machine, not in another. works well with ZFS (SATA or SAS drives). The newer cards are less expensive - but its not clear from the LSI website if they support JBOD operation or if you can form a mirror or stripe using only one drive and present it to ZFS as a single drive. I am okay with a one-disk mirror/stripe in the worse case, as long as cache flushes and such get passed through. Would definitely prefer JBOD though since single-disk virtual volumes tend to cause some additional headachages (like having two levels of volumen management). Please let us know what you find out... If I get anything confirmed from LSI I'll post an update. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Is there a specific reason why you need to do the caching at the DB level instead of the file system? I'm really curious as i've got conflicting data on why people do this. If i get more data on real reasons on why we shouldn't cache at the file system, then this could get bumped up in my priority queue. FWIW a MySQL database was recently moved to a FreeBSD system with ZFS. Performance ended up sucking because for some reason data did not make it into the cache in a predictable fashion (simple case of repeated queries were not cached; so for example a very common query, even when executed repeatedly on an idle system, would take more than 1 minute instead of 0.10 seconds or so when cached). Ended up convincing the person running the DB to switch from MyISAM (which does not seem to support DB level caching, other than of indexes) to InnoDB, thus allowing use of the InnoDB buffer cache. I don't know why it wasn't cached by ZFS/ARC to begin with (the size of the ARC cache was definitely large enough - ~ 800 MB, and I know the working set for this query was below 300 MB). Perhaps it has to do with ARC trying to be smart and avoiding flushing the cache with useless data? I am not read up on the details of the ARC. But in this particular case it was clear that a simple LRU had been much more useful - unless there was some other problem related to my setup or FreeBSD integration that somehow broke proper caching. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpmGHbRPWivC.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Survivability of zfs root
Now, what if that system had been using ZFS root? I have a hardware failure, I replace the raid card, the devid of the boot device changes. I am not sure on Solaris, but on FreeBSD I always use glabel:ed devices in my ZFS pools, making them entirely location independent. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgp4eTccU5E8q.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] reccomended disk configuration
I also wanted to test a recovery of my pool, so my took two disk raidz pool onto a friends freebsd box. It seems both systems use zfs version 6, but the import failed. I noticed on the boot logs: GEOM: ad6: corrupt or invalid GPT detected. GEOM: ad6: GPT rejected -- may not be recoverable. Is that a solaris or freebsd problem do you think? This has to do with the GPT (http://en.wikipedia.org/wiki/GUID_Partition_Table) support rather than ZFS. IIRC the GPT:s written by Solaris are valid, just not recognized properly by FreeBSD (but I am out of date and don't remember the source of this information). AFAIK the ZFS pools themselves are fully portable. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpFiIJzXKiig.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] It is that time again... ZFS + Firewire/USB - and a specific enclosure
However, I have been subscribed to the ZFS lists for some time now and done a Google check up for topics relating to this for some time, and the jury still seems to be out on whether or not this would be a good idea. My only input is that when I used cheap-ass USB enclosures (for single disks) flush-cache commands never survived. i would get errors about unimplemented SCSI commands upon every ZFS transaction group commit. So a no-go for reliability (unless you can turn write caching off through the USB conversion - dunno, but even then you will instead suffer the performance penalty). I don't know whether this was because of USB, or because it was cheap-ass crap, or something else. Perhaps Firewire works better, especially considering it was the de de facto standard in Macs for a long time. The other issue is that I never seem to get decent performance out of USB, and in general have experienced USB enclosures and even USB hubs to be very buggy (randomly breaking, particularly with I/O happening with multiple devices at the same time, and then requireing power cycling to start working again). I started buying USB drives at one point to get around the problems with fitting many disks in consumer products (=stuff that doesn't cost $100), but eventually just gave up because of broken drive controllers (hello WD) and broken USB hardware. And don't think that a retailer will consider drives broken for the purpose of warranty when they exhibit identical problems on both Solaris and FreeBSD (with AFAIK independent USB implementations), but work in Windows... If you have luck with that then please post some public info as I am sure I would not be the only one to be interested in it. Wonder what those 8-ways cost new... -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS Scalability/performance
FreeBSD plays it safe too. It's just that UFS, and other file systems on FreeBSD, understand write caches and flush at appropriate times. Do you have something to cite w.r.t. UFS here? Because as far as I know, that is not correct. FreeBSD shipped with write caching turned off by default for a while for this reason, but then changed it IIRC due to the hordes of people complaining about performance. I also have personal experience of corruption-after-powerfail that indicate otherwise. I also don't see any complaints about cache flushing on USB drives with UFS, which I did with ZFS every five seconds when it wanted to flush the cache (which fails since the SCSI-USB bridge, or probably the USB mass storage stuff itself, doest not support it). Also, given the design of UFS and the need for synchronous writes on updates, I would be surprised strictly based on performance observations if it actually did flush caches. The ability to get decent performance *AND* reliability on cheap disks is one of the major reasons why I love ZFS :) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] data gone?
I tried to add a third disk to the raidz array The third disk didn't get added to the raidz array, it was added to the pool, but 'parallel' to the raidz This is because it is not currently possible to add disks to a raidz/raidz2. Adding storage is typically done by adding an additional raidz/raidz2 array that you then stripe between. I believe zpool should have warned you about trying to add a non-redundant component alongside the redundant raidz, requiring you to force (-f) the addition. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: ZFS consistency guarantee
1. ZFS atomic operation that commits data. 2. Writes come into the app. 3. The db put in hotbackup mode. 4. Snapshot taken on storage. 5. ZFS atomic operation that commits data. So if i do a snap restore, ZFS might revert to point1, but from the db perspective, it is inconsistent and we would need to do a recovery..correct?. Right. So you'll want to synchronize your snapshots with a database consistency. Just like doing backups. I have gotten the feeling that everyone is misunderstanding everyone else in this thread ;) My understanding is that a zfs snapshot that can be proven to have happened subsequent to a particular write() (or link(), etc), is guaranteed to contain the data that was written. Anything else would massively decrease the usefulness of snapshots. Is this incorrect? If not, feel free to ignore the remainder of this E-Mail. If it is, then I don't see why the filesystem would be reverted to (1). It should in fact be guaranteed to revert to (4) (unless the creation of the snapshot is itself not guaranteed to be persistent without an explicit global sync by the administrator - but I doubt this is the case?). Regardless of the details of snapshots, I think the point that needs making to the OP is that regardless of filesystem issues the data as written to that filesystem by the application must always be consistent from the perspective of the application, and that a snapshot just gives you a snapshot of a filesystem for which any read will return whatever it would have done exactly at the point of the snapshot. If the application has not written the data, it will not be part of the snapshot. Thus if the application has writes pending that are needed for consistency, those writes must complete prior to snapshotting. The synching, which I assume refer to fsync() and/or the sync command, is about ensuring that the view of the filesystem (or usually a subset of it) as seen by applications is actually committed to persistent storage. This is done either to guarantee that some application-level data is committed and will remain in the face of a crash (e.g. a banking application does an SQL COMMIT), or as an overkill way of ensuring that some I/O operation B physically happens after some I/O operation A (such that in the event of a crash, B will never appear on disk if A does not also appear) (such as a database maintaining internal transactional consistency). Now, assuming that snapshots work in the way I assume and ask about above, the use of a zfs snapshot at a point in time when the application has written consistent data to the filesystem is sufficient to guarantee consistency in the event of a crash. Essentially the zfs snapshot can be used to achieve the effect of fsync(), with the added benefit of being able to administratively roll back to the previous version rather than just guaranteeing that there is some consistent state to return back to. (Incidentally, since, according to a post here on the list in response to a related question I had, ZFS already guarantees ordering of writes there is presumably some pretty significant performance improvements to be had if a database was made aware of this and allowed a weaker form of COMMIT where you drop the persistence requirement, but keep the consistency requirement.) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Making 'zfs destroy' safer
On the other hand personally I just don't see the need for this since the @ char isn't special to the shell so I don't see where the original problem came from. I never actually *had* a problem, I am just nervous about it. And yes, @ is not special for classical shells, but it's still more special than alphanumerics or '/', and probably more likely to *be* special in some languages/alternate shells. Then there is the seemingly trivial issue of the physical keyboard layout. The most common layout will tend to make you use the right shift key in order to type the @, in a particular sequence such that a slight slip of the finger there and *kaboom*, you have lost your (and/or everybody else's) data by accidentally hitting enter instead of shift. One can of course protect against this by writing commands backwards and such, which is what I do for cases like this along with SQL DELETE statements, but to me it just feels unnecessarily dangerous. One thing that might help here, when not running as root or as a user with ZFS File System Management RBAC profile, is user delegation. This will allow you to run the script as user that can only do certain operations to filesystems that they own or have been delegated specific access to operate on. On the other than a very minor modification to the command line tool gets you a pretty significant payoff without complicating things. It will affect the safety of the out-of-the-box tool, regardless of the local policy for privilege delegation. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Making 'zfs destroy' safer
I would much prefer to do for snap in $(zfs list -t snapshot -r foo/bar ) do zfs destroy -t snapshot $snap do the not have the -t. Especially the further away the destroy is from the generation of the list. The extra -t would be belt and braces but that is how I like my data protected. Especially if we imagine that someone further down the line decides to slightly modify the format of the zfs list -t snapshot output - such as by having it give a hierarchal view with the roots being the filesystem path... This is of course a normal problem with shell scripting (unless the zfs command is documented to guarantee backward compatible output?), but in cases like this it really really becomes critical. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making 'zfs destroy' safer
Rather than rehash this, again, from scratch. Refer to a previous rehashing. http://www.opensolaris.org/jive/thread.jspa?messageID=15363; I agree that adding a -f requirement and/or an interactive prompt is not a good solution. As has already been pointed out, my suggestion is different. zfs destroy is very general. Often, generality is good (e.g. in programming languages). But when dealing with something this dangerous and by it's very nature likely to be used on live production data either manually or in scripts (that are not subject to a release engineering process), I think it is useful to make it possible to be more specific, such that the possible repercussions of a misstake are limited. As an analogy - would you want rm to automatically do rm -rf if invoked on a directory? Most probably would not. The general solution would be for rm to just do what you tell it to - remove whatever you are pointing it to. But I think most would agree that things are safer the way they work now. However that said I am not suggesting removing existing functionality of destroy, but to provide a way be more specific about your intended actions in cases where you want to destroy snapshots or clones. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making 'zfs destroy' safer
Apparently (and I'm not sure where this is documented), you can 'rmdir' a snapshot to remove it (in some cases). Ok. That would be useful, though I also don't like that it breaks standard rmdir semantics. In any case it does not work in my case - but that was on FreeBSD. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Making 'zfs destroy' safer
Hello, with the advent of clones and snapshots, one will of course start creating them. Which also means destroying them. Am I the only one who is *extremely* nervous about doing zfs destroy some/[EMAIL PROTECTED]? This goes bot manually and automatically in a script. I am very paranoid about this; especially because the @ sign might conceivably be incorrectly interpreted by some layer of scripting, being a non-alphanumeric character and highly atypical for filenames/paths. What about having dedicated commands destroysnapshot, destroyclone, or remove (less dangerous variant of destroy) that will never do anything but remove snapshots or clones? Alternatively having something along the lines of zfs destroy --nofs or zfs destroy --safe. I realize this is borderline being in the same territory as special casing rm -rf / and similar, which is generally not considered a good idea. But somehow the snapshot situation feels a lot more risky. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making 'zfs destroy' safer
What about having dedicated commands destroysnapshot, destroyclone, or remove (less dangerous variant of destroy) that will never do anything but remove snapshots or clones? Alternatively having something along the lines of zfs destroy --nofs or zfs destroy --safe. Another option is to allow something along the lines of: zfs destroy snapshot:/path/to/[EMAIL PROTECTED] Where the use of snapshot: would guarantee that non-snapshots are not affected. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
That is interesting. Could this account for disproportionate kernel CPU usage for applications that perform I/O one byte at a time, as compared to other filesystems? (Nevermind that the application shouldn't do that to begin with.) I just quickly measured this (overwritting files in CHUNKS); This is a software benchmark (I/O is non-factor) CHUNK ZFS vz UFS 1B 4X slower 1K 2X slower 8K 25% slower 32K equal 64K 30% faster Quick and dirty but I think it paints a picture. I can't really answer your question though. I should probably have said other filesystems on other platforms, I did not really compare properly on the Solaris box. In this case it was actually BitTorrent (the official python client) that was completely CPU bound in kernel space, and tracing showed single-byte I/O. Regardless, the above stats are interesting and I suppose consistent with what one might expect, from previous discussion on this list. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Shrinking a zpool?
No 'home user' needs shrink. I strongly disagree with this. The ability to shrink can be useful in many specific situations, but in the more general sense, and this is in particular for home use, it allows you to plan much less rigidly. You can add/remove drives left and right at your leasure and won't work yourself into a corner with regards to drive sizes, lack of drive connectivity, or drive interfaces as you can always perform an incremental migration/upgrade/downgrade. Every professional datacenter needs shrink. Perhaps at some level. At the level of having 1-20 semi-structured servers with 5-20 or so terrabytes each, you probably don't need is all that much - even if it would be nice (speaking from experience). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Meta data corruptions on ZFS.
This is expected because of the copy-onwrite nature of ZFS. During truncate it is trying to allocate new disk blocks probably to write the new metadata and fails to find them. I realize there is a fundamental issue with copy on write, but does this mean ZFS does not maintain some kind of reservation to guarantee you can always remove data? If so I would consider this a major issue for general purpose use, and if nothing else it should most definitely be clearly documented. Accidentally filling up space is not at *all* uncommon in many situations, be it home use or medium sized business type use. Yes you should avoid it, but shit (always) happens. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
That said, actually implementing the underlying mechanisms may not be worth the trouble. It is only a matter of time before disks have fast non-volatile memory like PRAM or MRAM, and then the need to do explicit cache management basically disappears. I meant fbarrier() as a syscall exposed to userland, like fsync(), so that userland applications can achieve ordered semantics without synchronous writes. Whether or not ZFS in turn manages to eliminate synchronous writes by some feature of the underlying storage mechanism is a separate issue. But even if not, an fbarrier() exposes an asynchronous method of ensuring relative order of I/O operations to userland, which is often useful. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: How much do we really want zpool remove?
The ability to shrink a pool by removing devices is the only reason my enterprise is not yet using ZFS, simply because it prevents us from easily migrating storage. Being able to do this would be very high on my wishlist from the perspective of a home user. But also from the perspective of more serious user (though I am not involved in using ZFS in such a case - not yet anyway...) it is most definitely a very nice thing to be able to do, in various stituations. Example of things I would love to be able to do, which I would with such a feature: * Easily convert between mirror/striping/raidz/raidz2 (no need to purchase twice the capacity for temporary storage during a conversion). * Easily move storage between physical machines as needs change (assuming a situation where you want drives locally attached to the machines in question, and iSCSI and similar is not an option). * Revert stupid misstake: accidentally adding something to a pool that should not be there :) * Easily - even live under the right circumstances - temporarily evacuate a disk in order to e.g. perform drive testing if suspicious behavior is present without a known cause. * If a drive starts going bad and I do not have a spare readily available (typical home use situation), I may want to evacuate the semi-broken drive so that I do not loose redundancy until I can get another disk. May or may not be practical depending on current disk space usage of course. * Some machine A needs a spare drive but there is none, and I have free disk space ond disk B and B has matching drives. Evacuate a disk on B and use as replacement in A (again, typical home use situation). Once I obtain a new drive revert B's disk into B again, or alternatively keep it in A and use the new drive in B. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Synchronous Mount?
Specifically, I was trying to compare ZFS snapshots with LVM snapshots on Linux. One of the tests does writes to an ext3FS (that's on top of an LVM snapshot) mounted synchronously, in order to measure the real Copy-on-write overhead. So, I was wondering if I could do the same with ZFS. Seems not. Given that ZFS does COW for *all* writes, what does this test actually intend to show when running on ZFS? Am I missing something, or should not writes to a clone be as fast, or even faster, than a write to a non-clone? Given that COW is always performed, but in the case of the clone the old data is not removed. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] On-failure policies for pools
Hello, There have been comparisons posted here (and in general out there on the net) for various RAID levels and the chances of e.g. double failures. One problem that is rarely addressed though, is the various edge cases that significantly impact the probability of loss of data. In particular, I am concerned about the relative likelyhood of bad sectors on a drive, vs. entire-drive failure. On a raidz where uptime is not important, I would not want a dead drive + a single bad sector on another drive to cause loss of data, yet dead drive + bad sector is going to be a lot more likely than two dead drives within the same time window. In many situations it may not feel worth it to move to a raidz2 just to avoid this particular case. I would like a pool policy that allowed one to specify that at the moment a disk fails (where fails = considered faulty), all mutable I/O would be immediately stopped (returning I/O errors to userspace I presume), and any transaction in the process of being committed is rolled back. The result is that the drive that just failed completely will not go out of date immediately. If one then triggers a bad block on another drive while resilvering with a replacement drive, you know that you have the failed drive as a last resort (given that a full-drive failure is unlikely to mean the drive was physically obliterated; perhaps the controller circuitery can be replaced or certain physical components can be replaced). In the case of raidz2, you effectively have another half level of redundancy. Also, with either raidz/raidz2 one can imagine cases where a machine is booted with one or two drives missing (due to cabling issues for example); guaranteeing that no pool is ever online for writable operations (thus making abscent drives out of date) until the administrative explicitly asks for it, would greatly reduce the probability of data loss due to a bad block in this case aswell. In short, if true irrevocable dataloss is limited (assuming no software issues) to the complete obliteration of all data on n drives (for n levels of redundancy), or alternatively to the unlikely event of bad blocks co-inciding on multiple drives, wouldn't reliability be significantly increased in cases where this is an acceptable practice? Opinions? -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Adding disk to a RAID-Z?
PS The ZFS administration guide mentions this recommendation, but does not give PS any hint as to why. A reader may assume/believe it's just general adviced, PS based on someone's opinion that with more than 9 drives, the statistical PS probability of failure is too high for raidz (or raid5). It's a shame the PS statement in the guide is not further qualified to actually explain that PS there is a concrete issue at play. I don't know if ZFS MAN pages should teach people about RAID. If somebody doesn't understand RAID basics then some kind of tool where you just specify pool of disk and have to choose from: space efficient, performance, non-redundant and that's it - all the rest will be hidden. But the guide *does* make a recommendation, but does not qualify it. And if there is a problem specific to ZFS that is NOT just obvious results of some general principle, that's very relevant for the ZFS administration guide IMO (and man pages for that matter). -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Adding disk to a RAID-Z?
It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. There is at least one reason for wanting more drives in the same raidz/raid5/etc: redundancy. Suppose you have 18 drives. Having two raidz:s constisting of 9 drives is going to mean you are more likaly to fail than having a single raidz2 consisting of 18 drives, since in the former case yes - two drives can go down, but only if they are the *right* two drives. In the latter case any two drives can go down. The ZFS administration guide mentions this recommendation, but does not give any hint as to why. A reader may assume/believe it's just general adviced, based on someone's opinion that with more than 9 drives, the statistical probability of failure is too high for raidz (or raid5). It's a shame the statement in the guide is not further qualified to actually explain that there is a concrete issue at play. (I haven't looked into the archives to find the previously mentioned discussion.) -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Is this expected behavior? Assuming concurrent reads (not synchronous and sequential) I would naively expect an ndisk raidz2 pool to have a normalized performance of n for small reads. q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942tstart=0 where such behavior in a hardware RAID array lead to corruption which was detected by ZFS. No free lunch today, either. -- richard I appreciate the advantage of checksumming, believe me. Though I don't see why this is directly related to the small read problem, other than that the implementation is such. Is there some fundamental reason why one could not (though I understand one *would* not) keep a checksum on a per-disk basis, so that in the normal case one really could read from just one disk, for a small read? I realize it is not enough for a block to be self-consistent, but theoretically couldn't the block which points to the block in question contain multiple checksums for the various subsets on different disks, rather than just the one checksum for the entire block? Not that I consider this a major issue; but since you pointed me to that article in response to my statement above... -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding disk to a RAID-Z?
I want to setup a ZFS server with RAID-Z. Right now I have 3 disks. In 6 months, I want to add a 4th drive and still have everything under RAID-Z without a backup/wipe/restore scenario. Is this possible? You can add additional storage to the same pool effortlessly, such that the pool will be striped across two raidz:s. You cannot (AFAIK) expand the raidz itself. End result is 9 disks, with 7 disks worth of effective storage capacity. The ZFS administratiion guide contains examples of doing exactly this, except I believe the examples use mirrors. ZFS administration guide: http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
I've been using a simple model for small, random reads. In that model, the performance of a raidz[12] set will be approximately equal to a single disk. For example, if you have 6 disks, then the performance for the 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way dynamic stripe of 2-way mirrors will have a normalized performance of 6. I'd be very interested to see if your results concur. Is this expected behavior? Assuming concurrent reads (not synchronous and sequential) I would naively expect an ndisk raidz2 pool to have a normalized performance of n for small reads. Is there some reason why a small read on a raidz2 is not statistically very likely to require I/O on only one device? Assuming a non-degraded pool of course. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very strange performance patterns
Short version: Pool A is fast, pool B is slow. Writing to pool A is fast. Writing to pool B is slow. Writing to pool B WHILE writing to pool A is fast on both pools. Explanation? [snip] For the archives, it is interesting to note that when I do not perform a local dd to the device, but instead use rsync over ssh (thus being limited by network/cpu) performance is fine. It bursts out every now and then: -- - - - - - - fast51.7G 246G 0 0 0 0 raid0124M 1.16T 0 0 0 0 -- - - - - - - fast51.7G 246G 0 0 0 0 raid0124M 1.16T 0 0 0 0 -- - - - - - - fast51.7G 246G 0 0 0 0 raid0124M 1.16T 0 56 0 7.08M -- - - - - - - fast51.7G 246G 0 0 0 0 raid0124M 1.16T 0115 0 14.4M -- - - - - - - fast51.7G 246G 0 0 0 0 raid0154M 1.16T 0 90 0 1.04M -- - - - - - - fast51.7G 246G 0 0 0 0 raid0154M 1.16T 0 0 0 0 Performance is still not visibly stellar, but then the actual stransfer is only 4 megs or so per second and I don't know how flushes are handled by zfs. But it's most definitely faster than the 4 mb/second seen when dd:ing to the pool. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Very strange performance patterns
Hello, Short version: Pool A is fast, pool B is slow. Writing to pool A is fast. Writing to pool B is slow. Writing to pool B WHILE writing to pool A is fast on both pools. Explanation? Long version: I have an existing two-disk pool consisting of two SATA drives. Call this pool pool1. This has always been as fast as I would have expected; 30-50 MB/second write, 105 MB/second read. I have now added four additional drives (A, B, C and D) to the machine, that I wanted to use for a raidz. For initial testing I chose a striped pool just to see what kind of performance I would get. The initial two drives (pool1) are on their own controller. A and B are on a second controller and C and D on a third controller. All of the controllers are SiL3512. Here comes the very interesting bit. For the purpose of the diagram below, good performance means 20-50 mb/sec write and ~70-80 mb/sec read. Bad performance means 3-5 mb/sec () write and ~70-80 mb/sec read. disk layout in pool | other I/O | performance A + B | none | good C + D | none | good A + B + C + D | none | bad A + B + C | none | bad A + B + C + D | write to pool1 | goodish (!!!) (some tested combinations omitted) In other words: Initially it looked like write performance went down the drain as soon as I combined drives from multiple controllers into one pool, while performance was fine as long as I stayed within one controller. However, writing to the slow A+B+C+D pool *WHILE ALSO WRITING TO POOL1* actually *INCREASES* performance. The write to pool1 and the otherwise slow pool are not quite up to normal good level, but that is probably to be expected even under normal circumstances. CPU usage during the writing (when slow) is almost non-existent. There is no spike similar to what you seem to get every five second or so normally (during transaction commits?). Also, at least once I saw the write performance on the slow pool spike at 19 mb/second for a single second period (zpool iostat) when I initiated the write, then it went down again and remains very constant, not really varying outside 3.4-4.5. Often EXACTLY at 3.96. writing and reading means dd:ing (to /dev/null, from /dev/zero) with bs=$((1024*1024)). Pools created with zpool create speedtest c4d0 c5d0 c6d0 c7d0 and variations of that for the different combinations. The pool with all four drives is 1.16T in size. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS and write caching (SATA)
Hello, my understanding is that ZFS is specifically designed to work with write caching, by instructing drives to flush their caches when a write barrier is needed. And in fact, even turns write caching on explicitly on managed devices. My question is of a practical nature: will this *actually* be safe on the average consumer grade SATA drive? I have seen offhand references to PATA drives generally not being trustworthy when it comes to this (SCSI therefore being recommended), but I have not been able to find information on the status of typical SATA drives. While I do intend to perform actual powerloss tests, it would be interesting to hear from anybody whether it is generally expected to be safe. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and write caching (SATA)
PS While I do intend to perform actual powerloss tests, it would be interesting PS to hear from anybody whether it is generally expected to be safe. Well is disks honors cache flush commands then it should be reliable wether it's SATA or SCSI disk. Yes. Sorry, I could have stated my question clear:er. What I am specifically concerned about is exactly that - whether your typical SATA drive *will* honor cache flush commands, as I understand a lot of PATA drives did/do not. Googling tends to give very little concrete information on this since very few people actually seem to care about this. Since I wanted to confirm my understanding of ZFS semantics w.r.t. write caching anyway I thought I might aswell also ask about the general tendency among drives since, if anywhere, people here might know. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss