[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On 9/13/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Sure, if you want *everything* in your pool to be mirrored, there is no real need for this feature (you could argue that setting up the pool would be easier if you didn't have to slice up the disk though). Not necessarily. Implementing this on the FS level will still allow the administrator to turn on copies on the entire pool if since the pool is technically also a FS and the property is inherited by child FS's. Of course, this will allow the admin to turn off copies to the FS containing junk. Implementing it at the directory and file levels would be even more flexible: redundancy strategy would no longer be tightly tied to path location, but directories and files could themselves still inherit defaults from the filesystem and pool when appropriate (but could be individually handled when desirable). I've never understood why redundancy was a pool characteristic in ZFS - and the addition of 'ditto blocks' and now this new proposal (both of which introduce completely new forms of redundancy to compensate for the fact that pool-level redundancy doesn't satisfy some needs) just makes me more skeptical about it. (Not that I intend in any way to minimize the effort it might take to change that decision now.) It could be recommended in some situations. If you want to protect against disk firmware errors, bit flips, part of the disk getting scrogged, then mirroring on a single disk (whether via a mirror vdev or copies=2) solves your problem. Admittedly, these problems are probably less common that whole-disk failure, which mirroring on a single disk does not address. I beg to differ from experience that the above errors are more common than whole disk failures. It's just that we do not notice the disks are developing problems but panic when they finally fail completely. It would be interesting to know whether that would still be your experience in environments that regularly scrub active data as ZFS does (assuming that said experience was accumulated in environments that don't). The theory behind scrubbing is that all data areas will be hit often enough that they won't have time to deteriorate (gradually) to the point where they can't be read at all, and early deterioration encountered during the scrub pass (or other access) in which they have only begun to become difficult to read will result in immediate revectoring (by the disk or, if not, by the file system) to healthier locations. Since ZFS-style scrubbing detects even otherwise-indetectible 'silent corruption' missed by the disk's own ECC mechanisms, that lower-probability event is also covered (though my impression is that the probability of even a single such sector may be significantly lower than that of whole-disk failure, especially in laptop environments). All that being said, keeping multiple copies on a single disk of most metadata (the loss of which could lead to wide-spread data loss) definitely makes sense (especially given its typically negligible size), and it probably makes sense for some files as well. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature
On Thu, Sep 14, 2006 at 05:08:18PM -0500, Nicolas Williams wrote: On Thu, Sep 14, 2006 at 10:32:59PM +0200, Henk Langeveld wrote: Bady, Brant RBCM:EX wrote: Part of the archiving process is to generate checksums (I happen to use MD5), and store them with other metadata about the digital object in order to verify data integrity and demonstrate the authenticity of the digital object over time. Wouldn't it be helpful if there was a utility to access/read the checksum data created by ZFS, and use it for those same purposes. Doesn't ZFS use block-level checksums? Yes, but the checksum is stored with the pointer. So then, for each file/directory there's a dnode, and that dnode has several block pointers to data blocks or indirect blocks, and indirect blocks have pointers to... and so on. Does ZFS have block fragments? If so, then updating an unrelated file would change the checksum. Ceri -- That must be wonderful! I don't understand it at all. -- Moliere pgpzabNG9m5HW.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs panic installing a brandz zone
Yup, its almost certain that this is the bug you are hitting. -Mark Alan Hargreaves wrote: I know, bad form replying to myself, but I am wondering if it might be related to 6438702 error handling in zfs_getpage() can trigger page not locked Which is marked fix in progress with a target of the current build. alan. Alan Hargreaves wrote: Folks, before I start delving too deeply into this crashdump, has anyone seen anything like it? The background is that I'm running a non-debug open build of b49 and was in the process of running the zoneadm -z redlx install After a bit, the machine panics, initially looking at the crashdump, I'm down to 88mb free (out of a gig) and see the following stack. fe8000de7800 page_unlock+0x3b(180218720) fe8000de78d0 zfs_getpage+0x236(89b84d80, 12000, 2000, fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, fe808180a000, 1, 80826dc8) fe8000de7950 fop_getpage+0x52(89b84d80, 12000, 2000, fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, fe8081818000, 1, 80826dc8) fe8000de7a50 segmap_fault+0x1d6(801a6f38, fbc29b20, fe8081818000, 2000, 0, 1) fe8000de7b30 segmap_getmapflt+0x67a(fbc29b20, 89b84d80, 12000, 2000, 1, 1) fe8000de7bd0 lofi_strategy_task+0x14b(959d2400) fe8000de7c60 taskq_thread+0x1a7(84453da8) fe8000de7c70 thread_start+8() %rax = 0x %r9 = 0x0300430e %rbx = 0x000e %r10 = 0x1000 %rcx = 0xfe8081819000 %r11 = 0x113709b0 %rdx = 0xfe8000de7c80 %r12 = 0x000180218720 %rsi = 0x00013000 %r13 = 0xfbc52160 pse_mutex+0x200 %rdi = 0xfbc52160 pse_mutex+0x200 %r14 = 0x4000 %r8 = 0x0200 %r15 = 0xfe8000de79d8 %rip = 0xfb8474fb page_unlock+0x3b %rbp = 0xfe8000de7800 %rsp = 0xfe8000de77e0 %rflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=of,df,IF,tf,sf,ZF,af,PF,cf %cs = 0x0028%ds = 0x0043%es = 0x0043 %trapno = 0xe %fs = 0xfsbase = 0x8000 %err = 0x0 %gs = 0x01c3gsbase = 0xfbc27b70 While the panic string says NULL pointer dereference, it appears that 0x180218720 is not mapped. The dereference looks like the first dereference in page_unlock(), which looks at pp-p_selock. I can spend a little time looking at it, but was wondering if anyone had seen this kind of panic previously? I have two identical crashdumps created in exactly the same way. alan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [Blade 150] ZFS: extreme low performance
Hi forum, I'm currently a little playing around with ZFS on my workstation. I created a standard mirrored pool over 2 disk-slices. # zpool status Pool: mypool Status: ONLINE scrub: Keine erforderlich config: NAME STATE READ WRITE CKSUM mypoolONLINE 0 0 0 mirrorONLINE 0 0 0 c0t0d0s4 ONLINE 0 0 0 c0t2d0s4 ONLINE 0 0 0 Then i created a ZFS with no extra options: # zfs create mypool/zfs01 # zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 106K 27,8G 25,5K /mypool mypool/zfs01 24,5K 27,8G 24,5K /mypool/zfs01 When I now send a mkfile on the new FS, the performance of the whole system breaks down near zero: # mkfile 5g test last pid: 25286; load avg: 3.54, 2.28, 1.29; up 0+01:44:26 16:16:24 66 processes: 61 sleeping, 3 running, 1 zombie, 1 on cpu CPU states: 0.0% idle, 2.1% user, 97.9% kernel, 0.0% iowait, 0.0% swap Memory: 512M phys mem, 65M free mem, 2050M swap, 2050M free swap PID USERNAME LWP PRI NICE SIZE RES STATETIMECPU COMMAND 25285 root 1 84 1184K 752K run 0:09 66.28% mkfile It seams that some kind of kernel activity while writing to ZFS blocks the system. Is this a known problem? Do you need additional information? regards Mathias This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [Blade 150] ZFS: extreme low performance
The disks in that Blade 100, are these IDE disks? The performance problem is probably bug 6421427: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6421427 A fix for the issue was integrated into the Opensolaris 20060904 source drop (actually closed binary drop): http://dlc.sun.com/osol/on/downloads/20060904/on-changelog-20060904.html ... but has been removed in the next update: http://dlc.sun.com/osol/on/downloads/20060911/on-changelog-20060911.html This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature
Luke Scharf wrote: It sounded to me like he wanted to implement tripwire, but save some time and CPU power by querying the checksumming-work that was already done by ZFS. Nevermind. The e-mail client that I chose to use broke up the thread, and I didn't see that the issue had already been thoroughly discussed. -Luke smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Sol 10 x86_64 intermittent SATA device locks up server
What's the brand and model of the cards ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Access to ZFS checksums would be nice and very useful feature
On Fri, Sep 15, 2006 at 09:31:04AM +0100, Ceri Davies wrote: On Thu, Sep 14, 2006 at 05:08:18PM -0500, Nicolas Williams wrote: Yes, but the checksum is stored with the pointer. So then, for each file/directory there's a dnode, and that dnode has several block pointers to data blocks or indirect blocks, and indirect blocks have pointers to... and so on. Does ZFS have block fragments? If so, then updating an unrelated file would change the checksum. No. It has variable sized blocks. A block pointer in ZFS is much more than just a block number. Among other things a block pointer has the checksum of the block it points to. See the on-disk layout document for more info. There is no way that updating one file could change another's checksum. What does matter is that the ZFS checksum of a file, to be O(1), depends on the on-disk layout of the file, and anything that would change that (today nothing would) would change the ZFS checksum of the file. So I think that ZFS checksums, if exposed, are best left as a file change test optimization, not as an actual checksum of the file. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote: Implementing it at the directory and file levels would be even more flexible: redundancy strategy would no longer be tightly tied to path location, but directories and files could themselves still inherit defaults from the filesystem and pool when appropriate (but could be individually handled when desirable). The problem boils down to not having a way to express your intent that works over NFS (where you're basically limited by POSIX) that you can use from any platform (esp. ones where ZFS isn't installed). If you have some ideas, this is something we'd love to hear about. I've never understood why redundancy was a pool characteristic in ZFS - and the addition of 'ditto blocks' and now this new proposal (both of which introduce completely new forms of redundancy to compensate for the fact that pool-level redundancy doesn't satisfy some needs) just makes me more skeptical about it. We have thought long and hard about this problem and even know how to implement it (the name we've been using is Metaslab Grids, which isn't terribly descriptive, or as Matt put it a bag o' disks). There are two main problems with it, though. One is failures. The problem is that you want the set of disks implementing redundancy (mirror, RAID-Z, etc.) to be spread across fault domains (controller, cable, fans, power supplies, geographic sites) as much as possible. There is no generic mechanism to obtain this information and act upon it. We could ask the administrator to supply it somehow, but such a description takes effort, is not easy, and prone to error. That's why we have the model right now where the administrator specifies how they want the disks spread out across fault groups (vdevs). The second problem comes back to accounting. If you can specify, on a per-file or per-directory basis, what kind of replication you want, how do you answer the statvfs() question? I think the recent discussions on this list illustrate the complexity and passion on both sides of the argument. (Not that I intend in any way to minimize the effort it might take to change that decision now.) The effort is not actually that great. All the hard problems we needed to solve in order to implement this were basically solved when we did the RAID-Z code. As a matter of fact, you can see it in the on-disk specification as well. In the DVA, you'll notice an 8-bit field labeled GRID. These are the bits that would describe, on a per-block basis, what kind of redundancy we used. --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Bizzare problem with ZFS filesystem
It is highly likely you are seeing a duplicate of: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS which was fixed recently in build 48 on Nevada. The symptoms are very similar. That is a fsync from the vi would, prior to the bug being fixed, have to force out all other data through the intent log. Neil. Anantha N. Srirama wrote On 09/13/06 15:58,: One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. I am currently awaiting the completion of a write to /u099, well over 60 seconds. At the same time I was able create/save files in /u001 without any problems. The only difference between the /u001 and /u099 is the size of the filesystem (256GB vs 768GB). Per your suggestion I ran a 'zfs set' command and it completed after a wait of around 20 seconds while my file save from vi against /u099 is still pending!!! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] no automatic clearing of zoned eh?
s10u2, once zoned, always zoned? i see that zoned property is not cleared after removing the dataset from a zone cfg or even uninstalling the entire zone... [right, i know how to clear it by hand, but maybe i am missing a bit of magic otherwise anodyne zonecfg et al.] oz -- ozan s. yigit | [EMAIL PROTECTED] don't be afraid to find the rhinoceros to pick fleas from. -- richard gabriel [patterns of software] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: reslivering, how long will it take?
the status showed 19.46% the first time I ran it, then 9.46% the second. The question I have is I added the new disk, but it's showing the following: Device: c5d0 Storage Pool: fserv Type: Disk Device State: Faulted (cannot open) The disk is currently unpartitioned and unformatted. I was under the impression ZFS was going to take care of all of that. Do I need to setup partitioning and formatting before trying to add it to a pool? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: reslivering, how long will it take?
On Fri, Sep 15, 2006 at 01:10:25PM -0700, Tim Cook wrote: the status showed 19.46% the first time I ran it, then 9.46% the second. The question I have is I added the new disk, but it's showing the following: Device: c5d0 Storage Pool: fserv Type: Disk Device State: Faulted (cannot open) Did you run zpool replace fserv c5d0? We're working on the auto-replace when we detect a hot-plug, but it's not in yet. The disk is currently unpartitioned and unformatted. I was under the impression ZFS was going to take care of all of that. Do I need to setup partitioning and formatting before trying to add it to a pool? ZFS should take care of all that. --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: reslivering, how long will it take?
hrmm... cannot replace c5d0 with c5d0: cannot replace a replacing device This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on production servers with SLA
Quoth Darren J Moffat on Fri, Sep 08, 2006 at 01:59:16PM +0100: Nicolas Dorfsman wrote: Regarding system partitions (/var, /opt, all mirrored + alternate disk), what would be YOUR recommendations ? ZFS or not ? /var for now must be UFS since Solaris 10 doesn't not have ZFS root support and that means /, /etc/, /var/, /usr. Once 6354489 was fixed, I believe Stephen Hahn got zfs-on-/usr working. That might be painful to upgrade, though. I've run systems with /opt as a ZFS filesystem and it works just fine. However note that the Solaris installed puts stuff in /opt (for backwards compat reasons, ideally it wouldn't) and that may cause issues with live upgrade or require you to move that stuff onto your ZFS /opt datasets. I also use zfs for /opt. I have to unmount it before using Live Upgrade, though, because it refuses to leave /opt on a separate filesystem. I suppose it's right, since the package database may refer to files in /opt, but I haven't had any problems. David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
RE: [zfs-discuss] Re: reslivering, how long will it take?
Yes sir: [EMAIL PROTECTED]:/ # zpool status -v fserv pool: fserv state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 5.90% done, 27h13m to go config: NAMESTATE READ WRITE CKSUM fserv DEGRADED 0 0 0 raidz1DEGRADED 0 0 0 replacing DEGRADED 0 0 0 c5d0s0/o UNAVAIL 0 0 0 cannot open c5d0 ONLINE 0 0 0 c3d0ONLINE 0 0 0 c3d1ONLINE 0 0 0 c4d0ONLINE 0 0 0 errors: No known data errors -Original Message- From: Bill Moore [mailto:[EMAIL PROTECTED] Sent: Friday, September 15, 2006 4:45 PM To: Tim Cook Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Re: reslivering, how long will it take? On Fri, Sep 15, 2006 at 01:26:21PM -0700, Tim Cook wrote: says it's online now so I can only assume it's working. Doesn't seem to be reading from any of the other disks in the array though. Can it sliver without traffic to any other disks? /noob Can you send the output of zpool status -v pool? --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user
(I looked at my email before checking here, so I'll just cut-and-paste the email response in here rather than send it. By the way, is there a way to view just the responses that have accumulated in this forum since I last visited - or just those I've never looked at before?) Bill Moore wrote: On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote: Implementing it at the directory and file levels would be even more flexible: redundancy strategy would no longer be tightly tied to path location, but directories and files could themselves still inherit defaults from the filesystem and pool when appropriate (but could be individually handled when desirable). The problem boils down to not having a way to express your intent that works over NFS (where you're basically limited by POSIX) that you can use from any platform (esp. ones where ZFS isn't installed). If you have some ideas, this is something we'd love to hear about. Well, one idea is that it seems downright silly to gate ZFS facilities on the basis of two-decade-old network file access technology: sure, it's important to be able to *access* ZFS files using NFS, but does anyone really care if NFS can't express the full range of ZFS features - at least to the degree that they think such features should be suppressed as a result (rather than made available to local users plus any remote users employing a possibly future mechanism that *can* support them)? That being said, you could always adopt the ReiserFS approach of allowing access to file/directory metadata via extended path specifications in environments like NFS where richer forms of interaction aren't available: yes, it may feel a bit kludgey, but it gets the job done. And, of course, even if you did nothing to help NFS its users would still benefit from inheriting whatever arbitrarily fine-grained redundancy levels had been established via more comprehensive means: they just wouldn't be able to tweak redundancy levels themselves (any more, or any less, than they can do so today). I've never understood why redundancy was a pool characteristic in ZFS - and the addition of 'ditto blocks' and now this new proposal (both of which introduce completely new forms of redundancy to compensate for the fact that pool-level redundancy doesn't satisfy some needs) just makes me more skeptical about it. We have thought long and hard about this problem and even know how to implement it (the name we've been using is Metaslab Grids, which isn't terribly descriptive, or as Matt put it a bag o' disks). Yes, 'a bag o' disks' - used intelligently at a higher level - is pretty much what I had in mind. There are two main problems with it, though. One is failures. The problem is that you want the set of disks implementing redundancy (mirror, RAID-Z, etc.) to be spread across fault domains (controller, cable, fans, power supplies, geographic sites) as much as possible. There is no generic mechanism to obtain this information and act upon it. We could ask the administrator to supply it somehow, but such a description takes effort, is not easy, and prone to error. That's why we have the model right now where the administrator specifies how they want the disks spread out across fault groups (vdevs). Without having looked at the code I may be missing something here. Even with your current implementation, if there's indeed no automated way to obtain such information the administrator has to exercise manual control over disk groupings if they're going to attain higher availability by avoiding other single points of failure instead of just guard against unrecoverable data loss from disk failure. Once that information has been made available to the system, letting it make use of it at a higher level rather than just aggregating entire physical disks should not entail additional administrator effort. I admit that I haven't considered the problem in great detail, since my bias is toward solutions that employ redundant arrays of inexpensive nodes to scale up rather than a small number of very large nodes (in part because a single large node itself can often be a single point of failure even if many of its subsystems carefully avoid being so in the manner that you suggest). Each such small node has a relatively low disk count and little or no internal redundancy, and thus comprises its own little fault-containment environment, avoiding most such issues; as a plus, such node sizes mesh well with the bandwidth available from very inexpensive Gigabit Ethernet interconnects and switches (even when streaming data sequentially, such as video on demand) and allow fine-grained incremental system scaling (by the time faster interconnects become inexpensive, disk bandwidth should have increased enough that such a balance will still be fairly good). Still, if you can group whole disks intelligently in a large system with respect to supplementing