Re: [zfs-discuss] 'cannot import 'andaman': I/O error', and failure to follow my own advice
c == Miles Nordin car...@ivy.net writes: c terabithia:/# zpool import andaman c cannot import 'andaman': I/O error c Destroy and re-create the pool from c a backup source. snv_151, the proprietary release, was able to fix this. I didn't try oi_148 first, so there's a chance it would've worked too if I'd given it a chance. root@solaris:~# zpool import -n -F -f 7400719929021713582 Would be able to return andaman to its state as of April 3, 2011 03:53:23 PM PDT. Would discard approximately 31 seconds of transactions. root@solaris:~# zpool import -F -f 7400719929021713582 Pool andaman returned to its state as of April 3, 2011 03:53:23 PM PDT. Discarded approximately 31 seconds of transactions. so, ftr, seems not all 'import -F' are created equal. :) pgpK1uDrBZ1pZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Going forward after Oracle - Let's get organized, let's get started.
js == Joerg Schilling joerg.schill...@fokus.fraunhofer.de writes: js This is interesting. Where is this group hosted? +1 I glance at the list after years of neglect (selfishly...after almost losing my pool), and see stuff like this: shady backroom irc-kiddie bullshit. please: names, mailing lists, urls, hg servers. Many of us have worked on legitimate open source projects before, you know. We know what one looks like, and it's not enshrouded in a tangle of passive-voice sentences and exclusive mafia language. Of course you're welcome to associate with one another however you like, and maybe the hostile mailing-list-flame tone of people like me is part of what makes you want to make all your infrastructure private. but if the goal of The ZFS Organization is to reassure people they should make new ZFS pools after the Oracle implosion and therefore fund Nexenta support (a worthy goal IMHO!), this path won't work on me nor my friends. I'm confident of that. And I would have thought by now it'd be clear brilliant developers can survive on the open internet, and the momentum's usually a lot better there (not to mention transparency/legitimacy/resiliency). good luck, I guess. pgpB7UGMl2IhD.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] 'cannot import 'andaman': I/O error', and failure to follow my own advice
I have a Solaris Express snv_130 box that imports a zpool from two iSCSI targets, and after some power problems I cannot import the pool. When I found the machine, the pool was FAULTED with half of most mirrors shoring CORRUPTED DATA and half showing UNAVAIL. One of the two iSCSI enclosures was on, while the other was off. When I brought the other iSCSI enclosure up, bringing all the devices in each of the seven mirror vdev's online, the box paniced. It went into a panic loop every time it tried to import the problem pool at boot. I disabled all the iSCSI targets that make up the problem pool and brought the box up, then saved a copy of /etc/zfs/zpool.cache and exported the UNAVAIL pool. Then I turned the host off, brought back all the iSCSI targets, and booted without a crash, hoping I could 'zpool import' the problem pool. (Another mirrored pool on the same pair of iSCSI enclosures came back fine and scrubbed with no errors. shrug) Here is what I get typing some basic commands: -8- terabithia:/# zpool import pool: andaman id: 7400719929021713582 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: andaman ONLINE mirror-0 ONLINE c3t43d0 ONLINE c3t48d0 ONLINE mirror-1 ONLINE c3t45d0 ONLINE c3t47d0 ONLINE mirror-2 ONLINE c3t52d0 ONLINE c3t59d0 ONLINE mirror-3 ONLINE c3t46d0 ONLINE c3t49d0 ONLINE mirror-4 ONLINE c3t50d0 ONLINE c3t44d0 ONLINE mirror-5 ONLINE c3t57d0 ONLINE c3t53d0 ONLINE mirror-6 ONLINE c3t54d0 ONLINE c3t51d0 ONLINE terabithia:/# zpool import andaman cannot import 'andaman': I/O error Destroy and re-create the pool from a backup source. terabithia:/# zpool status pool: aboveground state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM aboveground ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c3t16d0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors terabithia:/# zpool import -F andaman cannot import 'andaman': I/O error Destroy and re-create the pool from a backup source. terabithia:/# zdb -ve andaman Configuration for import: version: 22 pool_guid: 7400719929021713582 name: 'andaman' state: 0 hostid: 2200768359 hostname: 'terabithia.th3h.inner.chaos' vdev_children: 7 vdev_tree: type: 'root' id: 0 guid: 7400719929021713582 children[0]: type: 'mirror' id: 0 guid: 337393226491877361 whole_disk: 0 metaslab_array: 14 metaslab_shift: 33 ashift: 9 asize: 1000191557632 is_log: 0 children[0]: type: 'disk' id: 0 guid: 1781150413433362160 phys_path: '/iscsi/disk@iqn.2006-11.chaos.inner.th3h.fishstick%3Asd-andaman0001,0:a' whole_disk: 1 DTL: 91 path: '/dev/dsk/c3t43d0s0' devid: 'id1,sd@t49455400020059100f00/a' children[1]: type: 'disk' id: 1 guid: 7841235598547702997 phys_path: '/iscsi/disk@iqn.2006-11.chaos.inner.th3h%3Aoldfishstick%3Asd-andaman0001,1:a' whole_disk: 1 DTL: 215 path: '/dev/dsk/c3t48d0s0' devid: 'id1,sd@t494554000200880e0f00/a' children[1]: type: 'mirror' id: 1 guid: 1953060080997571723 whole_disk: 0 metaslab_array: 210 metaslab_shift: 33 ashift: 9 asize: 1000191557632 is_log: 0 children[0]: type: 'disk'
Re: [zfs-discuss] ZFS ... open source moving forward?
js == Joerg Schilling joerg.schill...@fokus.fraunhofer.de writes: GPLv3 might help with NetApp - Oracle pact while CDDL does not. js GPLv3 does not help at all with NetApp as the CDDL already js includes a patent grant with the maximum possible js coverage. AIUI CDDL makes a user safe from Sun's patents only. If NetApp contributed code under CDDL, then it would make users safe from NetApp patents applying to code netapp contributed, but NetApp didn't contribute any code so it does nothing. no surprises here: Sun tries to prevent competitors from making poison contributions, which is something we should all do but is ``making the implicit grant explicit''. GPLv3 was a response to the patent pact made between Novell and Microsoft, which if it had worked would have made Linux unfree and given control of it to Microsoft and Novell, because one would need to buy a license from Novell to use Linux, and Microsoft could have participated in nsetting terms for that license whoucl could be quite elaborate like when RSA forced people to use the RSAREF library implementation of RSA to benefit from the limited patent grant, so these patent licenses have been used in the past not only to charge people who have source but also to take away software freedom from people who have source---their elaborateness can become really nefarious. The GPLv3 attempted-protection mechanism is: if Novell negotiates any patent indemnity, it must apply to all users not just Novell's users. This is exactly what we should want to stay free in the shadow of the NetApp - Oracle deal, but I don't understand the legal mechanism that accomplishes it. However I don't see anything remotely like this in CDDL and am pretty sure although not 100% sure that I don't see it because it isn't there. Unfortunately I do not understand it further, and I'm trying to limit the number of times I repeat myself, so welcome back to my killfile and please feel free to take the last word, but I'll only point out that I feel my understanding is more thorough than yours, Joerg, yet you are more certain your understanding is complete than I am of mine being complete, which is a big warning-sign to anyone who wants to take your blanket assertions as the end of the matter. js The interesting thing however is that the FSF js (before the GPLv3 exists) claimed that the CDDL is a bad js license _because_ of it's patent defense claims. Now the FSF js does the same as the CDDL ;-) If we are debating the merits of the backing organizations rather than the licenses themselves, then I think the more interesting thing is that Sun enticed a bunch of developers to trust their stewardship of the project by sassigning copyright to Sun, then got bought by Oracle and became incapable of upholding their moral commitment and changed the license to ``no source'', plucs ``no commercial use of binaries, no publishing benchmarks,'' and a bunch of other completely crazy unfree boilerplate software oppression. Your point, if it even survives an unmudddled understanding of the true patent clauses, vanishes next to that reversal. but merits of backing organization is relevant for deciding about assigning your copytright to another or about including/striking the ``or any later version'' GPL clause. The interaction between licenses and patents can be discussed apart from reputation, and probably should be otherwise I would say ``nobody use CDDL because it is backed by Oracle,'' but I don't say that. js You are obviously wrong here: The GPLv3 is definitevely js incompatible with the GPLv2 and most software does _not_ js include the or any later clause by intention. And you are writing in bad faith, uninformed, and in sentences that aren't internally consistent: GPLv2 with the clause is compatible with GPLv3 by upgrade, so it's not ``definitively'' incompatible. The official FSF-published version of GPLv2 does include the clause, so it would be ``by design'' compatible even if almost everyone struck the clause as you wrongly claim. And while it's overwhelmingly important that Linux kernel does strike the clause, still it is flatly untrue that ``most'' software does not include the clause: I gave examples that do include the clause (gcc and gnu libc and grub and all other FSF projects) while you have no examples at all, but there is no need to debate that since anyone can STFW instead of relying on a consistently unreliable party such as yourself. js OK, you just verified that you are just a troll. We need to js stop the discussion here. Did you miss the part where I said SFLC (authors of GPLv3) and Sun both advise that projects obtain copyright assignment from all developers? that this is normal, and probably a good idea? If so, you probably also missed the examples of good and bad consequences of assignment in the past? and the middle-ground offered by the ``or any later version'' clause? I am not really
Re: [zfs-discuss] ZFS ... open source moving forward?
js == Joerg Schilling joerg.schill...@fokus.fraunhofer.de delivered the following alternate reality of idealogical partisan hackery: js GPLv3 does not give you anything you don't have from CDDL js also. I think this is wrong. The patent indemnification is totally different: AIUI the CDDL makes the implicit patent license explicit and that's it, but GPLv3 does that and goes further by driving in a wedge against patent pacts, somehow. GPLv3 might help with NetApp - Oracle pact while CDDL does not. This is a big difference illustrated through a familiar and very relevant example---not sure how to do better than that, Joerg! js The GPLv3 is intentionally incompatible with the GPLv2 This is definitely wrong, if you dig into the detail more. Most GPLv2 programs include a clause ``or any later version'', so adding one GPLv3 file to them just makes the whole project GPLv3, and there's no real problem. Obviously this clause only makes sense if you trust the FSF, which I do so I include it, but Linus apparently didn't trust them so he struck the clause long ago. so GPLv3 and Apache are compatible while GPLv2 and GPLv3 are not, that is true and is designed. However GPLv2 was also designed to be upgradeable, which was absolutely the FSF's intent, to achieve compatibility, and they have done so with all their old projects like gcc and gnu libc. The usual way to accomplish license upgradeability is to delegate your copyright to the organization you trust to know the difference between ``upgrade'' and ``screw you over.'' That's the method Sun forced upon people who had to sign contributor agreements, and is also the method SFLC advises most new free software projects to adopt: don't let individual developers keep licenses, because they'll become obstinate ossified illogical partisan farts like Joerg, or will not answer email, so you can never ever change the license. FSF gives you this extra ``or any later version'' option to use, which is handy if you trust them to make your software more free in the future yet also want to keep your copyright so YOU can make it less free in the future, if you decide you want to. seems only fair to me, so long as you really did write all of it. GPLv3 is about as incompatible with GPLv2 as ``not giving any source at all'' is incompatible with CDDL. ie, if you delegated your copyright to Sun and contributed under CDDL, Sun has now ``upgraded'' your license to no-source-at-all, which is obviously CDDL-incompatible and by-design. The CDDL of course could never include an ``or any later version'' clause because it would be completely stupid: there's no reason to trust Sun/Oracle. IMHO this is a huge advantage of GPL---it's very easy to future-proof your work, provided you trust the FSF, which I'm sure Joerg does not, but many people do which is lucky for us who do. Joerg doesn't have anyone left to trust: if you donated your copyright to Sun to try to future-proof it against unexpected needed license changes, you're now screwed out of your original intent because they've altered the terms of the deal you thought you were getting. And if your clan of developers won't collectively trust anyone, you also lose because if your understanding of patents evolves in the future, your large old projects who refused-to-trust (like Linux!) are stuck with patent robustness much worse than it needs to be. pgpLNxOOb1hx9.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
ld == Linder, Doug doug.lin...@merchantlink.com writes: ld Very nice. So why isn't it in Fedora (for example)? I think it's slow and unstable? To me it's not clear yet whether it will be the first thing in the Linux world that's stable and has zfs-like capability. If ZFS were GPL it probably would have been, though. and I think I needed many other things from Solaris like zones, COMSTAR, IB, so I'll be trying to get those on Linux too before I can finally ditch these Solaris machines. so, at the time all those things are working, what will the best Linux filesystem be? maybe ZFS. ld I'll believe it when I see it in a big Linux distribution, ld supported like any other FS, and I can use it in production. ld Until then, it doesn't exist. yes. but it is not the license exactly that's keeping it out. I think the license is just annoying some of the Linux developers enough that they prefer to spend their effort elsewhere. ex., OpenBSD is also refusing to accept ZFS because of license, but in their case it is probably ``because we are forced to give source and don't want to''. I agree some of the haggling is stupid, but with all these jackmoves everywhere, saying ``I don't understand all this crap and want to code, so give me a license with a track record I can see, not the Dynacorp Public Goofylicense or something like that,'' is not a totally stupid position. I do wish people would do more than just code and try harder to learn the actual license details, though. pgpfD3JFx7B9z.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
ld == Linder, Doug doug.lin...@merchantlink.com writes: ld This list is for ZFS discussion. There are plenty of other ld places for License Wars and IP discussion. Did you miss the part where ZFS was forked by a license change? Did you miss Solaris Express 11 coming out with no source? Do you not understand everyone is looking for a place to get maintenance on their zpools without getting screwed over? and that whatever few people not too disgusted to walk away, like pjd and NetBSD and kqinfotech and so on, must worry about where to commit their patches and under what license they may use, or at least ``continue delegating to Sun, or stop?'' How can yuo call this OT at this point? ld I really don't care at all about licenses. I think you should start caring, because they affect you. Obviously your care is up to you, but you're also the one who offered to discuss it! ld Folks, I very much did not intend to start, nor do I want to ld participate in or perpetuate, any religious flame wars. yeah, but you're creating more drama by trying to cut off drama than you would by just letting people discuss. Sometimes these threads of ``excuse me but you are a flamer / no U / folks folks attention please everyone calm down / woah woah woah didn't mean to get your panties in a bunch'' is the real content-free post, not the actual disagreement which has some content in it. ld Is the issue important? Sure. Do I have time or interest to ld worry about niggly little details? No. Then you're lazy. Don't demand that others be lazy, too, because you're not only too lazy to care, but you're too lazy to skip their messages that you don't care about! ld personally very geeky about seems *hugely* important and you ld can't understand why others don't see that. Maybe it bugs you ld when people use GPL to mean open source, but the fact is ld that lots and lots of people do. It bugs me when Stallman ld tries to get everyone to use the ridiculous GNU/Linux, as if ld anyone would ever say that. It bugs me when people say I ld *could* care less. But I live with these things. If you live with them, why not live with them quietly? Listing what you don't care about is a lot less useful than talking about things that only some people care about. I think virtually no one cares to keep track of what unique things you don't care about, yet confusingly you seem to present your post as a way to avoid useless discussion. You already know others DO care about it, so? ld I regret and apologize for my callous disregard in casually ld tossing around a clearly incendiary term like GPL. no problem! But if you really regret it then you won't mind when you do it again and get corrected again. pgpIyZ1kwS2kR.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes: bf Perhaps it is better for Linux if it is GPLv2, but probably bf not if it is GPLv3. That's my understanding: GPLv3 is the one you would need to preserve software freedom under deals like NetApp-Oracle patent pact, http://www.gnu.org/licenses/rms-why-gplv3.html#patent-protection but GPLv3 is not compatible with Linux because the kernel is GPLv2 but stupidly/stubbornly deleted the ``or any later version'' language, meaning GPLv3 is not any more Linux-compatible than CDDL. however given how widely-used binary modules are to supposedly get around the license incompatibility, many might consider the GPLv3 patent protections worth more than license compatibility, if your goal is software freedom, or a predictable future for your business. pgphyRH6AbXxf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
rs == Robert Soubie robert.sou...@free.fr writes: rs Don't you forget that these companies also do much of their rs business in foreign countries (Europe, Asia) where software rs patenting is not allowed, dated myth. software patents do exist in europe, and the EPO has issued them. Fewer are issued, and then there's more enforceability question because unlke US, Europe has true federalism, but they still exist. If you google for 'software patents europe' there is stuff explaining this on the first page. The EU patent debate seems to me about fighting attempts to globally homogenize patents so that mountains of new patents would suddenly become valid in Europe, and companies could jurisdiction-shop so you would lose democratic control of the system's future. It's definitely not as simple or as good as ``preserve the status quo of no software patents.'' The European status quo is already not good enough to be safe. It's just vastly better than the future WIPO ASSO wants to bring you. rs where American law is not applicable, Unfortunately I think American law is always applicable because it seems patent law lets you sue almost anyone you like---the guy who wrote it, the company that distributed it, the customer who bought it. Only one has to be American, so American patents can be monetized with few Americans involved. When companies are conducting business negotiations based on the threat of lawsuit rather than the result, these suits don't have to get very far for the blackmail to translate into ``value.'' If there are really European companies opting out of the American market entirely because of patents, I think that's fantastic, but it doesn't seem very plausible with software where you want a big market more than anything. rs And do you really believe that this mailing list is only rs devoted to (US) Americans just because the products originated rs in the US, and the vernacular is English? your rage against hegemony or imperialism or empire or whatever you want to whine about this week is misplaced here: if you have a problem with American attitude or with the political landscape of the world, fine, that's smart, me too, whatever, but it's got zero to do with the complication patents add to an Oracle-free ZFS. Yeah it's really American companies doing almost all this work (sorry, proud Europe!), but anyway being European doesn't mean you can ignore American patents because even the (unlikely?) best case of suddenly losing the entire American market while suffering no loss from a judgement is still bad enough to kill a company. What's on-topic is: * when do the CDDL patent protections apply? to deals between Oracle and Netapp? or is it only protection against Oracle patents? I think the latter, but then, which Oracle patents? Suppose: + Oracle patents something needed ZFS crypto + Oracle publishes the promised yet-to-be-delivered zfs-crypto paper that's thorough enough to write a compatible implementation + Oracle makes no further ZFS source releases, ever + Nexenta reimplements zfs-crypto and releases it CDDL with the rest of ZFS + Oracle sues Nexenta. Oracle uses ``discovery'' to get exhaustive Nexenta customer list. Oracle sues users of Nexenta. Oracle monetizes ``Nexenta indemnification pack'' patent licenses and blackmails Nexenta's customers. CDDL was meant to create a space that appeared to be safe from the last point. But CDDL patent stuff is no help here, I think? so, in effect, patents reduce the software freedoms given by CDDL because, once you fork whatever partial source Oracle deems fit to distribute, you suffer increasing risk of stepping onto an (Oracle-placed!) patent landmine. * AIUI Oracle has distributed grub with zfs patches, and grub is GPLv3. Is this true? If so, GPLv3 includes stuff to extend patent deals, which was added becuase GPLv3 was written under the ominous spectre of the Microsoft-Novell Linux indemnification deal. Does GPLv3 grub extend any of the Netapp deal to those patented algorithms which are used within grub? The GPLv3 is supposed to do some of this, but I don't know how much. Is it extended only to grub users for use in grub, or can the patented stuff in grub be used anywhere by anyone who can get a copy of grub: download GPLv3 grub, then use CDDL ZFS in a Linux kmod with Oracle-provided immunity from any Netapp suit related to a ZFS patent used also in grub? This sounds totally unrealistic to me, so I would guess the GPLv3 protection would be much less, but then what is it? And anyway, though GPLv3 is meant to mandatorily extend private patent deals, how can any patent protection from the Netapp deal be extended when the deal is secret? Don't you need some basis to force disclosure of the deal, and some way to define ``all relevant deals''? If Oracle is defending
Re: [zfs-discuss] ZFS ... open source moving forward?
et == Erik Trimble erik.trim...@oracle.com writes: et In that case, can I be the first to say PANIC! RUN FOR THE et HILLS! Erik I thought most people already understood pushing to the public hg gate had stopped at b147, hence Illumos and OpenIndiana. it's not that you're wrong, just that you should be in the hills by now if you started out running. the S11 Express release without source and with its new, more-onerous license than SXCE is new dismal news, and the problems on other projects and the waves of smart people leaving might be even more dismal for opensolaris since in the past there was a lot of integration and a lot of forward progress, but what you were specifically asking about dates in hg was already included in the old bad news AFAIK. And anyway there was never complete source code, nor source for all new work (drivers), nor source for the stable branch, which has always been a serious problem. The good news to my view is that Linux may actually be only about one year behind (and sometimes ahead) on the non-ZFS features in Solaris. FreeBSD is missing basically all of this, ex jails are really not as thorough as VServer or LXC, but Linux is basically there already: * Xen support is better. Oracle is sinking Solaris Xen support in favour of some old Oracle Xen kit based on Linux, I think? which is disruptive and annoying for me, because I originally used OpenSolaris Xen to get some isolation from the churn of Linux Xen. but it means there's a fully-free-software path that's not even less annoying a transition than what Oracle's offering through partially-free uncertain-future tools. * Infiniband support in Linux was always good. They don't have a single COMSTAR system which is too bad, but they have SCST for SRP (non-IP RDMA SCSI, the COMSTAR one that people say works with VMWare), and stgt for iSER (the one that works with the Solaris initiator). * instead of Crossbow they have RPS and RFS, which give some performance boost with ordinary network cards, not just with 10gig ones with flow caches. My understanding's hazy but I think, with an ordinary card, you still have to take an IPI, but it will touch hardly any of the packet on the wrongCPU so you can still take advantage of per-core caches hot with TCP-flow-specific structures. I'm not a serious enough developer to know whether RPS+RFS is more or less thorough than the Crossbow-branded stuff, but it was committed to mainline at about the same time as Crossbow. * Dreamhost is already selling Linux zones based on VServer and has been for many years, so there *is* a zones alternative on Linux, and better yet unlike the incompletely-delivered and eventually removed lx brand, on Linux you get Linux zones with Linux packages and nginx working with epoll and sendfile (on solaris, for me eventport works but sendfile does not). There's supposedly a total rewrite of VServer in the works called LXC, so maybe that will be the truly good one. It may take them longer to get sysadmin tools that match zonecfg/zoneadm, but the path is set. * LTTng is an attempt at something dtrace-like. It's still experimental, but has the same idea of large libraries of probes, programs cannot tell if they're being traced or not, and relatively sophisticated bundled analysis tools. http://multivax.blogspot.com/2010/11/introduction-to-linux-tracing-toolkit.html -- LTTng linux dtrace competitor The only thing missing is ZFS. To me it looks like a good replacement for that is years away. I'm not excited about ocfs, or about kernel module ZFS ports taking advantage of the Linus kmod ``interpretation'' and the grub GPLv3 patent protection. Instead I'm hoping they skip this stage and style of storage and go straight to something Lustre-like that supports snapshots. I've got my eye on ceph, and on Lustre itself of course because of the IB support. ex perhaps in the end you will have 64 - 256MB of atftpd-provided initramfs which never goes away where init and sshd and libc and all the complicated filesystem-related userspace lives, so there is no more problems of running /usr/sbin/zpool off of a ZFS---you will always be able to administrate your system even if every ``disk'' is hung (or if cluster access is disrupted). and there will not be a complexity difference between a laptop with local disks and cluster storage---everything will be the full-on complicated version. I feel ZFS doesn't scale small enough for phones, nor big enough for what people are already doing in data centers, so why not give up on small completely and waste even more RAM and complexity in the laptop case? and one of the most interesting appnotes to me about ZFS is this one relling posted long ago: http://docs.sun.com/app/docs/doc/820-7821/girgb?a=view which is an extremely limited analog of what ceph and Lustre do, where compute and storage nodes do not necessarily need
Re: [zfs-discuss] ashift and vdevs
dm == David Magda dma...@ee.ryerson.ca writes: dm The other thing is that with the growth of SSDs, if more OS dm vendors support dynamic sectors, SSD makers can have dm different values for the sector size okay, but if the size of whatever you're talking about is a multiple of 512, we don't actually need (or, probably, want!) any SCSI sector size monkeying around. Just establish a minimum write size in the filesystem, and always write multiple aligned 512-sectors at once instead. the 520-byte sectors you mentioned can't be accomodated this way, but for 4kByte it seems fine. dm to allow for performance dm changes as the technology evolves. Currently everything is dm hard-coded, XFS is hardcoded. NTFS has settable block size. ZFS has ashift (almost). ZFS slog is apparently hardcoded though. so, two of those four are not hardcoded, and the two hardcoded ones are hardcoded to 4kByte. dm Until you're in a virtualized environment. I believe that in dm the combination of NetApp and VMware, a 64K alignment is best dm practice last I head. Similarly with the various stripe widths dm available on traditional RAID arrays, it could be advantageous dm for the OS/FS to know it. There is another setting in XFS for RAID stripe size, but I don't know what it does. It's separate from the (unsettable) XFS block size setting. so...this 64kByte thing might not be the same thing as what we're talking about so far...though in terms of aligning partitions it's the same, I guess. pgpKhRGPwJZ8d.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
kd == Krunal Desai mov...@gmail.com writes: kd http://support.microsoft.com/kb/whatever dude.seriously? This is worse than a waste of time. Don't read a URL that starts this way. kd Windows 7 (even with SP1) has no support for 4K-sector kd drives. NTFS has 4KByte allocation units, so all you have to do is make sure the NTFS partition starts at an LBA that's a multiple of 8, and you have full performance. Probably NTFS is the reason WD has chosen 4kByte. Linux XFS is also locked at 4kByte sector size, because that's the VM page size and XFS cannot use any other block size than the page size. so, 4kByte is good (except for ZFS). kd can you explicate further about these drives and their kd emulation (or lack thereof), I'd appreciate it! further explication: all drives will have the emulation, or else you wouldn't be able to boot from them. The world of peecees isn't as clean as you imagine. kd which 4K sector drives offer a jumper or other method to kd completely disable any form of emulation and appear to the kd host OS as a 4K-sector drive? None that I know of. It's probably simpler and less silly to leave the emulation in place forever than start adding jumpers and modes and more secret commands. It doesn't matter what sector size the drive presents to the host OS because you can get the same performance character by always writing an aligned set of 8 sectors at once, which is what people are trying to force ZFS to do by adding 3 to ashift. Whether the number is reported by some messy new invented SCSI command, input by the operator, or derived by a mini-benchmark added to format/fmthard/zpool/whatever-applies-the-label, this is done once for the life of the disk, and after that happens whenever the OS needs this number it's gotten by issuing READ on the label. Day-to-day, the drive doesn't need to report it. Therefore, it is ``ability to accomodate a minimum-aligned-write-size'' which badly people want added to their operating systems, and no one sane really cares about automatic electronic reporting of true sector size. Unfortunately (but predictably) it sounds like if you 'zfs replace' a 512-byte drive with a 4096-byte drive you are screwed. therefore even people with 512-byte drives might want to set their ashift for 4096-byte drives right now. This is another reason it's a waste of time to worry about reporting/querying a drive's ``true'' sector size: for a pool of redundant disks, the needed planning's more complicated than query-report-obey. Also did anyone ever clarify whether the slog has an ashift? or is it forced-512? or derived from whatever vdev will eventually contain the separately-logged data? I would expect generalized immediate Caring about that since no slogs except ACARD and DDRDrive will have 512-byte sectors. pgpdnTloWn49S.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf
t == taemun tae...@gmail.com writes: t I would note that the Seagate 2TB LP has a 0.32% Annualised t Failure Rate. bullshit. pgpsMvTxl5Ghd.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
zu == zfs user zf...@itsbeen.sent.com writes: djm == Darren J Moffat darr...@opensolaris.org writes: zu Ugh, we all know that the first rule of crytpo is that any zu proprietary, closed source, black-box crypto is crap, blah, zu blah, blah (I am not sure what the point of repeating that zu tired line is) and I am not one to give Oracle an inch but zu wtf? They just released this crap, give them a minute My educated guess would be that the other encrypted systems released papers about the algorithm either concurrently with the implementation, or sometimes BEFORE the implementation, but not after. It's just silly to think geli or dmcrypt would expect anyone to use them without explaining the algorithm and exposing it to review. Also, Darren has been working on this for THREE YEARS, and he committed it just weeks after the ``opensolaris now closed'' announcement and hg pushing stopped. so, any time in the last three years would have been a better and more reasonable time to release a paper than tomorrow, after the binary proprietary release of the implementation has happened. This would eliminate the need for my objection as well as give the crypto community time to advise Darren's design, which is something I'm surprised he didn't want as much of as possible, but so be it: he's the one doing the work, and good for him, and since based on hints he's dropped I suspect the work is quite good, I'm more interested in reviewing the work that's there than whinging about preciesly how it was done or how long it took or when I can get it. For all that, I'll gladly wait. I just think firstly that the design needs review before trust, and secondly that it's starkly enough against best practice to be borderline irresponsible to release the work at all without subjecting the design to peer review. zu anything we have seen so far from Oracle shows us is that they zu are slow to move with external communication about Solaris. yeah, well. what happened after you ``waited'' last time? When people like me were saying ``not all of opensolaris is free software. In fact the free component is shockingly small, albeit an important component,'' and ``the full development cycle from hg to livecd needs to be freed, like it is on *BSD (build.sh) and RHEL (CentOS), so that the project can be forked if, god forbid, it needs to be---forking is bad, but forkability is a key component of freedom,'' and ``it is a problem that the toolchain is proprietary'', people like you said ``just give them time.'' I think we actually did quietly get a few big chunks liberated just by waiting, but still, in the end, you gave them too much time: openindiana and illumos are now struggling to solve parts of these problems without certainty of success, are rushed because Nexenta's business depends on them, and people who have invested in the platform thinking its freedom gave it a stable future are now sitting on many terabytes of locked-in data and many man-hours of doomed scriptage. While the disaster is certainly not complete and some gradual-transition outcomes remain possible, your ``give them time'' advice is basically dead wrong, according to history. How can you say that now? I don't get it. Finally, there's a problem with the style of argument. Not everything on a mailing list is ``$ENTITY sucks/rules.'' I'm allowed to say something critical without implicitly saying ``everything Oracle does and everything they touch is wrong and evil and should be burned with torches.'' I don't really care about Oracle at all. What I said was much more specific, and there's no cause to wait before saying ``I will not take zfs crypto seriously so long as it's a black box.'' The right time to say that is NOW. so, no, I disagree: do not give them time. Wait for the paper, or more likely for the actual source, before using ZFS crypto. That is what you should do with your Time. djm It is a work in progress. Fine, and good. I thought it might be. In the unlikely event there was any impediment to your writing, and releasing, the paper, hopefully my complaining will be one among many things that helps remove it. Really, it is just mandatory. pgpogmN8mbJjZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideas for ghetto file server data reliability?
sl == Sigbjorn Lie sigbj...@nixtra.com writes: sl Do you need registered ECC, or will non-reg ECC do registered means the same thing as buffered. It has nothing to do with registering to some kind of authority---it's a register like the accumulators inside CPU's. The register allows more sticks per channel at the questionably-relevant cost of ``latency.'' Lately, more than two sticks per channel seems to require registers. Your choice of motherboard (and the memory controller implied by that choice) decides whether the memory must be registered or must be unregistered, and I don't know of any motherboards that will take both kinds (though I bet there are some out there, somewhere in history). There are other weird kinds of memory connection besides just registered and unregistered, but everything has higher latency than ``unregistered''. None of this has anything to do with ECC, though it may sometimes seem to since both registers and ECC cost money so tightly cost-constrained systems might tend to have neither, and quantities go down and profit margins get immediately jacked up once you ask for either of the two. hth. :/ pgpwc9fQAUyLZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
djm == Darren J Moffat darr...@opensolaris.org writes: djm http://blogs.sun.com/darren/entry/introducing_zfs_crypto_in_oracle djm http://blogs.sun.com/darren/entry/assued_delete_with_zfs_dataset djm http://blogs.sun.com/darren/entry/compress_encrypt_checksum_deduplicate_with Is there a URL describing the on-disk format and implementation details? djm Encryption at the application layer solves a different set of djm problems to encryption at the storage layer. black-box crypto is snake oil at any level, IMNSHO. Congrats again on finishing your project, but every other disk encryption framework I've seen taken remotely seriously has a detailed paper describing the algorithm, not just a list of features and a configuration guide. It should be a requirement for anything treated as more than a toy. I might have missed yours, or maybe it's coming soon. pgphDwX1ujOx9.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
tc == Tim Cook t...@cook.ms writes: tc Channeling Ethernet will not make it any faster. Each tc individual connection will be limited to 1gbit. iSCSI with tc mpxio may work, nfs will not. well...probably you will run into this problem, but it's not necessarily totally unsolved. I am just regurgitating this list again, but: need to include L4 port number in the hash: http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb port-channel load-balance mixed -- for L2 etherchannels mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) nexus makes all this more complicated. there are a few ways that seem they'd be able to accomplish ECMP: FTag flow markers in ``FabricPath'' L2 forwarding LISP MPLS the basic scheme is that the L4 hash is performed only by the edge router and used to calculate a label. The routing protocol will either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of per-entire-path ECMP for LISP and MPLS. unfortunately I don't understand these tools well enoguh to lead you further, but if you're not using infiniband and want to do 10way ECMP this is probably where you need to look. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 feature added in snv_117, NFS client connections can be spread over multiple TCP connections When rpcmod:clnt_max_conns is set to a value 1 however Even though the server is free to return data on different connections, [it does not seem to choose to actually do so] -- 6696163 fixed snv_117 nfs:nfs3_max_threads=32 in /etc/system, which changes the default 8 async threads per mount to 32. This is especially helpful for NFS over 10Gb and sun4v this stuff gets your NFS traffic onto multiple TCP circuits, which is the same thing iSCSI multipath would accomplish. From there, you still need to do the cisco/??? stuff above to get TCP circuits spread across physical paths. http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html -- suspect. it advises ``just buy 10gig'' but many other places say 10G NIC's don't perform well in real multi-core machines unless you have at least as many TCP streams as cores, which is honestly kind of obvious. lego-netadmin bias. pgputFUSXDRds.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does a zvol use the zil?
re == Richard Elling richard.ell...@gmail.com writes: re it seems the hypervisors try to do crazy things like make the re disks readonly, haha! re which is perhaps the worst thing you can do to a guest OS re because now it needs to be rebooted I might've set it up to ``pause'' the VM for most failures, and for punts like this read-only case, maybe leave it paused until someone comes along to turn it off or unpause it. But for loss of connection to an iSCSI-backed disk, I think that's wrong. I guess the truly correct failure handling would be to immediately poweroff the guest VM: pausing it tempts the sysadmin to fix the iscsi connection and unpause it, which in this case is the only real disaster-begging thing to do. One would get a lot of complaints from sysadmins who don't understand the iscsi write hole, but I think it's right. so...in that context, maybe read-only-until-reboot is actually not so dumb! For guests unknowingly getting their disks via NFS, it would make sense to pause the VM to stop (some of) its interval timer(s), (and hope you get the timer running the ATA/SCSI/... driver among the stopped ones) because the guest's disk driver won't understand NFS hard mount timeout rules---won't understand that, for certain errors, you can pass ``stale file handle'' up the stack, but for other errors you must wait forever. Instead they'll enforce a 30-second timeout like for an ATA disk. I think you could probably still avoid losing the 'write B' if the guest fired its ATA timeout with an NFS-backed disk because the writes have already been handed off to the host. It might be weird user experience in the VM manager because whatever process is doing the NFS writes will be unkillable 'D' state even if you poweroff the VM, but this weirdness is an expression of arcane reality, not a bug. It'd be better sysadmin experience to avoid the guest ATA timeout, though: pause the VM and resume so that NFS server reboots would freeze guests for a while, not require rebooting them, just like they do for nonvirtual NFSv3 clients. You would have to figure out the maximum number of seconds the guests can go without disk access, and deviously pause them before their burried / proprietary disk timeouts can fire. pgpYvvjgSY5Gl.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does a zvol use the zil?
re == Richard Elling richard.ell...@gmail.com writes: re The risk here is not really different that that faced by re normal disk drives which have nonvolatile buffers (eg re virtually all HDDs and some SSDs). This is why applications re can send cache flush commands when they need to ensure the re data is on the media. It's probably different because of the iSCSI target reboot problem I've written about before: iSCSI initiator iSCSI target nonvolatile medium write A - ack A write B - ack B --[A] [REBOOT] write C [timeout!] reconnect - ack Connected write C - ack C flush - [C] - ack Flush in the above time chart, the initiator thinks A, B, and C are written, but in fact only A and C are written. I regard this as a failing of imagination in the SCSI protocol, but probably with better understanding of the details than I have the initiator could be made to provably work around the problem. My guess has always been that no current initiators actually do, though. I think it could happen also with a directly-attached SATA disk if you remove power from the disk without rebooting the host, so as Richard said it is not really different, except that in the real world it's much more common for an iSCSI target to lose power without the initiator's also losing power than it is for a disk to lose power without its host adapter losing power. The ancient practice of unix filesystem design always considers cord-yanking as something happening to the entire machine, and failing disks are not the filesystem's responsibility to work arund because how could it? This assumption should have been changed and wasn't, when we entered the era of RAID and removable disks, where the connections to disks and disks themselves are both allowed to fail. However, when NFS was designed, the assumption *WAS* changed, and indeed NFSv2 and earlier operated always with the write cache OFF to be safe from this, just as COMSTAR does in its (default?) abyssmal-performance mode (so campuses bought prestoserve cards (equivalent to a DDRDrive except much less silly because they have onboard batteries), or auspex servers with included NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc FC/iSCSI targets which always have big NVRAM's so they can leave the write cache off), and NFSv3 has a commit protocol that is smart enough to replay the 'write B' which makes the nonvolatile caches less necessary (so long as you're not closing files frequently, I guess?). I think it would be smart to design more storage systems so NFS can replace the role of iSCSI, for disk access. In Isilon or Lustre clusters this trick is common when a node can settle with unshared access to a subtree: create an image file on the NFS/Lustre back-end and fill it with an ext3 or XFS, and writes to that inner filesystem become much faster because this rube goldberg arrangement discards the clsoe-to-open consistency guarantee. We might use it in the ZFS world for actual physical disk acess instead of iSCSI, ex., it should be possible to NFS-export a zvol and see a share with a single file in it named 'theTarget' or something, but this file would be without read-ahead. Better yet, to accomodate VMWare limitations, would be to export a single fake /zvol share containing all NFS-shared zvol's, and as you export zvol's their files appear within this share. Also it should be possible to mount vdev elements over NFS without deadlocks---I know that is difficult, but VMWare does it. Perahps it cannot be done through the existing NFS client, but obviously it can be done somehow, and it would both solve the iSCSI target reboot problem and also allow using more kinds of proprietary storage backend---the same reasons VMWare wants to give admins a choice applies to ZFS. When NFS is used in this way the disk image file is never closed, so the NFS server will not need a slog to give good performance: the same job is accomplished by double-caching the uncommitted data on the client so it can be replayed if the time diagram above happens. pgp5D3EwpiIVp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
en == Eff Norwood sm...@jsvp.com writes: en We also tried SSDs as the ZIL which worked ok until they got en full, then performance tanked. As I have posted before, SSDs en as your ZIL - don't do it! yeah, iirc the thread went back and forth between you and I for a few days, something like this, you: SSD's work fine at first, then slow down, see this anandtech article. We got bit by this. me: That article is two years old. Read this other article which is one year old and explains the problem is fixed if you buy current gen2 intel or sandforce-based SSD. you: Well absent test results from you I think we will just have to continue believing that all SSD's gradually slow down like I said, though I would love to be proved wrong. me: You haven't provided any test results yourself nor even said what drive you're using. We've both just cited anandtech, and my citation's newer than yours. you: I welcome further tests that prove the DDRDrive is not the only suitable ZIL, but absent these tests we have to assume I'm right that it is. silly! slowdowns with age: http://www.pcper.com/article.php?aid=669 http://www.anandtech.com/show/2738/15 slowdowns fixed: http://www.anandtech.com/show/2899/8 ``With the X25-M G2 Intel managed to virtually eliminate the random-write performance penalty on a sequentially filled drive. In other words, if you used an X25-M G2 as a normal desktop drive, 4KB random write performance wouldnCBB http://www.anandtech.com/show/2738/25 t really degrade over time. Even without TRIM.'' note this is not advice to buy sandforce for slog because I don't know if anyone's tested it respects flush-cache commands and suspect it may drop them. sumary: There's probably been major, documented shifts in the industry between when you tested and now, but no one knows because you don't even tell what you tested or how---you just spread FUD and flog the DDRDrive and then say ``do research to prove me wrong or else my hazy statement stands.'' bad science. pgp6mClKLco5m.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs issues
tb == Thomas Burgess wonsl...@gmail.com writes: tb I'm running b134 and have been for months now, without issue. tb Recently i enabled 2 services to get bonjoir notificatons tb working in osx tb /network/dns/multicast:default tb /system/avahi-bridge-dsd:default tb and i added a few .service files to /etc/avahi/services/ tb ever since doing this, nfs is keeps crashing try changing 'hosts' key in /etc/nsswitch.conf to: -8- hosts: files mdns dns -8- pgpha0z94UlZ8.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
nw == Nicolas Williams nicolas.willi...@oracle.com writes: nw The current system fails closed wrong. $ touch t0 $ chmod 444 t0 $ chmod A0+user:$(id -nu):write_data:allow t0 $ ls -l t0 -r--r--r--+ 1 carton carton 0 Oct 6 20:22 t0 now go to an NFSv3 client: $ ls -l t0 -r--r--r-- 1 carton 405 0 2010-10-06 16:26 t0 $ echo lala t0 $ wide open. NFSv3 and SMB sharing the same dataset is a use-case you claim to accomodate. This case fails open once Windows users start adding 'allow' ACL's. It's not a corner case; it's a design that fails open. ever had 777 it would send a SIGWTF to any AFS-unaware graybeards nw A signal?! How would that work when the entity doing a chmod nw is on a remote NFS client? please find SIGWTF under 'kill -l' and you might understand what I meant. nw You seem to be in denial. You continue to ignore the nw constraint that Windows clients must be able to fully control nw permissions in spite of their inability to perceive and modify nw file modes. You remain unshakably certain that this is true of my proposal in spite of the fact that you've said clearly that you don't understand my proposal. That's bad science. It may be my fault that you don't understand it: maybe I need to write something shorter but just as expressive to fit within mailing list attention spans, or maybe my examples are unclear. However that doesn't mean that I'm in denial nor make you right---that just makes me annoying. -- READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (BOGUS AGREEMENTS) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. pgpvrZFYgaHat.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
ag == Andrew Gabriel andrew.gabr...@oracle.com writes: ag Having now read a number of forums about these, there's a ag strong feeling WD screwed up by not providing a switch to ag disable pseudo 512b access so you can use the 4k native. this reporting lie is no different from SSD's which have 2 - 8 kB sectors on the inside and benefit from alignment. I think probably everything will report 512 byte sectors forever. If a device had a 4224-byte sector, it would make sense to report that, but I don't see a big downside to reporting 512 when it's really 4096. NAND flash often does have sectors with odd sizes like 4224, and (some of) Linux's NAND-friendly filesystems (ubifs, yaffs, nilfs) use this OOB area for filesystem structures, which are intermixed with the ECC. but in that case it's not a SCSI interface to the odd-sized sector---it's an ``mtd'' interface that supports operations like ``erase page'', ``suspend erasing'', ``erase some more''. that said I am in the ``ignore WD for now'' camp. but this isn't why. Ignore them (among other, better reasons) because they have 4k sectors at all which don't yet work well until we can teach ZFS to never write smaller than 4kB. but failure to report 4k as SCSI 4kB sector is not a problem, to my view. You can just align your partitions. pgp6jwIDoUJ9i.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
dd == David Dyer-Bennet d...@dd-b.net writes: dd Richard Elling said ZFS handles the 4k real 512byte fake dd drives okay now in default setups There are two steps to handling it well. one is to align the start of partitions to 4kB, and apparently on Solaris (thanks to all the cumbersome partitioning tools) that is done. On Linux you often have to really pay attention to make this happen, depending on the partitioning tool that happens to be built into your ``distro'' or whatever. The second step is to never write anything smaller than 4kB. ex., if you want to write 0.5kB, pad it with 3.5kB of zeroes to avoid the read-modify-write penalty. AIUI that is not done yet, and zfs does sometimes want to write 0.5kB. When it's writing 128kB of course there is no penalty. For this, I think XFS and NTFS are actually better and tend not to write the small blocks, but I could be wrong. pgpn3kSSlfThy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
nw == Nicolas Williams nicolas.willi...@oracle.com writes: nw *You* stated that your proposal wouldn't allow Windows users nw full control over file permissions. me: I have a proposal you: op! OP op, wait! DOES YOUR PROPOSAL blah blah WINDOWS blah blah COMPLETELY AND EXACTLY LIKE THE CURRENT ONE. me: no, but what it does is... you: well then I don't even have to read it. It's unacceptable because $BLEH. me: untrue. My proposal handles $BLEH just fine. you: you just said it didn't! me: well, it does. Please read it. you: I read it and I don't understand it. Anyway it doesn't handle $BLEH so it's no good. This is not really working, and concision is the problem. so, I now, today, state: My proposal allows Windows users full control over file permissions. nw Yes, that may be. I encourage you to find a clearer way to nw express your proposal. So far, it's just us talking. I think I'll wait and see if anyone besides you reads it. If so, maybe they can ask questions that help me clarify it. If no one does, it's probably not interesting here anyway. pgp4wuhrA1SzN.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS crypto bug status change
dm == David Magda dma...@ee.ryerson.ca writes: dm Thank you Mr. Moffat et al. Hopefully the rest of us will be dm able to bang on this at some point. :) Thanks for the heads-up on the gossip. This etiquette seems weird, though: I don't thank Microsoft for releasing a new version of Word. I'll postpone my thanks for 2 years until the source is released, though by then who knows if I'll still be using ZFS at all. Maybe more appropriate would be: congrats on finally finishing your seven-year project, Darren! must be a huge relief. I'm glad it wasn't my project, though. If I were in Darren's place I'd have signed on to work for an open-source company, spent seven years of my life working on something, delaying it and pushing hard to make it a generation beyond other filesystem crypto, and then when I'm finally done, yoink!. That's me, though. I shouldn't speculate on someone else's situation. Maybe he signed on under different circumstances, or delayed for different reasons than feature-ambition, or cares about different things than I do. I only mean to make an example of how politics, featuresets, and IT planning interact to make an ecosystem that's got more complicated implications than just a bulleted list of features and a license with an OSI logo. -- READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (BOGUS AGREEMENTS) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. pgpxfnP4VSj9Z.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
nw == Nicolas Williams nicolas.willi...@oracle.com writes: nw Keep in mind that Windows lacks a mode_t. We need to interop nw with Windows. If a Windows user cannot completely change file nw perms because there's a mode_t completely out of their nw reach... they'll be frustrated. well...AIUI this already works very badly, so keep that in mind, too. In AFS this is handled by most files having 777, and we could do the same if we had an AND-based system. This is both less frustrating and more self-documenting than the current system. In an AND-based system, some unix users will be able to edit the windows permissions with 'chmod A...'. In shops using older unixes where users can only set mode bits, the rule becomes ``enforced permissions are the lesser of what Unix people and Windows people apply.'' This rule is easy to understand, not frustrating, and readily encourages ad-hoc cooperation (``can you please set everything-everyone on your subtree? we'll handle it in unix.'' / ``can you please set 777 on your subtree? or 770 group windows? we want to add windows silly-sid-permissions.''). This is a big step better than existing systems with subtrees where Unix and Windows users are forced to cooperate. It would certainly work much better than the current system, where you look at your permissions and don't have any idea whether you've got more, less, or exactly the same permission as what your software is telling you: the crappy autotranslation teaches users that all bets are off. It would be nice if, under my proposal, we could delete the unix tagspace entirely: chpacl '(unix)' chmod -R A- . but unfortunately, deletion of ACL's is special-cased by Solaris's chmod to ``rewrite ACL's that match the UNIX permissions bits,'' so it would probably have to stay special-cased in a tagspace system. pgpzWtQEMyslr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
Can the user in (3) fix the permissions from Windows? no, not under my proposal. but it sounds like currently people cannot ``fix'' permissions through the quirky autotranslation anyway, certainly not to the point where neither unix nor windows users are confused: windows users are always confused, and unix users don't get to see all the permissions. Now what? set the unix perms to 777 as a sign to the unix people to either (a) leave it alone, or (b) learn to use 'chmod A...'. This will actually work: it's not a hand-waving hypothetical that just doesn't play out. What I provide, which we don't have now, is a way to make: /tub/dataset/a subtree -rwxrwxrwxin old unix [working, changeable permissions] in windows /tub/dataset/b subtree -rw-r--r--in old unix [everything: everyone]in windows, but unix permissions still enforced this means: * unix writers and windows writers can cooperate even within a single dataset * an intuitive warning sign when non-native permissions are in effect, * fewer leaked-data surprises If you accept that the autotranslation between the two permissions regimes is total shit, which it is, then what I offer is the best oyu can hope for. My proposal also generalizes to other permissions autoconversion problems: * Future ACL translation stupidity that will happen as more bizarre ACL mechanisms are invented, or underspecified parts of current spec make different choices in different OS's. - POSIX - NFSv4, Darwin - NFSv4 If Apple provides a Darwin - NFSv4 translation that's silly, a match for Darwin NFS client IP's in the share string could put these clients into a tagged ACL group. - AFP - NFSv4 ACL's can be tagged by protocol for new weird protocols. If [new protocol]'s ACLs are a subset of NFSv4 ACL's, then they can be implemented by the bridge and apply to users who don't go through the bridge. The [new protocol] bridge will have an ACLspace all to itself, within which it can be certain nothing but itself will change ACL's, so it can rely on never having to read NFSv4 ACL's that do not match the subset it would feel inclined to write. Unix users will get an everything:everyone or 777 warning that someone else is managing the ACLspace. Yet, Unix users can descend into its private subtrees and muck around with ACL's, and the Unix changes will still get enforced. It's easy to search for all the changes made by Unix, vs all the changes made by [new protocol] bridge, and see if some are important. It's easy to delete all of them at once if someone shouldn't have been mucking arond from unix, or if the [new protocol] bridge was unleashed on a dataset that wasn't dedicated to it and made a mess. This is a case where the [new protocol] bridge is using the ACL's for two related but slightly-orthogonal purposes: to enforce security, and to store metadata. My proposal separates the two. - SMB - NFSv4, NFSv4 - NFSv4 I get that the NFSv4 ACL's are supposed to match Windows perfectly, but if that turns out to be untrue, Linux and Windows clients could be put in separate ACL groups even though they're both, in theory, using NFSv4 ACL's. * zones running large software packages that have bizarre or misguided ACL behavior ACL's are complicated enough that a lot of programmers will get them wrong. If you have a large, assertion-riddled app that will shit itself if it doesn't see the ACL's it expects, or autoset or autoremove ACL's, or does other stupid things with ACL's, you can put it into a zone and configure an ACL tag on the zone, segregating its ACL-writing from the rest of the system. Yet, its restrictions are still respected. If the app were setting ACL's that don't give enough permission, it wouldn't work. but it may have hardcoded crap that stupidly opens up ACL's, or refuses to work if ACL's aren't as open as it thinks they should be. Now you can fake it out whenever it calls getacl, but set other ACL's kept secret from it and still return permission denied when you like. * (optional) a backup mechanism. If you make the choice ``global zone ignores ACLgroups with 'zoned' bit set'', then you can run backups in the global zone that won't be stopped by ACL's set by the inner zones, however you can still limit your backup process's access by adding zoned=0 ACL's. chpacl '(unix)' chmod -R A- . nw Huh? I think you are confused because you didn't read my proposal because it was too long, or the examples I wrote weren't easy to understand. however if I try to repeat it in small pieces, I think it'll just be even longer and harder to understand than the original. What's more, if you don't agree that the
[zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)
rb == Ralph Böhme ra...@rsrc.de writes: rb The Darwin kernel evaluates permissions in a first rb match paradigm, evaluating the ACL before the mode well...I think it would be better to AND them together like AFS did. In that case it doesn't make any difference in which order you do it because AND is commutative. The Darwin method you describe means one might remove permissions with chmod but still have access granted under first-match by the ACL. I just tested, and Darwin does indeed work this way. :( One way to get from NFSv4 to what I want is that you might add EVEN MORE complexity and have ``tagged ACL groups'': * all the existing ACL tools and NFS/SMB clients targeting the #(null) tag, * traditional 'chmod' unix permissions targeting the #(unix) tag. * The evaluation within a tag-group is first-match like now, * The result of each tag-group is ANDed together for the final evaluation When accomodating Darwin ACL's or Windows ACL's or Linux NFSv4 ACL's or translated POSIX ACL's, the result of the imperfect translation can be shoved into a tag-group if it's unclean. The way I would implement the userspace, tools would display all tag groups if given some new argument, but they would always be incapable of editing any tag group except #(null). Another chroot-like tool would swap a given tag-group for #(null) for all child processes: car...@awabagal:~/bar$ ls -v\# foo -rw-r--r-- 1 carton carton 0 Sep 29 18:31 foo 0#(unix):owner@:execute:deny 1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes /write_acl/write_owner:allow 2#(unix):group@:write_data/append_data/execute:deny 3#(unix):group@:read_data:allow 4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes /write_acl/write_owner:deny 5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize :allow car...@awabagal:~/bar$ chmod A+owner@:write_data:deny foo car...@awabagal:~/bar$ ls -v\# foo -rw-r--r-- 1 carton carton 0 Sep 29 18:31 foo 0#(null):owner@:write_data:deny # 0#(unix):owner@:execute:deny 1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes /write_acl/write_owner:allow 2#(unix):group@:write_data/append_data/execute:deny 3#(unix):group@:read_data:allow 4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes /write_acl/write_owner:deny 5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize :allow car...@awabagal:~/bar$ echo lala foo -bash: foo: Permission denied car...@awabagal:~/bar$ chpacl baz ls -v\# foo -rw-r--r-- 1 carton carton 0 Sep 29 18:31 foo # 0#root:owner@:write_data:deny -- #root is what's mapped to #(null) at boot # 0#(unix):owner@:execute:deny 1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes /write_acl/write_owner:allow 2#(unix):group@:write_data/append_data/execute:deny 3#(unix):group@:read_data:allow 4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes /write_acl/write_owner:deny 5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize :allow car...@awabagal:~/bar$ chpacl '(null)' true chpacl: '(null)' is reserved. car...@awabagal:~/bar$ chpacl baz chmod A+owner@:read_data:deny foo car...@awabagal:~/bar$ chpacl baz ls -v\# foo -rw-r--r-- 1 carton carton 0 Sep 29 18:31 foo 0#(null):owner@:read_data:deny # 0#root:owner@:write_data:deny # 0#(unix):owner@:execute:deny 1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes /write_acl/write_owner:allow 2#(unix):group@:write_data/append_data/execute:deny 3#(unix):group@:read_data:allow 4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes /write_acl/write_owner:deny 5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize :allow car...@awabagal:~bar$ cat foo -bash: foo: Permission denied car...@awabagal:~bar$ chpacl baz cat foo -- current tagspace is irrelevant to ACL evaluation -bash: foo: Permission denied car...@awabagal:~/bar$ ls -v\# foo -rw-r--r-- 1 carton carton 0 Sep 29 18:31 foo 0#(null):owner@:write_data:deny # 0#baz:owner@:read_data:deny # 0#(unix):owner@:execute:deny 1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes /write_acl/write_owner:allow 2#(unix):group@:write_data/append_data/execute:deny 3#(unix):group@:read_data:allow 4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes /write_acl/write_owner:deny 5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize :allow
Re: [zfs-discuss] drive speeds etc
sb == Simon Breden sbre...@gmail.com writes: sb WD itself does not recommend them for 'business critical' RAID sb use The described problems with WD aren't okay for non-critical development/backup/home use either. The statement from WD is nothing but an attempt to upsell you, to differentiate the market so they can tap into the demand curve at multiple points, and to overload you with information so the question becomes ``which WD drive should I buy'' instead of ``which manufactuer's drive should I buy.'' Don't let this stuff get a foothold inside your brain. ``mixing'' drives within a stripe is a good idea because it protects you from bad batches and bad models/firmwares, which are not rare in recent experience! I always mix drives and included WD in that mix up until this latest rash of problems. ``mixing'' is only bad (for WD) because it makes it easier for you, the customer, to characterize the green performance deficit and notice the firmware bugs that are unique to the WD drives. pgpg2mRMPLVGG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
dd == David Dyer-Bennet d...@dd-b.net writes: dd Sure, if only a single thread is ever writing to the disk dd store at a time. video warehousing is a reasonable use case that will have small numbers of sequential readers and writers to large files. virtual tape library is another obviously similar one. basically, things which used to be stored on tape. which are not uncommon. AIUI ZFS does not have a fragmentation problem for these cases unless you fill past 96%, though I've been trying to keep my pool below 80% because general FUD. dd This situation doesn't exist with any kind of enterprise disk dd appliance, though; there are always multiple users doing dd stuff. the point's relevant, but I'm starting to tune out every time I hear the word ``enterprise.'' seems it often decodes to: (1) ``fat sacks and no clue,'' or (2) ``i can't hear you i can't hear you i have one big hammer in my toolchest and one quick answer to all questions, and everything's perfect! perfect, I say. unless you're offering an even bigger hammer I can swap for this one, I don't want to hear it,'' or (3) ``However of course I agree that hammers come in different colors, and a wise and experienced craftsman will always choose the color of his hammer based on the color of the nail he's hitting, because the interface between hammers and nails doesn't work well otherwise. We all know here how to match hammer and nail colors, but I don't want to discuss that at all because it's a private decision to make between you and your salesdroid. ``However, in this forum here we talk about GREEN NAILS ONLY. If you are hitting green nails with red hammers and finding they go into the wood anyway then you are being very unprofessional because that nail might have been a bank transaction. --posted from opensolaris.org'' pgpqzPhCxoUuU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
ml == Mark Little marklit...@koallo.com writes: ml Just to clarify - do you mean TLER should be off or on? It should be set to ``do not have asvc_t 11 seconds and 1 io/s''. ...which is not one of the settings of the TLER knob. This isn't a problem with the TLER *setting*. TLER does not even apply unless the drive has a latent sector error. TLER does not even apply unless the drive has a latent sector error. TLER does not even apply unless the drive has a latent sector error. GOT IT? so if the drive is not defective, but is erratically having huge latency when not busy, this isn't a TLER problem. It's a drive-is-unpredictable-piece-of-junk problem. Will the problem go away if you change the TLER setting to the opposite of whatever it is? Who knows?! It shouldn't based on the claimed purpose of TLER, but in reality, maybe, maybe not, because the drive shouldn't (``shouldn't'', haha) act like that to begin with. It will be more likely to go away if you replace the drive with a different model, though. ml Storage forum on hardforum.com, the experts there seem to ml recommend NOT having TLER enabled when using ZFS as ZFS can be ml configured for its timeouts, etc, I don't believe there are any configurable timeouts in ZFS. The ZFS developers take the position that timeouts are not our problem and push all that work down the stack to the controller driver and the disk driver, which cooperate (this is two drivers, now. plus a third ``SCSI mid-layer'' perhaps, for some controllers but not others.) to implement a variety of inconsistent, silly, undocumented cargo-cult flailing timeout regimes that we all have to put up with. However they are always quite long. The ATA max timeout is 30sec, and AIUI they are all much longer than that. My new favorite thing, though, is the reference counting. OS: ``This disk/iSCSIdisk is `busy' so you can't detach it''. me: ``bullshit. YOINK, detached, now deal with it.'' IMO this area is in need of some serious bar-raising. ml and the main reason to use TLER is when using those drives ml with hardware RAID cards which will kick a drive out of the ml array if it takes longer than 10 seconds. yup. which is something the drive will not do unless it encounters an ERROR. that is the E in TLER. In other words, the feature as described prevents you from noticing and invoking warranty replacement on your about-to-fail drive. For this you pay double. Have I got that right? In any case the obvious proper place to fix this is in the RAID-on-a-card firmware, not the disk firmware, if it does even need fixing which is unclear to me. unless the disk manufacturers are going to offer a feature ``do not spend more than 1 second out of every 2 seconds `trying harder' to read marginal data, just return errors'' which woudl actually have real value, the only reason TLER is proper is that it can convince all you gamers to pay twice as much for a drive because they've flipped a single bit in the firmware and then shovelled a big pile of bullshit into your heads. ml Can anyone else here comment if they have had experience with ml the WD drives and ZFS and if they have TLER enabled or ml disabled? I do not have any problems with drives dropping out of ZFS using the normal TLER setting. I do have problems with slowly-failing drives fucking up the whole system. ZFS doesn't deal with them gracefully, and I have to find the bad drive and remove it by hand. All this stuff about cold spares automatically replacing and USARS never notice, is largely a fantasy. Neither observation leads me to want TLER. however observations like this ``why did my disks suddenly slow down?'' lead me to avoid WD drives period, for ZFS or not ZFS or anything at all. Whipping up all this marketing sillyness around TLER also leads me to avoid them because I know they will shovel bullshit and FUD to justify jacked prices. pgpMng48rq0w8.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
dm == David Magda dma...@ee.ryerson.ca writes: dm http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/ http://www.groklaw.net/articlebasic.php?story=20050121014650517 says when the MPL was modified to become the CDDL, clauses were removed which would have required Oracle to disclose any patent licenses it might have negotiated with NetApp covering CDDL code. The disclosure would have to be added to hg, freeze or no: ``If Contributor obtains such knowledge after the Modification is made available as described in Section 3.2, Contributor shall promptly modify the LEGAL file in all copies Contributor makes available thereafter and shall take other steps (such as notifying appropriate mailing lists or newsgroups) reasonably calculated to inform those who received the Covered Code that new knowledge has been obtained.'' This is in MPL but removed from CDDL. The groklaw poster's concern is that this is a mechanism through which Oracle could manoever to make the CDDL worthless as a guarantee of zfs users' software freedom. CDDL does implicitly grant rights to Oracle's patents, but not to negotiations for shield from NetApp's. AIUI GPLv3 is different and does not have this problem, though I don't understand it well so I could be wrong. With MPL at least we would know about the negotiations: the settlement was ``secret'' which is exactly the disaster scenario the groklaw poster warned of. I'm sorry you cannot be uninterested in licenses and ``just want to get work done.'' To me it looks like the patent situation is mostly an obstacle to getting ZFS development funded. If you used ZFS secretly in some kind of cloud service, and never told anyone about it, you could be pretty certain of getting away with it without any patent claims throughout the entire decade or so that ZFS remains relevant, but if you want to participate in a horizontally-divided market like Coraid, or otherwise share source changes, you might get sued. This regime has to be a huge drag on the industry, and it makes things really unpredictable which has to discourage investment, and it strongly favours large companies. pgpLRI59okaob.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
en == Eff Norwood sm...@jsvp.com writes: en http://www.anandtech.com/show/2738/8 but a few pages later: http://www.anandtech.com/show/2738/25 so, as you say, ``with all major SSDs in the role of a ZIL you will eventually not be happy.'' is true, but you seem to have accidentally left out the ``EXCEPT INTEL!'' Oops! Funnier still, the EXCEPT INTEL is right there in exactly the article YOU cited. however, that's not the end of it. Searching this very mailing list for 'anandtech' I found this cited about ten times: http://www.anandtech.com/show/2899/8 anandtech does not think TRIM / dirty drives are a problem any longer. You might want to redo whatever tests you did (or else read newer anandtech articles). I've made the same mistake of passing around anandtech links without keeping up with their latest posts, but the thing is, that link debunking your ideas was posted on this list *so* *many* *times* and over such a long interval! You can also use the anandtech articles as a point of reference for how you might write up your ``extensive testing'' of ``all major'' SSD's in a way that will ``assure'' people your conclusions are correct. (HINT: list the SSD's you tested. describe the testing method. Results would be nice, too, but the first two were missing from your post. They help a lot, and do not take much time to include, though leaving them out does help FUD spread further if you are trying to promote this ``DDRDrive'' with the silly external power brick.) en I can't think of an easy way to measure pages that have not en been consumed since it's really an SSD controller function en which is obfuscated from the OS, yeah, SSD's are largely just a different way of selling proprietary software, but I guess a lot of ``hardware'' is. pgpi59M7WwDpr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] native ZFS on Linux
aa == Anurag Agarwal anu...@kqinfotech.com writes: aa Every one being part of beta program will have access to aa source code ...and the right to redistribute it if they like, which I think is also guaranteed by the license. Yes, I agree a somewhat formal beta program could be smart for this type of software, which can lose large amounts of data, and where reproducing problems isn't easy because debugging the way analagous to other software requires shipping around multi-terabyte possibly-confidential images, so you'd like competent testers so you can skip this without becoming too frustrated. But I don't see how anything fitting the definition of ``closed'' is possible with free software. Even just asking participants, ``please don't leak our software outside the beta, even though you've the legal right to do so. If you do leak it, we'll be unhappy,'' is an implicit threat to retaliate (ex. by excluding people from further beta releases, which you'll likely be making in a continuous stream). so the word ``closed'' alone, even without any further discussion, is likely to have a chilling effect on the software freedom of the beta participants, and I think this effect is absolutely intended by you, and that it's wrong. on one hand it's sort of a fine point, but on the other for the facts on the ground it can matter quite a lot. Thanks for the effort! and for clarifying that you will always release matching source along with every binary release you make! pgpN2VocVYwL0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] native ZFS on Linux
aa == Anurag Agarwal anu...@kqinfotech.com writes: aa * Currently we are planning to do a closed beta aa * Source code will be made available with release. CDDL violation. aa * We will be providing paid support for our binary aa releases. great, so long as your ``binary releases'' always include source that matches the release exactly. pgpOBx1yJdmLD.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS issue with ZFS
pb( == Phillip Bruce (Mindsource) v-phb...@microsoft.com writes: pb( Problem solved.. Try using FQDN on the server end and that pb( work. The client did not have to use FQDN. 1. your syntax is wrong. You must use netgroup syntax to specify an IP, otherwise it will think you mean the hostname made up of those numbers and dots as characters. NAME PROPERTY VALUE andaman/arrchive sharenfs r...@10.100.100.0/23:@192.168.2.3/32 2. there's a bug in mountd. well, there are many bugs in mountd, but this is the one I ran into, which makes the netgroup syntax mostly useless: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832 one workaround is to give every IP reverse lookup, ex. using BIND $generate or something. I just use a big /etc/hosts covering every IP to which I've exported. I suppose actually fixing mountd would be what a good sysadmin would have done: it can't be that hard. pgp6GX6Mwe4Z0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
gd == Garrett D'Amore garr...@nexenta.com writes: Joerg is correct that CDDL code can legally live right alongside the GPLv2 kernel code and run in the same program. gd My understanding is that no, this is not possible. GPLv2 and CDDL are incompatible: http://www.fsf.org/licensing/education/licenses/index_html/#GPLIncompatibleLicenses however Linus's ``interpretation'' of the GPL considers that 'insmod' is ``mere aggregation'' and not ``linking'', but subject to rules of ``bad taste''. Although this may sound ridiculous, there are blob drivers for wireless chips, video cards, and storage controllers relying on this ``interpretation'' for over a decade. I think a ZFS porting project could do the same and end up emitting the same warning about a ``tained'' kernel that proprietary modules do: http://lwn.net/Articles/147070/ the quickest link I found of Linus actually speaking about his ``interpretation'', his thoughts are IMHO completely muddled (which might be intentional): http://lkml.org/lkml/2003/12/3/228 thus ultimately I think the question of whether it's legal or not isn't very interesting compared to ``is it moral?'' (what some of us might care about), and ``is it likely to survive long enough and not blow back in your face fiercely enough that it's a good enough business case to get funded somehow?'' (the question all the hardware manufacturers shipping blob drivers presumably asked themselves) My own view on blob modules is: * that it's immoral, and that Linus is both taking the wrong position and doing it without authority. Even if his position is ``everyone, please let's not fight,'' in practice that is a strong position favouring GPL violation, and his squirrelyness may look like taking a soft view but in practice it throws so much sand into the debate it ends up being actually a much stronger position than saying outright, ``I think insmod is mere aggregation.'' My copyright shouldn't have to bow to your celebrity. * and secondly that it does make business sense and is unlikely to cause any problems, because no one is able to challenge his authority. Whatever is the view on binary blob modules, I think it's the same view on ZFS w.r.t. the law, but not necessarily the same view w.r.t. morality or business, because the copyright law itself is immoral according to the views of many and the business risk depends on how much you piss people off. pgpor5KF8fYq9.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
pj == Peter Jeremy peter.jer...@alcatel-lucent.com writes: gd == Garrett D'Amore garr...@nexenta.com writes: cb == C Bergström codest...@osunix.org writes: fc == Frank Cusack frank+lists/z...@linetwo.net writes: tc == Tim Cook t...@cook.ms writes: pj Given that both provide similar features, it's difficult to pj see why Oracle would continue to invest in both. So far I think the tricky parts of filesystems have been the work of 1 - 3 people. It's difficult to see why the kind of developer who's capable of advancing those filesystems would continue to work in a negative environment like this one, but maybe they will. Such a developer can get money from several places, and I've never heard of something else this crew brings to the table than money. That's a bleak outlook on their ability to actually facilitate relevant ``investment,'' but who knows! gd Oracle *will* spend more on Solaris than Sun did. I believe gd that. hahaha, yup. At least I believe their saying they will try to do it. fc all public companies are very, very greedy. yeah, it's not helpful to anthropomorphize them, nor tell human interest 1930's newsreel-hero stories about their supposedly genius and/or evil leaders, nor imagine yourself into their point of view like they are your favorite soccer team. What's needed is clear focus on the rules of collaboration, and how these rules determine the future of your own greedy schemes. cb It was a community of system administrators and nearly no cb developers. sysadmins need to care about licenses because their investment cycle in a platform is, apparently, long compared to the stability of a publicly-traded company. tc *ONE* developer from Redhat does not change the fact that tc Oracle owns the rights to the majority of the code, one developer making the tinyest change to line breaks and then asserting his copyright does change everything, if it gets committed to trunk and used as the basis for further work that can't be rolled back. gd we are in the process of some enhancements to this gd code which will make it into Illumos, but probably not into gd Oracle Solaris unless they pull from Illumos. :-) yeah, well, add your copyright to it, and thus see that it doesn't make it into Solaris 11. Without hg, there's no longer any incentive to sign over your copyright to them in exchange for getting your changes committed, so not to keep it for yourself would be negligent and silly. Good or bad, it's just reality. FWIW, the SFLC usually suggests you get copyright assignments from every member to a single trusted organization so the license can be changed someday when a change might seem obviously wise. For example, Sun was careful to get assignments from all contributors, which at one time had good hypotheticals as well as the current bad reality: they could have released their tree under Linux-compatible GPL some day if convinced. ISTR some cheap talk about this right after most of Java was released as GPL. If Sun had included some Joerg Schilling-owned pieces in there, his one or two files would become a poison pill making license change impossible. However when there is no such trusted organization around, I think copyrights held by multiple orgs like Linux has are more sustainable. Nexenta clearly isn't a ``trusted organization,'' but having a source tree copyrighted by both Nexenta and Oracle could make the terms more stable than they'd be for a tree copyrighted by either alone. I don't think the Announcement means much for ZFS, though: it means releases will come only every year or two, which is about the maximum pace FreeBSD can keep up with so it will actually bring Solaris and FreeBSD closer in ZFS feature-parity not further apart. However, if you were using ZFS along with things like infiniband iSER/SRP/NFS-RDMA, zones, 10gig nics with cpu-affinity-optimized TCP, xen dom0, virtualbox, dtrace, or waiting/hoping for pNFS, or if you foolishly became addicted to proprietary SunPro and Sun's debugger, then you might be annoyed or even set back a few years by the Announcement since FreeBSD has none of these things. Post-Announcement, ZFS will no longer entice people to experiment with these features, but those who listened to the last half-decade of apologist's, ``let's wait patiently and quietly. More code will be liberated, even the C compiler. Just give them time,'' those suckers have now got problems. I've got a heap of IB cards trying to convince me to bury my head in the sand or keep ``hoping'' instead of reacting. I wish I'd invested my time into an OS I could continue using under consistent terms. pgps28C1MIhcQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
dd 2 * Copyright (C) 2007 Oracle. All rights reserved. dd 3 * dd 4 * This program is free software; you can redistribute it and/or dd 5 * modify it under the terms of the GNU General Public dd 6 * License v2 as published by the Free Software Foundation. dd http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=blob;f=fs/btrfs/root-tree.c;h=2d958be761c84556b39c60afa3b0f3fd75d6;hb=HEAD http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=blob;f=fs/btrfs/free-space-cache.c;h=f488fac04d99ea45eea93607bbf17c021b5b2207;hb=HEAD 1 /* 2 * Copyright (C) 2008 Red Hat. All rights reserved. 3 * 4 * This program is free software; you can redistribute it and/or 5 * modify it under the terms of the GNU General Public 6 * License v2 as published by the Free Software Foundation. see, that's good, and is a realistic future scenario for ZFS, AFAICT: there can be a branch that's safe to collaborate on, which cannot go into Solaris 11 and cannot be taken proprietary by Nexenta, either. pgprH3DS8ogDw.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
mg == Mike Gerdts mger...@gmail.com writes: sw == Saxon, Will will.sa...@sage.com writes: sw I think there may be very good reason to use iSCSI, if you're sw limited to gigabit but need to be able to handle higher sw throughput for a single client. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 look at it now before it gets pulled back inside the wall. :( I think this bug was posted on zfs-discuss earlier. Please see the comments because he is not using lagg's: even with a single 10Gbit/s NIC, you cannot use the link well unless you take advantage of the multiple MSI's and L4 preclass built into the NIC. You need multiple TCP circuits between client and server so that each will fire a different MSI. He got about 3x performance using 8 connections. It sounds like NFS is already fixed for this, but requires manual tuning of clnt_max_conns and the number of reader and writer threads. mg it is rather common to have multiple 1 Gb links to mg servers going to disparate switches so as to provide mg resilience in the face of switch failures. This is not unlike mg (at a block diagram level) the architecture that you see in mg pretty much every SAN. In such a configuation, it is mg reasonable for people to expect that load balancing will mg occur. nope. spanning tree removes all loops, which means between any two points there will be only one enabled path. An L2-switched network will look into L4 headers for splitting traffic across an aggregated link (as long as it's been deliberately configured to do that---by default probably only looks to L2), but it won't do any multipath within the mesh. Even with an L3 routing protocol it usually won't do multipath unless the costs of the paths match exactly, so you'd want to build the topology to achieve this and then do all switching at layer 3 by making sure no VLAN is larger than a switch. There's actually a cisco feature to make no VLAN larger than a *port*, which I use a little bit. It's meant for CATV networks I think, or DSL networks aggregated by IP instead of ATM like maybe some European ones? but the idea is not to put edge ports into vlans any more but instead say 'ip unnumbered loopbackN', and then some black magic they have built into their DHCP forwarder adds /32 routes by watching the DHCP replies. If you don't use DHCP you can add static /32 routes yourself, and it will work. It does not help with IPv6, and also you can only use it on vlan-tagged edge ports (what? arbitrary!) but neat that it's there at all. http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html The best thing IMHO would be to use this feature on the edge ports, just as I said, but you will have to teach the servers to VLAN-tag their packets. not such a bad idea, but weird. You could also use it one hop up from the edge switches, but I think it might have problems in general removing the routes when you unplug a server, and using it one hop up could make them worse. I only use it with static routes so far, so no mobility for me: I have to keep each server plugged into its assigned port, and reconfigure switches if I move it. Once you have ``no vlan larger than 1 switch,'' if you actually need a vlan-like thing that spans multiple switches, the new word for it is 'vrf'. so, yeah, it means the server people will have to take over the job of the networking people. The good news is that networking people don't like spanning tree very much because it's always going wrong, so AFAICT most of them who are paying attention are already moving in this direction. pgpEDdDjwl9Ck.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 1tb SATA drives
bh == Brandon High bh...@freaks.com writes: bh Recent versions no longer support enabling TLER or ERC. To bh the best of my knowledge, Samsung and Hitachi drives all bh support CCTL, which is yet another name for the same thing. once again, I have to ask, has anyone actually found these features to make a verified positive difference with ZFS? Some of those things you cannot even set on Solaris because the channel to the drive with a LSI controller isn't sufficiently transparent to support smartctl, and the settings don't survive reboots. Brandon have you actually set it yourself, or are you just aggregating forum discussion? The experience so far that I've read here has been: * if a drive goes bad completely + zfs will mark the drive unavailable after a delay that depends on the controller you're using, but with lengths like 60 seconds, 180 seconds, 2 hours, or forever. The delay is not sane or reasonable with all controllers, and even if redundancy is available ZFS will patiently wait for the controller. The delay depends on the controller driver. It's part of the Solaris code. best case zpool will freeze until the delay is up, but there are application timeouts and iSCSI initiator-target timeouts, too---getting the equivalent of an NFS hard mount is hard these days (even with NFS, in some people's experiences). + the delay is different if the system's running when the drive fails, or if it's trying to boot up. For example iSCSI will ``patiently wait'' forever for a drive to appear while booting up, but will notice after 180 seconds while running. + because the disk is compeltely bad, TLER, ERC, CCTL, whatever you call it, doesn't apply. The drive might not answer commands ever, at all. The timer is not in the drive: the drive is bad starting now, continuing forever. * if a drive goes partially bad (large and increasing numbers of latent sector errors, which for me happens more often than bad-completely): + the zpool becomes unusably slow + it stays unusably slow until you use 'iostat' or 'fmdump' to find the marginal drive and offline it + TLER, ERC, CCTL makes the slowness factor 7ms : 7000ms vs 7ms : 3ms. In other words, it's unusably slow with or without the feature. AFAICT the feature is useful as a workaround for buggy RAID card firmware and nothing else. It's a cost differentiator, and you're swallowing it hook, line and sinker. If you know otherwise please reinform me, but the discussion here so far doesn't match what I've learned about ZFS and Solaris exception handling. That said, to reword Don Marti, ``uninformed Western Digital bashing is better than no Western Digital bashing at all.'' pgpFMSCuYt2qE.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] File cloning
sw == Saxon, Will will.sa...@sage.com writes: sw 'clone' vs. a 'copy' would be very easy since we have sw deduplication now dedup doesn't replace the snapshot/clone feature for the NFS-share-full-of-vmdk use case because there's no equivalent of 'zfs rollback' I'm tempted to say, ``vmware needs to remove their silly limit'' but there are takes-three-hours-to-boot problems with thousands of Solaris NFS exports so maybe their limit is not so silly after all. What is the scenario, you have? Is it something like 40 hosts with live migration among them, and 40 guests on each host? so you need 1600 filesystems mounted even though only 40 are actually in use? 'zfs set sharenfs=absorb dataset' would be my favorite answer, but lots of people have asked for such a feature, and answer is always ``wait for mirror mounts'' (which BTW are actually just-works for me on very-recent linux, even with plain 'mount host:/fs /fs', without saying 'mount -t nfs4', in spite of my earlier rant complaining they are not real). Of course NFSv4 features are no help to vmware, but hypothetically I guess mirror-mounting would work if vmware supported it, so long as they were careful not to provoke the mounting of guests not in use. The ``implicit automounter'' on which the mirror mount feature's based would avoid the boot delay of mounting 1600 filesystems. and BTW I've not been able to get the Real Automounter in Linux to do what this implicit one already can with subtrees. Why is it so hard to write a working automounter? The other thing I've never understood is, if you 'zfs rollback' an NFS-exported filesystem, what happens to all the NFS clients? It seems like this would cause much worse corruption than the worry when people give fire-and-brimstone speeches about never disabling zil-writing while using the NFS server. but it seems to mostly work anyway when I do this, so I'm probably confused about something. pgpTw9yE68txJ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] carrying on
re == Richard Elling rich...@nexenta.com writes: re we would very much like to see Oracle continue to produce re developer distributions which more closely track the source re changes. I'd rather someone else than Oracle did it. Until someone else is doing the ``building'', whatever that entails all the way from Mercurial to DVD, we will never know if the source we have is complete enough to do a fork if we need to. I realize everyone has in their heads, FORK == BAD. Yes, forks are usually bad, but the *ability to make forks* is good, because it ``decouples the investments our businesses make in OpenSolaris/ZFS from the volatility of Sun and Oracle's business cycle,'' to paraphrase some blog comment. Particularly when you are dealing with datasets so large it might cost tens of thousands to copy them into another format than ZFS, it's important to have a 2 year plan for this instead of being subject to ``I am altering the deal. Pray I don't alter it any further.'' Nexenta being stuck at b134, and secret CVE fixes, does not look good. Though yeah, it looks better than it would if Nexenta didn't exist. IMHO it's important we don't get stuck running Nexenta in the same spot we're now stuck with OpenSolaris: with a bunch of CDDL-protected source that few people know how to use in practice because the build procedure is magical and secret. This is why GPL demands you release ``all build scripts''! One good way to help make sure you've the ability to make a fork, is to get the source from one organization and the binary distribution from another. As long as they're not too collusive, you can relax and rely on one of them to complain to the other. Another way is to use a source-based distribution like Gentoo or BSD, where the distributor includes a deliverable tool that produces bootable DVD's from the revision control system, and ordinary contributors can introspect these tools and find any binary blobs that may exist. pgpf3OSDelKXh.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD failed
ds == Dmitry Sorokin dmitry.soro...@bmcorp.ca writes: ds The SSD drive has failed and zpool is unavailable anymore. AIUI, 6733267 Allow a pool to be imported with a missing slog is only fixed for the case where the pool is still imported. If you export it without removing the slog first, the pool is lost. Instructions here: http://opensolaris.org/jive/thread.jspa?messageID=377018 http://github.com/pjjw/logfix/tree/master how how to ``fake out'' the lazy assertions, but you have to prepare to use the workaround before your slog fails by noting its GUID. If you don't know the GUID, then it is as Richard Elling says, ``a rather long trial-and-error process.'' Decoded from Fanboi-ese into English, the ``rather long'' process is ``finding a sha1 hash collision.'' so either UTFS or ``restore from backup.'' :( pgpyK7PHBQp9Y.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Legality and the future of zfs...
ab == Alex Blewitt alex.blew...@gmail.com writes: 3. The quality of software inside the firewire cases varies wildly and is a big source of stability problems. (even on mac) ab It would be good if you could refrain from spreading FUD if ab you don't have experience with it. yup, my experience was with the Prolific PL-3705 chip, which was very popular for a while. it has two problems: * it doesn't auto-pick its ``ID number'' or ``address'' or something, so if you have two cases with this chip on the same bus, they won't work. go google it! * it crashes. as in, I reboot the computer but not the case, and the drive won't mount. I reboot the case but not the computer, and the drive starts working again. http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html I even upgraded the firmware to give the chinese another shot. still broken. You can easily google for other problems with firewire cases in general. The performance of the overall system is all over the place depending on the bridge chip you use. Some of them have problems with ``large'' transactions as well. Some of them lose their shit when the drive reports bad sectors, instead of passing the error along so you can usefully diagnose it---not that they're the only devices with awful exception handling in this area, but why add one more mystery? I think it was already clear I had experience from the level of detail in the other items I mentioned, though, wasn't it? Add also to all of it the cache flush suspicions from Garrett: these bridge chips have full-on ARM cores inside them and lots of buffers, which is something SAS multipliers don't have AIUI. Yeah, in a way that's slightly FUDdy but not really since IIRC the write cache problem has been verified at least on some USB cases, hasn't it? Also since the testing procedure for cache flush problems is a littlead-hoc, and a lot of people are therefore putting hardware to work without testing cache flush at all, I think it makes perfect sense to replace suspicious components with lengths of dumb wire where possible even if the suspicions aren't proved. ab I have used FW400 and FW800 on Mac systems for the last 8 ab years; the only problem was with the Oxford 911 chipset in OSX ab 10.1 days. yeah, well, if you don't want to listen, then fine, don't listen. ab It may not suit everyone's needs, and it may not be supported ab well on OpenSolaris, but it works fine on a Mac. aside from being slow unstable and expensive, yeah it works fine on Mac. But you don't really have the eSATA option on the mac unless you pay double for the ``pro'' desktop, so i can see why you'd defend your only choice of disk if you've already committed to apple. Does the Mac OS even have an interesting zfs port? Remind me why we are discussing this, again? pgpbltDPUUaLy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Legality and the future of zfs...
ab == Alex Blewitt alex.blew...@gmail.com writes: ab All Mac Minis have FireWire - the new ones have FW800. I tried attaching just two disks to a ZFS host using firewire, and it worked very badly for me. I found: 1. The solaris firewire stack isn't as good as the Mac OS one. 2. Solaris is very obnoxious about drives it regards as ``removeable''. There are ``hot-swappable'' drives that are not considered removeable but can be removed about as easily, that are maybe handled less obnoxiously. Firewire's removeable while SAS/SATA are hot-swappable. 3. The quality of software inside the firewire cases varies wildly and is a big source of stability problems. (even on mac) The companies behind the software are sketchy and weak, while only a few large cartels make SAS expanders for example. Also, the price of these cases is ridiculously high compared to SATA world. If you go there you may as well take your wad next door and get SAS. 4. The translation between firewire and SATA is not a simple one, and is not transparent to 'smartctl' commands, or other werid things like hard disk firmware upgraders. though I guess the same is true of the lsi controllers under solaris. This problem's rampant unfortunately. 5. Firewire is slow. too slow to make 2x speed interesting. and the host chips are not that advanced so they use a lot of CPU. 6. The DTL partial-mirror-resilver doesn't work. With b130 it still doesn't work. After half a mirror goes away and comes back, scrubs always reveal CKSUM errors on the half that went away. With b71 I foudn if I meticulously 'zpool offline'd the disks before taking them away, the CKSUM errors didn't happen. With b130 that no longer helps. so, scratchy unreliable connections are just unworkable. Even iSCSI is not great, but firewire cases sprawled all over a desk with trippable scratchy cables is just not on. It's better to have larger cases that can be mounted in a rack, or if not that, at least cases that are heavier and fewer in number and fewer in cordage. suggest that you do not waste time with firewire. SATA, SAS, or fuckoff. None of this is an insult to your blingy designer apple iShit. It applies equally well to any hardware involving lots of tiny firewire cases. pgp6yEjqWzyNZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
np == Neil Perrin neil.per...@oracle.com writes: np The L2ARC just holds blocks that have been evicted from the np ARC due to memory pressure. The DDT is no different than any np other object (e.g. file). The other cacheable objects require pointers to stay in the ARC pointing to blocks in the L2ARC. If the DDT required this, L2ARC-ification would be pointless since DDT entries aren't much smaller than ARC-L2ARC pointers, so from what I hear it is actually special in some way though I don't know precisely what way. pgpWlNwOCvSTx.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Trouble detecting Seagate LP 2TB drives
bh == Brandon High bh...@freaks.com writes: Atom bh 32-bit kernels can't support drives over 1GB. iirc, atom desktop chips are 64-bit and recognized as 64-bit by kernel, but not recognized by grub. but I thought this got fixed. If you use 'e' in grub to alter the boot line to replace $ISADIR with 'amd64' does it come up 64-bit and work? That's the fix I recall. pgpDQNWEKjhMU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
gd == Garrett D'Amore garr...@nexenta.com writes: gd There are numerous people in the community that have indicated gd that they believe that such linking creates a *derivative* gd work. Donald Becker has made this claim rather forcefully. yes, I think he has a point. The reality is, as long as Linus continues insisting that his ``interpretation'' of the GPL allows loading proprietary modules like ati/nVidia/wireless/... into the Linux kernel, it looks like no one will be sued over a module. This has been holding for a few decades anyway. If everyone with standing to sue is sufficiently under Linus's thumb then you may become safe enough for it to be worth the risk. Also, if they do not distribute their ZFS port to anyone else then they're fine: quite intentionally, they can link anything they like with Linux so long as they never distribute any binaries outside their organization, just like Akamai is fine basing their entire business off GPL'd Squid source code that they've improved vastly and not shared with anyone. We may find ourselves in a position where the guys distributing this Linux ZFS module could be sued and then told ``you have lost the right to distribute the GPL-derived work,'' to which their answer is, ``fine, we do not need to distribute it anyway. We only need to use it internally,'' so confronting them is a net loss for most of the parties with standing to do the confronting. An exception is, it could be a net win for Oracle because if they could shut down zfs.ko then peopo would be forced to run Solaris to get performant ZFS, which might play out in a funny way: Q. We are the owners of foobrulator.c in Linux, a GPLv2 source file. You may not link this CDDL stuff against our foobrulator.c. You have lost the right to distribute foobrulator.c. A. Wait, don't you own the copyright to the more restrictive CDDL stuff in question? Q. Yes, we own the copyrights to both sources, but you cannot link them together. A. HAHAHA you can't be serious. Q. Mwauh hah hah. A. ... who knows. maybe it could happen. In short, * yes zfs.ko could be a little sketchy * other people are doing much sketchier things already and making a lot of money doing it * looking at the big picture is a lot more convoluted than just ``allowed'' or ``OMGillegall''. If you want your share of this money/fame of the second bullet you might push the envelope as the others have, and consider who has standing to sue whom given a specific way of building and distributing the module, and among those who have standing who has motivation to do it, and finally if they actually do then how much have you got to lose. In other words: business, instead of FUD pedantry and CYA. * in particular, if your business does not involve distributing software... :) * GPL has so much momentum that contributing to a GPL-incompatible project is a significantly less valuable use of your time than contributing to a GPL-compatible one, even and maybe especially if you do not like the GPL. Perl, Apache, BSD, and FSF are all wising up to this and making their licenses more compatible from both directions. CDDL is thus, granted obviously well-liked by some, but very disappointing and regressive to quite a few potential contributors, and this disappointment is widely-understood partly becuse of ZFS+Linux. I almost hope they do not share their port with anyone and use it only internally, and that they make some huge improvements to ZFS that they then claim cannot be given back to Solaris because of license incompatibility. That will send a strong message to the forces of arrogance that crafted a GPL incompatible license at such a late date. In this age of web-scale megacompanies the distinction between GPL-style freedom and BSD-style freedom is much less because operations do not require binary redistributing, but license compatibility does still matter. pgpJGNtgXx2f3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Homegrown Hybrid Storage
pk == Pasi Kärkkäinen pa...@iki.fi writes: You're really confused, though I'm sure you're going to deny it. I don't think so. I think that it is time to reset and reboot yourself on the technology curve. FC semantics have been ported onto ethernet. This is not your grandmother's ethernet but it is capable of supporting both FCoE and normal IP traffic. The FCoE gets per-stream QOS similar to what you are used to from Fibre Channel. FCoE != iSCSI. FCoE was not being discussed in the part you're trying to contradict. If you read my entire post, I talk about FCoE at the end and say more or less ``I am talking about FCoE here only so you don't try to throw out my entire post by latching onto some corner case not applying to the OP by dragging FCoE into the mix'' which is exactly what you did. I'm guessing you fired off a reply without reading the whole thing? pk Yeah, today enterprise iSCSI vendors like Equallogic (bought pk by Dell) _recommend_ using flow control. Their iSCSI storage pk arrays are designed to work properly with flow control and pk perform well. pk Of course you need a proper (certified) switches aswell. pk Equallogic says the delays from flow control pause frames are pk shorter than tcp retransmits, so that's why they're using and pk recommending it. please have a look at the three links I posted about flow control not being used the way you think it is by any serious switch vendor, and the explanation of why this limitation is fundamental, not something that can be overcome by ``technology curve.'' It will not hurt anything to allow autonegotiation of flow control on non-broken switches so I'm not surprised they recommend it with ``certified'' known-non-broken switches, but it also will not help unless your switches have input/backplane congestion which they usually don't, or your end host is able to generate PAUSE frames for PCIe congestion which is maybe more plausible. In particular it won't help with the typical case of the ``incast'' problem in the experiment in the FAST incast paper URL I gave, because they narrowed down what was happening in their experiment to OUTPUT queue congestion, which (***MODULO FCoE*** mr ``reboot yourself on the technology curve'') never invokes ethernet flow control. HTH. ok let me try again: yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with blocking storage-friendly buffer semantics if your FCoE/CEE switches can manage that, but I would like to hear of someone actually DOING it before we drag it into the discussion. I don't think that's happening in the wild so far, and it's definitely not the application for which these products have been flogged. I know people run iSCSI over IB (possibly with RDMA for moving the bulk data rather than TCP), and I know people run SCSI over FC, and of course SCSI (not iSCSI) over FCoE. Remember the original assertion was: please try FC as well as iSCSI if you can afford it. Are you guys really saying you believe people are running ***iSCSI*** over the separate HOL-blocking hop-by-hop pause frame CoS's of FCoE meshes? or are you just spewing a bunch of noxious white paper vapours at me? because AIUI people using the lossless/small-output-buffer channel of FCoE are running the FC protocol over that ``virtual channel'' of the mesh, not iSCSI, are they not? pgp7HCeOuOq4h.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Homegrown Hybrid Storage
re == Richard Elling richard.ell...@gmail.com writes: re Please don't confuse Ethernet with IP. okay, but I'm not. seriously, if you'll look into it. Did you misread where I said FC can exert back-pressure? I was contrasting with Ethernet. Ethernet output queues are either FIFO or RED, and are large compared to FC and IB. FC is buffer-credit, which HOL-blocks to prevent the small buffers from overflowing, and IB is...blocking (almost no buffer at all---about 2KB per port and bandwidth*delay product of about 1KB for the whole mesh, compared to ARISTA which has about 48MB per port, so except to pedantic IB is bufferless, ie it does not even buffer one full frame). Unlike Ethernet, both are lossless fabrics (sounds good) and have an HOL-blocking character (sounds bad). They're fundamentally different at L2, so this is not about IP. If you run IP over IB, it is still blocking and lossless. It does not magically start buffering when you use IP because the fabric is simply unable to buffer---there is no RAM in the mesh anywhere. Both L2 and L3 switches have output queues, and both L3 and L2 output queues can be FIFO or RED because the output buffer exists in the same piece of silicon of an L3 switch no matter whether it's set to forward in L2 or L3 mode, so L2 and L3 switches are like each other and unlike FC IB. This is not about IP. It's about Ethernet. a relevant congestion difference between L3 and L2 switches (confusing ethernet with IP) might be ECN, because only an L3 switch can do ECN. But I don't think anyone actually uses ECN. It's disabled by default in Solaris and, I think, all other Unixes. AFAICT my Extreme switches, a very old L3 flow-forwarding platform, are not able to flip the bit. I think 6500 can, but I'm not certain. re no back-off other than that required for the link. Since re GbE and higher speeds are all implemented as switched fabrics, re the ability of the switch to manage contention is paramount. re You can observe this on a Solaris system by looking at the NIC re flow control kstats. You're really confused, though I'm sure you're going to deny it. Ethernet flow control mostly isn't used at all, and it is never used to manage output queue congestion except in hardware that everyone agrees is defective. I almost feel like I've written all this stuff already, even the part about ECN. Ethernet flow control is never correctly used to signal output queue congestion. The ethernet signal for congestion is a dropped packet. flow control / PAUSE frames are *not* part of some magic mesh-wide mechanism by which switches ``manage'' congestion. PAUSE are used, when they're used at all, for oversubscribed backplanes: for congestion on *input*, which in Ethernet is something you want to avoid. You want to switch ethernet frames to the output port where it may or may not encounter congestion so that you don't hold up input frames headed toward other output ports. If you did hold them up, you'd have something like HOL blocking. IB takes a different approach: you simply accept the HOL blocking, but tend to design a mesh with little or no oversubscription unlike ethernet LAN's which are heavily oversubscribed on their trunk ports. so...the HOL blocking happens, but not as much as it would with a typical Ethernet topology, and it happens in a way that in practice probably increases the performance of storage networks. This is interesting for storage because when you try to shove a 128kByte write into an Ethernet fabric, part of it may get dropped in an output queue somewhere along the way. In IB, never will part of the write get dropped, but sometimes you can't shove it into the network---it just won't go, at L2. With Ethernet you rely on TCP to emulate this can't-shove-in condition, and it does not work perfectly in that it can introduce huge jitter and link underuse (``incast'' problem: http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf ), and secondly leave many kilobytes in transit within the mesh or TCP buffers, like tens of megabytes and milliseconds per hop, requiring large TCP buffers on both ends to match the bandwidth*jitter and frustrating storage QoS by queueing commands on the link instead of in the storage device, but in exchange you get from Ethernet no HOL blocking and the possibility of end-to-end network QoS. It is a fair tradeoff but arguably the wrong one for storage based on experience with iSCSI sucking so far. But the point is, looking at those ``flow control'' kstats will only warn you if your switches are shit, and shit in one particular way that even cheap switches rarely are. The metric that's relevant is how many packets are being dropped, and in what pattern (a big bucket of them at once like FIFO, or a scattering like RED), and how TCP is adapting to these drops. For this you might look at TCP stats in solaris, or at output queue drop and output queue size stats on managed switches, or simply at the overall
Re: [zfs-discuss] Homegrown Hybrid Storage
et == Erik Trimble erik.trim...@oracle.com writes: et With NFS-hosted VM disks, do the same thing: create a single et filesystem on the X4540 for each VM. previous posters pointed out there are unreasonable hard limits in vmware to the number of NFS mounts or iSCSI connections or something, so you will probably run into that snag when attempting to use the much faster snapshotting/cloning in ZFS. * Are the FSYNC speed issues with NFS resolved? et The ZIL SSDs will compensate for synchronous write issues in et NFS. okay, but sometimes for VM's I think this often doesn't matter because NFSv3 and v4 only add fsync()'s on file closings, and a virtual disk is one giant file that the client never closes. There may still be synchronous writes coming through if they don't get blocked in LVM2 inside the guest or blocked in the VM software, but whatever comes through ought to be exactly the same number of them for NFS or iSCSI, unless the vm software has different bugs in the nfs vs iscsi back-ends. the other difference is in the latest comstar which runs in sync-everything mode by default, AIUI. Or it does use that mode only when zvol-backed? Or something. I've the impression it went through many rounds of quiet changes, both in comstar and in zvol's, on its way to its present form. I've heard said here you can change the mode both from the comstar host and on the remote initiator, but I don't know how to do it or how sticky the change is, but if you didn't change and stuck with the default sync-everything I think NFS would be a lot faster. This is if we are comparing one giant .vmdk or similar on NFS, against one zvol. If we are comparing an exploded filesystem on NFS mounted through the virtual network adapter, then of course you're right again Erik. The tradeoff integrity tests are, (1) reboot the solaris storage host without rebooting the vmware hosts guests and see what happens, (2) cord-yank the vmware host. Both of these are probably more dangerous than (3) command the vm software to virtual-cord-yank the guest. * Should I go with fiber channel, or will the 4 built-in 1Gbe NIC's give me enough speed? FC has different QoS properties than Ethernet because of the buffer credit mechanism---it can exert back-pressure all the way through the fabric. same with IB, which is HOL-blocking. This is a big deal with storage, with its large blocks of bursty writes that aren't really the case for which TCP shines. I would try both and compare, if you can afford it! je IMHO Solaris Zones with LOFS mounted ZFSs gives you the je highest flexibility in all directions, probably the best je performance and least resource consumption, fine grained je resource management (CPU, memory, storage space) and less je maintainance stress etc... yeah zones are really awesome, especially combined with clones and snapshots. For once the clunky post-Unix XML crappo solaris interfaces are actually something I appreciate a little, because lots of their value comes from being able to do consistent repeatable operations on them. The problem is that the zones run Solaris instead of Linux. BrandZ never got far enough to, for example, run Apache under a 2.6-kernel-based distribution, so I don't find it useful for any real work. I do keep a CentOS 3.8 (I think?) brandz zone around, but not for anything production---just so I can try it if I think the new/weird version of a tool might be broken. as for native zones, the ipkg repository, and even the jucr repository, has two years old versions of everything---django/python, gcc, movabletype. Many things are missing outright, like nginx. I'm very disappointed that Solaris did not adopt an upstream package system like Dragonfly did. Gentoo or pkgsrc would have been very smart, IMHO. Even opencsw is based on Nick Moffitt's GAR system, which was an old mostly-abandoned tool for building bleeding edge Gnome on Linux. The ancient perpetually-abandoned set of packages on jucr and the crufty poorly-factored RPM-like spec files leave me with little interest in contributing to jucr myself, while if Solaris had poured the effort instead into one of these already-portable package systems like they poured it into Mercurial after adopting that, then I'd instead look into (a) contributing packages that I need most, and (b) using whatever system Solaris picked on my non-Solaris systems. This crap/marginalized build system means I need to look at a way to host Linux under Solaris, using Solaris basically just for ZFS and nothing else. The alternative is to spend heaps of time re-inventing the wheel only to end up with an environment less rich than competitors and charge twice as much for it like joyent. But, yeah, while working on Solaris I would never install anything in the global zone after discovering how easy it is to work with ipkg zones. They are really brilliant, and unlike everyone else's attempt at these
Re: [zfs-discuss] ZFS recovery tools
sl == Sigbjørn Lie sigbj...@nixtra.com writes: sl Excellent! I wish I would have known about these features when sl I was attempting to recover my pool using 2009.06/snv111. the OP tried the -F feature. It doesn't work after you've lost zpool.cache: op I was setting up a new systen (osol 2009.06 and updating to op the lastest version of osol/dev - snv_134 - with op deduplication) and then I tried to import my backup zpool, but op it does not work. op # zpool import -f tank1 op cannot import 'tank1': one or more devices is currently unavailable op Destroy and re-create the pool from a backup source op Any other option (-F, -X, -V, -D) and any combination of them op doesn't helps too. I have been in here repeatedly warning about this incompleteness of the feature while fanbois keep saying ``we have slog recovery so don't worry.'' R., please let us know if the 'zdb -e -bcsvL zpool-name' incantation Sigbjorn suggested ends up working for you or not. pgpFHj14VBEC7.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
cs == Cindy Swearingen cindy.swearin...@oracle.com writes: okay wtf. Why is this thread still alive? cs The mirror mount feature It's unclear to me from this what state the feature's in: http://hub.opensolaris.org/bin/view/Project+nfs-namespace/ It sounds like mirror mounts are done but referrals are not, but I don't know. Are the client and server *both* done? I assume so, because I don't know how else it could be tested. Is the bug with 'find' fixed? It looks like it was fixed, but very recently: http://opensolaris.org/jive/message.jspa?messageID=409895#409895 and it sounds like there could be problems with other programs that have a --one-file-system option like gnutar and rsync because the fix is sort of ad-hoc---it's done by making changes to the solaris userland. Are all the features described at: http://hub.opensolaris.org/bin/download/Project+nfs-namespace/files/mm-PRS-open.html actually implemented, including automounter overrides, automatic unmounting, recursive unmounting? not sure. Are you even using NFSv4 in Linux? It's very unlikely. probably you are using NFSv3. People are reporting unresolved problems with NFSv4 with connections bouncing and not properly simulating the ``statelessness'' that allows servers to reboot when clients don't: http://mail.opensolaris.org/pipermail/nfs-discuss/2010-April/002087.html granted, ISTR some of the problems are reported by people doing goofy bullshit through firewalls, like bank admins that don't seem to understand TCP/IP and are flailing around with the blamestick because they are in a CYA environment and don't have reasonable control of their own systems. but I am not sure it's worth the trouble! AFAICT you cannot even net-boot opensolaris over NFSv4: '/' comes up mounted with NFSv3. It seems to me every time this ``I can't see subdirectories'' comes up it's from someone who doesn't understand how NFS and Unix works, doesn't know how to mount ANY filesystem much less NFS, has no idea what version of NFS he is using much less how to determine his NFSv4 idmap domain (answer is: 'cat /var/run/nfs4_domain'). The right answer is ``you need to mount the underlying filesystem. You need one mount command or mount line in /etc/{v,}fstab per one exported filesystem on the server.'' very simple, very reasonable. But the answer pitched at them is all this convoluted bleeding edge mess about mirror mounts, coming from people who don't have any experience actually USING mirror mounts, always with the caveat ``I'm not sure if your client supports BUT ...''!!! But what? Are you even sure if the feature works ANYwhere, if you've never used it yourself? It sounds like a simple feature, but it just isn't. If it actually worked the question would not even exist, so how can it be the answer? It is like ``Q. Can you please help me? / A. You might not even be here. Maybe we are not having this conversation because everything works perfectly. Let me explain to you what `working perfectly' means and then you can tell me if you are real or not.'' I would suggest you forget about this noise for the moment and write heirarchical automount maps. This works on both Linux and Solaris, except that you don't have the full value of the automounter here because you cannot refresh parts of the subtree while the parent is still mounted, which is part of what the automounter is good for. It's normal that an aoutmounter won't consider new map data for things that are already mounted, but for heirarchical automounts, AFAICT you have to unmount the whole tree before any changes deep inside the tree will be refreshed from the map, which is less than ideal but reflects the ad-hoc way the automounter's corner cases were slowly semifixed, especially on Linux. There are examples of heirarchical automounts in the man page, and if you don't understand the examples then simply do not use the automounter at all. You do not even need to use the automounter. You can just put your filesystems into /etc/fstab and walk away from it. Honestly I think it is crazy that it takes you over a month simply to get one NFS subdirectory mounted inside another. This should take one hour. Please just forget about all this newfangled bullshit and mount the filesystem. see 'man mount' and just DO it! Like this in /etc/fstab on Linux: terabithia:/arrchive/arrchive nfs rw,noacl,nodev 0 0 terabithia:/arrchive/music /arrchive/music nfs rw,noacl,nodev 0 0 *DONE*. There is no NFSv4. It is NFSv3. There is no automounter. There are no ``mirror mounts'' and no referrals. If you add more ZFS filesystems, you add more lines to /etc/fstab on every Linux client. okay? If you are afraid you are using NFSv4, stop that from happening by saying '-o vers=3' on Solaris or '-t nfs' in Linux. But if you're using Linux, you're not using NFSv4. Solaris uses v4 by
Re: [zfs-discuss] zfs recordsize change improves performance
ai == Asif Iqbal vad...@gmail.com writes: If you disable the ZIL for locally run Oracle and you have an unscheduled outage, then it is highly probable that you will lose data. ai yep. that is why I am not doing it until we replace the ai battery no, wait please, you still need the ZIL to be on, even with the battery. disabling the cache flush command is what the guide says is allowed and sometimes helpful for people who have NVRAM's, but disabling the cache flush command and disabling the ZIL are different. Disabling the ZIL means the write can be cached in DRAM until the next txg flush and not issued to the disks at all, so even if you have a disk array with an NVRAM that effectively writes everything as if it were sync, the disk array will not even see the write until txg commit time with ZIL disabled. If you have working NVRAM, I think disabling the ZIL is likely not to give much speed-up, so if you are going to try disabling it, now when your battery is dead is the time to do it. Once the battery's fixed theory says your testing will probably show things are just as fast with ZIL enabled. AIUI if you disable the ZIL, the database should still come back in a crash-consisent state after a cord-yank, but it will be an older state than it should be, so if you have several RDBMS behind some kind of tiered middleware the different databases won't be in sync with each other so you can lose integrity. If you have only one RDBMS I think you will lose only durability through this monkeybusiness, and integrity will survive. I'm not an expert of anything, but that's my understanding for now. pgpFapbkFrlFR.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
d == Don d...@blacksun.org writes: hk == Haudy Kazemi kaze0...@umn.edu writes: d You could literally split a sata cable and add in some d capacitors for just the cost of the caps themselves. no, this is no good. The energy only flows in and out of the capacitor when the voltage across it changes. In this respect they are different from batteries. It's normal to use (non-super) capacitors as you describe for filters next to things drawing power in a high-frequency noisy way, but to use them for energy storage across several seconds you need a switching supply to drain the energy from it. the step-down and voltage-pump kinds of switchers are non-isolated and might do fine, and are cheaper than full-fledged DC-DC that are isolated (meaning the input and output can float wrt each other). you can charge from 12V and supply 5V if that's cheaper. :) hope it works. hk okay, we've waited 5 seconds for additional data to arrive to hk be written. None has arrived in the last 5 seconds, so we're hk going to write what we already have to better ensure data hk integrity, yeah, I am worried about corner cases like this. ex: input power to the SSD becomes scratchy or sags, but power to the host and controller remain fine. Writes arrive continuously. The SSD sees nothing wrong with its power and continues to accept and acknowledge writes. Meanwhile you burn through your stored power hiding the sagging supply until you can't, then the SSD loses power suddenly and drops a bunch of writes on the floor. That is why I drew that complicated state diagram in which the pod disables and holds-down the SATA connection once it's running on reserve power. Probably y'all don't give a fuck about such corners though, nor do many of the manufacturers selling this stuff, so, whatever. pgpYM02z6LZ58.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?
dd == David Dyer-Bennet d...@dd-b.net writes: dd Just how DOES one know something for a certainty, anyway? science. Do a test like Lutz did on X25M G2. see list archives 2010-01-10. pgpeiR4DYODbj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS memory recommendations
et == Erik Trimble erik.trim...@oracle.com writes: et No, you're reading that blog right - dedup is on a per-pool et basis. The way I'm reading that blog is that deduped data is expaned in the ARC. pgpozjcLXZlNV.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
d == Don d...@blacksun.org writes: d Since it ignores Cache Flush command and it doesn't have any d persistant buffer storage, disabling the write cache is the d best you can do. This actually brings up another question I d had: What is the risk, beyond a few seconds of lost writes, if d I lose power, there is no capacitor and the cache is not d disabled? why use a slog at all if it's not durable? You should disable the ZIL instead. Compared to a slog that ignores cache flush, disabling the ZIL will provide the same guarantees to the application w.r.t. write ordering preserved, and the same problems with NFS server reboots, replicated databases, mail servers. It'll be faster than the fake-slog. It'll be less risk of losing the pool because the slog went bad and then you accidentally exported the pool while trying to fix things. The only case where you are ahead with the fake-slog, is the host's going down because of kernel panics rather than power loss. I don't know, though, what to do about these reports of devices that almost respect cache flushes but seem to lose exactly one transaction. AFAICT this should be a works/doesntwork situation, not a continuum. pgp4xXGJ3xew4.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?
rsk == Roy Sigurd Karlsbakk r...@karlsbakk.net writes: dm == David Magda dma...@ee.ryerson.ca writes: tt == Travis Tabbal tra...@tabbal.net writes: rsk Disabling ZIL is, according to ZFS best practice, NOT rsk recommended. dm As mentioned, you do NOT want to run with this in production, dm but it is a quick way to check. REPEAT: I disagree. Once you associate the disasterizing and dire warnings from the developer's advice-wiki with the specific problems that ZIL-disabling causes for real sysadmins rather than abstract notions of ``POSIX'' or ``the application'', a lot more people end up wanting to disable their ZIL's. In fact, most of the SSD's sold seem to be relying on exactly the trick disabled-ZIL ZFS does for much of their high performance, if not their feasibility within their price bracket period: provide a guarantee of write ordering without durability, and many applications are just, poof, happy. If the SSD's arrange that no writes are reordered across a SYNC CACHE, but don't bother actually providing durability, end uzarZ will ``OMG windows fast and no corruption.'' -- ssd sales. The ``do-not-disable-buy-SSD!!!1!'' advice thus translates to ``buy one of these broken SSD's, and you will be basically happy. Almost everyone is. When you aren't, we can blame the SSD instead of ZFS.'' all that bottlenecked SATA traffic host-SSD is just CYA and of no real value (except for kernel panics). Now, if someone would make a Battery FOB, that gives broken SSD 60 seconds of power, then we could use the consumer crap SSD's in servers again with real value instead of CYA value. FOB should work like this: == RUNNING == battery ,--- SATA port: pass -. recharged? / power to SSD: on\ input /\ power ( . lost | | . input ,---\ v power / v restored / =power lost= =power restored= . =hold-down = =hold down =-- SATA port: block power to SSD: off power to SSD: on ^ | | | . . 60 seconds input\/ elapsed power . =power off= , restored power to SSD: off - The device must know when its battery has gone bad and stick itself in ``power restored hold down'' state. Knowing when the battery is bad may require more states to test the battery, but this is the general idea. I think it would be much cheaper to build an SSD with supercap, and simpler because you can assume the supercap is good forever instead of testing it. However because of ``market forces'' the FOB approach might sell for cheaper because the FOB cannot be tied to the SSD and used as a way to segment the market. If there are 2 companies making only FOB's and not making SSD's, only then competition will work like people want it to. Otherwise FOBs will be $1000 or something because only ``enterprise'' users are smart/dumb enough to demand them. Normally I would have a problem that the FOB and SSD are separable, but see, the FOB and SSD can be put together with double-sided tape: the tape only has to hold for 60 seconds after $event, and there's no way to separate the two by tripping over a cord. You can safely move SSD+FOB from one chassis to another without fearing all is lost if you jiggle the connection. I think it's okay overall. tt This risk is mostly mitigated by UPS backup and auto-shutdown tt when the UPS detects power loss, correct? no no it's about cutting off a class of failure cases and constraining ourselves to relatively sane forms of failure. We are not haggling about NO FAILURES EVAR yet. First, for STEP 1 we isolate the insane kinds of failure that cost us days or months of data rather than just a few seconds, the kinds that call for crazy unplannable ad-hoc recovery methods like `Viktor plz help me' and ``is anyone here a Postgres data recovery expert?'' and ``is there a way I can invalidate the batch of billing auth requests I uploaded yesterday so I can rerun it without double-billing anyone?'' For STEP 1 we make the insane fail almost impossible through clever software and planning. A UPS never never ever qualifies as ``almost impossible''. Then, once that's done, we come back for STEP 2 where we try to minimize the sane failures also, and for step 2 things like UPS might be useful. For STEP 2 it makes sense to talk about percent availability, probability of failure, length of time to recover from Scenario X. but in STEP 1 all the failures are insane
Re: [zfs-discuss] ZFS memory recommendations
et == Erik Trimble erik.trim...@oracle.com writes: et frequently-accessed files from multiple VMs are in fact et identical, and thus with dedup, you'd only need to store one et copy in the cache. although counterintuitive I thought this wasn't part of the initial release. Maybe I'm wrong altogether or maybe it got added later? http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comment-1257191094000 pgp4W7jhfu4MV.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes
bh == Brandon High bh...@freaks.com writes: bh The devid for a USB device must change as it moves from port bh to port. I guess it was tl;dr the first time I said this, but: the old theory was that a USB device does not get a devid because it is marked ``removeable'' in some arcane SCSI page, for the same reason it doesn't make sense to give a CD-ROM a devid because its medium can be removed. I don't know if this has changed, or if it's even what's really going on. but like I said without the ramdisk boot option it's more important to fix this type of problem, so if someone has a workaround please share! pgpkdrT55NtZq.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?
jcm == James C McPherson james.mcpher...@oracle.com writes: storage controllers are more difficult for driver support. jcm Be specific - put up, or shut up. marvell controller hangs machine when a drive is unplugged marvell controller does not support NCQ marvell driver is closed-source blob sil3124 driver has lost interrupt problems ATI SB600/SB700 AHCI driver has performance problems mpt driver has disconnect under heavy load problems that may or may not be MSI-related mpt driver is closed source blob mpt driver is not SATA framework and thus does not work with DVD-ROMS or with smartctl XXX -- smartctl does work now, with '-d sat,12'? or only AHCI works with that? USUAL SUGGESTION: use 1068e non-raid and mpt driver, live with problems USUAL OPTIMISM: lsi2008 / mega_sas, which i THINK are open source but opengrok seems to be down so I did not verify. My perception is if you are using external cards which you know work for networking and storage, then you should be alright. Am I out in left-field on this? jcm I believe you are talking through your hat. network performance problems with realtek network performance problems with nvidia nforce network working-at-all problems with broadcom bge and bnx because of the ludicrous number of chip steppings and errata closed-source blob drivers with broadcom bnx performance and working-at-all problems for atheros L1 USUAL SUGGESTION: use intel 82540 derivative. which, for an AMD board, will almost always be an external card because AMD boards are usually realtek, broadcom, or marvell for AMD chipsets, and realtek or nforce for nVidia chipsets (if anyone still uses nvidia chipsets). FAIR STATEMENT: Linux shares most of these problems except over there bnx is open source. USUAL OPTIMISM: crossbow-supported cards with L4 classifiers in the MAC other than bnx, such as 10gig ones, may be the future, much more performant, ready for CoS pause frames, and good multicore performance, and having source. god willing their quality might turn out to be more uniform but probably nobody knows yet, and they're not cheap and ubiquitous onboard yet. I'm hoping infiniband comes back and 10gig goes away, but that's probably not realistic. WELL POISONING: saying ``if you want open-source drivers go whine at the hardware vendor because they make us sign an NDA, so there's nothing we can do,'' is hogwash. (a) Sun's the one able to realistically bargain with the vendor, not users, because they bring to the table developer hours, OS support, a class of customers, trusting contacts within the vendor, and a hardware manufacturing arm that can make purchasing decisions long-term and at a motherboard component level; no user has anywhere near this insane level of bargaining power; see OpenBSD presentation and ``the OEM problem'', (b) usually only one chip works anyway, so there is no competition, (c) Linux has open source drivers for all these chips and is an existence proof that yes, you can do something about it, and (d) the competition for users is between Solaris and Linux, not between Marvell and LSI. If we want complete source for the OS we will get it faster and more reliably by going to the OS that offers it, not by whining to chip vendors. This is not flamebait but just obvious reality---so obvious that almost everyone who really cares enough to say it is already gone. HTH, HAND. pgpGJkSjxmX5x.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes
bh == Brandon High bh...@freaks.com writes: bh If you boot from usb and move your rpool from one port to bh another, you can't boot. If you plug your boot sata drive into bh a different port on the motherboard, you can't bh boot. Apparently if you are missing a device from your rpool bh mirror, you can't boot. yeah, this is retarded and should get fixed. bh zpool.cache saves the device path to make importing pools bh faster. It would be nice if there was a boot flag you could bh give it to ignore the file... I've no doubt this is true but ISTR it's not related to the booting problem above becuase I do not think zpool.cache is used to find the root pool. It's only used for finding other pools. ISTR the root pool is found through devid's that grub reads from the label on the BIOS device it picks, and then passes to the kernel. note that zpool.cache is ON THE POOL, so it can't be used to find the pool (ok, it can---on x86 it can be sync'ed into the boot archive, and on SPARC it can be read through the PROM---but although I could be wrong ISTR this is not what's actually done). I think you'll find you CAN move drives among sata ports, just not among controller types, because the devid is a blob generated by the disk driver, and pci-ide and AHCI will yeild up different devid's for the same disk. Grub never calculates a devid, just reads one from the label (reads a devid that some earlier kernel got from pci-ide or ahci and wrote into the label). so when ports and device names change, rewriting labels is helpful but not urgent. When disk drivers change, rewriting labels is urgent. yeah, the fact that ramdisk booting isn't possible with opensolaris makes tihs whole situation a lot more serious than it was back when SXCE was still available for download. Is there any way to make a devid-proof rescue boot option? Is there a way to make grub boot an iso image off the hard disk for example? pgp84LsPjArBH.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hard drives for ZFS NAS
eg == Emily Grettel emilygrettelis...@hotmail.com writes: eg What do people already use on their enterprise level NAS's? For a SOHO NAS similar to the one you are running, I mix manufacturer types within a redundancy set so that a model-wide manufacturing or firmware glitch like the ones of which we've had several in the last few years does not take out an entire array, and to make it easier to figure out whether weird problems in iostat are controller/driver's fault, or drive's fault. If there are not enough manufacturers with good drives on offer, I'll try to buy two different models of the same manufacturer, ex get one of them an older model number of the same drive size/featureset. Often you find two mini-generations are on offer at once. At the moment, I would not buy any WD drive because they have been changing drives' behavior without changing model numbers which makes pointless discussions like this one because the model numbers become meaningless and you cannot bind your experience to a repeatable purchasing decision other than ``do/don't buy WD''. When the dust settles from this silent-firmware-version-bumps and 4k-sector disaster, I would buy WD again because the more mfg diversity, the more bad-batch-proofing you have for wide stripes. I used to buy near-line drives but no longer do this because it's cheaper to buy two regular drives than one near-line drive, but this may be a mistake because of the whole vibration disaster. pgpRFDcJerIaG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hard drives for ZFS NAS
bh == Brandon High bh...@freaks.com writes: bh From what I've read, the Hitachi and Samsung drives both bh support CCTL, which is in the ATA-8 spec. There's no way to bh toggle it on from OpenSolaris (yet) and it doesn't persist bh through reboot so it's not really ideal. bh Here's a patch to smartmontools that is supposed to enable bh it. It's in the SVN version 5.40 but not the current 5.39 bh release: http://www.csc.liv.ac.uk/~greg/projects/erc/ That's good to know. It would be interesting to know if the smartctl command in question can actually make it through a solaris system, and on what disk driver. AHCI and mpt are different because one is SATA framework and one isn't. I wonder also if SAS expanders cause any problems for smartctl? also, has anyone actually found this feature to have any value at all? To be clear, I do understand what the feature does. I do not need it explained to me again. but AIUI with ZFS you must remove a partially failing drive, or else the entire pool becomes slow. It does not matter if the partially-failing drive is returning commands in 30sec (the ATA maximum) or 7sec by CCTL/TLER/---you must still find and remove it, or the zpool will become pathologically slow. If there is actual experience with the feature helping ZFS, I'd be interested, but so far I think people are just echoing wikipedia shoulds and speculations, right? pgpOauVvynd3C.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes
bh == Brandon High bh...@freaks.com writes: bh The drive should be on the same USB port because the device bh path is saved in the zpool.cache. If you removed the bh zpool.cache, it wouldn't matter where the drive was plugged bh in. I thought it was supposed to go by devid. There was a bug a while ago that Solaris won't calculate devid for devices that say over SCSI they are ``removeable'' because, in the sense that a DynaMO or DVD-R is ``removeable'', the serial number returned by various identity commands or mode pages isn't bound to any set of stored bits, and the way devid's are used throughout Solaris means they are like a namespace or an array-of for a set of bit-stores so it's not appropriate for a DVD-R drive to have a devid. A DVD disc could have one, though---in fact a release of a pressed disc could appropriately have a non-serialized devid. However USB stick designers used to working with Microsoft don't bother to think through how the SCSI architecture should work in a sane world because they are used to reading chatty-idiot Microsoft manuals, so they fill out the page like a beaurocratic form with whatever feels appropriate and mark USB sticks ``removeable'', which according to the standard and to a sane implementer is a warning that the virtual SCSI disk attached to the virtual SCSI host adapter inside the USB pod might be soldered to removeable FLASH chips. It's quite stupid because before the OS has even determined what kind of USB device is plugged in, it already knows the device is removeable in that sense, just like it knows hot-swap SATA is removeable. USB is no more removeable, even in practical use, than SATA. (eSATA! *slap*) Even in the case of CF readers, it's probably wrong most of the time to set the removeable SCSI flag because the connection that's severable is between the virtual SCSI adapter in the ``reader'' and the virtual SCSI disk in the CF/SD/... card, while the removeable flag indicates severability between SCSI disk and storage medium. In the CF/SD/... reader case the serial number in the IDENTIFY command or mode pages will come from CF/SD/... and remain bound to the bits. The only case that might call for setting the bit is where the adapter is synthesizing a fake mode page where the removeable bit appears, but even then the bit should be clear so long as any serialized fields in other commands and mode pages are still serialized somehow (whether synthesized or not). Actual removeable in-the-scsi-standard's-sense HARD DISK drives mostly don't exist, and real removeable things in the real world attach as optical where an understanding of their removeability is embedded in the driver: ANYTHING the cd driver attaches will be treated removeable. consequently the bit is useless to the way solaris is using it, and does little more than break USB support in ways like this, but the developers refuse to let go of their dreams about what the bit was supposed to mean even though a flood of reality has guaranteed at this point their dream will never come true. I think there was some magical simon-sez flag they added to /kernel/drv/whatever.conf so the bug could be closed, so you might go hunting for that flag in which they will surely want you to encode in a baroque case-sensitive undocumented notation that ``The Microtraveler model 477217045 serial 80502813 attached to driver/hub/hub/port/function has a LYING REMOVEABLE FLAG'', but maybe you can somehow set it to '*' and rejoin reality. Still this won't help you on livecd's. It's probably wiser to walk away from USB unless/until there's a serious will to adopt the practical mindset needed to support it reasonably. pgpAoBbGUMwdU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it safe to disable the swap partition?
mg == Mike Gerdts mger...@gmail.com writes: mg If Solaris is under memory pressure, [...] mg The best thing to do with processes that can be swapped out mg forever is to not run them. Many programs allocate memory they never use. Linux allows overcommitting by default (but disableable), but Solaris doesn't and can't, so on a Solaris system without swap those allocations turn into physical RAM that can never be used. At the time the never-to-be-used pages are allocated, ARC must be dumped to make room for them. With swap, pages that are allocated but never written can be backed by swap, and the ARC doesn't need to be dumped until the pages are actually written. Note that, in this hypothetical story, swap is never written at all, but it still has to be there. If you run a java vm on your ``storage server'', then you might care about this. I think the no-swap dogma is very soothing and yet very obviously wrong. If you want to get into the overcommit game, fine. If you want to play a game where you will overcommit up to the size of the ARC, well, ``meh'', but fine. Until then, though, swap makes sense. pgpA7wEb34DwB.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, NFS, and ACLs ssues
mef == Mary Ellen Fitzpatrick mfitz...@bu.edu writes: mef Is there a way to set permissions so that the /etc/auto.home mef file on the clients does not list every exported dir/mount mef point? If I understand the question right, then, no. These maps are very traditional from the earliest days of NIS and need to be managed centrally, and they should match the structure of filesystems exported from servers exactly. The new scheme of ``mirror mounts'' and ``referrals'' which does away with the global automount map and sprinkles bits of the map onto individual servers, is oft-discussed but seldom implemented, and it's not at all traditional so it's unclear to me that it's ever going to become The Way even though everyone talks about it like it's _fait accompli_. I only mention it to poison the well: if someone tries to discuss this with you, you should immediately close your ears because they are only dreaming for now, and none of it works yet. Unfortunately, autofs implementations' quality varies widely. I think Linux is on their...fourth? rewrite of the whole automount framework, and Mac OS X on at least their second if not third. I found Mac OS X's one is poor at handling nested mounts like what you're doing compared to the solaris one. The apple people sneakily altered the automounter documentation to remove all examples showing nested mounts, without actually documenting frankly the limitation which surely prompted them to alter the examples. slimey fucks. You can work around their fail using the 'net' option but this prevents assembling subtrees from several different servers. Each of your nested subtrees must be from the same server when using the 'net' option workaround because you lose the right to choose where they're mounted: http://web.ivy.net/~carton/rant/macos-automounter.html#9050149 The linux one will do nested subtrees, but I think you need to express the entire subtree as a single automount record, with a single trigger. This is different from Mac OS X with-workaround which will (provided you use 'net') miraculously assemble a view of the entire subtree from several dscl records which in theory could even be on LDAP. so, Linux will automount and unmount everything together, while Mac OS X will not. You might reasonably wish to have the mountpoints within the automounted filesystem turn into triggers themselves so that parts of the subtree are only mounted just as deeply as and only along the branch needed to satisfy the trigger---that way a subtree could be assembled from many servers, and if the map for a deep corner of the subtree were changed and pushed, clients could start obeying the changes sooner. but I think on Linux this won't work. not sure it works anywhere though. I guess it sort of works on Mac OS X with heavy caveat, but not sure about Solaris. carton -hard,intr,noacl/ cash:/export/home/ \ /VDI cash:/export/home//VDI but although Linux beats the Mac here, the linux one is shit at handling direct mounts: if you give it a subdirectory in which it owns all of the fake files, like /home, it works ok. but if you want it to for example automount /arrchive (just the one filesystem onto /arrchive from one share on one server) I found it hardly works at all. so I have /arrchive automounted on my Solaris boxes, and /remedial-automount/arrchive with symlink /archive - remedial-automount/arrchive on Linux boxes. so much for one traditional centrally-managed map. pgppJ7OZtGw8p.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
dm == David Magda dma...@ee.ryerson.ca writes: dm Given that ZFS is always consistent on-disk, why would you dm lose a pool if you lose the ZIL and/or cache file? because of lazy assertions inside 'zpool import'. you are right there is no fundamental reason for it---it's just code that doesn't exist. If you are a developer you can probably still recover your pool, but there aren't any commands with a supported interface to do it. 'zpool.cache' doesn't contain magical information, but it allows you to pass through a different code path that doesn't include the ``BrrkBrrk, omg panic device missing, BAIL OUT HERE'' checks. I don't think squirreling away copies of zpool.cache is a great way to make your pool safe from slog failures because there may be other things about the different manual 'zpool import' codepath that you need during a disaster, like -F, which will remain inaccessible to you if you rely on some saving-your-zpool.cache hack, even if your hack ends up actually working when the time comes, which it might not. I think is really interesting, the case of an HA cluster using a single-device slog made from a ramdisk on the passive node. This case would also become safer if slogs were fully disposeable. pgpmcPw2Mcugv.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
re == Richard Elling richard.ell...@gmail.com writes: A failed unmirrored log device would be the permanent death of the pool. re It has also been shown that such pools are recoverable, albeit re with tedious, manual procedures required. for the 100th time, No, they're not, not if you lose zpool.cache also. pgpuUVBmI8w1p.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
re == Richard Elling richard.ell...@gmail.com writes: re a well managed system will not lose zpool.cache or any other re file. I would complain this was circular reasoning if it weren't such obvious chest-puffing bullshit. It's normal even to the extent of being a best practice to have no redundancy for rpool on systems that can tolerate gaps in availability because you can reinstall from the livecd relatively quickly. re It is disingenuous to complain about multiple failures strongly disagree. I'm quite genuine. A really common and really terrible suggestion is, ``get an SSD, and put your rpool in one slice and your slog in another.'' If you do that and lose the SSD, you've lost the whole pool. You cannot recover with 'zpool clear' or any number of -f -F -FFF flags. This common scenario doesn't require any multiple failure. Now, even among those who don't do this, people following your suggestions will not design their systems realizing the rpool and the SSD make up a redundant pair. They will not see: you can lose the rpool and import the pool IFF you have the SSD, and you can lose the SSD and force-online the pool IFF you have the rpool with the missing-slog pool already imported to it. They will instead desgin following the raidz/mirroring failure rules treating slog as disposable, like you've told them, and this is flat wrong. Hiding behind fuzzy glossary terms like ``multiple failures'' is useless, IMHO to the point of being deliberately obtuse. Besides that, you don't need any multiple failures---all you need to do is make the mistake of typing the perfectly reasonable command 'zpool export' in the course of trying to fix your problem, and poof, your whole pool is gone. A pool that runs fine until you try to export and re-import it, after which it is permanently lost, is a ticking time bomb. I don't think it's a good idea to run that way at all because of the flexible tools one needs to have available for maintenance in a disaster (ex., livecd of newer version with special import -F rescue-magic in it, WONT WORK. moving drives to a different controller causing them to have a different devid, WONT WORK. accumulate enough of these and not only does your toolkit get smaller and weaker, but you must move slowly and with great fear because the slightest move can make everything explode in totally unobvious ways.). If you do want to run this way, as an absolute MINIMUM, you need to discuss this cannot-import case at moments like this one so that it can influence people's designs. It seems if I say it the long way, I get ignored. If I say it the short way, you dive into every corner case. I don't know how to be any more clear, so...good luck out there, y'all. pgplz3pxj4vHy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
edm == Eric D Mudama edmud...@bounceswoosh.org writes: edm How would you stripe or manage a dataset across a mix of edm devices with different geometries? the ``geometry'' discussed is 1-dimensional: sector size. The way that you do it is to align all writes, and never write anything smaller than the sector size. The rule is very simple, and you can also start or stop following it at any moment without rewriting any of the dataset and still get the full benefit. pgpj2CsEgHKlY.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
edm == Eric D Mudama edmud...@bounceswoosh.org writes: edm What you're suggesting is exactly what SSD vendors already do. no, it's not. You have to do it for them. edm They present a 512B standard host interface sector size, and edm perform their own translations and management inside the edm device. It is not nearly so magical! The pages are 2 - 4kB. They are this size for nothing to do with the erase block size or the secret blackbox filesystem running on the SSD. It's because of the ECC, because the reed-solomon for the entire block must be recalculated if any of the block is changed. Therefore, changing 0.5kB means: for a 4kB page device: * read 4kB * write 4kB for a 2kB page device: * read 2kB * write 2kB and changing 4kB at offset integer * 4kB means: for a 4kB device: * write 4kB for a 2kB device: * write 4kB It does not matter if all devices have the same page size or not. Just write at the biggest size, or write at the appropriate size if you can. The important thing is that you write a whole page, even if you just pad with zeroes, so the controller does not have to do any reading. simple. the problem with big-sector spinning hard drives and alignment/blocksize is exactly the same problem. non-ZFS people discuss it a lot becuase ZFS filesystems start at integer * rather large block offset, thanks to all the disk label hokus pocus, but NTFS filesystems often start at 16065 * 0.5kB pgpEdIwHb5RuZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?
jcm == James C McPherson james.mcpher...@oracle.com writes: ga == Günther Alka a...@hfg-gmuend.de writes: jcm I am amazed that you believe OpenSolaris binary distro has too jcm much desktop stuff. Most people I have come across are firmly jcm of the belief that it does not have enough. minification is stupid, anyway. It causes way more harm than good. I can understand not wanting to have weird flavour-of-the-month daemons running until you've been bothered to learn what they do, but not wanting to have their binaries on the disk is just silly. It's also annoying when some sysadmin minifies away xauth so that 'ssh -Y' doesn't work, or minifies away vi because he uses nano---his OCD becomes unreasonable biggotry taking the place of building a workable consensus platform, which is the proper task at hand when deciding what to include and how to present it. But it gets much worse when the minifiers start reaching into the packages themselves and turning off options. Ex., they will turn off the Perl/Python scripting support for some common package because they want to yank out Perl and Python to make the distribution smaller. Or they do not want to ship libX11.so, so they'll rebuild packages with X support switched off. Once they've done that, if you actually need those things, it will waste heaps more time to track down what went wrong. The existence of the knobs themselves is harmful enough, but the popular demand of idiots for this kind of knob wastes the time of the non-idiot packagers expected to provide it: they have to split the result of a single build into twenty tiny interdependent subpackages, shim dlopen() in there where it wasn't before (if it's a binary package system), and then go back and test the whole monster: wherever they drop the ball, you suffer, and while they're tossing the ball around they're spending time pandering to the damned minifiers instead of making and updating other packages which are actually useful to sane people. The insanity gets pushed further when whole packages start factoring core pieces of functionality into ``modules'', so now in Eye of Gnome, I have the ``double click on a picture to make it bigger Module.so''. I guess, if I want to make my system smaller, I can use the packaging system to remove the ability to double click on pictures and make them bigger? What the fuck? The minification fetish has spread out both directions from the packaging system and infected everything from the architecture of the source code to the user-visible menu structure of the app! Minification zealotry should stick to systems running from NOR flash like openwrt, or 1GB NAND systems like android. It's got no place on a system with disks. As a corrolary, any minification based on busting a binary into .so's and then scattering the .so's into packages is stupid, because the package systems where minification makes sense are source-based and don't need that, in fact suffer from it because the split binaries contain more symbols and are larger in core and larger on disk. Just say no to minification if you're doing it because it ``feels'' right. Just knock it off. Go work on your car stereo, or develop perverted rituals with your espresso machine, instead. ga i installed opensolaris and my first impression was very ga disappointing. yeah. me, too: my first impression was ``the installer does not work at all without X11. oh, and BTW X11 does not work at all without nVidia haha, ENJOY.'' That was at least two years ago though. ga the gui was slow and not very intuitive and the only thing ga thats's running fine was the browser. wtf, mate? You complain the install is not minimal, but then you judge the overall system by the superficial impression its GUI makes? ga if someone will try it -its free. Is it? I don't really understand the nexenta license, which is why I don't bother with it. The opensolaris licensing is already confusing because parts of it are binary, and 'pkg' makes it very easy to install things with non-redistributable licenses, or extremely weird things like SunPro compilers that claim to have different licenses depending on what you use them for or how you define yourself as a person and include automatic agreements not to publish unfavorable benchmarks and other similar bullshit. It's admirable, important, and surprising to me that Solaris has actually managed to become a redistributable livecd with a modern package system (yeah, and where's your darwin livecd, fanboy?), but still because of the ecosystem opensolaris comes from you're constantly one enter key away from encumbering your system. If you want it to be free maybe use freebsd---then you still get ZFS but you get away from some of the lazy assertions, most of the binary disk drivers and mid-layers, and from the stupid legacy disk-labeling. FreeBSD also has a scripted build process all the way from source tree to .iso that you can run yourself.
Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?
dd == David Dyer-Bennet d...@dd-b.net writes: dd Is it possible to switch to b132 now, for example? yeah, this is not so bad. I know of two approaches: * genunix.org assembles livecd's of each bnnn tag. You can burn one, unplug from the internet, install it. It is nice to have a livecd capable of mounting whatever zpool and zfs version you are using. I'm not sure how they do this, but they do it. * see these untested but relatively safe-looking instructions (apolo to whoever posted that i didn't write down the credit): formal IPS docs: http://dlc.sun.com/osol/docs/content/2009.06/IMGPACKAGESYS/index.html how to get a specific snv build with ips -8- Starting from OpenSolaris 2009.06 (snv_111b) active BE. 1) beadm create snv_111b-dev 2) beadm activate snv_111b-dev 3) reboot 4) pkg set-authority -O http://pkg.opensolaris.org/dev opensolaris.org 5) pkg install SUNWipkg 6) pkg list 'entire*' 7) beadm create snv_118 8) beadm mount snv_118 /mnt 9) pkg -R /mnt refresh 10) pkg -R /mnt install ent...@0.5.11-0.118 11) bootadm update-archive -R /mnt 12) beadm umount snv_118 13) beadm activate snv_118 14) reboot Now you have a snv_118 development environment. also see: http://defect.opensolaris.org/bz/show_bug.cgi?id=3436 which currently says about the same thing. -8- you see the bnnn is specified in line 10, ent...@0.5.11-0.nnn There is no ``failsafe'' boot archive with opensolaris like the ramdisk-based one that was in the now-terminated SXCE, so you should make a failsafe boot option yourself by cloning a working BE and leaving that clone alone. and...make the failsafe clone new enough to understand your pool version or else it's not very useful. :) pgpxowC3Fu66n.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
dm == David Magda dma...@ee.ryerson.ca writes: bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes: dm OP may also want to look into the multi-platform pkgsrc for dm third-party open source software: +1. jucr.opensolaris.org seems to be based on RPM which is totally fail. RPM is the oldest, crappiest, most frustrating thing! packages are always frustrating but pkgsrc is designed to isolate itself from the idiosyncracies of each host platform, through factoring. Its major weakness is upgrades, but with Solaris you can use zones and snapshots to make this a lot less painful: * run their ``bulk build'' inside a zone. The ``bulk build'' feature is like the jucr: it downloads stuff from all over the internet and bulids it, generates a tree of static web pages to report its results, plus a repository of binary packages. Like jucr it does not build packages on an ordinary machine, but in a well-specified minimal environment which has installed only the packages named as build dependencies---between each package build the bulk scripts remove all not-needed packages. Thus you really need a separate machine, like a zone, for bulk building. There is a non-bulk way to build pkgsrc, but it's not as good. Except that unlike the jucr, the implementation of the bulk build is included in the pkgsrc distribution and supported and ordinary people who run pkgsrc are expected to use it themselves. * clone a zone, upgrade the packages inside it using the binary packages produced by the bulk build, and cut services over to the clone only after everything's working right. Both of these things are a bit painful with pkgsrc on normal systems and much easier with zones and ZFS. The type of upgrade that's guaranteed to work on pkgsrc, is: * to take a snapshot of /usr/pkgsrc which *is* pkgsrc, all packages' build instructions, and no binaries under this tree * ``bulk build'' * replace all your current running packages with the new binary packages in the repository the bulk build made. In practice people usually rebuild less than that to upgrade a package, and it often works anyway, but if it doesn't work then you're left wondering ``is pkgsrc just broken again, or will a more thorough upgrade actually work?'' The coolest immediate trick is that you can run more than one bulk build with different starting options, ex SunPro vs gcc, 32 vs 64-bit. The first step of using pkgsrc is to ``bootstrap'' it, and during bootstrap you choose the C compiler and also whether to use host's or pkgsrc's versions of things like perl and pax and awk. You also choose prefixes for /usr /var and /etc and /var/db/pkg that will isolate all pkgsrc files from the rest of the system. In general this level of pathname flexibility is only achievable at build time, so only a source-based package system can pull off this trick. The corrolary is that you can install more than one pkgsrc on a single system and choose between them with PATH. pkgsrc is generally designed to embed full pathnames of its shared libs, so this has got a good shot of working. You could have /usr/pkg64 and /usr/pkg32, or /usr/pkg-gcc and /usr/pkg-spro. pkgsrc will also build pkg_add, pkg_info, u.s.w. under /usr/pkg-gcc/bin which will point to /var/db/pkg-gcc or whatever to track what's installed, so you can have more than one pkg_add on a single system pointing to different sets of directories. You could also do weirder things like use different paths every time you do a bulk build, like /usr/pkg-20100130 and /usr/pkg-20100408, although it's very strange to do that so far. It would also be possible to use ugly post-Unix directory layouts, ex /pkg/marker/usr/bin and /pkg/marker/etc and /pkg/marker/var/db/pkg, and then make /pkg/marker into a ZFS that could be snapshotted and rolled back. It is odd in pkgsrc world to put /var/db/pkg tracking-database of what's installed into the same subtree as the installed stuff itself, but in the context of ZFS it makes sense to do that. However the pathnames will be fixed for a given set of binary packages, so whatever you do with the ZFS the results of bulk builds sharing a common ``bootstrap'' phase would have to stay mounted on the same directory. You cannot clone something to a new directory then add/remove packages. There was an attempt called ``pkgviews'' to do something like this, but I think it's ultimately doomed because the idea's not compartmentalized enough to work with every package. In general pkgsrc gives you a toolkit for dealing with suboptimal package trees where a lot of shit is broken. It's well-adapted to the ugly modern way we run Unixes, sealed, with only web facing the users, because you can dedicate an entire bulk build to one user-facing app. If you have an app that needs a one-line change to openldap, pkgsrc makes it easy to perform this 1-line change and rebuild 100 interdependent packages linked to your mutant library,
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
rs == Ragnar Sundblad ra...@csc.kth.se writes: rs use IPSEC to make IP address spoofing harder. IPsec with channel binding is win, but not until SA's are offloaded to the NIC and all NIC's can do IPsec AES at line rate. Until this happens you need to accept there will be some protocols used on SAN that are not on ``the Internet'' and for which your axiomatic security declarations don't apply, where the relevant features are things like doing the DNS lookup in the proper .rhosts manner and doing uRPF, minimum, and more optimistically stop adding new protocols without IPv6 support, and start adding support for multiple IP stacks / VRF's. If saying ``the only way to do any given thing is twicecrypted kerberized ipsec within dnssec namespaces'' is blocking doing these immediate plaintext things that allow a host to participate in both the internet and a SAN at once, well that's no good either. pgptkJNIK5h42.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
jr == Jeroen Roodhart j.r.roodh...@uva.nl writes: jr Running OSOL nv130. Power off the machine, removed the F20 and jr power back on. Machines boots OK and comes up normally with jr the following message in 'zpool status': yeah, but try it again and this time put rpool on the F20 as well and try to import the pool from a LiveCD: if you lose zpool.cache at this stage, your pool is toast./end repeat mode pgpt1GZtrVxS6.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
enh == Edward Ned Harvey solar...@nedharvey.com writes: enh If you have zpool less than version 19 (when ability to remove enh log device was introduced) and you have a non-mirrored log enh device that failed, you had better treat the situation as an enh emergency. Ed the log device removal support is only good for adding a slog to try it out, then changing your mind and removing the slog (which was not possible before). It doesn't change the reliability situation one bit: pools with dead slogs are not importable. There've been threads on this for a while. It's well-discussed because it's an example of IMHO broken process of ``obviously a critical requirement but not technically part of the original RFE which is already late,'' as well as a dangerous pitfall for ZFS admins. I imagine the process works well in other cases to keep stuff granular enough that it can be prioritized effectively, but in this case it's made the slog feature significantly incomplete for a couple years and put many production systems in a precarious spot, and the whole mess was predicted before the slog feature was integrated. The on-disk log (slog or otherwise), if I understand right, can actually make the filesystem recover to a crash-INconsistent state enh You're speaking the opposite of common sense. Yeah, I'm doing it on purpose to suggest that just guessing how you feel things ought to work based on vague notions of economy isn't a good idea. enh If disabling the ZIL makes the system faster *and* less prone enh to data corruption, please explain why we don't all disable enh the ZIL? I said complying with fsync can make the system recover to a state not equal to one you might have hypothetically snapshotted in a moment leading up to the crash. Elsewhere I might've said disabling the ZIL does not make the system more prone to data corruption, *iff* you are not an NFS server. If you are, disabling the ZIL can lead to lost writes if an NFS server reboots and an NFS client does not, which can definitely cause app-level data corruption. Disabling the ZIL breaks the D requirement of ACID databases which might screw up apps that replicate, or keep databases on several separate servers in sync, and it might lead to lost mail on an MTA, but because unlike non-COW filesystems it costs nothing extra for ZFS to preserve write ordering even without fsync(), AIUI you will not get corrupted application-level data by disabling the ZIL. you just get missing data that the app has a right to expect should be there. The dire warnings written by kernel developers in the wikis of ``don't EVER disable the ZIL'' are totally ridiculous and inappropriate IMO. I think they probably just worked really hard to write the ZIL piece of ZFS, and don't want people telling their brilliant code to fuckoff just because it makes things a little slower. so we get all this ``enterprise'' snobbery and so on. ``crash consistent'' is a technical term not a common-sense term, and I may have used it incorrectly: http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html With a system that loses power on which fsync() had been in use, the files getting fsync()'ed will probably recover to more recent versions than the rest of the files, which means the recovered state achieved by yanking the cord couldn't have been emulated by cloning a snapshot and not actually having lost power. However, the app calling fsync() will expect this, so it's not supposed to lead to application-level inconsistency. If you test your app's recovery ability in just that way, by cloning snapshots of filesystems on which the app is actively writing and then seeing if the app can recover the clone, then you're unfortunately not testing the app quite hard enough if fsync() is involved, so yeah I guess disabling the ZIL might in theory make incorrectly-written apps less prone to data corruption. Likewise, no testing of the app on a ZFS will be aggressive enough to make the app powerfail-proof on a non-COW POSIX system because ZFS keeps more ordering than the API actually guarantees to the app. I'm repeating myself though. I wish you'll just read my posts with at least paragraph granularity instead of just picking out individual sentences and discarding everything that seems too complicated or too awkwardly stated. I'm basing this all on the ``common sense'' that to do otherwise, fsync() would have to completely ignore its filedescriptor argument. It'd have to copy the entire in-memory ZIL to the slog and behave the same as 'lockfs -fa', which I think would perform too badly compared to non-ZFS filesystems' fsync()s, and would lead to emphatic performance advice like ``segregate files that get lots of fsync()s into separate ZFS datasets from files that get high write bandwidth,'' and we don't have advice like that in the blogs/lists/wikis which makes me think it's not beneficial (the benefit would be
Re: [zfs-discuss] dedup and memory/l2arc requirements
re == Richard Elling richard.ell...@gmail.com writes: re # ptime zdb -S zwimming Simulated DDT histogram: re refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE re Total2.63M277G218G225G3.22M337G263G 270G rein-core size = 2.63M * 250 = 657.5 MB Thanks, that is really useful! It'll probably make the difference between trying dedup and not, for me. It is not working for me yet. It got to this point in prstat: 6754 root 2554M 1439M sleep 600 0:03:31 1.9% zdb/106 and then ran out of memory: $ pfexec ptime zdb -S tub out of memory -- generating core dump I might add some swap I guess. I will have to try it on another machine with more RAM and less pool, and see how the size of the zdb image compares to the calculated size of DDT needed. So long as zdb is the same or a little smaller than the DDT it predicts, the tool's still useful, just sometimes it will report ``DDT too big but not sure by how much'', by coredumping/thrashing instead of finishing. pgprpk9HSdr61.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
enh == Edward Ned Harvey solar...@nedharvey.com writes: enh Dude, don't be so arrogant. Acting like you know what I'm enh talking about better than I do. Face it that you have enh something to learn here. funny! AIUI you are wrong and Casper is right. ZFS recovers to a crash-consistent state, even without the slog, meaning it recovers to some state through which the filesystem passed in the seconds leading up to the crash. This isn't what UFS or XFS do. The on-disk log (slog or otherwise), if I understand right, can actually make the filesystem recover to a crash-INconsistent state (a state not equal to a snapshot you might have hypothetically taken in the seconds leading up to the crash), because files that were recently fsync()'d may be of newer versions than files that weren't---that is, fsync() durably commits only the file it references, by copying that *part* of the in-RAM ZIL to the durable slog. fsync() is not equivalent to 'lockfs -fa' committing every file on the system (is it?). I guess I could be wrong about that. If I'm right, this isn't a bad thing because apps that call fsync() are supposed to expect the inconsistency, but it's still important to understanding what's going on. pgpUNxWo30EYO.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
la == Lori Alt lori@oracle.com writes: la I'm only pointing out that eliminating the zpool.cache file la would not enable root pools to be split. More work is la required for that. makes sense. All the same, please do not retaliate against the bug-opener by adding a lazy-assertion to prevent rpools from being split: this type of brittleness, ex. around all the many disk-labeling programs, is a large part of what makes Solaris systems feel flakey and unwelcoming to those who've used Linux, BSD, or Mac OS X. and AFAICT there is not much of it in the ZFS boot support so far---it's an uncluttered architecture that's quite friendly to creative abuse and impatient hacking. pgpy5Ksjv18Ne.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
rm == Robert Milkowski mi...@task.gda.pl writes: rm This is not true. If ZIL device would die *while pool is rm imported* then ZFS would start using z ZIL withing a pool and rm continue to operate. what you do not say, is that a pool with dead zil cannot be 'import -f'd. So, for example, if your rpool and slog are on the same SSD, and it dies, you have just lost your whole pool. pgp9E0wFxqcc4.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
rm == Robert Milkowski mi...@task.gda.pl writes: rm the reason you get better performance out of the box on Linux rm as NFS server is that it actually behaves like with disabled rm ZIL careful. Solaris people have been slinging mud at linux for things unfsd did in spite of the fact knfsd has been around for a decade. and ``has options to behave like the ZIL is disabled (sync/async in /etc/exports)'' != ``always behaves like the ZIL is disabled''. If you are certain about Linux NFS servers not preserving data for hard mounts when the server reboots even with the 'sync' option which is the default, please confirm, but otherwise I do not believe you. rm Which is an expected behavior when you break NFS requirements rm as Linux does out of the box. wrong. The default is 'sync' in /etc/exports. The default has changed, but the default is 'sync', and the whole thing is well-documented. rm What would be useful though is to be able to easily disable rm ZIL per dataset instead of OS wide switch. yeah, Linux NFS servers have that granularity for their equivalent option. pgpg1qLhwVTDs.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool cannot replace a replacing device
cm == Courtney Malone court...@courtneymalone.com writes: j == Jim biainmcna...@hotmail.com writes: j Thanks for the suggestion, but have tried detaching but it j refuses reporting no valid replicas. yeah this happened to someone else also, see list archives around 2008-12-03: cm I have a 10 drive raidz, recently one of the disks appeared to cm be generating errors (this later turned out to be a cable), cm # zpool replace data 17096229131581286394 c0t2d0 cm cannot replace 17096229131581286394 with c0t2d0: cannot cm replace a replacing device cm if i try to detach it i get: cm # zpool detach data 17096229131581286394 cm cannot detach 17096229131581286394: no valid replicas pgpKVbb2twZdu.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup Performance
srbi == Steve Radich, BitShop, Inc ste...@bitshop.com writes: srbi http://www.bitshop.com/Blogs/tabid/95/EntryId/78/Bug-in-OpenSolaris-SMB-Server-causes-slow-disk-i-o-always.aspx I'm having trouble understanding many things in here like ``our file move'' (moving what from where to where with what protocol?) and ``with SMB running'' (with the server enabled on Solaris, with filesystems mounted, with activity on the mountpoints? what does running mean?) and ``RAID-0/stripe reads is the slow point'' (what does this mean? How did you determine which part of the stack is limiting the observed speed? This is normally quite difficult and requires comparing several experiments, not doing just one experiment like ``a file move between zfs pools''.). What is ``bytes the negotiated protocol allows''? mtu, mss, window size? Can you show us in what tool you see one number and where you see the other number that's too big? pgpAMuI2YHJGk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is this pool recoverable?
sn == Sriram Narayanan sri...@belenix.org writes: sn http://docs.sun.com/app/docs/doc/817-2271/ghbxs?a=view yeah, but he has no slog, and he says 'zpool clear' makes the system panic and reboot, so even from way over here that link looks useless. Patrick, maybe try a newer livecd from genunix.org like b130 or later and see if the panic is fixed so that you can import/clear/export the pool. The new livecd's also have 'zpool import -F' for Fix Harder (see manpage first). Let us know what happens. pgpT7dIOFPNUD.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Error in zfs list output?
bh == Brandon High bh...@freaks.com writes: bh I think I'm seeing an error in the output from zfs list with bh regards to snapshot space utilization. no bug. You just need to think harder about it: the space used cannot be neatly put into buckets next to each snapshot that add to the total, just because of...math. To help understand, suppose you decide, just to fuck things up, that from now on every time you take a snapshot you take two snapshots, with exactly zero filesystem writing happening between the two. What do you want 'zfs list' to say now? What does happen if you do that, is it says all snapshots use zero space. the space shown in zfs list is the amount you'd get back if you deleted this one snapshot. Yes, every time you delete a snapshot, all the numbers reshuffle. Yes, there is a whole cat's cradle of space accounting information hidden in there that does not come out through 'zfs list'. pgpzRUSk68FzY.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/OSOL/Firewire...
k == Khyron khyron4...@gmail.com writes: k FireWire is an Apple technology, so they have a vested k interest in making sure it works well [...] They could even k have a specific chipset that they exclusively use in their k systems, yes, you keep repeating yourselves, but there are only a few firewire host chips, like ohci and lynx, and apple uses the same ones as everyone else, no magic. Why would you speak such a complicated fantasy out loud without any reason to believe it other than your imaginations? I also tried to use firewire on Solaris long ago and had a lot of problems with it, both with the driver stack in Solaris and with the embedded software inside a cheaper non-Oxford case (Prolific). I think y'all forum users shuold stick to SAS/SATA for external disks and avoid firewire and USB both. Realize, though, that it is not just the chip driver but the entire software stack that influences speed and reliability. Even above what you normally consider the firewire stack, above all the mid-layer and scsi emulation stuff, Mac OS X for example is rigorous about handling force-unmounting, both with umount -f and disks that go away without warning. FreeBSD OTOH has major problems with force-unmounting, panicing and waiting forever. Solaris has problems too with freezing zpool maintenance commands, access to pools unrelated to the one with the device that went away, and NFS serving anything while any zpool is frozen. This is a problem even if you don't make a habit of yanking disks because it can make diagnosing problems really difficult: what if your case, like my non-Oxford one, has a firmware bug that makes it freeze up sometimes? or a flakey power supply or lose cable? If the OS does not stay up long enough to report the case detached, and stay sane enough for you to figure out what makes it retach (waiting a while, rebooting the case, jiggling the power connector, jiggling the data connector) then you will probably never figure out what's wrong with it, as I didn't for months while if I'd had the same broken case on a Mac I'd have realized almost immediately that it sometimes detaches itself for no reason and retaches when I cycle it's power switch but not when I plug/unplug its data cable and not when I reboot the Mac, so I'd know the case had buggy firmware, while with Solaris I just get these craazy panic messages. Once your exception handling reaches a certain level of crappyness, you cannot touch anything without everything collapsing. And on Solaris all this freezing/panicing behavior depends a lot which disk driver yuo're using while Mac OS X it's, meh, basically working the same for SATA, USB, Firewire, or NFS client, and also you can mount images with hdiutil over NFS without getting weird checksum errors or deadlocks like you do with file or lofiadm-backed ZFS. (globalsan iscsi is still a mess though, worse than all other mac disk drivers and worse than the solaris initiator) I do not like the Mac OS much because it's slow, because the hardware's overpriced and fragile, because the only people running it inside VM's are using piratebay copies, and because I distrust Apple and strongly disapprove of their master plan both in intent and practice like the way they crippled dtrace, the displayport bullshit, and their terrible developer relations like nontransparent last-minute API yanking and ``agreements'' where you even have to agree not to discuss the agreement, and in general of their honing a talent for manipulating people into exploitable corners by slowly convincing them it's okay to feel lazy and entitled. But yes they've got some things relevant to server-side storage working better than Solaris does like handling flakey disks sanely, and providing source for the stable supported version of their OS not just the development version. pgpzf9yUTzCYk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
djm == Darren J Moffat darren.mof...@oracle.com writes: djm I've logged CR# 6936195 ZFS send stream while checksumed djm isn't fault tollerant to keep track of that. Other tar/cpio-like tools are also able to: * verify the checksums without extracting (like scrub) * verify or even extract the stream using a small userland tool that writes files using POSIX functions, so that you can build the tool on not-Solaris or extract the data onto not-ZFS. The 'zfs send' stream can't be extracted without the solaris kernel, although yes the promise that newer kernels can extract older streams is a very helpful one. For example, ufsdump | ufsrestore could move UFS data into ZFS. but zfs send | zfs recv leaves us trapped on ZFS, even though migrating/restoring ZFS data onto a pNFS or Lustre backend is a realistic desire in the near term. * partial extract Personally, I could give up the third bullet point. Admittedly the second bullet is hard to manage while still backing up zvol's, pNFS / Lustre data-node datasets, windows ACL's, properties, snapshots/clones, u.s.w., so it's kind of...if you want both vanilla and chocolate cake at once, you're both going to be unhappy. But there should at least be *a* tool that can copy from zfs to NFSv4 while preserving windows ACL's, and the tool should build on other OS's that support NFSv4 and be capable of faithfully copying one NFSv4 tree to another preserving all the magical metadata. I know it sounds like ACL-aware rsync is unrelated to your (Darren) goal of tweaking 'zfs send' to be appropriate for backups, but for example before ZFS I could make a backup on the machine with disks attached to it or on an NFS client, and get exactly the same stream out. Likewise, I could restore into an NFS client. Sticking to a clean API instead of dumping the guts of the filesystem, made the old stream formats more archival. The ``I need to extract a ZFS dataset so large that my only available container is a distributed Lustre filesystem'' use-case is pretty squarely within the archival realm, is going to be urgent in a year or so if it isn't already, and is accomodated by GNUtar, cpio, Amanda (even old ufsrestore Amanda), and all the big commercial backup tools. I admit it would be pretty damn cool if someone could write a purely userland version of 'zfs send' and 'zfs recv' that interact with the outside world using only POSIX file i/o and unix pipes but produce the standard deduped-ZFS-stream format, even if the hypothetical userland tool accomplishes this by including a FUSE-like amount of ZFS code and thus being quite hard to build. However, so far I don't think the goals of a replication tool: ``make a faithful and complete copy, efficiently, or else give an error,'' are compatible with the goals of an archival tool: ``extract robustly far into the future even in non-ideal and hard to predict circumstances such as different host kernel, different destination filesystem, corrupted stream, limited restore space.'' pgpyWHuwbuWZf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
c == Miles Nordin car...@ivy.net writes: mg == Mike Gerdts mger...@gmail.com writes: c are compatible with the goals of an archival tool: sorry, obviously I meant ``not compatible''. mg Richard Elling made an interesting observation that suggests mg that storing a zfs send data stream on tape is a quite mg reasonable thing to do. Richard's background makes me trust mg his analysis of this much more than I trust the typical person mg that says that zfs send output is poison. ssh and tape are perfect, yet whenever ZFS pools become corrupt Richard talks about scars on his knees from weak TCP checksums and lying disk drives and about creating a ``single protection domain'' of zfs checksums and redundancy instead of a bucket-brigade of fail of tcp into ssh into $blackbox_backup_Solution(likely involving unchecksummed disk storage) into SCSI/FC into ECC tapes. At worst, lying then or lying now? At best, the whole thing still strikes me as a pattern of banging a bunch of arcania into whatever shape's needed to fit the conclusion that ZFS is glorious and no further work is requried to make it perfect. and there is still no way to validate a tape without extracting it, which is last I worked with them, an optional but suggested part of $blackbox_backup_Solution (and one which, incidentally, helps with the bucket brigade problem Richard likes to point out). and the other archival problems of constraining the restore environment, and the fundamental incompatibility of goals between faithful replication and robust, future-proof archiving from my last post. pgpLLsyZQuSKJ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
k == Khyron khyron4...@gmail.com writes: k Star is probably perfect once it gets ZFS (e.g. NFS v4) ACL nope, because snapshots are lost and clones are expanded wrt their parents, and the original tree of snapshots/clones can never be restored. we are repeating, though. This is all in the archives. pgpTLTb9Ads3W.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
la == Lori Alt lori@sun.com writes: la This is no longer the case. The send stream format is now la versioned in such a way that future versions of Solaris will la be able to read send streams generated by earlier versions of la Solaris. Your memory of the thread is selective. This is only one of the several problems with it. If you are not concerned with bitflip gremlins on tape, then all the baloney about checksums and copies=2 metadata and insisting on zpool-level redundancy is just a bunch of opportunistic FUD. la The comment in the zfs(1M) manpage discouraging the la use of send streams for later restoration has been removed. The man page never warned of all the problems, nor did the si wiki. pgpCjAGUvOlWe.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
ea == erik ableson eable...@me.com writes: dc == Dennis Clarke dcla...@blastwave.org writes: rw,ro...@100.198.100.0/24, it works fine, and the NFS client can do the write without error. ea I' ve found that the NFS host based settings required the ea FQDN, and that the reverse lookup must be available in your ea DNS. I found, oddly, the @a.b.c.d/y syntax works only if the client's IP has reverse lookup. I had to add bogus hostnames to /etc/hosts for the whole /24 because if I didn't, for v3 it would reject mounts immediately, and for v4 mountd would core dump (and get restarted) which you see from the client as a mount that appears to hang. This is all using the @ip/mask syntax. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832 If you use hostnames instead, it makes sense that you would have to use FQDN's. If you want to rewrite mountd to allow using short hostnames, the access checking has to be done like this: at export time: given hostname- forward nss lookup - list of IP's - remember IP's at mount time: client IP - check against list of remembered IP's but with fqdn's it can be: at export time: given hostname - remember it at mount time: client IP - reverse nss lookup - check against remembered list \--forward lookup-verify client IP among results The second way, all the lookups happen at mount time rather than export time. This way the data in the nameservice can change without forcing you to learn and then invoke some kind of ``rescan the exported filesystems'' command or making mountd remember TTL's for its cached nss data, or any such complexity. Keep all the nameservice caching inside nscd so there is only one place to flush it! However the forward lookup is mandatory for security, not optional OCDism. Without it, anyone from any IP can access your NFS server so long as he has control of his reverse lookup, which he probably does. I hope mountd is doing that forward lookup! dc Try to use a backslash to escape those special chars like so : dc zfs set dc sharenfs=nosub\,nosuid\,rw\=hostname1\:hostname2\,root\=hostname2 dc zpoolname/zfsname/pathname wth? Commas and colons are not special characters. This is silly. pgptWVuUb6wBm.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
dc == Dennis Clarke dcla...@blastwave.org writes: dc zfs set dc sharenfs=nosub\,nosuid\,rw\=hostname1\:hostname2\,root\=hostname2 dc zpoolname/zfsname/pathname wth? Commas and colons are not special characters. This is silly. dc Works real well. I said it was silly, not broken. It's cargo-cult. Try this: \z\f\s \s\e\t \s\h\a\r\e\n\f\s\=\n\o\s\u\b\,\n\o\s\u\i\d\,\r\w\=\h\o\s\t\n\a\m\e\1\:\h\o\s\t\n\a\m\e\2\,\r\o\o\t\=\h\o\s\t\n\a\m\e\2 \z\p\o\o\l\n\a\m\e\/\z\f\s\n\a\m\e\/\p\a\t\h\n\a\m\e works real well, too. pgp9sZc4ojaDX.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup zpool to tape
gd == Gregory Durham gregory.dur...@gmail.com writes: gd it to mount on boot I do not understand why you have a different at-boot-mounting problem with and without lofiadm: either way it's your script doing the importing explicitly, right? so just add lofiadm to your script. I guess you were exporting pools explicitly at shutdown because you didn't trust solaris to unmount the two levels of zfs in the right order? Anyway I would guess it doesn't matter because my ``back up file zpools to tape'' suggestion seems to be bogus bad advice. The other bug referenced in the one you quoted, 6915127, seems a lot more disruptive and says there are weird corruption problems with using file vdev's directly, and then there are deadlock problems with lofiadm from the two layers of zfs that haven't been ironed out yet. I guess file-based zpools do not work, and we're back to having no good plan that I can see to back up zpools to tape that preserves dedup, snapshots/clones, NFSv4 acl's, u.s.w. I assumed they did work because it looked like regression tests people were quoting and many examples depended upon them, but now it seems they don't, which explains some problems I had last month extracting an s10brand image from a .VDI. :( (iirc i got the image out using lofiadm and just assumed I was confused, banging away at things until they work and then forgetting about them. not good on me.) There is only zfs send which is made with replication in mind ( * it'll intentionally destroy the entire stream and any incremental descendents if there's a single bit-flip, which is a good feature to make sure the replication is retried if the copy's not faithful but a bad feature for tape. If ZFS rallies against other filesystems for their fragile lack of metadata copies and checksums, why should the tape format be so oddly fragile that tape archives become massive gamma gremlin detectors? * and it has no scrub-like method analagous to 'tar t' or 'cpio -it' because it's assumed you'll always recv it in a situation where you've the opportunity to re-send, while a tape is something you might like to validate after transporting it or every few years. If pools need scrubing why don't tapes? * and no partial-restore feature because it assumes if you don't have enough space on the destination for the entire dataset you'll use rsync or cpio or some other tree-granularity tool instead of the replication toolkit. a tool which does not fully exist (sparse files, 4GB files, NFSv4 ACL's), but that's a separate problem. ). how about zpools on zvol's. Does that avoid the deadlock/corruption bugs with file vdevs? It's not a workaround for the cases in the bug becuase they wanted to use NFS to replace iSCSI, but for backups, zvols might be okay, if they work? It's certainly possible to write them onto a tape (dd was originally meant for such things). pgpaynQ63iMAj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
al == Adam Leventhal a...@eng.sun.com writes: al As always, we welcome feedback (although zfs-discuss is not al the appropriate forum), ``Please, you criticize our work in private while we compliment it in public.'' pgpyrrUQeYImd.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
gm == Gary Mills mi...@cc.umanitoba.ca writes: gm destroys the oldest snapshots and creates new ones, both gm recursively. I'd be curious if you try taking the same snapshots non-recursively instead, does the pause go away? Because recursive snapshots are special: they're supposed to atomically synchronize the cut-point across all the filesystems involved, AIUI. I don't see that recursive destroys should be anything special though. gm Is it destroying old snapshots or creating new ones that gm causes this dead time? sortof seems like you should tell us this, not the other way around. :) Seriously though, isn't that easy to test? And I'm curious myself too. pgpnlnCUlJtvb.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss