from:"Miles Nordin"

Re: [zfs-discuss] 'cannot import 'andaman': I/O error', and failure to follow my own advice

2011-04-05 Thread Miles Nordin

 c == Miles Nordin car...@ivy.net writes:

 c terabithia:/# zpool import andaman
 c cannot import 'andaman': I/O error
 c Destroy and re-create the pool from
 c a backup source.

snv_151, the proprietary release, was able to fix this.  I didn't try
oi_148 first, so there's a chance it would've worked too if I'd given
it a chance.

root@solaris:~# zpool import -n -F -f 7400719929021713582
Would be able to return andaman to its state as of April  3, 2011 03:53:23 PM 
PDT.
Would discard approximately 31 seconds of transactions.
root@solaris:~# zpool import -F -f 7400719929021713582
Pool andaman returned to its state as of April  3, 2011 03:53:23 PM PDT.
Discarded approximately 31 seconds of transactions.

so, ftr, seems not all 'import -F' are created equal.  :)


pgpK1uDrBZ1pZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Going forward after Oracle - Let's get organized, let's get started.

2011-04-05 Thread Miles Nordin

 js == Joerg Schilling joerg.schill...@fokus.fraunhofer.de writes:

js This is interesting. Where is this group hosted?

+1

I glance at the list after years of neglect (selfishly...after almost
losing my pool), and see stuff like this: shady backroom irc-kiddie
bullshit.  please: names, mailing lists, urls, hg servers.  Many of us
have worked on legitimate open source projects before, you know.  We
know what one looks like, and it's not enshrouded in a tangle of
passive-voice sentences and exclusive mafia language.

Of course you're welcome to associate with one another however you
like, and maybe the hostile mailing-list-flame tone of people like me
is part of what makes you want to make all your infrastructure
private.  but if the goal of The ZFS Organization is to reassure
people they should make new ZFS pools after the Oracle implosion and
therefore fund Nexenta support (a worthy goal IMHO!), this path won't
work on me nor my friends.  I'm confident of that.  And I would have
thought by now it'd be clear brilliant developers can survive on the
open internet, and the momentum's usually a lot better there (not to
mention transparency/legitimacy/resiliency).  

good luck, I guess.


pgpB7UGMl2IhD.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] 'cannot import 'andaman': I/O error', and failure to follow my own advice

2011-04-04 Thread Miles Nordin

I have a Solaris Express snv_130 box that imports a zpool from two
iSCSI targets, and after some power problems I cannot import the pool.

When I found the machine, the pool was FAULTED with half of most
mirrors shoring CORRUPTED DATA and half showing UNAVAIL.  One of
the two iSCSI enclosures was on, while the other was off.  When I
brought the other iSCSI enclosure up, bringing all the devices in each
of the seven mirror vdev's online, the box paniced.

It went into a panic loop every time it tried to import the problem
pool at boot.  I disabled all the iSCSI targets that make up the
problem pool and brought the box up, then saved a copy of
/etc/zfs/zpool.cache and exported the UNAVAIL pool.  Then I turned the
host off, brought back all the iSCSI targets, and booted without a
crash, hoping I could 'zpool import' the problem pool.

(Another mirrored pool on the same pair of iSCSI enclosures came back
fine and scrubbed with no errors. shrug)

Here is what I get typing some basic commands:

-8-
terabithia:/# zpool import
  pool: andaman
id: 7400719929021713582
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

andaman  ONLINE
  mirror-0   ONLINE
c3t43d0  ONLINE
c3t48d0  ONLINE
  mirror-1   ONLINE
c3t45d0  ONLINE
c3t47d0  ONLINE
  mirror-2   ONLINE
c3t52d0  ONLINE
c3t59d0  ONLINE
  mirror-3   ONLINE
c3t46d0  ONLINE
c3t49d0  ONLINE
  mirror-4   ONLINE
c3t50d0  ONLINE
c3t44d0  ONLINE
  mirror-5   ONLINE
c3t57d0  ONLINE
c3t53d0  ONLINE
  mirror-6   ONLINE
c3t54d0  ONLINE
c3t51d0  ONLINE
terabithia:/# zpool import andaman
cannot import 'andaman': I/O error
Destroy and re-create the pool from
a backup source.
terabithia:/# zpool status
  pool: aboveground
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
aboveground  ONLINE   0 0 0
  mirror-0   ONLINE   0 0 0
c3t10d0  ONLINE   0 0 0
c3t16d0  ONLINE   0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror-0ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors
terabithia:/# zpool import -F andaman
cannot import 'andaman': I/O error
Destroy and re-create the pool from
a backup source.
terabithia:/# zdb -ve andaman

Configuration for import:
version: 22
pool_guid: 7400719929021713582
name: 'andaman'
state: 0
hostid: 2200768359
hostname: 'terabithia.th3h.inner.chaos'
vdev_children: 7
vdev_tree:
type: 'root'
id: 0
guid: 7400719929021713582
children[0]:
type: 'mirror'
id: 0
guid: 337393226491877361
whole_disk: 0
metaslab_array: 14
metaslab_shift: 33
ashift: 9
asize: 1000191557632
is_log: 0
children[0]:
type: 'disk'
id: 0
guid: 1781150413433362160
phys_path: 
'/iscsi/disk@iqn.2006-11.chaos.inner.th3h.fishstick%3Asd-andaman0001,0:a'
whole_disk: 1
DTL: 91
path: '/dev/dsk/c3t43d0s0'
devid: 
'id1,sd@t49455400020059100f00/a'
children[1]:
type: 'disk'
id: 1
guid: 7841235598547702997
phys_path: 
'/iscsi/disk@iqn.2006-11.chaos.inner.th3h%3Aoldfishstick%3Asd-andaman0001,1:a'
whole_disk: 1
DTL: 215
path: '/dev/dsk/c3t48d0s0'
devid: 
'id1,sd@t494554000200880e0f00/a'
children[1]:
type: 'mirror'
id: 1
guid: 1953060080997571723
whole_disk: 0
metaslab_array: 210
metaslab_shift: 33
ashift: 9
asize: 1000191557632
is_log: 0
children[0]:
type: 'disk'

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-19 Thread Miles Nordin

 js == Joerg Schilling joerg.schill...@fokus.fraunhofer.de writes:

 GPLv3 might help with NetApp - Oracle pact while CDDL does
 not.  

js GPLv3 does not help at all with NetApp as the CDDL already
js includes a patent grant with the maximum possible
js coverage.

AIUI CDDL makes a user safe from Sun's patents only.  If NetApp
contributed code under CDDL, then it would make users safe from NetApp
patents applying to code netapp contributed, but NetApp didn't
contribute any code so it does nothing.  no surprises here: Sun tries
to prevent competitors from making poison contributions, which is
something we should all do but is ``making the implicit grant
explicit''.

GPLv3 was a response to the patent pact made between Novell and
Microsoft, which if it had worked would have made Linux unfree and
given control of it to Microsoft and Novell, because one would need to
buy a license from Novell to use Linux, and Microsoft could have
participated in nsetting terms for that license whoucl could be quite
elaborate like when RSA forced people to use the RSAREF library
implementation of RSA to benefit from the limited patent grant, so
these patent licenses have been used in the past not only to charge
people who have source but also to take away software freedom from
people who have source---their elaborateness can become really
nefarious.  The GPLv3 attempted-protection mechanism is: if Novell
negotiates any patent indemnity, it must apply to all users not just
Novell's users.  This is exactly what we should want to stay free in
the shadow of the NetApp - Oracle deal, but I don't understand the
legal mechanism that accomplishes it.  However I don't see anything
remotely like this in CDDL and am pretty sure although not 100% sure
that I don't see it because it isn't there.

Unfortunately I do not understand it further, and I'm trying to limit
the number of times I repeat myself, so welcome back to my killfile
and please feel free to take the last word, but I'll only point out
that I feel my understanding is more thorough than yours, Joerg, yet
you are more certain your understanding is complete than I am of mine
being complete, which is a big warning-sign to anyone who wants to
take your blanket assertions as the end of the matter.

js  The interesting thing however is that the FSF
js (before the GPLv3 exists) claimed that the CDDL is a bad
js license _because_ of it's patent defense claims. Now the FSF
js does the same as the CDDL ;-)

If we are debating the merits of the backing organizations rather than
the licenses themselves, then I think the more interesting thing is
that Sun enticed a bunch of developers to trust their stewardship of
the project by sassigning copyright to Sun, then got bought by Oracle
and became incapable of upholding their moral commitment and changed
the license to ``no source'', plucs ``no commercial use of binaries,
no publishing benchmarks,'' and a bunch of other completely crazy
unfree boilerplate software oppression.   Your point, if it even
survives an unmudddled understanding of the true patent clauses,
vanishes next to that reversal.

but merits of backing organization is relevant for deciding about
assigning your copytright to another or about including/striking the
``or any later version'' GPL clause.  

The interaction between licenses and patents can be discussed apart
from reputation, and probably should be otherwise I would say ``nobody
use CDDL because it is backed by Oracle,'' but I don't say that.

js You are obviously wrong here: The GPLv3 is definitevely
js incompatible with the GPLv2 and most software does _not_
js include the or any later clause by intention.

And you are writing in bad faith, uninformed, and in sentences that
aren't internally consistent: GPLv2 with the clause is compatible with
GPLv3 by upgrade, so it's not ``definitively'' incompatible.  The
official FSF-published version of GPLv2 does include the clause, so it
would be ``by design'' compatible even if almost everyone struck the
clause as you wrongly claim.  And while it's overwhelmingly important
that Linux kernel does strike the clause, still it is flatly untrue
that ``most'' software does not include the clause: I gave examples
that do include the clause (gcc and gnu libc and grub and all other
FSF projects) while you have no examples at all, but there is no need
to debate that since anyone can STFW instead of relying on a
consistently unreliable party such as yourself.


js OK, you just verified that you are just a troll. We need to
js stop the discussion here.

Did you miss the part where I said SFLC (authors of GPLv3) and Sun
both advise that projects obtain copyright assignment from all
developers?  that this is normal, and probably a good idea?  If so,
you probably also missed the examples of good and bad consequences of
assignment in the past?  and the middle-ground offered by the ``or any
later version'' clause?

I am not really

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-16 Thread Miles Nordin

 js == Joerg Schilling joerg.schill...@fokus.fraunhofer.de
  delivered the following alternate reality of idealogical
  partisan hackery:

js GPLv3 does not give you anything you don't have from CDDL
js also.

I think this is wrong.  The patent indemnification is totally
different: AIUI the CDDL makes the implicit patent license explicit
and that's it, but GPLv3 does that and goes further by driving in a
wedge against patent pacts, somehow.

GPLv3 might help with NetApp - Oracle pact while CDDL does not.
This is a big difference illustrated through a familiar and very
relevant example---not sure how to do better than that, Joerg!

js The GPLv3 is intentionally incompatible with the GPLv2

This is definitely wrong, if you dig into the detail more.  Most GPLv2
programs include a clause ``or any later version'', so adding one
GPLv3 file to them just makes the whole project GPLv3, and there's no
real problem.

Obviously this clause only makes sense if you trust the FSF, which I
do so I include it, but Linus apparently didn't trust them so he
struck the clause long ago.

so GPLv3 and Apache are compatible while GPLv2 and GPLv3 are not, that
is true and is designed.  However GPLv2 was also designed to be
upgradeable, which was absolutely the FSF's intent, to achieve
compatibility, and they have done so with all their old projects like
gcc and gnu libc.

The usual way to accomplish license upgradeability is to delegate your
copyright to the organization you trust to know the difference between
``upgrade'' and ``screw you over.''  That's the method Sun forced upon
people who had to sign contributor agreements, and is also the method
SFLC advises most new free software projects to adopt: don't let
individual developers keep licenses, because they'll become obstinate
ossified illogical partisan farts like Joerg, or will not answer
email, so you can never ever change the license.

FSF gives you this extra ``or any later version'' option to use, which
is handy if you trust them to make your software more free in the
future yet also want to keep your copyright so YOU can make it less
free in the future, if you decide you want to.  seems only fair to me,
so long as you really did write all of it.


GPLv3 is about as incompatible with GPLv2 as ``not giving any source
at all'' is incompatible with CDDL.  ie, if you delegated your
copyright to Sun and contributed under CDDL, Sun has now ``upgraded''
your license to no-source-at-all, which is obviously CDDL-incompatible
and by-design.

The CDDL of course could never include an ``or any later version''
clause because it would be completely stupid: there's no reason to
trust Sun/Oracle.  IMHO this is a huge advantage of GPL---it's very
easy to future-proof your work, provided you trust the FSF, which I'm
sure Joerg does not, but many people do which is lucky for us who do.
Joerg doesn't have anyone left to trust: if you donated your copyright
to Sun to try to future-proof it against unexpected needed license
changes, you're now screwed out of your original intent because
they've altered the terms of the deal you thought you were getting.
And if your clan of developers won't collectively trust anyone, you
also lose because if your understanding of patents evolves in the
future, your large old projects who refused-to-trust (like Linux!) are
stuck with patent robustness much worse than it needs to be.


pgpLNxOOb1hx9.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-16 Thread Miles Nordin

 ld == Linder, Doug doug.lin...@merchantlink.com writes:

ld Very nice.  So why isn't it in Fedora (for example)?

I think it's slow and unstable?  To me it's not clear yet whether it
will be the first thing in the Linux world that's stable and has
zfs-like capability.  If ZFS were GPL it probably would have been,
though.

and I think I needed many other things from Solaris like zones,
COMSTAR, IB, so I'll be trying to get those on Linux too before I can
finally ditch these Solaris machines.  so, at the time all those
things are working, what will the best Linux filesystem be?  maybe
ZFS.

ld I'll believe it when I see it in a big Linux distribution,
ld supported like any other FS, and I can use it in production.
ld Until then, it doesn't exist.

yes.

but it is not the license exactly that's keeping it out.  I think the
license is just annoying some of the Linux developers enough that they
prefer to spend their effort elsewhere.  ex., OpenBSD is also refusing
to accept ZFS because of license, but in their case it is probably
``because we are forced to give source and don't want to''.  I agree
some of the haggling is stupid, but with all these jackmoves
everywhere, saying ``I don't understand all this crap and want to
code, so give me a license with a track record I can see, not the
Dynacorp Public Goofylicense or something like that,'' is not a
totally stupid position.  I do wish people would do more than just
code and try harder to learn the actual license details, though.


pgpfD3JFx7B9z.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-16 Thread Miles Nordin

 ld == Linder, Doug doug.lin...@merchantlink.com writes:

ld This list is for ZFS discussion.  There are plenty of other
ld places for License Wars and IP discussion.

Did you miss the part where ZFS was forked by a license change?  Did
you miss Solaris Express 11 coming out with no source?  Do you not
understand everyone is looking for a place to get maintenance on their
zpools without getting screwed over?  and that whatever few people not
too disgusted to walk away, like pjd and NetBSD and kqinfotech and so
on, must worry about where to commit their patches and under what
license they may use, or at least ``continue delegating to Sun, or
stop?''  How can yuo call this OT at this point?

ld I really don't care at all about licenses.

I think you should start caring, because they affect you.  

Obviously your care is up to you, but you're also the one who offered
to discuss it!

ld Folks, I very much did not intend to start, nor do I want to
ld participate in or perpetuate, any religious flame wars.

yeah, but you're creating more drama by trying to cut off drama than
you would by just letting people discuss.  Sometimes these threads of
``excuse me but you are a flamer / no U / folks folks attention please
everyone calm down / woah woah woah didn't mean to get your panties in
a bunch'' is the real content-free post, not the actual disagreement
which has some content in it.

ld Is the issue important?  Sure.  Do I have time or interest to
ld worry about niggly little details?  No.

Then you're lazy.  Don't demand that others be lazy, too, because
you're not only too lazy to care, but you're too lazy to skip their
messages that you don't care about!

ld personally very geeky about seems *hugely* important and you
ld can't understand why others don't see that.  Maybe it bugs you
ld when people use GPL to mean open source, but the fact is
ld that lots and lots of people do.  It bugs me when Stallman
ld tries to get everyone to use the ridiculous GNU/Linux, as if
ld anyone would ever say that.  It bugs me when people say I
ld *could* care less.  But I live with these things.

If you live with them, why not live with them quietly?  Listing what
you don't care about is a lot less useful than talking about things
that only some people care about.  I think virtually no one cares to
keep track of what unique things you don't care about, yet confusingly
you seem to present your post as a way to avoid useless discussion.
You already know others DO care about it, so?

ld I regret and apologize for my callous disregard in casually
ld tossing around a clearly incendiary term like GPL.

no problem!  But if you really regret it then you won't mind when you
do it again and get corrected again.


pgpIyZ1kwS2kR.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-15 Thread Miles Nordin

 bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes:

bf Perhaps it is better for Linux if it is GPLv2, but probably
bf not if it is GPLv3.

That's my understanding: GPLv3 is the one you would need to preserve
software freedom under deals like NetApp-Oracle patent pact,

 http://www.gnu.org/licenses/rms-why-gplv3.html#patent-protection

but GPLv3 is not compatible with Linux because the kernel is GPLv2 but
stupidly/stubbornly deleted the ``or any later version'' language,
meaning GPLv3 is not any more Linux-compatible than CDDL.

however given how widely-used binary modules are to supposedly get
around the license incompatibility, many might consider the GPLv3
patent protections worth more than license compatibility, if your goal
is software freedom, or a predictable future for your business.


pgphyRH6AbXxf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-13 Thread Miles Nordin

 rs == Robert Soubie robert.sou...@free.fr writes:

rs Don't you forget that these companies also do much of their
rs business in foreign countries (Europe, Asia) where software
rs patenting is not allowed, 

dated myth.  software patents do exist in europe, and the EPO has
issued them.  Fewer are issued, and then there's more enforceability
question because unlke US, Europe has true federalism, but they still
exist.  If you google for 'software patents europe' there is stuff
explaining this on the first page.  

The EU patent debate seems to me about fighting attempts to globally
homogenize patents so that mountains of new patents would suddenly
become valid in Europe, and companies could jurisdiction-shop so you
would lose democratic control of the system's future.  It's definitely
not as simple or as good as ``preserve the status quo of no software
patents.''  The European status quo is already not good enough to be
safe.  It's just vastly better than the future WIPO ASSO wants to
bring you.

rs where American law is not applicable,

Unfortunately I think American law is always applicable because it
seems patent law lets you sue almost anyone you like---the guy who
wrote it, the company that distributed it, the customer who bought it.
Only one has to be American, so American patents can be monetized with
few Americans involved.  When companies are conducting business
negotiations based on the threat of lawsuit rather than the result,
these suits don't have to get very far for the blackmail to translate
into ``value.''  If there are really European companies opting out of
the American market entirely because of patents, I think that's
fantastic, but it doesn't seem very plausible with software where you
want a big market more than anything.

rs And do you really believe that this mailing list is only
rs devoted to (US) Americans just because the products originated
rs in the US, and the vernacular is English?

your rage against hegemony or imperialism or empire or whatever you
want to whine about this week is misplaced here: if you have a problem
with American attitude or with the political landscape of the world,
fine, that's smart, me too, whatever, but it's got zero to do with the
complication patents add to an Oracle-free ZFS.  Yeah it's really
American companies doing almost all this work (sorry, proud Europe!), 
but anyway being European doesn't mean you can ignore American patents
because even the (unlikely?) best case of suddenly losing the entire
American market while suffering no loss from a judgement is still bad
enough to kill a company.  What's on-topic is:

 * when do the CDDL patent protections apply?  to deals between Oracle
   and Netapp?  or is it only protection against Oracle patents?  I
   think the latter, but then, which Oracle patents?  Suppose:

   + Oracle patents something needed ZFS crypto

   + Oracle publishes the promised yet-to-be-delivered zfs-crypto
 paper that's thorough enough to write a compatible implementation

   + Oracle makes no further ZFS source releases, ever

   + Nexenta reimplements zfs-crypto and releases it CDDL with the
 rest of ZFS

   + Oracle sues Nexenta.  Oracle uses ``discovery'' to get exhaustive
 Nexenta customer list.  Oracle sues users of Nexenta.  Oracle
 monetizes ``Nexenta indemnification pack'' patent licenses and
 blackmails Nexenta's customers.

   CDDL was meant to create a space that appeared to be safe from the
   last point.  But CDDL patent stuff is no help here, I think?  so,
   in effect, patents reduce the software freedoms given by CDDL
   because, once you fork whatever partial source Oracle deems fit to
   distribute, you suffer increasing risk of stepping onto an
   (Oracle-placed!) patent landmine.

 * AIUI Oracle has distributed grub with zfs patches, and grub is
   GPLv3.  Is this true?  If so, GPLv3 includes stuff to extend patent
   deals, which was added becuase GPLv3 was written under the ominous
   spectre of the Microsoft-Novell Linux indemnification deal.  Does
   GPLv3 grub extend any of the Netapp deal to those patented
   algorithms which are used within grub?  The GPLv3 is supposed to do
   some of this, but I don't know how much.

   Is it extended only to grub users for use in grub, or can the
   patented stuff in grub be used anywhere by anyone who can get a
   copy of grub: download GPLv3 grub, then use CDDL ZFS in a Linux
   kmod with Oracle-provided immunity from any Netapp suit related to
   a ZFS patent used also in grub?  This sounds totally unrealistic to
   me, so I would guess the GPLv3 protection would be much less, but
   then what is it?  

   And anyway, though GPLv3 is meant to mandatorily extend private
   patent deals, how can any patent protection from the Netapp deal be
   extended when the deal is secret?  Don't you need some basis to
   force disclosure of the deal, and some way to define ``all relevant
   deals''?  If Oracle is defending

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-11 Thread Miles Nordin

 et == Erik Trimble erik.trim...@oracle.com writes:

et In that case, can I be the first to say PANIC!  RUN FOR THE
et HILLS!

Erik I thought most people already understood pushing to the public hg
gate had stopped at b147, hence Illumos and OpenIndiana.  it's not
that you're wrong, just that you should be in the hills by now if you
started out running.

the S11 Express release without source and with its new, more-onerous
license than SXCE is new dismal news, and the problems on other
projects and the waves of smart people leaving might be even more
dismal for opensolaris since in the past there was a lot of
integration and a lot of forward progress, but what you were
specifically asking about dates in hg was already included in the old
bad news AFAIK.  And anyway there was never complete source code, nor
source for all new work (drivers), nor source for the stable branch,
which has always been a serious problem.

The good news to my view is that Linux may actually be only about one
year behind (and sometimes ahead) on the non-ZFS features in Solaris.
FreeBSD is missing basically all of this, ex jails are really not as
thorough as VServer or LXC, but Linux is basically there already:

 * Xen support is better.  Oracle is sinking Solaris Xen support in
   favour of some old Oracle Xen kit based on Linux, I think?  

   which is disruptive and annoying for me, because I originally used
   OpenSolaris Xen to get some isolation from the churn of Linux Xen.
   but it means there's a fully-free-software path that's not even
   less annoying a transition than what Oracle's offering through
   partially-free uncertain-future tools.

 * Infiniband support in Linux was always good.  They don't have a
   single COMSTAR system which is too bad, but they have SCST for SRP
   (non-IP RDMA SCSI, the COMSTAR one that people say works with
   VMWare), and stgt for iSER (the one that works with the Solaris
   initiator).

 * instead of Crossbow they have RPS and RFS, which give some
   performance boost with ordinary network cards, not just with 10gig
   ones with flow caches.  My understanding's hazy but I think, with
   an ordinary card, you still have to take an IPI, but it will touch
   hardly any of the packet on the wrongCPU so you can still take
   advantage of per-core caches hot with TCP-flow-specific structures.
   I'm not a serious enough developer to know whether RPS+RFS is more
   or less thorough than the Crossbow-branded stuff, but it was
   committed to mainline at about the same time as Crossbow.

 * Dreamhost is already selling Linux zones based on VServer and has
   been for many years, so there *is* a zones alternative on Linux,
   and better yet unlike the incompletely-delivered and eventually
   removed lx brand, on Linux you get Linux zones with Linux packages
   and nginx working with epoll and sendfile (on solaris, for me
   eventport works but sendfile does not).  There's supposedly a total
   rewrite of VServer in the works called LXC, so maybe that will be
   the truly good one.  It may take them longer to get sysadmin tools
   that match zonecfg/zoneadm, but the path is set.

 * LTTng is an attempt at something dtrace-like.  It's still
   experimental, but has the same idea of large libraries of probes,
   programs cannot tell if they're being traced or not, and relatively
   sophisticated bundled analysis tools.

   
http://multivax.blogspot.com/2010/11/introduction-to-linux-tracing-toolkit.html 
-- LTTng linux dtrace competitor

The only thing missing is ZFS.  To me it looks like a good replacement
for that is years away.  I'm not excited about ocfs, or about kernel
module ZFS ports taking advantage of the Linus kmod ``interpretation''
and the grub GPLv3 patent protection.

Instead I'm hoping they skip this stage and style of storage and go
straight to something Lustre-like that supports snapshots.  I've got
my eye on ceph, and on Lustre itself of course because of the IB
support.  ex perhaps in the end you will have 64 - 256MB of
atftpd-provided initramfs which never goes away where init and sshd
and libc and all the complicated filesystem-related userspace lives,
so there is no more problems of running /usr/sbin/zpool off of a
ZFS---you will always be able to administrate your system even if
every ``disk'' is hung (or if cluster access is disrupted).  and there
will not be a complexity difference between a laptop with local disks
and cluster storage---everything will be the full-on complicated
version.

I feel ZFS doesn't scale small enough for phones, nor big enough for
what people are already doing in data centers, so why not give up on
small completely and waste even more RAM and complexity in the laptop
case?  and one of the most interesting appnotes to me about ZFS is
this one relling posted long ago:

 http://docs.sun.com/app/docs/doc/820-7821/girgb?a=view

which is an extremely limited analog of what ceph and Lustre do, where
compute and storage nodes do not necessarily need

Re: [zfs-discuss] ashift and vdevs

2010-12-02 Thread Miles Nordin

 dm == David Magda dma...@ee.ryerson.ca writes:

dm The other thing is that with the growth of SSDs, if more OS
dm vendors support dynamic sectors, SSD makers can have
dm different values for the sector size 

okay, but if the size of whatever you're talking about is a multiple
of 512, we don't actually need (or, probably, want!) any SCSI sector
size monkeying around.  Just establish a minimum write size in the
filesystem, and always write multiple aligned 512-sectors at once
instead.

the 520-byte sectors you mentioned can't be accomodated this way, but
for 4kByte it seems fine.

dm to allow for performance
dm changes as the technology evolves.  Currently everything is
dm hard-coded,

XFS is hardcoded.  NTFS has settable block size.  ZFS has ashift
(almost).  ZFS slog is apparently hardcoded though.  so, two of those
four are not hardcoded, and the two hardcoded ones are hardcoded to
4kByte.

dm Until you're in a virtualized environment. I believe that in
dm the combination of NetApp and VMware, a 64K alignment is best
dm practice last I head. Similarly with the various stripe widths
dm available on traditional RAID arrays, it could be advantageous
dm for the OS/FS to know it.

There is another setting in XFS for RAID stripe size, but I don't know
what it does.  It's separate from the (unsettable) XFS block size
setting.  so...this 64kByte thing might not be the same thing as what
we're talking about so far...though in terms of aligning partitions
it's the same, I guess.


pgpKhRGPwJZ8d.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ashift and vdevs

2010-12-01 Thread Miles Nordin

 kd == Krunal Desai mov...@gmail.com writes:

kd http://support.microsoft.com/kb/whatever

dude.seriously?

This is worse than a waste of time.  Don't read a URL that starts this
way.

kd Windows 7 (even with SP1) has no support for 4K-sector
kd drives.

NTFS has 4KByte allocation units, so all you have to do is make sure
the NTFS partition starts at an LBA that's a multiple of 8, and you
have full performance.  Probably NTFS is the reason WD has chosen
4kByte.

Linux XFS is also locked at 4kByte sector size, because that's the VM
page size and XFS cannot use any other block size than the page size.
so, 4kByte is good (except for ZFS).

kd can you explicate further about these drives and their
kd emulation (or lack thereof), I'd appreciate it!

further explication: all drives will have the emulation, or else you
wouldn't be able to boot from them.  The world of peecees isn't as
clean as you imagine.

kd which 4K sector drives offer a jumper or other method to
kd completely disable any form of emulation and appear to the
kd host OS as a 4K-sector drive?

None that I know of.  It's probably simpler and less silly to leave
the emulation in place forever than start adding jumpers and modes and
more secret commands.

It doesn't matter what sector size the drive presents to the host OS
because you can get the same performance character by always writing
an aligned set of 8 sectors at once, which is what people are trying
to force ZFS to do by adding 3 to ashift.  Whether the number is
reported by some messy new invented SCSI command, input by the
operator, or derived by a mini-benchmark added to
format/fmthard/zpool/whatever-applies-the-label, this is done once for
the life of the disk, and after that happens whenever the OS needs
this number it's gotten by issuing READ on the label.  Day-to-day, the
drive doesn't need to report it.  Therefore, it is ``ability to
accomodate a minimum-aligned-write-size'' which badly people want
added to their operating systems, and no one sane really cares about
automatic electronic reporting of true sector size.

Unfortunately (but predictably) it sounds like if you 'zfs replace' a
512-byte drive with a 4096-byte drive you are screwed.  therefore even
people with 512-byte drives might want to set their ashift for
4096-byte drives right now.  This is another reason it's a waste of
time to worry about reporting/querying a drive's ``true'' sector size:
for a pool of redundant disks, the needed planning's more complicated
than query-report-obey.

Also did anyone ever clarify whether the slog has an ashift?  or is it
forced-512?  or derived from whatever vdev will eventually contain the
separately-logged data?  I would expect generalized immediate Caring
about that since no slogs except ACARD and DDRDrive will have 512-byte
sectors.


pgpdnTloWn49S.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf

2010-12-01 Thread Miles Nordin

 t == taemun  tae...@gmail.com writes:

 t I would note that the Seagate 2TB LP has a 0.32% Annualised
 t Failure Rate.

bullshit.


pgpsMvTxl5Ghd.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-11-18 Thread Miles Nordin

 zu == zfs user zf...@itsbeen.sent.com writes:
 djm == Darren J Moffat darr...@opensolaris.org writes:

zu Ugh, we all know that the first rule of crytpo is that any
zu proprietary, closed source, black-box crypto is crap, blah,
zu blah, blah (I am not sure what the point of repeating that
zu tired line is) and I am not one to give Oracle an inch but
zu wtf? They just released this crap, give them a minute

My educated guess would be that the other encrypted systems released
papers about the algorithm either concurrently with the
implementation, or sometimes BEFORE the implementation, but not after.
It's just silly to think geli or dmcrypt would expect anyone to use
them without explaining the algorithm and exposing it to review.

Also, Darren has been working on this for THREE YEARS, and he
committed it just weeks after the ``opensolaris now closed''
announcement and hg pushing stopped.  so, any time in the last three
years would have been a better and more reasonable time to release a
paper than tomorrow, after the binary proprietary release of the
implementation has happened.  This would eliminate the need for my
objection as well as give the crypto community time to advise Darren's
design, which is something I'm surprised he didn't want as much of as
possible, but so be it: he's the one doing the work, and good for him,
and since based on hints he's dropped I suspect the work is quite
good, I'm more interested in reviewing the work that's there than
whinging about preciesly how it was done or how long it took or when I
can get it.  For all that, I'll gladly wait.

I just think firstly that the design needs review before trust, and
secondly that it's starkly enough against best practice to be
borderline irresponsible to release the work at all without subjecting
the design to peer review.

zu anything we have seen so far from Oracle shows us is that they
zu are slow to move with external communication about Solaris.

yeah, well.  what happened after you ``waited'' last time?

When people like me were saying ``not all of opensolaris is free
software.  In fact the free component is shockingly small, albeit an
important component,'' and ``the full development cycle from hg to
livecd needs to be freed, like it is on *BSD (build.sh) and RHEL
(CentOS), so that the project can be forked if, god forbid, it needs
to be---forking is bad, but forkability is a key component of
freedom,'' and ``it is a problem that the toolchain is proprietary'',
people like you said ``just give them time.''  I think we actually did
quietly get a few big chunks liberated just by waiting, but still, in
the end, you gave them too much time: openindiana and illumos are now
struggling to solve parts of these problems without certainty of
success, are rushed because Nexenta's business depends on them, and
people who have invested in the platform thinking its freedom gave it
a stable future are now sitting on many terabytes of locked-in data
and many man-hours of doomed scriptage.  While the disaster is
certainly not complete and some gradual-transition outcomes remain
possible, your ``give them time'' advice is basically dead wrong,
according to history.  How can you say that now?  I don't get it.

Finally, there's a problem with the style of argument.  Not everything
on a mailing list is ``$ENTITY sucks/rules.''  I'm allowed to say
something critical without implicitly saying ``everything Oracle does
and everything they touch is wrong and evil and should be burned with
torches.''  I don't really care about Oracle at all.  What I said was
much more specific, and there's no cause to wait before saying ``I
will not take zfs crypto seriously so long as it's a black box.''  The
right time to say that is NOW.

so, no, I disagree: do not give them time.  Wait for the paper, or
more likely for the actual source, before using ZFS crypto.  That is
what you should do with your Time.

   djm It is a work in progress.

Fine, and good.  I thought it might be.  

In the unlikely event there was any impediment to your writing, and
releasing, the paper, hopefully my complaining will be one among many
things that helps remove it.  Really, it is just mandatory.


pgpogmN8mbJjZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Ideas for ghetto file server data reliability?

2010-11-17 Thread Miles Nordin

 sl == Sigbjorn Lie sigbj...@nixtra.com writes:

sl Do you need registered ECC, or will non-reg ECC do

registered means the same thing as buffered.  It has nothing to do
with registering to some kind of authority---it's a register like the
accumulators inside CPU's.  The register allows more sticks per
channel at the questionably-relevant cost of ``latency.''  Lately,
more than two sticks per channel seems to require registers.  Your
choice of motherboard (and the memory controller implied by that
choice) decides whether the memory must be registered or must be
unregistered, and I don't know of any motherboards that will take both
kinds (though I bet there are some out there, somewhere in history).
There are other weird kinds of memory connection besides just
registered and unregistered, but everything has higher latency than
``unregistered''.  

None of this has anything to do with ECC, though it may sometimes seem
to since both registers and ECC cost money so tightly cost-constrained
systems might tend to have neither, and quantities go down and profit
margins get immediately jacked up once you ask for either of the two.

hth. :/


pgpwc9fQAUyLZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-11-17 Thread Miles Nordin

 djm == Darren J Moffat darr...@opensolaris.org writes:

   djm http://blogs.sun.com/darren/entry/introducing_zfs_crypto_in_oracle
   djm http://blogs.sun.com/darren/entry/assued_delete_with_zfs_dataset
   djm 
http://blogs.sun.com/darren/entry/compress_encrypt_checksum_deduplicate_with

Is there a URL describing the on-disk format and implementation details?

   djm Encryption at the application layer solves a different set of
   djm problems to encryption at the storage layer. 

black-box crypto is snake oil at any level, IMNSHO.

Congrats again on finishing your project, but every other disk
encryption framework I've seen taken remotely seriously has a detailed
paper describing the algorithm, not just a list of features and a
configuration guide.  It should be a requirement for anything treated
as more than a toy.  I might have missed yours, or maybe it's coming
soon.


pgphDwX1ujOx9.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-16 Thread Miles Nordin

tc == Tim Cook t...@cook.ms writes:

tc Channeling Ethernet will not make it any faster. Each
tc individual connection will be limited to 1gbit. iSCSI with
tc mpxio may work, nfs will not.

well...probably you will run into this problem, but it's not
necessarily totally unsolved.

I am just regurgitating this list again, but:

need to include L4 port number in the hash:

http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb
port-channel load-balance mixed -- for L2 etherchannels
mls ip cef load-sharing full -- for L3 routing (OSPF ECMP)

nexus makes all this more complicated. there are a few ways that
seem they'd be able to accomplish ECMP:
FTag flow markers in ``FabricPath'' L2 forwarding
LISP
MPLS
the basic scheme is that the L4 hash is performed only by the edge
router and used to calculate a label. The routing protocol will
either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of
per-entire-path ECMP for LISP and MPLS. unfortunately I don't
understand these tools well enoguh to lead you further, but if
you're not using infiniband and want to do 10way ECMP this is
probably where you need to look.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942
feature added in snv_117, NFS client connections can be spread over multiple
TCP connections
When rpcmod:clnt_max_conns is set to a value 1
however Even though the server is free to return data on different
connections, [it does not seem to choose to actually do so] --
6696163 fixed snv_117

nfs:nfs3_max_threads=32
in /etc/system, which changes the default 8 async threads per mount to
32. This is especially helpful for NFS over 10Gb and sun4v

this stuff gets your NFS traffic onto multiple TCP circuits, which
is the same thing iSCSI multipath would accomplish. From there, you
still need to do the cisco/??? stuff above to get TCP circuits
spread across physical paths.

http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html
-- suspect. it advises ``just buy 10gig'' but many other places
say 10G NIC's don't perform well in real multi-core machines
unless you have at least as many TCP streams as cores, which is
honestly kind of obvious. lego-netadmin bias.

pgputFUSXDRds.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Does a zvol use the zil?

2010-10-25 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

re it seems the hypervisors try to do crazy things like make the
re disks readonly,

haha!

re which is perhaps the worst thing you can do to a guest OS
re because now it needs to be rebooted

I might've set it up to ``pause'' the VM for most failures, and for
punts like this read-only case, maybe leave it paused until someone
comes along to turn it off or unpause it.  But for loss of connection
to an iSCSI-backed disk, I think that's wrong.  I guess the truly
correct failure handling would be to immediately poweroff the guest
VM: pausing it tempts the sysadmin to fix the iscsi connection and
unpause it, which in this case is the only real disaster-begging thing
to do.  One would get a lot of complaints from sysadmins who don't
understand the iscsi write hole, but I think it's right.  so...in that
context, maybe read-only-until-reboot is actually not so dumb!

For guests unknowingly getting their disks via NFS, it would make
sense to pause the VM to stop (some of) its interval timer(s), (and
hope you get the timer running the ATA/SCSI/... driver among the
stopped ones) because the guest's disk driver won't understand NFS
hard mount timeout rules---won't understand that, for certain errors,
you can pass ``stale file handle'' up the stack, but for other errors
you must wait forever.  Instead they'll enforce a 30-second timeout
like for an ATA disk.  I think you could probably still avoid losing
the 'write B' if the guest fired its ATA timeout with an NFS-backed
disk because the writes have already been handed off to the host.  It
might be weird user experience in the VM manager because whatever
process is doing the NFS writes will be unkillable 'D' state even if
you poweroff the VM, but this weirdness is an expression of arcane
reality, not a bug.  It'd be better sysadmin experience to avoid the
guest ATA timeout, though: pause the VM and resume so that NFS server
reboots would freeze guests for a while, not require rebooting them,
just like they do for nonvirtual NFSv3 clients.  You would have to
figure out the maximum number of seconds the guests can go without
disk access, and deviously pause them before their burried /
proprietary disk timeouts can fire.


pgpYvvjgSY5Gl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Does a zvol use the zil?

2010-10-22 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

re The risk here is not really different that that faced by
re normal disk drives which have nonvolatile buffers (eg
re virtually all HDDs and some SSDs).  This is why applications
re can send cache flush commands when they need to ensure the
re data is on the media.

It's probably different because of the iSCSI target reboot problem
I've written about before:

iSCSI initiator iSCSI target   nonvolatile medium

write A   
   -  ack A
write B   
   -  ack B
  --[A]
 [REBOOT]
write C   
[timeout!]
reconnect 
   -  ack Connected
write C   
   -  ack C
flush 
  - [C]
   -  ack Flush

in the above time chart, the initiator thinks A, B, and C are written,
but in fact only A and C are written.  I regard this as a failing of
imagination in the SCSI protocol, but probably with better
understanding of the details than I have the initiator could be made
to provably work around the problem.  My guess has always been that no
current initiators actually do, though.

I think it could happen also with a directly-attached SATA disk if you
remove power from the disk without rebooting the host, so as Richard
said it is not really different, except that in the real world it's
much more common for an iSCSI target to lose power without the
initiator's also losing power than it is for a disk to lose power
without its host adapter losing power.  The ancient practice of unix
filesystem design always considers cord-yanking as something happening
to the entire machine, and failing disks are not the filesystem's
responsibility to work arund because how could it?  This assumption
should have been changed and wasn't, when we entered the era of RAID
and removable disks, where the connections to disks and disks
themselves are both allowed to fail.  However, when NFS was designed,
the assumption *WAS* changed, and indeed NFSv2 and earlier operated
always with the write cache OFF to be safe from this, just as COMSTAR
does in its (default?) abyssmal-performance mode (so campuses bought
prestoserve cards (equivalent to a DDRDrive except much less silly
because they have onboard batteries), or auspex servers with included
NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
FC/iSCSI targets which always have big NVRAM's so they can leave the
write cache off), and NFSv3 has a commit protocol that is smart enough
to replay the 'write B' which makes the nonvolatile caches less
necessary (so long as you're not closing files frequently, I guess?).

I think it would be smart to design more storage systems so NFS can
replace the role of iSCSI, for disk access.  In Isilon or Lustre
clusters this trick is common when a node can settle with unshared
access to a subtree: create an image file on the NFS/Lustre back-end
and fill it with an ext3 or XFS, and writes to that inner filesystem
become much faster because this rube goldberg arrangement discards the
clsoe-to-open consistency guarantee.  We might use it in the ZFS world
for actual physical disk acess instead of iSCSI, ex., it should be
possible to NFS-export a zvol and see a share with a single file in it
named 'theTarget' or something, but this file would be without
read-ahead.  Better yet, to accomodate VMWare limitations, would be to
export a single fake /zvol share containing all NFS-shared zvol's, and
as you export zvol's their files appear within this share.  Also it
should be possible to mount vdev elements over NFS without
deadlocks---I know that is difficult, but VMWare does it.  Perahps it
cannot be done through the existing NFS client, but obviously it can
be done somehow, and it would both solve the iSCSI target reboot
problem and also allow using more kinds of proprietary storage
backend---the same reasons VMWare wants to give admins a choice
applies to ZFS.  When NFS is used in this way the disk image file is
never closed, so the NFS server will not need a slog to give good
performance: the same job is accomplished by double-caching the
uncommitted data on the client so it can be replayed if the time
diagram above happens.


pgp5D3EwpiIVp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Bursty writes - why?

2010-10-12 Thread Miles Nordin

 en == Eff Norwood sm...@jsvp.com writes:

en We also tried SSDs as the ZIL which worked ok until they got
en full, then performance tanked. As I have posted before, SSDs
en as your ZIL - don't do it!

yeah, iirc the thread went back and forth between you and I for a few
days, something like this,

you: SSD's work fine at first, then slow down, see this anandtech
 article.  We got bit by this.

me:  That article is two years old.  Read this other article which is
 one year old and explains the problem is fixed if you buy current gen2
 intel or sandforce-based SSD.

you: Well absent test results from you I think we will just have to
 continue believing that all SSD's gradually slow down like I
 said, though I would love to be proved wrong.

me:  You haven't provided any test results yourself nor even said what
 drive you're using.  We've both just cited anandtech, and my
 citation's newer than yours.

you: I welcome further tests that prove the DDRDrive is not the only
 suitable ZIL, but absent these tests we have to assume I'm right
 that it is.

silly!

slowdowns with age:
  http://www.pcper.com/article.php?aid=669
  http://www.anandtech.com/show/2738/15

slowdowns fixed:
  http://www.anandtech.com/show/2899/8  

  ``With the X25-M G2 Intel managed to virtually eliminate the
random-write performance penalty on a sequentially filled
drive. In other words, if you used an X25-M G2 as a normal desktop
drive, 4KB random write performance wouldnCBB
http://www.anandtech.com/show/2738/25 t really degrade over
time. Even without TRIM.''

  note this is not advice to buy sandforce for slog because I don't
  know if anyone's tested it respects flush-cache commands and suspect
  it may drop them.

sumary: There's probably been major, documented shifts in the industry
between when you tested and now, but no one knows because you don't
even tell what you tested or how---you just spread FUD and flog the
DDRDrive and then say ``do research to prove me wrong or else my hazy
statement stands.''  bad science.


pgp6mClKLco5m.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] nfs issues

2010-10-11 Thread Miles Nordin

 tb == Thomas Burgess wonsl...@gmail.com writes:

tb I'm running b134 and have been for months now, without issue.
tb Recently i enabled 2 services to get bonjoir notificatons
tb working in osx

tb /network/dns/multicast:default
tb /system/avahi-bridge-dsd:default


tb and i added a few .service files to /etc/avahi/services/

tb ever since doing this, nfs is keeps crashing

try changing 'hosts' key in /etc/nsswitch.conf to:

-8-
hosts: files mdns dns
-8-


pgpha0z94UlZ8.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-06 Thread Miles Nordin

 nw == Nicolas Williams nicolas.willi...@oracle.com writes:

nw The current system fails closed 

wrong.

$ touch t0
$ chmod 444 t0
$ chmod A0+user:$(id -nu):write_data:allow t0
$ ls -l t0
-r--r--r--+  1 carton   carton 0 Oct  6 20:22 t0

now go to an NFSv3 client:
$ ls -l t0
-r--r--r-- 1 carton 405 0 2010-10-06 16:26 t0
$ echo lala  t0
$ 

wide open.

NFSv3 and SMB sharing the same dataset is a use-case you claim to
accomodate.  This case fails open once Windows users start adding
'allow' ACL's.  It's not a corner case; it's a design that fails open.

  ever had 777 it would send a SIGWTF to any AFS-unaware
  graybeards

nw A signal?!  How would that work when the entity doing a chmod
nw is on a remote NFS client?

please find SIGWTF under 'kill -l' and you might understand what I
meant.

nw You seem to be in denial.  You continue to ignore the
nw constraint that Windows clients must be able to fully control
nw permissions in spite of their inability to perceive and modify
nw file modes.

You remain unshakably certain that this is true of my proposal in
spite of the fact that you've said clearly that you don't understand
my proposal.  That's bad science.

It may be my fault that you don't understand it: maybe I need to write
something shorter but just as expressive to fit within mailing list
attention spans, or maybe my examples are unclear.  However that
doesn't mean that I'm in denial nor make you right---that just makes
me annoying.


-- 
READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer,
to release me from all obligations and waivers arising from any and all
NON-NEGOTIATED  agreements, licenses, terms-of-service, shrinkwrap, clickwrap,
browsewrap, confidentiality, non-disclosure, non-compete and acceptable use
policies (BOGUS AGREEMENTS) that I have entered into with your employer, its
partners, licensors, agents and assigns, in perpetuity, without prejudice to my
ongoing rights and privileges. You further represent that you have the
authority to release me from any BOGUS AGREEMENTS on behalf of your employer.


pgpvrZFYgaHat.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Miles Nordin

 ag == Andrew Gabriel andrew.gabr...@oracle.com writes:

ag Having now read a number of forums about these, there's a
ag strong feeling WD screwed up by not providing a switch to
ag disable pseudo 512b access so you can use the 4k native.

this reporting lie is no different from SSD's which have 2 - 8 kB
sectors on the inside and benefit from alignment.  I think probably
everything will report 512 byte sectors forever.  If a device had a
4224-byte sector, it would make sense to report that, but I don't see
a big downside to reporting 512 when it's really 4096.

NAND flash often does have sectors with odd sizes like 4224, and (some
of) Linux's NAND-friendly filesystems (ubifs, yaffs, nilfs) use this
OOB area for filesystem structures, which are intermixed with the ECC.
but in that case it's not a SCSI interface to the odd-sized
sector---it's an ``mtd'' interface that supports operations like
``erase page'', ``suspend erasing'', ``erase some more''.

that said I am in the ``ignore WD for now'' camp.  but this isn't why.
Ignore them (among other, better reasons) because they have 4k sectors
at all which don't yet work well until we can teach ZFS to never write
smaller than 4kB.  but failure to report 4k as SCSI 4kB sector is not
a problem, to my view.  You can just align your partitions.


pgp6jwIDoUJ9i.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Miles Nordin

 dd == David Dyer-Bennet d...@dd-b.net writes:

dd Richard Elling said ZFS handles the 4k real 512byte fake
dd drives okay now in default setups

There are two steps to handling it well.  one is to align the start of
partitions to 4kB, and apparently on Solaris (thanks to all the
cumbersome partitioning tools) that is done.  On Linux you often have
to really pay attention to make this happen, depending on the
partitioning tool that happens to be built into your ``distro'' or
whatever.

The second step is to never write anything smaller than 4kB.  ex., if
you want to write 0.5kB, pad it with 3.5kB of zeroes to avoid the
read-modify-write penalty.  AIUI that is not done yet, and zfs does
sometimes want to write 0.5kB.  When it's writing 128kB of course
there is no penalty.  For this, I think XFS and NTFS are actually
better and tend not to write the small blocks, but I could be wrong.


pgpn3kSSlfThy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-06 Thread Miles Nordin

 nw == Nicolas Williams nicolas.willi...@oracle.com writes:

nw *You* stated that your proposal wouldn't allow Windows users
nw full control over file permissions.

me: I have a proposal

you: op!  OP op, wait!  DOES YOUR PROPOSAL blah blah WINDOWS blah blah
 COMPLETELY AND EXACTLY LIKE THE CURRENT ONE.

me: no, but what it does is...

you: well then I don't even have to read it.  It's unacceptable
 because $BLEH.

me: untrue.  My proposal handles $BLEH just fine.

you: you just said it didn't!

me: well, it does.  Please read it.

you: I read it and I don't understand it.  Anyway it doesn't handle
 $BLEH so it's no good.


This is not really working, and concision is the problem.  so, I now,
today, state:

My proposal allows Windows users full control over file permissions.

nw Yes, that may be.  I encourage you to find a clearer way to
nw express your proposal.

So far, it's just us talking.  I think I'll wait and see if anyone
besides you reads it.  If so, maybe they can ask questions that help
me clarify it.  If no one does, it's probably not interesting here
anyway.


pgp4wuhrA1SzN.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS crypto bug status change

2010-10-05 Thread Miles Nordin

 dm == David Magda dma...@ee.ryerson.ca writes:

dm Thank you Mr. Moffat et al. Hopefully the rest of us will be
dm able to bang on this at some point. :)

Thanks for the heads-up on the gossip.  

This etiquette seems weird, though: I don't thank Microsoft for
releasing a new version of Word.  I'll postpone my thanks for 2 years
until the source is released, though by then who knows if I'll still
be using ZFS at all.

Maybe more appropriate would be: congrats on finally finishing your
seven-year project, Darren!  must be a huge relief.

I'm glad it wasn't my project, though.  If I were in Darren's place
I'd have signed on to work for an open-source company, spent seven
years of my life working on something, delaying it and pushing hard to
make it a generation beyond other filesystem crypto, and then when I'm
finally done, yoink!.  

That's me, though.  I shouldn't speculate on someone else's situation.
Maybe he signed on under different circumstances, or delayed for
different reasons than feature-ambition, or cares about different
things than I do.  I only mean to make an example of how politics,
featuresets, and IT planning interact to make an ecosystem that's got
more complicated implications than just a bulleted list of features
and a license with an OSI logo.


-- 
READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer,
to release me from all obligations and waivers arising from any and all
NON-NEGOTIATED  agreements, licenses, terms-of-service, shrinkwrap, clickwrap,
browsewrap, confidentiality, non-disclosure, non-compete and acceptable use
policies (BOGUS AGREEMENTS) that I have entered into with your employer, its
partners, licensors, agents and assigns, in perpetuity, without prejudice to my
ongoing rights and privileges. You further represent that you have the
authority to release me from any BOGUS AGREEMENTS on behalf of your employer.


pgpxfnP4VSj9Z.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Miles Nordin

 nw == Nicolas Williams nicolas.willi...@oracle.com writes:

nw Keep in mind that Windows lacks a mode_t.  We need to interop
nw with Windows.  If a Windows user cannot completely change file
nw perms because there's a mode_t completely out of their
nw reach... they'll be frustrated.

well...AIUI this already works very badly, so keep that in mind, too.

In AFS this is handled by most files having 777, and we could do the
same if we had an AND-based system.  This is both less frustrating and
more self-documenting than the current system.

In an AND-based system, some unix users will be able to edit the
windows permissions with 'chmod A...'.  In shops using older unixes
where users can only set mode bits, the rule becomes ``enforced
permissions are the lesser of what Unix people and Windows people
apply.''  This rule is easy to understand, not frustrating, and
readily encourages ad-hoc cooperation (``can you please set
everything-everyone on your subtree?  we'll handle it in unix.'' /
``can you please set 777 on your subtree?  or 770 group windows?  we
want to add windows silly-sid-permissions.'').  This is a big step
better than existing systems with subtrees where Unix and Windows
users are forced to cooperate.

It would certainly work much better than the current system, where you
look at your permissions and don't have any idea whether you've got
more, less, or exactly the same permission as what your software is
telling you: the crappy autotranslation teaches users that all bets
are off.


It would be nice if, under my proposal, we could delete the unix
tagspace entirely:

 chpacl '(unix)' chmod -R A- .

but unfortunately, deletion of ACL's is special-cased by Solaris's
chmod to ``rewrite ACL's that match the UNIX permissions bits,'' so it
would probably have to stay special-cased in a tagspace system.


pgpzWtQEMyslr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Miles Nordin

 Can the user in (3) fix the permissions from Windows?

no, not under my proposal.

but it sounds like currently people cannot ``fix'' permissions through
the quirky autotranslation anyway, certainly not to the point where
neither unix nor windows users are confused: windows users are always
confused, and unix users don't get to see all the permissions.

 Now what?

set the unix perms to 777 as a sign to the unix people to either (a)
leave it alone, or (b) learn to use 'chmod A...'.  This will actually
work: it's not a hand-waving hypothetical that just doesn't play out.


What I provide, which we don't have now, is a way to make:

  /tub/dataset/a subtree

-rwxrwxrwxin old unix
[working, changeable permissions] in windows

  /tub/dataset/b subtree

-rw-r--r--in old unix
[everything: everyone]in windows, but unix permissions 
  still enforced

this means:

 * unix writers and windows writers can cooperate even within a single
   dataset

 * an intuitive warning sign when non-native permissions are in effect, 

 * fewer leaked-data surprises

If you accept that the autotranslation between the two permissions
regimes is total shit, which it is, then what I offer is the best oyu
can hope for.

My proposal also generalizes to other permissions autoconversion
problems:

 * Future ACL translation stupidity that will happen as more bizarre
   ACL mechanisms are invented, or underspecified parts of current
   spec make different choices in different OS's.

   - POSIX - NFSv4, Darwin - NFSv4

 If Apple provides a Darwin - NFSv4 translation that's silly, a
 match for Darwin NFS client IP's in the share string could put
 these clients into a tagged ACL group.

   - AFP - NFSv4

 ACL's can be tagged by protocol for new weird protocols.  If [new
 protocol]'s ACLs are a subset of NFSv4 ACL's, then they can be
 implemented by the bridge and apply to users who don't go through
 the bridge.  The [new protocol] bridge will have an ACLspace all
 to itself, within which it can be certain nothing but itself will
 change ACL's, so it can rely on never having to read NFSv4 ACL's
 that do not match the subset it would feel inclined to write.

 Unix users will get an everything:everyone or 777 warning that
 someone else is managing the ACLspace.  Yet, Unix users can
 descend into its private subtrees and muck around with ACL's, and
 the Unix changes will still get enforced.

 It's easy to search for all the changes made by Unix, vs all the
 changes made by [new protocol] bridge, and see if some are
 important.  It's easy to delete all of them at once if someone
 shouldn't have been mucking arond from unix, or if the [new
 protocol] bridge was unleashed on a dataset that wasn't dedicated
 to it and made a mess.

 This is a case where the [new protocol] bridge is using the ACL's
 for two related but slightly-orthogonal purposes: to enforce
 security, and to store metadata.  My proposal separates the two.

   - SMB - NFSv4, NFSv4 - NFSv4

 I get that the NFSv4 ACL's are supposed to match Windows
 perfectly, but if that turns out to be untrue, Linux and Windows
 clients could be put in separate ACL groups even though they're
 both, in theory, using NFSv4 ACL's.

 * zones running large software packages that have bizarre or
   misguided ACL behavior

   ACL's are complicated enough that a lot of programmers will get
   them wrong.  If you have a large, assertion-riddled app that will
   shit itself if it doesn't see the ACL's it expects, or autoset or
   autoremove ACL's, or does other stupid things with ACL's, you can
   put it into a zone and configure an ACL tag on the zone,
   segregating its ACL-writing from the rest of the system.

   Yet, its restrictions are still respected.  If the app were setting
   ACL's that don't give enough permission, it wouldn't work.  but it
   may have hardcoded crap that stupidly opens up ACL's, or refuses to
   work if ACL's aren't as open as it thinks they should be.  Now you
   can fake it out whenever it calls getacl, but set other ACL's kept
   secret from it and still return permission denied when you like.

 * (optional)

   a backup mechanism.  If you make the choice ``global zone ignores
   ACLgroups with 'zoned' bit set'', then you can run backups in the
   global zone that won't be stopped by ACL's set by the inner zones,
   however you can still limit your backup process's access by adding
   zoned=0 ACL's.

  chpacl '(unix)' chmod -R A- .

nw Huh?

I think you are confused because you didn't read my proposal because
it was too long, or the examples I wrote weren't easy to understand.

however if I try to repeat it in small pieces, I think it'll just be
even longer and harder to understand than the original.

What's more, if you don't agree that the

[zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Miles Nordin

 rb == Ralph Böhme ra...@rsrc.de writes:

rb The Darwin kernel evaluates permissions in a first
rb match paradigm, evaluating the ACL before the mode 

well...I think it would be better to AND them together like AFS did.
In that case it doesn't make any difference in which order you do it
because AND is commutative.  The Darwin method you describe means one
might remove permissions with chmod but still have access granted
under first-match by the ACL.  I just tested, and Darwin does indeed
work this way. :(

One way to get from NFSv4 to what I want is that you might add EVEN
MORE complexity and have ``tagged ACL groups'':

 * all the existing ACL tools and NFS/SMB clients targeting 
   the #(null) tag, 

 * traditional 'chmod' unix permissions targeting the #(unix) tag.  

 * The evaluation within a tag-group is first-match like now, 

 * The result of each tag-group is ANDed together for the final
   evaluation

When accomodating Darwin ACL's or Windows ACL's or Linux NFSv4 ACL's
or translated POSIX ACL's, the result of the imperfect translation can
be shoved into a tag-group if it's unclean.

The way I would implement the userspace, tools would display all tag
groups if given some new argument, but they would always be incapable
of editing any tag group except #(null).  Another chroot-like tool
would swap a given tag-group for #(null) for all child processes:

car...@awabagal:~/bar$ ls -v\# foo
-rw-r--r--   1 carton   carton 0 Sep 29 18:31 foo
 0#(unix):owner@:execute:deny
 
1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes
 /write_acl/write_owner:allow
 2#(unix):group@:write_data/append_data/execute:deny
 3#(unix):group@:read_data:allow
 
4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes
 /write_acl/write_owner:deny
 
5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
 :allow
car...@awabagal:~/bar$ chmod A+owner@:write_data:deny foo
car...@awabagal:~/bar$ ls -v\# foo
-rw-r--r--   1 carton   carton 0 Sep 29 18:31 foo
 0#(null):owner@:write_data:deny
   #
 0#(unix):owner@:execute:deny
 
1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes
 /write_acl/write_owner:allow
 2#(unix):group@:write_data/append_data/execute:deny
 3#(unix):group@:read_data:allow
 
4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes
 /write_acl/write_owner:deny
 
5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
 :allow
car...@awabagal:~/bar$ echo lala  foo
-bash: foo: Permission denied
car...@awabagal:~/bar$ chpacl baz ls -v\# foo
-rw-r--r--   1 carton   carton 0 Sep 29 18:31 foo
   #
 0#root:owner@:write_data:deny -- #root is what's mapped to 
#(null) at boot
   #
 0#(unix):owner@:execute:deny
 
1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes
 /write_acl/write_owner:allow
 2#(unix):group@:write_data/append_data/execute:deny
 3#(unix):group@:read_data:allow
 
4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes
 /write_acl/write_owner:deny
 
5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
 :allow
car...@awabagal:~/bar$ chpacl '(null)' true
chpacl: '(null)' is reserved.
car...@awabagal:~/bar$ chpacl baz chmod A+owner@:read_data:deny foo
car...@awabagal:~/bar$ chpacl baz ls -v\# foo
-rw-r--r--   1 carton   carton 0 Sep 29 18:31 foo
 0#(null):owner@:read_data:deny
   #
 0#root:owner@:write_data:deny
   #
 0#(unix):owner@:execute:deny
 
1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes
 /write_acl/write_owner:allow
 2#(unix):group@:write_data/append_data/execute:deny
 3#(unix):group@:read_data:allow
 
4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes
 /write_acl/write_owner:deny
 
5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
 :allow
car...@awabagal:~bar$ cat foo
-bash: foo: Permission denied
car...@awabagal:~bar$ chpacl baz cat foo  -- current tagspace is irrelevant to 
ACL evaluation
-bash: foo: Permission denied
car...@awabagal:~/bar$ ls -v\# foo
-rw-r--r--   1 carton   carton 0 Sep 29 18:31 foo
 0#(null):owner@:write_data:deny
   #
 0#baz:owner@:read_data:deny
   #
 0#(unix):owner@:execute:deny
 
1#(unix):owner@:read_data/write_data/append_data/write_xattr/write_attributes
 /write_acl/write_owner:allow
 2#(unix):group@:write_data/append_data/execute:deny
 3#(unix):group@:read_data:allow
 
4#(unix):everyone@:write_data/append_data/write_xattr/execute/write_attributes
 /write_acl/write_owner:deny
 
5#(unix):everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
 :allow

Re: [zfs-discuss] drive speeds etc

2010-09-28 Thread Miles Nordin

 sb == Simon Breden sbre...@gmail.com writes:

sb WD itself does not recommend them for 'business critical' RAID
sb use

The described problems with WD aren't okay for non-critical
development/backup/home use either.  The statement from WD is nothing
but an attempt to upsell you, to differentiate the market so they can
tap into the demand curve at multiple points, and to overload you with
information so the question becomes ``which WD drive should I buy''
instead of ``which manufactuer's drive should I buy.''  Don't let this
stuff get a foothold inside your brain.

``mixing'' drives within a stripe is a good idea because it protects
you from bad batches and bad models/firmwares, which are not rare in
recent experience!  I always mix drives and included WD in that mix up
until this latest rash of problems.  ``mixing'' is only bad (for WD)
because it makes it easier for you, the customer, to characterize the
green performance deficit and notice the firmware bugs that are unique
to the WD drives.


pgpg2mRMPLVGG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-16 Thread Miles Nordin

 dd == David Dyer-Bennet d...@dd-b.net writes:

dd Sure, if only a single thread is ever writing to the disk
dd store at a time.

video warehousing is a reasonable use case that will have small
numbers of sequential readers and writers to large files.  virtual
tape library is another obviously similar one.  basically, things
which used to be stored on tape.  which are not uncommon.  

AIUI ZFS does not have a fragmentation problem for these cases unless
you fill past 96%, though I've been trying to keep my pool below 80%
because general FUD.

dd This situation doesn't exist with any kind of enterprise disk
dd appliance, though; there are always multiple users doing
dd stuff.

the point's relevant, but I'm starting to tune out every time I hear
the word ``enterprise.''  seems it often decodes to: 

 (1) ``fat sacks and no clue,'' or 

 (2) ``i can't hear you i can't hear you i have one big hammer in my
 toolchest and one quick answer to all questions, and everything's
 perfect! perfect, I say.  unless you're offering an even bigger
 hammer I can swap for this one, I don't want to hear it,'' or

 (3) ``However of course I agree that hammers come in different
 colors, and a wise and experienced craftsman will always choose
 the color of his hammer based on the color of the nail he's
 hitting, because the interface between hammers and nails doesn't
 work well otherwise.  We all know here how to match hammer and
 nail colors, but I don't want to discuss that at all because it's
 a private decision to make between you and your salesdroid.  

 ``However, in this forum here we talk about GREEN NAILS ONLY.  If
 you are hitting green nails with red hammers and finding they go
 into the wood anyway then you are being very unprofessional
 because that nail might have been a bank transaction. --posted
 from opensolaris.org''


pgpqzPhCxoUuU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Miles Nordin

 ml == Mark Little marklit...@koallo.com writes:

ml Just to clarify - do you mean TLER should be off or on?

It should be set to ``do not have asvc_t 11 seconds and 1 io/s''.

...which is not one of the settings of the TLER knob.

This isn't a problem with the TLER *setting*.  TLER does not even
apply unless the drive has a latent sector error.

TLER does not even apply unless the drive has a latent sector error.

TLER does not even apply unless the drive has a latent sector error.

GOT IT?  so if the drive is not defective, but is erratically having
huge latency when not busy, this isn't a TLER problem.  It's a
drive-is-unpredictable-piece-of-junk problem.  Will the problem go
away if you change the TLER setting to the opposite of whatever it is?
Who knows?!  It shouldn't based on the claimed purpose of TLER, but in
reality, maybe, maybe not, because the drive shouldn't (``shouldn't'',
haha) act like that to begin with.  It will be more likely to go away
if you replace the drive with a different model, though.

ml Storage forum on hardforum.com, the experts there seem to
ml recommend NOT having TLER enabled when using ZFS as ZFS can be
ml configured for its timeouts, etc, 

I don't believe there are any configurable timeouts in ZFS.  The ZFS
developers take the position that timeouts are not our problem and
push all that work down the stack to the controller driver and the
disk driver, which cooperate (this is two drivers, now.  plus a third
``SCSI mid-layer'' perhaps, for some controllers but not others.) to
implement a variety of inconsistent, silly, undocumented cargo-cult
flailing timeout regimes that we all have to put up with.  However
they are always quite long.  The ATA max timeout is 30sec, and AIUI
they are all much longer than that.

My new favorite thing, though, is the reference counting.  OS: ``This
disk/iSCSIdisk is `busy' so you can't detach it''.  me: ``bullshit.
YOINK, detached, now deal with it.''  IMO this area is in need of some
serious bar-raising.

ml and the main reason to use TLER is when using those drives
ml with hardware RAID cards which will kick a drive out of the
ml array if it takes longer than 10 seconds.

yup.

which is something the drive will not do unless it encounters an
ERROR.  that is the E in TLER.  In other words, the feature as
described prevents you from noticing and invoking warranty replacement
on your about-to-fail drive.  For this you pay double.  Have I got
that right?

In any case the obvious proper place to fix this is in the
RAID-on-a-card firmware, not the disk firmware, if it does even need
fixing which is unclear to me.  unless the disk manufacturers are
going to offer a feature ``do not spend more than 1 second out of
every 2 seconds `trying harder' to read marginal data, just return
errors'' which woudl actually have real value, the only reason TLER is
proper is that it can convince all you gamers to pay twice as much for
a drive because they've flipped a single bit in the firmware and then
shovelled a big pile of bullshit into your heads.

ml Can anyone else here comment if they have had experience with
ml the WD drives and ZFS and if they have TLER enabled or
ml disabled?

I do not have any problems with drives dropping out of ZFS using the
normal TLER setting.

I do have problems with slowly-failing drives fucking up the whole
system.  ZFS doesn't deal with them gracefully, and I have to find the
bad drive and remove it by hand.  All this stuff about cold spares
automatically replacing and USARS never notice, is largely a fantasy.

Neither observation leads me to want TLER.

however observations like this ``why did my disks suddenly slow
down?'' lead me to avoid WD drives period, for ZFS or not ZFS or
anything at all.  Whipping up all this marketing sillyness around TLER
also leads me to avoid them because I know they will shovel bullshit
and FUD to justify jacked prices.


pgpMng48rq0w8.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Miles Nordin

 dm == David Magda dma...@ee.ryerson.ca writes:

dm http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/

http://www.groklaw.net/articlebasic.php?story=20050121014650517

says when the MPL was modified to become the CDDL, clauses were
removed which would have required Oracle to disclose any patent
licenses it might have negotiated with NetApp covering CDDL code.  The
disclosure would have to be added to hg, freeze or no: ``If
Contributor obtains such knowledge after the Modification is made
available as described in Section 3.2, Contributor shall promptly
modify the LEGAL file in all copies Contributor makes available
thereafter and shall take other steps (such as notifying appropriate
mailing lists or newsgroups) reasonably calculated to inform those who
received the Covered Code that new knowledge has been obtained.''
This is in MPL but removed from CDDL.

The groklaw poster's concern is that this is a mechanism through which
Oracle could manoever to make the CDDL worthless as a guarantee of zfs
users' software freedom.  CDDL does implicitly grant rights to
Oracle's patents, but not to negotiations for shield from NetApp's.

AIUI GPLv3 is different and does not have this problem, though I don't
understand it well so I could be wrong.  With MPL at least we would
know about the negotiations: the settlement was ``secret'' which is
exactly the disaster scenario the groklaw poster warned of.

I'm sorry you cannot be uninterested in licenses and ``just want to
get work done.''

To me it looks like the patent situation is mostly an obstacle to
getting ZFS development funded.  If you used ZFS secretly in some kind
of cloud service, and never told anyone about it, you could be pretty
certain of getting away with it without any patent claims throughout
the entire decade or so that ZFS remains relevant, but if you want to
participate in a horizontally-divided market like Coraid, or otherwise
share source changes, you might get sued.  This regime has to be a
huge drag on the industry, and it makes things really unpredictable
which has to discourage investment, and it strongly favours large
companies.


pgpLRI59okaob.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-29 Thread Miles Nordin

 en == Eff Norwood sm...@jsvp.com writes:

en http://www.anandtech.com/show/2738/8

but a few pages later:

  http://www.anandtech.com/show/2738/25

so, as you say, ``with all major SSDs in the role of a ZIL you will
eventually not be happy.'' is true, but you seem to have accidentally
left out the ``EXCEPT INTEL!''  Oops!  Funnier still, the EXCEPT INTEL
is right there in exactly the article YOU cited.

however, that's not the end of it.  Searching this very mailing list
for 'anandtech' I found this cited about ten times:

 http://www.anandtech.com/show/2899/8

anandtech does not think TRIM / dirty drives are a problem any longer.
You might want to redo whatever tests you did (or else read newer
anandtech articles).

I've made the same mistake of passing around anandtech links without
keeping up with their latest posts, but the thing is, that link
debunking your ideas was posted on this list *so* *many* *times* and
over such a long interval!

You can also use the anandtech articles as a point of reference for
how you might write up your ``extensive testing'' of ``all major''
SSD's in a way that will ``assure'' people your conclusions are
correct.  (HINT: list the SSD's you tested.  describe the testing
method.  Results would be nice, too, but the first two were missing
from your post.  They help a lot, and do not take much time to include,
though leaving them out does help FUD spread further if you are trying
to promote this ``DDRDrive'' with the silly external power brick.)

en I can't think of an easy way to measure pages that have not
en been consumed since it's really an SSD controller function
en which is obfuscated from the OS,

yeah, SSD's are largely just a different way of selling proprietary
software, but I guess a lot of ``hardware'' is.


pgpi59M7WwDpr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] native ZFS on Linux

2010-08-29 Thread Miles Nordin

 aa == Anurag Agarwal anu...@kqinfotech.com writes:

aa Every one being part of beta program will have access to
aa source code

...and the right to redistribute it if they like, which I think is
also guaranteed by the license.

Yes, I agree a somewhat formal beta program could be smart for this
type of software, which can lose large amounts of data, and where
reproducing problems isn't easy because debugging the way analagous to
other software requires shipping around multi-terabyte
possibly-confidential images, so you'd like competent testers so you
can skip this without becoming too frustrated.  But I don't see how
anything fitting the definition of ``closed'' is possible with free
software.

Even just asking participants, ``please don't leak our software
outside the beta, even though you've the legal right to do so.  If you
do leak it, we'll be unhappy,'' is an implicit threat to retaliate
(ex. by excluding people from further beta releases, which you'll
likely be making in a continuous stream).  so the word ``closed''
alone, even without any further discussion, is likely to have a
chilling effect on the software freedom of the beta participants, and
I think this effect is absolutely intended by you, and that it's
wrong.  on one hand it's sort of a fine point, but on the other for
the facts on the ground it can matter quite a lot.

Thanks for the effort!  and for clarifying that you will always
release matching source along with every binary release you make!


pgpN2VocVYwL0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] native ZFS on Linux

2010-08-28 Thread Miles Nordin

 aa == Anurag Agarwal anu...@kqinfotech.com writes:

aa * Currently we are planning to do a closed beta 

aa * Source code will be made available with release.

CDDL violation.

aa * We will be providing paid support for our binary
aa releases.

great, so long as your ``binary releases'' always include source that
matches the release exactly.


pgpOBx1yJdmLD.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS issue with ZFS

2010-08-26 Thread Miles Nordin

 pb( == Phillip Bruce (Mindsource) v-phb...@microsoft.com writes:

   pb( Problem solved..  Try using FQDN on the server end and that
   pb( work.  The client did not have to use FQDN.

1. your syntax is wrong.  You must use netgroup syntax to specify an
   IP, otherwise it will think you mean the hostname made up of those
   numbers and dots as characters.

NAME  PROPERTY  VALUE
andaman/arrchive  sharenfs  r...@10.100.100.0/23:@192.168.2.3/32

2. there's a bug in mountd.  well, there are many bugs in mountd, but
   this is the one I ran into, which makes the netgroup syntax mostly
   useless:

   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832

one workaround is to give every IP reverse lookup, ex. using BIND
$generate or something.  I just use a big /etc/hosts covering every IP
to which I've exported.  I suppose actually fixing mountd would be
what a good sysadmin would have done: it can't be that hard.


pgp6GX6Mwe4Z0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-17 Thread Miles Nordin

 gd == Garrett D'Amore garr...@nexenta.com writes:

  Joerg is correct that CDDL code can legally live right
  alongside the GPLv2 kernel code and run in the same program.

gd My understanding is that no, this is not possible.

GPLv2 and CDDL are incompatible:

 
http://www.fsf.org/licensing/education/licenses/index_html/#GPLIncompatibleLicenses

however Linus's ``interpretation'' of the GPL considers that 'insmod'
is ``mere aggregation'' and not ``linking'', but subject to rules of
``bad taste''.  Although this may sound ridiculous, there are blob
drivers for wireless chips, video cards, and storage controllers
relying on this ``interpretation'' for over a decade.  I think a ZFS
porting project could do the same and end up emitting the same warning
about a ``tained'' kernel that proprietary modules do:

 http://lwn.net/Articles/147070/

the quickest link I found of Linus actually speaking about his
``interpretation'', his thoughts are IMHO completely muddled (which
might be intentional):

 http://lkml.org/lkml/2003/12/3/228

thus ultimately I think the question of whether it's legal or not
isn't very interesting compared to ``is it moral?'' (what some of us
might care about), and ``is it likely to survive long enough and not
blow back in your face fiercely enough that it's a good enough
business case to get funded somehow?'' (the question all the hardware
manufacturers shipping blob drivers presumably asked themselves)

My own view on blob modules is: 

 * that it's immoral, and that Linus is both taking the wrong position
   and doing it without authority.  Even if his position is
   ``everyone, please let's not fight,'' in practice that is a strong
   position favouring GPL violation, and his squirrelyness may look
   like taking a soft view but in practice it throws so much sand into
   the debate it ends up being actually a much stronger position than
   saying outright, ``I think insmod is mere aggregation.''  My
   copyright shouldn't have to bow to your celebrity.

 * and secondly that it does make business sense and is unlikely to
   cause any problems, because no one is able to challenge his
   authority.

Whatever is the view on binary blob modules, I think it's the same
view on ZFS w.r.t. the law, but not necessarily the same view
w.r.t. morality or business, because the copyright law itself is
immoral according to the views of many and the business risk depends
on how much you piss people off.


pgpor5KF8fYq9.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Miles Nordin

 pj == Peter Jeremy peter.jer...@alcatel-lucent.com writes:
 gd == Garrett D'Amore garr...@nexenta.com writes:
 cb == C Bergström codest...@osunix.org writes:
 fc == Frank Cusack frank+lists/z...@linetwo.net writes:
 tc == Tim Cook t...@cook.ms writes:

pj Given that both provide similar features, it's difficult to
pj see why Oracle would continue to invest in both.

So far I think the tricky parts of filesystems have been the work of 1
- 3 people.  It's difficult to see why the kind of developer who's
capable of advancing those filesystems would continue to work in a
negative environment like this one, but maybe they will.  Such a
developer can get money from several places, and I've never heard of
something else this crew brings to the table than money.  That's a
bleak outlook on their ability to actually facilitate relevant
``investment,'' but who knows!

gd Oracle *will* spend more on Solaris than Sun did.  I believe
gd that.

hahaha, yup.  At least I believe their saying they will try to do it.

fc all public companies are very, very greedy.

yeah, it's not helpful to anthropomorphize them, nor tell human
interest 1930's newsreel-hero stories about their supposedly genius
and/or evil leaders, nor imagine yourself into their point of view
like they are your favorite soccer team.

What's needed is clear focus on the rules of collaboration, and how
these rules determine the future of your own greedy schemes.

cb It was a community of system administrators and nearly no
cb developers.

sysadmins need to care about licenses because their investment cycle
in a platform is, apparently, long compared to the stability of a
publicly-traded company.

tc *ONE* developer from Redhat does not change the fact that
tc Oracle owns the rights to the majority of the code,

one developer making the tinyest change to line breaks and then
asserting his copyright does change everything, if it gets committed
to trunk and used as the basis for further work that can't be rolled
back.

gd we are in the process of some enhancements to this
gd code which will make it into Illumos, but probably not into
gd Oracle Solaris unless they pull from Illumos. :-)

yeah, well, add your copyright to it, and thus see that it doesn't
make it into Solaris 11.  Without hg, there's no longer any incentive
to sign over your copyright to them in exchange for getting your
changes committed, so not to keep it for yourself would be negligent
and silly.  Good or bad, it's just reality.

FWIW, the SFLC usually suggests you get copyright assignments from
every member to a single trusted organization so the license can be
changed someday when a change might seem obviously wise.  For example,
Sun was careful to get assignments from all contributors, which at one
time had good hypotheticals as well as the current bad reality: they
could have released their tree under Linux-compatible GPL some day if
convinced.  ISTR some cheap talk about this right after most of Java
was released as GPL.  If Sun had included some Joerg Schilling-owned
pieces in there, his one or two files would become a poison pill
making license change impossible.

However when there is no such trusted organization around, I think
copyrights held by multiple orgs like Linux has are more sustainable.
Nexenta clearly isn't a ``trusted organization,'' but having a source
tree copyrighted by both Nexenta and Oracle could make the terms more
stable than they'd be for a tree copyrighted by either alone.

I don't think the Announcement means much for ZFS, though: it means
releases will come only every year or two, which is about the maximum
pace FreeBSD can keep up with so it will actually bring Solaris and
FreeBSD closer in ZFS feature-parity not further apart.

However, if you were using ZFS along with things like infiniband
iSER/SRP/NFS-RDMA, zones, 10gig nics with cpu-affinity-optimized TCP,
xen dom0, virtualbox, dtrace, or waiting/hoping for pNFS, or if you
foolishly became addicted to proprietary SunPro and Sun's debugger,
then you might be annoyed or even set back a few years by the
Announcement since FreeBSD has none of these things.

Post-Announcement, ZFS will no longer entice people to experiment with
these features, but those who listened to the last half-decade of
apologist's, ``let's wait patiently and quietly.  More code will be
liberated, even the C compiler.  Just give them time,'' those suckers
have now got problems.  I've got a heap of IB cards trying to convince
me to bury my head in the sand or keep ``hoping'' instead of reacting.
I wish I'd invested my time into an OS I could continue using under
consistent terms.


pgps28C1MIhcQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Miles Nordin

dd 2  * Copyright (C) 2007 Oracle.  All rights reserved.
dd 3  *
dd 4  * This program is free software; you can redistribute it and/or
dd 5  * modify it under the terms of the GNU General Public
dd 6  * License v2 as published by the Free Software Foundation.

dd 
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=blob;f=fs/btrfs/root-tree.c;h=2d958be761c84556b39c60afa3b0f3fd75d6;hb=HEAD

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=blob;f=fs/btrfs/free-space-cache.c;h=f488fac04d99ea45eea93607bbf17c021b5b2207;hb=HEAD

   1 /*
   2  * Copyright (C) 2008 Red Hat.  All rights reserved.
   3  *
   4  * This program is free software; you can redistribute it and/or
   5  * modify it under the terms of the GNU General Public
   6  * License v2 as published by the Free Software Foundation.

see, that's good, and is a realistic future scenario for ZFS, AFAICT:
there can be a branch that's safe to collaborate on, which cannot go
into Solaris 11 and cannot be taken proprietary by Nexenta, either.


pgprH3DS8ogDw.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Miles Nordin

 mg == Mike Gerdts mger...@gmail.com writes:
 sw == Saxon, Will will.sa...@sage.com writes:

sw I think there may be very good reason to use iSCSI, if you're
sw limited to gigabit but need to be able to handle higher
sw throughput for a single client.

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942

look at it now before it gets pulled back inside the wall. :(

I think this bug was posted on zfs-discuss earlier.  Please see the
comments because he is not using lagg's: even with a single 10Gbit/s
NIC, you cannot use the link well unless you take advantage of the
multiple MSI's and L4 preclass built into the NIC.  You need multiple
TCP circuits between client and server so that each will fire a
different MSI.  He got about 3x performance using 8 connections.

It sounds like NFS is already fixed for this, but requires manual
tuning of clnt_max_conns and the number of reader and writer threads.

mg it is rather common to have multiple 1 Gb links to
mg servers going to disparate switches so as to provide
mg resilience in the face of switch failures.  This is not unlike
mg (at a block diagram level) the architecture that you see in
mg pretty much every SAN.  In such a configuation, it is
mg reasonable for people to expect that load balancing will
mg occur.

nope.  spanning tree removes all loops, which means between any two
points there will be only one enabled path.  An L2-switched network
will look into L4 headers for splitting traffic across an aggregated
link (as long as it's been deliberately configured to do that---by
default probably only looks to L2), but it won't do any multipath
within the mesh.

Even with an L3 routing protocol it usually won't do multipath unless
the costs of the paths match exactly, so you'd want to build the
topology to achieve this and then do all switching at layer 3 by
making sure no VLAN is larger than a switch.

There's actually a cisco feature to make no VLAN larger than a *port*,
which I use a little bit.  It's meant for CATV networks I think, or
DSL networks aggregated by IP instead of ATM like maybe some European
ones?  but the idea is not to put edge ports into vlans any more but
instead say 'ip unnumbered loopbackN', and then some black magic they
have built into their DHCP forwarder adds /32 routes by watching the
DHCP replies.  If you don't use DHCP you can add static /32 routes
yourself, and it will work.  It does not help with IPv6, and also you
can only use it on vlan-tagged edge ports (what? arbitrary!) but
neat that it's there at all.

 http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html

The best thing IMHO would be to use this feature on the edge ports,
just as I said, but you will have to teach the servers to VLAN-tag
their packets.  not such a bad idea, but weird.

You could also use it one hop up from the edge switches, but I think
it might have problems in general removing the routes when you unplug
a server, and using it one hop up could make them worse.  I only use
it with static routes so far, so no mobility for me: I have to keep
each server plugged into its assigned port, and reconfigure switches
if I move it.  Once you have ``no vlan larger than 1 switch,'' if you
actually need a vlan-like thing that spans multiple switches, the new
word for it is 'vrf'.

so, yeah, it means the server people will have to take over the job of
the networking people.  The good news is that networking people don't
like spanning tree very much because it's always going wrong, so
AFAICT most of them who are paying attention are already moving in
this direction.


pgpEDdDjwl9Ck.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 1tb SATA drives

2010-07-22 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

bh Recent versions no longer support enabling TLER or ERC.  To
bh the best of my knowledge, Samsung and Hitachi drives all
bh support CCTL, which is yet another name for the same thing.

once again, I have to ask, has anyone actually found these features to
make a verified positive difference with ZFS?

Some of those things you cannot even set on Solaris because the
channel to the drive with a LSI controller isn't sufficiently
transparent to support smartctl, and the settings don't survive
reboots.  Brandon have you actually set it yourself, or are you just
aggregating forum discussion?

The experience so far that I've read here has been:

 * if a drive goes bad completely

   + zfs will mark the drive unavailable after a delay that depends on
 the controller you're using, but with lengths like 60 seconds,
 180 seconds, 2 hours, or forever.  The delay is not sane or
 reasonable with all controllers, and even if redundancy is
 available ZFS will patiently wait for the controller.  The delay
 depends on the controller driver.  It's part of the Solaris code.
 best case zpool will freeze until the delay is up, but there are
 application timeouts and iSCSI initiator-target timeouts,
 too---getting the equivalent of an NFS hard mount is hard these
 days (even with NFS, in some people's experiences).

   + the delay is different if the system's running when the drive
 fails, or if it's trying to boot up.  For example iSCSI will
 ``patiently wait'' forever for a drive to appear while booting
 up, but will notice after 180 seconds while running.

   + because the disk is compeltely bad, TLER, ERC, CCTL, whatever you
 call it, doesn't apply.  The drive might not answer commands
 ever, at all.  The timer is not in the drive: the drive is bad
 starting now, continuing forever.

 * if a drive goes partially bad (large and increasing numbers of
   latent sector errors, which for me happens more often than
   bad-completely):

   + the zpool becomes unusably slow

   + it stays unusably slow until you use 'iostat' or 'fmdump' to find
 the marginal drive and offline it

   + TLER, ERC, CCTL makes the slowness factor 7ms : 7000ms vs 
 7ms : 3ms.  In other words, it's unusably slow with or
 without the feature.

AFAICT the feature is useful as a workaround for buggy RAID card
firmware and nothing else.  It's a cost differentiator, and you're
swallowing it hook, line and sinker.

If you know otherwise please reinform me, but the discussion here so
far doesn't match what I've learned about ZFS and Solaris exception
handling.

That said, to reword Don Marti, ``uninformed Western Digital bashing
is better than no Western Digital bashing at all.''


pgpFMSCuYt2qE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] File cloning

2010-07-22 Thread Miles Nordin

 sw == Saxon, Will will.sa...@sage.com writes:

sw 'clone' vs. a 'copy' would be very easy since we have
sw deduplication now

dedup doesn't replace the snapshot/clone feature for the
NFS-share-full-of-vmdk use case because there's no equivalent of 
'zfs rollback'


I'm tempted to say, ``vmware needs to remove their silly limit'' but
there are takes-three-hours-to-boot problems with thousands of Solaris
NFS exports so maybe their limit is not so silly after all.

What is the scenario, you have?  Is it something like 40 hosts with
live migration among them, and 40 guests on each host?  so you need
1600 filesystems mounted even though only 40 are actually in use?

'zfs set sharenfs=absorb dataset' would be my favorite answer, but
lots of people have asked for such a feature, and answer is always
``wait for mirror mounts'' (which BTW are actually just-works for me
on very-recent linux, even with plain 'mount host:/fs /fs', without
saying 'mount -t nfs4', in spite of my earlier rant complaining they
are not real).  Of course NFSv4 features are no help to vmware, but
hypothetically I guess mirror-mounting would work if vmware supported
it, so long as they were careful not to provoke the mounting of guests
not in use.  The ``implicit automounter'' on which the mirror mount
feature's based would avoid the boot delay of mounting 1600
filesystems.

and BTW I've not been able to get the Real Automounter in Linux to do
what this implicit one already can with subtrees.  Why is it so hard
to write a working automounter?

The other thing I've never understood is, if you 'zfs rollback' an
NFS-exported filesystem, what happens to all the NFS clients?  It
seems like this would cause much worse corruption than the worry when
people give fire-and-brimstone speeches about never disabling
zil-writing while using the NFS server.  but it seems to mostly work
anyway when I do this, so I'm probably confused about something.


pgpTw9yE68txJ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] carrying on

2010-07-18 Thread Miles Nordin

 re == Richard Elling rich...@nexenta.com writes:

re we would very much like to see Oracle continue to produce
re developer distributions which more closely track the source
re changes.

I'd rather someone else than Oracle did it.  Until someone else is
doing the ``building'', whatever that entails all the way from
Mercurial to DVD, we will never know if the source we have is complete
enough to do a fork if we need to.

I realize everyone has in their heads, FORK == BAD.  Yes, forks are
usually bad, but the *ability to make forks* is good, because it
``decouples the investments our businesses make in OpenSolaris/ZFS
from the volatility of Sun and Oracle's business cycle,'' to
paraphrase some blog comment.  

Particularly when you are dealing with datasets so large it might cost
tens of thousands to copy them into another format than ZFS, it's
important to have a 2 year plan for this instead of being subject to
``I am altering the deal.  Pray I don't alter it any further.''
Nexenta being stuck at b134, and secret CVE fixes, does not look good.
Though yeah, it looks better than it would if Nexenta didn't exist.

IMHO it's important we don't get stuck running Nexenta in the same
spot we're now stuck with OpenSolaris: with a bunch of CDDL-protected
source that few people know how to use in practice because the build
procedure is magical and secret.  This is why GPL demands you release
``all build scripts''!

One good way to help make sure you've the ability to make a fork, is
to get the source from one organization and the binary distribution
from another.  As long as they're not too collusive, you can relax and
rely on one of them to complain to the other.

Another way is to use a source-based distribution like Gentoo or BSD,
where the distributor includes a deliverable tool that produces
bootable DVD's from the revision control system, and ordinary
contributors can introspect these tools and find any binary blobs that
may exist.


pgpf3OSDelKXh.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZIL SSD failed

2010-07-13 Thread Miles Nordin

 ds == Dmitry Sorokin dmitry.soro...@bmcorp.ca writes:

ds The SSD drive has failed and zpool is unavailable anymore.

AIUI,

 6733267 Allow a pool to be imported with a missing slog

is only fixed for the case where the pool is still imported.  If you
export it without removing the slog first, the pool is lost.

Instructions here:

 http://opensolaris.org/jive/thread.jspa?messageID=377018
 http://github.com/pjjw/logfix/tree/master

how how to ``fake out'' the lazy assertions, but you have to prepare
to use the workaround before your slog fails by noting its GUID.

If you don't know the GUID, then it is as Richard Elling says, ``a
rather long trial-and-error process.''  Decoded from Fanboi-ese into
English, the ``rather long'' process is ``finding a sha1 hash
collision.''

so either UTFS or ``restore from backup.'' :(


pgpyK7PHBQp9Y.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-10 Thread Miles Nordin

 ab == Alex Blewitt alex.blew...@gmail.com writes:

 3. The quality of software inside the firewire cases varies
 wildly and is a big source of stability problems.  (even on
 mac)

ab It would be good if you could refrain from spreading FUD if
ab you don't have experience with it. 

yup, my experience was with the Prolific PL-3705 chip, which was very
popular for a while.  it has two problems:

 * it doesn't auto-pick its ``ID number'' or ``address'' or something,
   so if you have two cases with this chip on the same bus, they won't
   work.  go google it!

 * it crashes.  as in, I reboot the computer but not the case, and the
   drive won't mount.  I reboot the case but not the computer, and
   the drive starts working again.

   http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html

   I even upgraded the firmware to give the chinese another shot.
   still broken.

You can easily google for other problems with firewire cases in
general.  The performance of the overall system is all over the place
depending on the bridge chip you use.  Some of them have problems with
``large'' transactions as well.  Some of them lose their shit when the
drive reports bad sectors, instead of passing the error along so you
can usefully diagnose it---not that they're the only devices with
awful exception handling in this area, but why add one more mystery?

I think it was already clear I had experience from the level of detail
in the other items I mentioned, though, wasn't it?

Add also to all of it the cache flush suspicions from Garrett: these
bridge chips have full-on ARM cores inside them and lots of buffers,
which is something SAS multipliers don't have AIUI.  Yeah, in a way
that's slightly FUDdy but not really since IIRC the write cache
problem has been verified at least on some USB cases, hasn't it?  Also
since the testing procedure for cache flush problems is a
littlead-hoc, and a lot of people are therefore putting hardware
to work without testing cache flush at all, I think it makes perfect
sense to replace suspicious components with lengths of dumb wire where
possible even if the suspicions aren't proved.

ab I have used FW400 and FW800 on Mac systems for the last 8
ab years; the only problem was with the Oxford 911 chipset in OSX
ab 10.1 days.

yeah, well, if you don't want to listen, then fine, don't listen.

ab It may not suit everyone's needs, and it may not be supported
ab well on OpenSolaris, but it works fine on a Mac.

aside from being slow unstable and expensive, yeah it works fine on
Mac.  But you don't really have the eSATA option on the mac unless you
pay double for the ``pro'' desktop, so i can see why you'd defend your
only choice of disk if you've already committed to apple.

Does the Mac OS even have an interesting zfs port?  Remind me why we
are discussing this, again?


pgpbltDPUUaLy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-09 Thread Miles Nordin

 ab == Alex Blewitt alex.blew...@gmail.com writes:

ab All Mac Minis have FireWire - the new ones have FW800.

I tried attaching just two disks to a ZFS host using firewire, and it
worked very badly for me.  I found:

 1. The solaris firewire stack isn't as good as the Mac OS one.

 2. Solaris is very obnoxious about drives it regards as
``removeable''.  There are ``hot-swappable'' drives that are not
considered removeable but can be removed about as easily, that are
maybe handled less obnoxiously.  Firewire's removeable while
SAS/SATA are hot-swappable.

 3. The quality of software inside the firewire cases varies wildly
and is a big source of stability problems.  (even on mac) The
companies behind the software are sketchy and weak, while only a
few large cartels make SAS expanders for example.  Also, the price
of these cases is ridiculously high compared to SATA world.  If
you go there you may as well take your wad next door and get SAS.

 4. The translation between firewire and SATA is not a simple one, and
is not transparent to 'smartctl' commands, or other werid things
like hard disk firmware upgraders.  though I guess the same is
true of the lsi controllers under solaris.  This problem's rampant
unfortunately.

 5. Firewire is slow.  too slow to make 2x speed interesting.  and the
host chips are not that advanced so they use a lot of CPU.

 6. The DTL partial-mirror-resilver doesn't work.  With b130 it still
doesn't work.  After half a mirror goes away and comes back,
scrubs always reveal CKSUM errors on the half that went away.
With b71 I foudn if I meticulously 'zpool offline'd the disks
before taking them away, the CKSUM errors didn't happen.  With
b130 that no longer helps.  so, scratchy unreliable connections
are just unworkable.  Even iSCSI is not great, but firewire cases
sprawled all over a desk with trippable scratchy cables is just
not on.  It's better to have larger cases that can be mounted in a
rack, or if not that, at least cases that are heavier and fewer in
number and fewer in cordage.

suggest that you do not waste time with firewire.  SATA, SAS, or fuckoff.

None of this is an insult to your blingy designer apple iShit.  It
applies equally well to any hardware involving lots of tiny firewire
cases.


pgp6yEjqWzyNZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-02 Thread Miles Nordin

 np == Neil Perrin neil.per...@oracle.com writes:

np The L2ARC just holds blocks that have been evicted from the
np ARC due to memory pressure. The DDT is no different than any
np other object (e.g. file).

The other cacheable objects require pointers to stay in the ARC
pointing to blocks in the L2ARC.  If the DDT required this,
L2ARC-ification would be pointless since DDT entries aren't much
smaller than ARC-L2ARC pointers, so from what I hear it is actually
special in some way though I don't know precisely what way.


pgpWlNwOCvSTx.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Trouble detecting Seagate LP 2TB drives

2010-06-27 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

  Atom 

bh 32-bit kernels can't support drives over 1GB.

iirc, atom desktop chips are 64-bit and recognized as 64-bit by
kernel, but not recognized by grub.  but I thought this got fixed.  If
you use 'e' in grub to alter the boot line to replace $ISADIR with
'amd64' does it come up 64-bit and work?  That's the fix I recall.


pgpDQNWEKjhMU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Native ZFS for Linux

2010-06-11 Thread Miles Nordin

 gd == Garrett D'Amore garr...@nexenta.com writes:

gd There are numerous people in the community that have indicated
gd that they believe that such linking creates a *derivative*
gd work.  Donald Becker has made this claim rather forcefully.

yes, I think he has a point.  The reality is, as long as Linus
continues insisting that his ``interpretation'' of the GPL allows
loading proprietary modules like ati/nVidia/wireless/... into the
Linux kernel, it looks like no one will be sued over a module.  This
has been holding for a few decades anyway.  If everyone with standing
to sue is sufficiently under Linus's thumb then you may become safe
enough for it to be worth the risk.

Also, if they do not distribute their ZFS port to anyone else then
they're fine: quite intentionally, they can link anything they like
with Linux so long as they never distribute any binaries outside their
organization, just like Akamai is fine basing their entire business
off GPL'd Squid source code that they've improved vastly and not
shared with anyone.

We may find ourselves in a position where the guys distributing this
Linux ZFS module could be sued and then told ``you have lost the right
to distribute the GPL-derived work,'' to which their answer is,
``fine, we do not need to distribute it anyway.  We only need to use
it internally,'' so confronting them is a net loss for most of the
parties with standing to do the confronting.  An exception is, it
could be a net win for Oracle because if they could shut down zfs.ko
then peopo would be forced to run Solaris to get performant ZFS, which
might play out in a funny way:

 Q. We are the owners of foobrulator.c in Linux, a GPLv2 source
file. You may not link this CDDL stuff against our foobrulator.c.
You have lost the right to distribute foobrulator.c.

 A. Wait, don't you own the copyright to the more restrictive CDDL
stuff in question?

 Q. Yes, we own the copyrights to both sources, but you cannot link
them together.

 A. HAHAHA you can't be serious.

 Q. Mwauh hah hah.

 A. ...

who knows.  maybe it could happen.  In short,

 * yes zfs.ko could be a little sketchy

 * other people are doing much sketchier things already and making a
   lot of money doing it

 * looking at the big picture is a lot more convoluted than just
   ``allowed'' or ``OMGillegall''.  If you want your share of this
   money/fame of the second bullet you might push the envelope as the
   others have, and consider who has standing to sue whom given a
   specific way of building and distributing the module, and among
   those who have standing who has motivation to do it, and finally if
   they actually do then how much have you got to lose.  In other
   words: business, instead of FUD pedantry and CYA.

 * in particular, if your business does not involve distributing
   software...  :)

 * GPL has so much momentum that contributing to a GPL-incompatible
   project is a significantly less valuable use of your time than
   contributing to a GPL-compatible one, even and maybe especially if
   you do not like the GPL.  Perl, Apache, BSD, and FSF are all wising
   up to this and making their licenses more compatible from both
   directions.  CDDL is thus, granted obviously well-liked by some,
   but very disappointing and regressive to quite a few potential
   contributors, and this disappointment is widely-understood partly
   becuse of ZFS+Linux.  

I almost hope they do not share their port with anyone and use it only
internally, and that they make some huge improvements to ZFS that they
then claim cannot be given back to Solaris because of license
incompatibility.  That will send a strong message to the forces of
arrogance that crafted a GPL incompatible license at such a late date.
In this age of web-scale megacompanies the distinction between
GPL-style freedom and BSD-style freedom is much less because
operations do not require binary redistributing, but license
compatibility does still matter.


pgpJGNtgXx2f3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-11 Thread Miles Nordin

 pk == Pasi Kärkkäinen pa...@iki.fi writes:

 You're really confused, though I'm sure you're going to deny
 it.

  I don't think so.  I think that it is time to reset and reboot
 yourself on the technology curve.  FC semantics have been
 ported onto ethernet.  This is not your grandmother's ethernet
 but it is capable of supporting both FCoE and normal IP
 traffic.  The FCoE gets per-stream QOS similar to what you are
 used to from Fibre Channel.

FCoE != iSCSI.

FCoE was not being discussed in the part you're trying to contradict.
If you read my entire post, I talk about FCoE at the end and say more
or less ``I am talking about FCoE here only so you don't try to throw
out my entire post by latching onto some corner case not applying to
the OP by dragging FCoE into the mix'' which is exactly what you did.
I'm guessing you fired off a reply without reading the whole thing?

pk Yeah, today enterprise iSCSI vendors like Equallogic (bought
pk by Dell) _recommend_ using flow control. Their iSCSI storage
pk arrays are designed to work properly with flow control and
pk perform well.

pk Of course you need a proper (certified) switches aswell.

pk Equallogic says the delays from flow control pause frames are
pk shorter than tcp retransmits, so that's why they're using and
pk recommending it.

please have a look at the three links I posted about flow control not
being used the way you think it is by any serious switch vendor, and
the explanation of why this limitation is fundamental, not something
that can be overcome by ``technology curve.''  It will not hurt
anything to allow autonegotiation of flow control on non-broken
switches so I'm not surprised they recommend it with ``certified''
known-non-broken switches, but it also will not help unless your
switches have input/backplane congestion which they usually don't, or
your end host is able to generate PAUSE frames for PCIe congestion
which is maybe more plausible.  In particular it won't help with the
typical case of the ``incast'' problem in the experiment in the FAST
incast paper URL I gave, because they narrowed down what was happening
in their experiment to OUTPUT queue congestion, which (***MODULO
FCoE*** mr ``reboot yourself on the technology curve'') never invokes
ethernet flow control.

HTH.

ok let me try again:

yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with
blocking storage-friendly buffer semantics if your FCoE/CEE switches
can manage that, but I would like to hear of someone actually DOING it
before we drag it into the discussion.  I don't think that's happening
in the wild so far, and it's definitely not the application for which
these products have been flogged.

I know people run iSCSI over IB (possibly with RDMA for moving the
bulk data rather than TCP), and I know people run SCSI over FC, and of
course SCSI (not iSCSI) over FCoE.  Remember the original assertion
was: please try FC as well as iSCSI if you can afford it.

Are you guys really saying you believe people are running ***iSCSI***
over the separate HOL-blocking hop-by-hop pause frame CoS's of FCoE
meshes?  or are you just spewing a bunch of noxious white paper
vapours at me?  because AIUI people using the
lossless/small-output-buffer channel of FCoE are running the FC
protocol over that ``virtual channel'' of the mesh, not iSCSI, are
they not?


pgp7HCeOuOq4h.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

re Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

Ethernet output queues are either FIFO or RED, and are large compared
to FC and IB.  FC is buffer-credit, which HOL-blocks to prevent the
small buffers from overflowing, and IB is...blocking (almost no buffer
at all---about 2KB per port and bandwidth*delay product of about 1KB
for the whole mesh, compared to ARISTA which has about 48MB per port,
so except to pedantic IB is bufferless, ie it does not even buffer one
full frame).  Unlike Ethernet, both are lossless fabrics (sounds good)
and have an HOL-blocking character (sounds bad).  They're
fundamentally different at L2, so this is not about IP.  If you run IP
over IB, it is still blocking and lossless.  It does not magically
start buffering when you use IP because the fabric is simply unable to
buffer---there is no RAM in the mesh anywhere.  Both L2 and L3
switches have output queues, and both L3 and L2 output queues can be
FIFO or RED because the output buffer exists in the same piece of
silicon of an L3 switch no matter whether it's set to forward in L2 or
L3 mode, so L2 and L3 switches are like each other and unlike FC  IB.
This is not about IP.  It's about Ethernet.

a relevant congestion difference between L3 and L2 switches (confusing
ethernet with IP) might be ECN, because only an L3 switch can do ECN.
But I don't think anyone actually uses ECN.  It's disabled by default
in Solaris and, I think, all other Unixes.  AFAICT my Extreme
switches, a very old L3 flow-forwarding platform, are not able to flip
the bit.  I think 6500 can, but I'm not certain.

re no back-off other than that required for the link. Since
re GbE and higher speeds are all implemented as switched fabrics,
re the ability of the switch to manage contention is paramount.
re You can observe this on a Solaris system by looking at the NIC
re flow control kstats.

You're really confused, though I'm sure you're going to deny it.
Ethernet flow control mostly isn't used at all, and it is never used
to manage output queue congestion except in hardware that everyone
agrees is defective.  I almost feel like I've written all this stuff
already, even the part about ECN.

Ethernet flow control is never correctly used to signal output queue
congestion.  The ethernet signal for congestion is a dropped packet.
flow control / PAUSE frames are *not* part of some magic mesh-wide
mechanism by which switches ``manage'' congestion.  PAUSE are used,
when they're used at all, for oversubscribed backplanes: for
congestion on *input*, which in Ethernet is something you want to
avoid.  You want to switch ethernet frames to the output port where it
may or may not encounter congestion so that you don't hold up input
frames headed toward other output ports.  If you did hold them up,
you'd have something like HOL blocking.  IB takes a different
approach: you simply accept the HOL blocking, but tend to design a
mesh with little or no oversubscription unlike ethernet LAN's which
are heavily oversubscribed on their trunk ports.  so...the HOL
blocking happens, but not as much as it would with a typical Ethernet
topology, and it happens in a way that in practice probably increases
the performance of storage networks.

This is interesting for storage because when you try to shove a
128kByte write into an Ethernet fabric, part of it may get dropped in
an output queue somewhere along the way.  In IB, never will part of
the write get dropped, but sometimes you can't shove it into the
network---it just won't go, at L2.  With Ethernet you rely on TCP to
emulate this can't-shove-in condition, and it does not work perfectly
in that it can introduce huge jitter and link underuse (``incast'' problem:

 http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf

), and secondly leave many kilobytes in transit within the mesh or TCP
buffers, like tens of megabytes and milliseconds per hop, requiring
large TCP buffers on both ends to match the bandwidth*jitter and
frustrating storage QoS by queueing commands on the link instead of in
the storage device, but in exchange you get from Ethernet no HOL
blocking and the possibility of end-to-end network QoS.  It is a fair
tradeoff but arguably the wrong one for storage based on experience
with iSCSI sucking so far.

But the point is, looking at those ``flow control'' kstats will only
warn you if your switches are shit, and shit in one particular way
that even cheap switches rarely are.  The metric that's relevant is
how many packets are being dropped, and in what pattern (a big bucket
of them at once like FIFO, or a scattering like RED), and how TCP is
adapting to these drops.  For this you might look at TCP stats in
solaris, or at output queue drop and output queue size stats on
managed switches, or simply at the overall

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Miles Nordin

 et == Erik Trimble erik.trim...@oracle.com writes:

et With NFS-hosted VM disks, do the same thing: create a single
et filesystem on the X4540 for each VM.

previous posters pointed out there are unreasonable hard limits in
vmware to the number of NFS mounts or iSCSI connections or something,
so you will probably run into that snag when attempting to use the
much faster snapshotting/cloning in ZFS.

 * Are the FSYNC speed issues with NFS resolved?
 
et The ZIL SSDs will compensate for synchronous write issues in
et NFS.

okay, but sometimes for VM's I think this often doesn't matter because
NFSv3 and v4 only add fsync()'s on file closings, and a virtual disk
is one giant file that the client never closes.  There may still be
synchronous writes coming through if they don't get blocked in LVM2
inside the guest or blocked in the VM software, but whatever comes
through ought to be exactly the same number of them for NFS or iSCSI,
unless the vm software has different bugs in the nfs vs iscsi
back-ends.

the other difference is in the latest comstar which runs in
sync-everything mode by default, AIUI.  Or it does use that mode only
when zvol-backed?  Or something.  I've the impression it went through
many rounds of quiet changes, both in comstar and in zvol's, on its
way to its present form.  I've heard said here you can change the mode
both from the comstar host and on the remote initiator, but I don't
know how to do it or how sticky the change is, but if you didn't
change and stuck with the default sync-everything I think NFS would be
a lot faster.  This is if we are comparing one giant .vmdk or similar
on NFS, against one zvol.  If we are comparing an exploded filesystem
on NFS mounted through the virtual network adapter, then of course
you're right again Erik.

The tradeoff integrity tests are, (1) reboot the solaris storage host
without rebooting the vmware hosts  guests and see what happens, (2)
cord-yank the vmware host.  Both of these are probably more dangerous
than (3) command the vm software to virtual-cord-yank the guest.

 * Should I go with fiber channel, or will the 4 built-in 1Gbe
 NIC's give me enough speed?

FC has different QoS properties than Ethernet because of the buffer
credit mechanism---it can exert back-pressure all the way through the
fabric.  same with IB, which is HOL-blocking.  This is a big deal with
storage, with its large blocks of bursty writes that aren't really the
case for which TCP shines.  I would try both and compare, if you can
afford it!

je IMHO Solaris Zones with LOFS mounted ZFSs gives you the
je highest flexibility in all directions, probably the best
je performance and least resource consumption, fine grained
je resource management (CPU, memory, storage space) and less
je maintainance stress etc...

yeah zones are really awesome, especially combined with clones and
snapshots.  For once the clunky post-Unix XML crappo solaris
interfaces are actually something I appreciate a little, because lots
of their value comes from being able to do consistent repeatable
operations on them.

The problem is that the zones run Solaris instead of Linux.  BrandZ
never got far enough to, for example, run Apache under a
2.6-kernel-based distribution, so I don't find it useful for any real
work.  I do keep a CentOS 3.8 (I think?) brandz zone around, but not
for anything production---just so I can try it if I think the
new/weird version of a tool might be broken.

as for native zones, the ipkg repository, and even the jucr
repository, has two years old versions of everything---django/python,
gcc, movabletype.  Many things are missing outright, like nginx.  I'm
very disappointed that Solaris did not adopt an upstream package
system like Dragonfly did.  Gentoo or pkgsrc would have been very
smart, IMHO.  Even opencsw is based on Nick Moffitt's GAR system,
which was an old mostly-abandoned tool for building bleeding edge
Gnome on Linux.  The ancient perpetually-abandoned set of packages on
jucr and the crufty poorly-factored RPM-like spec files leave me with
little interest in contributing to jucr myself, while if Solaris had
poured the effort instead into one of these already-portable package
systems like they poured it into Mercurial after adopting that, then
I'd instead look into (a) contributing packages that I need most, and
(b) using whatever system Solaris picked on my non-Solaris systems.
This crap/marginalized build system means I need to look at a way to
host Linux under Solaris, using Solaris basically just for ZFS and
nothing else.  The alternative is to spend heaps of time re-inventing
the wheel only to end up with an environment less rich than
competitors and charge twice as much for it like joyent.

But, yeah, while working on Solaris I would never install anything in
the global zone after discovering how easy it is to work with ipkg
zones.  They are really brilliant, and unlike everyone else's attempt
at these

Re: [zfs-discuss] ZFS recovery tools

2010-06-04 Thread Miles Nordin

 sl == Sigbjørn Lie sigbj...@nixtra.com writes:

sl Excellent! I wish I would have known about these features when
sl I was attempting to recover my pool using 2009.06/snv111.

the OP tried the -F feature.  It doesn't work after you've lost zpool.cache:

op I was setting up a new systen (osol 2009.06 and updating to
op the lastest version of osol/dev - snv_134 - with
op deduplication) and then I tried to import my backup zpool, but
op it does not work.  

op # zpool import -f tank1 
op cannot import 'tank1': one or more devices is currently unavailable 
op Destroy and re-create the pool from a backup source

op Any other option (-F, -X, -V, -D) and any combination of them
op doesn't helps too.

I have been in here repeatedly warning about this incompleteness of
the feature while fanbois keep saying ``we have slog recovery so don't
worry.''

R., please let us know if the 'zdb -e -bcsvL zpool-name' incantation
Sigbjorn suggested ends up working for you or not.


pgpFHj14VBEC7.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] nfs share of nested zfs directories?

2010-06-03 Thread Miles Nordin

 cs == Cindy Swearingen cindy.swearin...@oracle.com writes:

okay wtf.  Why is this thread still alive?

cs The mirror mount feature

It's unclear to me from this what state the feature's in:

 http://hub.opensolaris.org/bin/view/Project+nfs-namespace/

It sounds like mirror mounts are done but referrals are not, but I
don't know.  Are the client and server *both* done?  I assume so,
because I don't know how else it could be tested.  Is the bug with
'find' fixed?  It looks like it was fixed, but very recently:

  http://opensolaris.org/jive/message.jspa?messageID=409895#409895

and it sounds like there could be problems with other programs that
have a --one-file-system option like gnutar and rsync because the fix
is sort of ad-hoc---it's done by making changes to the solaris
userland.  Are all the features described at:

  
http://hub.opensolaris.org/bin/download/Project+nfs-namespace/files/mm-PRS-open.html

actually implemented, including automounter overrides, automatic
unmounting, recursive unmounting?  not sure.

Are you even using NFSv4 in Linux?  It's very unlikely.  probably you
are using NFSv3.  People are reporting unresolved problems with NFSv4
with connections bouncing and not properly simulating the
``statelessness'' that allows servers to reboot when clients don't:

 http://mail.opensolaris.org/pipermail/nfs-discuss/2010-April/002087.html

granted, ISTR some of the problems are reported by people doing goofy
bullshit through firewalls, like bank admins that don't seem to
understand TCP/IP and are flailing around with the blamestick because
they are in a CYA environment and don't have reasonable control of
their own systems.  but I am not sure it's worth the trouble!  AFAICT
you cannot even net-boot opensolaris over NFSv4: '/' comes up mounted
with NFSv3.

It seems to me every time this ``I can't see subdirectories'' comes up
it's from someone who doesn't understand how NFS and Unix works,
doesn't know how to mount ANY filesystem much less NFS, has no idea
what version of NFS he is using much less how to determine his NFSv4
idmap domain (answer is: 'cat /var/run/nfs4_domain').

The right answer is ``you need to mount the underlying filesystem.
You need one mount command or mount line in /etc/{v,}fstab per one
exported filesystem on the server.''  very simple, very reasonable.
But the answer pitched at them is all this convoluted bleeding edge
mess about mirror mounts, coming from people who don't have any
experience actually USING mirror mounts, always with the caveat ``I'm
not sure if your client supports BUT ...''!!!  But what?  Are you even
sure if the feature works ANYwhere, if you've never used it yourself?
It sounds like a simple feature, but it just isn't.  

If it actually worked the question would not even exist, so how can it
be the answer?  It is like ``Q. Can you please help me? / A. You might
not even be here.  Maybe we are not having this conversation because
everything works perfectly.  Let me explain to you what `working
perfectly' means and then you can tell me if you are real or not.''

I would suggest you forget about this noise for the moment and write
heirarchical automount maps.  This works on both Linux and Solaris,
except that you don't have the full value of the automounter here
because you cannot refresh parts of the subtree while the parent is
still mounted, which is part of what the automounter is good for.
It's normal that an aoutmounter won't consider new map data for things
that are already mounted, but for heirarchical automounts, AFAICT you
have to unmount the whole tree before any changes deep inside the tree
will be refreshed from the map, which is less than ideal but reflects
the ad-hoc way the automounter's corner cases were slowly semifixed,
especially on Linux.  There are examples of heirarchical automounts in
the man page, and if you don't understand the examples then simply do
not use the automounter at all.

You do not even need to use the automounter.  You can just put
your filesystems into /etc/fstab and walk away from it.  Honestly I
think it is crazy that it takes you over a month simply to get one NFS
subdirectory mounted inside another.  This should take one hour.
Please just forget about all this newfangled bullshit and mount the
filesystem.  see 'man mount' and just DO it!  Like this in
/etc/fstab on Linux:

terabithia:/arrchive/arrchive   
nfs rw,noacl,nodev 0 0
terabithia:/arrchive/music  /arrchive/music 
nfs rw,noacl,nodev 0 0

*DONE*.  There is no NFSv4.  It is NFSv3.  There is no automounter.
There are no ``mirror mounts'' and no referrals.  If you add more ZFS
filesystems, you add more lines to /etc/fstab on every Linux client.
okay?

If you are afraid you are using NFSv4, stop that from happening by
saying '-o vers=3' on Solaris or '-t nfs' in Linux.  But if you're
using Linux, you're not using NFSv4.  Solaris uses v4 by

Re: [zfs-discuss] zfs recordsize change improves performance

2010-05-24 Thread Miles Nordin

 ai == Asif Iqbal vad...@gmail.com writes:

   If you disable the ZIL for locally run Oracle and you have an
  unscheduled outage, then it is highly probable that you will
  lose data.

ai yep. that is why I am not doing it until we replace the
ai battery

no, wait please, you still need the ZIL to be on, even with the battery.

disabling the cache flush command is what the guide says is allowed
and sometimes helpful for people who have NVRAM's, but disabling the
cache flush command and disabling the ZIL are different.  Disabling
the ZIL means the write can be cached in DRAM until the next txg flush
and not issued to the disks at all, so even if you have a disk array
with an NVRAM that effectively writes everything as if it were sync,
the disk array will not even see the write until txg commit time with
ZIL disabled.

If you have working NVRAM, I think disabling the ZIL is likely not to
give much speed-up, so if you are going to try disabling it, now when
your battery is dead is the time to do it.  Once the battery's fixed
theory says your testing will probably show things are just as fast
with ZIL enabled.

AIUI if you disable the ZIL, the database should still come back in a
crash-consisent state after a cord-yank, but it will be an older state
than it should be, so if you have several RDBMS behind some kind of
tiered middleware the different databases won't be in sync with each
other so you can lose integrity.  If you have only one RDBMS I think
you will lose only durability through this monkeybusiness, and
integrity will survive.  I'm not an expert of anything, but that's my
understanding for now.


pgpFapbkFrlFR.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-24 Thread Miles Nordin

 d == Don  d...@blacksun.org writes:
 hk == Haudy Kazemi kaze0...@umn.edu writes:

 d You could literally split a sata cable and add in some
 d capacitors for just the cost of the caps themselves.

no, this is no good.  The energy only flows in and out of the
capacitor when the voltage across it changes.  In this respect they
are different from batteries.  It's normal to use (non-super)
capacitors as you describe for filters next to things drawing power in
a high-frequency noisy way, but to use them for energy storage across
several seconds you need a switching supply to drain the energy from
it.  the step-down and voltage-pump kinds of switchers are
non-isolated and might do fine, and are cheaper than full-fledged
DC-DC that are isolated (meaning the input and output can float wrt
each other).

you can charge from 12V and supply 5V if that's cheaper.  :)

hope it works.

hk okay, we've waited 5 seconds for additional data to arrive to
hk be written.  None has arrived in the last 5 seconds, so we're
hk going to write what we already have to better ensure data
hk integrity,

yeah, I am worried about corner cases like this.  ex: input power to
the SSD becomes scratchy or sags, but power to the host and controller
remain fine.  Writes arrive continuously.  The SSD sees nothing wrong
with its power and continues to accept and acknowledge writes.
Meanwhile you burn through your stored power hiding the sagging supply
until you can't, then the SSD loses power suddenly and drops a bunch
of writes on the floor.  That is why I drew that complicated state
diagram in which the pod disables and holds-down the SATA connection
once it's running on reserve power.  Probably y'all don't give a fuck
about such corners though, nor do many of the manufacturers selling
this stuff, so, whatever.


pgpYM02z6LZ58.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Miles Nordin

 dd == David Dyer-Bennet d...@dd-b.net writes:

dd Just how DOES one know something for a certainty, anyway?

science.

Do a test like Lutz did on X25M G2.  see list archives 2010-01-10.


pgpeiR4DYODbj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS memory recommendations

2010-05-20 Thread Miles Nordin

 et == Erik Trimble erik.trim...@oracle.com writes:

et No, you're reading that blog right - dedup is on a per-pool
et basis.

The way I'm reading that blog is that deduped data is expaned in the
ARC.


pgpozjcLXZlNV.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-20 Thread Miles Nordin

 d == Don  d...@blacksun.org writes:

 d Since it ignores Cache Flush command and it doesn't have any
 d persistant buffer storage, disabling the write cache is the
 d best you can do.  This actually brings up another question I
 d had: What is the risk, beyond a few seconds of lost writes, if
 d I lose power, there is no capacitor and the cache is not
 d disabled?

why use a slog at all if it's not durable?  You should disable the ZIL
instead.  Compared to a slog that ignores cache flush, disabling the
ZIL will provide the same guarantees to the application w.r.t. write
ordering preserved, and the same problems with NFS server reboots,
replicated databases, mail servers.  It'll be faster than the
fake-slog.  It'll be less risk of losing the pool because the slog
went bad and then you accidentally exported the pool while trying to
fix things.

The only case where you are ahead with the fake-slog, is the host's
going down because of kernel panics rather than power loss.

I don't know, though, what to do about these reports of devices that
almost respect cache flushes but seem to lose exactly one transaction.
AFAICT this should be a works/doesntwork situation, not a continuum.


pgp4xXGJ3xew4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread Miles Nordin

 rsk == Roy Sigurd Karlsbakk r...@karlsbakk.net writes:
 dm == David Magda dma...@ee.ryerson.ca writes:
 tt == Travis Tabbal tra...@tabbal.net writes:

   rsk Disabling ZIL is, according to ZFS best practice, NOT
   rsk recommended.

dm As mentioned, you do NOT want to run with this in production,
dm but it is a quick way to check.

REPEAT: I disagree.

Once you associate the disasterizing and dire warnings from the
developer's advice-wiki with the specific problems that ZIL-disabling
causes for real sysadmins rather than abstract notions of ``POSIX'' or
``the application'', a lot more people end up wanting to disable their
ZIL's.

In fact, most of the SSD's sold seem to be relying on exactly the
trick disabled-ZIL ZFS does for much of their high performance, if not
their feasibility within their price bracket period: provide a
guarantee of write ordering without durability, and many applications
are just, poof, happy.

If the SSD's arrange that no writes are reordered across a SYNC CACHE,
but don't bother actually providing durability, end uzarZ will ``OMG
windows fast and no corruption.'' -- ssd sales.

The ``do-not-disable-buy-SSD!!!1!'' advice thus translates to ``buy
one of these broken SSD's, and you will be basically happy.  Almost
everyone is.  When you aren't, we can blame the SSD instead of ZFS.''
all that bottlenecked SATA traffic host-SSD is just CYA and of no
real value (except for kernel panics).


Now, if someone would make a Battery FOB, that gives broken SSD 60
seconds of power, then we could use the consumer crap SSD's in servers
again with real value instead of CYA value.  FOB should work like
this:

== RUNNING ==
   battery   ,--- SATA port: pass   -.
 recharged? /  power to SSD: on\  input
   /\ power
  (  . lost
  |  |
  .   input  ,---\   v
  power / v
  restored /   =power lost=
=power restored=   .   =hold-down =
=hold down =-- SATA port: block
power to SSD: off  power to SSD: on
   ^   |
   |   |
   .  .  60 seconds
   input\/   elapsed
   power .  =power off= ,
   restored power to SSD: off -


The device must know when its battery has gone bad and stick itself in
``power restored hold down'' state.  Knowing when the battery is bad
may require more states to test the battery, but this is the general
idea.

I think it would be much cheaper to build an SSD with supercap, and
simpler because you can assume the supercap is good forever instead of
testing it.  However because of ``market forces'' the FOB approach
might sell for cheaper because the FOB cannot be tied to the SSD and
used as a way to segment the market.  If there are 2 companies making
only FOB's and not making SSD's, only then competition will work like
people want it to.  Otherwise FOBs will be $1000 or something because
only ``enterprise'' users are smart/dumb enough to demand them.

Normally I would have a problem that the FOB and SSD are separable,
but see, the FOB and SSD can be put together with double-sided tape:
the tape only has to hold for 60 seconds after $event, and there's no
way to separate the two by tripping over a cord.  You can safely move
SSD+FOB from one chassis to another without fearing all is lost if you
jiggle the connection.  I think it's okay overall.

tt This risk is mostly mitigated by UPS backup and auto-shutdown
tt when the UPS detects power loss, correct?

no no it's about cutting off a class of failure cases and constraining
ourselves to relatively sane forms of failure.  We are not haggling
about NO FAILURES EVAR yet.  First, for STEP 1 we isolate the insane
kinds of failure that cost us days or months of data rather than just
a few seconds, the kinds that call for crazy unplannable ad-hoc
recovery methods like `Viktor plz help me' and ``is anyone here a
Postgres data recovery expert?'' and ``is there a way I can invalidate
the batch of billing auth requests I uploaded yesterday so I can rerun
it without double-billing anyone?''  For STEP 1 we make the insane
fail almost impossible through clever software and planning.  A UPS
never never ever qualifies as ``almost impossible''.  

Then, once that's done, we come back for STEP 2 where we try to
minimize the sane failures also, and for step 2 things like UPS might
be useful.  For STEP 2 it makes sense to talk about percent
availability, probability of failure, length of time to recover from
Scenario X.  but in STEP 1 all the failures are insane

Re: [zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Miles Nordin

 et == Erik Trimble erik.trim...@oracle.com writes:

et frequently-accessed files from multiple VMs are in fact
et identical, and thus with dedup, you'd only need to store one
et copy in the cache.

although counterintuitive I thought this wasn't part of the initial
release.  Maybe I'm wrong altogether or maybe it got added later?

  http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comment-1257191094000



pgp4W7jhfu4MV.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes

2010-05-13 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

bh The devid for a USB device must change as it moves from port
bh to port.

I guess it was tl;dr the first time I said this, but:

  the old theory was that a USB device does not get a devid because it
  is marked ``removeable'' in some arcane SCSI page, for the same
  reason it doesn't make sense to give a CD-ROM a devid because its
  medium can be removed.  

I don't know if this has changed, or if it's even what's really going
on.  but like I said without the ramdisk boot option it's more
important to fix this type of problem, so if someone has a workaround
please share!


pgpkdrT55NtZq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?

2010-05-12 Thread Miles Nordin

 jcm == James C McPherson james.mcpher...@oracle.com writes:

 storage controllers are more difficult for driver support.

   jcm Be specific - put up, or shut up.

marvell controller hangs machine when a drive is unplugged

marvell controller does not support NCQ

marvell driver is closed-source blob

sil3124 driver has lost interrupt problems

ATI SB600/SB700 AHCI driver has performance problems

mpt driver has disconnect under heavy load problems that may or may
   not be MSI-related

mpt driver is closed source blob

mpt driver is not SATA framework and thus does not work with DVD-ROMS
   or with smartctl  XXX -- smartctl does work now, with '-d sat,12'?
   or only AHCI works with that?

USUAL SUGGESTION: use 1068e non-raid and mpt driver, live with problems

USUAL OPTIMISM: lsi2008 / mega_sas, which i THINK are open source but
opengrok seems to be down so I did not verify.

  My perception is if you are using external cards which you know
  work for networking and storage, then you should be alright.
  Am I out in left-field on this?

   jcm I believe you are talking through your hat.

network performance problems with realtek

network performance problems with nvidia nforce

network working-at-all problems with broadcom bge and bnx because of
the ludicrous number of chip steppings and errata

closed-source blob drivers with broadcom bnx

performance and working-at-all problems for atheros L1

USUAL SUGGESTION: use intel 82540 derivative.  which, for an AMD
board, will almost always be an external card because AMD boards are
usually realtek, broadcom, or marvell for AMD chipsets, and realtek or
nforce for nVidia chipsets (if anyone still uses nvidia chipsets).

FAIR STATEMENT: Linux shares most of these problems except over there
bnx is open source.

USUAL OPTIMISM: crossbow-supported cards with L4 classifiers in the
MAC other than bnx, such as 10gig ones, may be the future, much more
performant, ready for CoS pause frames, and good multicore
performance, and having source.  god willing their quality might turn
out to be more uniform but probably nobody knows yet, and they're not
cheap and ubiquitous onboard yet.  I'm hoping infiniband comes back
and 10gig goes away, but that's probably not realistic.


WELL POISONING: saying ``if you want open-source drivers go whine at
the hardware vendor because they make us sign an NDA, so there's
nothing we can do,'' is hogwash.  (a) Sun's the one able to
realistically bargain with the vendor, not users, because they bring
to the table developer hours, OS support, a class of customers,
trusting contacts within the vendor, and a hardware manufacturing arm
that can make purchasing decisions long-term and at a motherboard
component level; no user has anywhere near this insane level of
bargaining power; see OpenBSD presentation and ``the OEM problem'',
(b) usually only one chip works anyway, so there is no competition,
(c) Linux has open source drivers for all these chips and is an
existence proof that yes, you can do something about it, and (d) the
competition for users is between Solaris and Linux, not between
Marvell and LSI.  If we want complete source for the OS we will get it
faster and more reliably by going to the OS that offers it, not by
whining to chip vendors.  This is not flamebait but just obvious
reality---so obvious that almost everyone who really cares enough to
say it is already gone.

HTH, HAND.


pgpGJkSjxmX5x.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes

2010-05-12 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

bh If you boot from usb and move your rpool from one port to
bh another, you can't boot. If you plug your boot sata drive into
bh a different port on the motherboard, you can't
bh boot. Apparently if you are missing a device from your rpool
bh mirror, you can't boot.

yeah, this is retarded and should get fixed.

bh zpool.cache saves the device path to make importing pools
bh faster. It would be nice if there was a boot flag you could
bh give it to ignore the file...

I've no doubt this is true but ISTR it's not related to the booting
problem above becuase I do not think zpool.cache is used to find the
root pool.  It's only used for finding other pools.  ISTR the root
pool is found through devid's that grub reads from the label on the
BIOS device it picks, and then passes to the kernel.  note that
zpool.cache is ON THE POOL, so it can't be used to find the pool (ok,
it can---on x86 it can be sync'ed into the boot archive, and on SPARC
it can be read through the PROM---but although I could be wrong ISTR
this is not what's actually done).

I think you'll find you CAN move drives among sata ports, just not
among controller types, because the devid is a blob generated by the
disk driver, and pci-ide and AHCI will yeild up different devid's for
the same disk.  Grub never calculates a devid, just reads one from the
label (reads a devid that some earlier kernel got from pci-ide or ahci
and wrote into the label).  so when ports and device names change,
rewriting labels is helpful but not urgent.  When disk drivers change,
rewriting labels is urgent.

yeah, the fact that ramdisk booting isn't possible with opensolaris
makes tihs whole situation a lot more serious than it was back when
SXCE was still available for download.  Is there any way to make a
devid-proof rescue boot option?  Is there a way to make grub boot an
iso image off the hard disk for example?


pgp84LsPjArBH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hard drives for ZFS NAS

2010-05-12 Thread Miles Nordin

 eg == Emily Grettel emilygrettelis...@hotmail.com writes:

eg What do people already use on their enterprise level NAS's?

For a SOHO NAS similar to the one you are running, I mix manufacturer
types within a redundancy set so that a model-wide manufacturing or
firmware glitch like the ones of which we've had several in the last
few years does not take out an entire array, and to make it easier to
figure out whether weird problems in iostat are controller/driver's
fault, or drive's fault.  If there are not enough manufacturers with
good drives on offer, I'll try to buy two different models of the same
manufacturer, ex get one of them an older model number of the same
drive size/featureset.  Often you find two mini-generations are on
offer at once.

At the moment, I would not buy any WD drive because they have been
changing drives' behavior without changing model numbers which makes
pointless discussions like this one because the model numbers become
meaningless and you cannot bind your experience to a repeatable
purchasing decision other than ``do/don't buy WD''.  When the dust
settles from this silent-firmware-version-bumps and 4k-sector
disaster, I would buy WD again because the more mfg diversity, the
more bad-batch-proofing you have for wide stripes.

I used to buy near-line drives but no longer do this because it's
cheaper to buy two regular drives than one near-line drive, but this
may be a mistake because of the whole vibration disaster.


pgpRFDcJerIaG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hard drives for ZFS NAS

2010-05-12 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

bh From what I've read, the Hitachi and Samsung drives both
bh support CCTL, which is in the ATA-8 spec. There's no way to
bh toggle it on from OpenSolaris (yet) and it doesn't persist
bh through reboot so it's not really ideal.

bh Here's a patch to smartmontools that is supposed to enable
bh it. It's in the SVN version 5.40 but not the current 5.39
bh release: http://www.csc.liv.ac.uk/~greg/projects/erc/

That's good to know.  It would be interesting to know if the smartctl
command in question can actually make it through a solaris system, and
on what disk driver.  AHCI and mpt are different because one is SATA
framework and one isn't.  I wonder also if SAS expanders cause any
problems for smartctl?

also, has anyone actually found this feature to have any value at all?
To be clear, I do understand what the feature does.  I do not need it
explained to me again.  but AIUI with ZFS you must remove a partially
failing drive, or else the entire pool becomes slow.  It does not
matter if the partially-failing drive is returning commands in 30sec
(the ATA maximum) or 7sec by CCTL/TLER/---you must still find and
remove it, or the zpool will become pathologically slow.

If there is actual experience with the feature helping ZFS, I'd be
interested, but so far I think people are just echoing wikipedia
shoulds and speculations, right?


pgpOauVvynd3C.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes

2010-05-10 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

bh The drive should be on the same USB port because the device
bh path is saved in the zpool.cache. If you removed the
bh zpool.cache, it wouldn't matter where the drive was plugged
bh in.

I thought it was supposed to go by devid.

There was a bug a while ago that Solaris won't calculate devid for
devices that say over SCSI they are ``removeable'' because, in the
sense that a DynaMO or DVD-R is ``removeable'', the serial number
returned by various identity commands or mode pages isn't bound to any
set of stored bits, and the way devid's are used throughout Solaris
means they are like a namespace or an array-of for a set of bit-stores
so it's not appropriate for a DVD-R drive to have a devid.  A DVD disc
could have one, though---in fact a release of a pressed disc could
appropriately have a non-serialized devid.  However USB stick
designers used to working with Microsoft don't bother to think through
how the SCSI architecture should work in a sane world because they are
used to reading chatty-idiot Microsoft manuals, so they fill out the
page like a beaurocratic form with whatever feels appropriate and mark
USB sticks ``removeable'', which according to the standard and to a
sane implementer is a warning that the virtual SCSI disk attached to
the virtual SCSI host adapter inside the USB pod might be soldered to
removeable FLASH chips.  It's quite stupid because before the OS has
even determined what kind of USB device is plugged in, it already
knows the device is removeable in that sense, just like it knows
hot-swap SATA is removeable.  USB is no more removeable, even in
practical use, than SATA.  (eSATA!  *slap*) Even in the case of CF
readers, it's probably wrong most of the time to set the removeable
SCSI flag because the connection that's severable is between the
virtual SCSI adapter in the ``reader'' and the virtual SCSI disk in
the CF/SD/... card, while the removeable flag indicates severability
between SCSI disk and storage medium.  In the CF/SD/... reader case
the serial number in the IDENTIFY command or mode pages will come from
CF/SD/... and remain bound to the bits.  The only case that might call
for setting the bit is where the adapter is synthesizing a fake mode
page where the removeable bit appears, but even then the bit should be
clear so long as any serialized fields in other commands and mode
pages are still serialized somehow (whether synthesized or not).
Actual removeable in-the-scsi-standard's-sense HARD DISK drives mostly
don't exist, and real removeable things in the real world attach as
optical where an understanding of their removeability is embedded in
the driver: ANYTHING the cd driver attaches will be treated
removeable.

consequently the bit is useless to the way solaris is using it, and
does little more than break USB support in ways like this, but the
developers refuse to let go of their dreams about what the bit was
supposed to mean even though a flood of reality has guaranteed at this
point their dream will never come true.  I think there was some
magical simon-sez flag they added to /kernel/drv/whatever.conf so the
bug could be closed, so you might go hunting for that flag in which
they will surely want you to encode in a baroque case-sensitive
undocumented notation that ``The Microtraveler model 477217045 serial
80502813 attached to driver/hub/hub/port/function has a LYING
REMOVEABLE FLAG'', but maybe you can somehow set it to '*' and rejoin
reality.  Still this won't help you on livecd's.  It's probably wiser
to walk away from USB unless/until there's a serious will to adopt the
practical mindset needed to support it reasonably.


pgpAoBbGUMwdU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is it safe to disable the swap partition?

2010-05-10 Thread Miles Nordin

 mg == Mike Gerdts mger...@gmail.com writes:

mg If Solaris is under memory pressure, [...]

mg The best thing to do with processes that can be swapped out
mg forever is to not run them.

Many programs allocate memory they never use.  Linux allows
overcommitting by default (but disableable), but Solaris doesn't and
can't, so on a Solaris system without swap those allocations turn into
physical RAM that can never be used.  At the time the never-to-be-used
pages are allocated, ARC must be dumped to make room for them.  With
swap, pages that are allocated but never written can be backed by
swap, and the ARC doesn't need to be dumped until the pages are
actually written.  

Note that, in this hypothetical story, swap is never written at all,
but it still has to be there.

If you run a java vm on your ``storage server'', then you might care
about this.

I think the no-swap dogma is very soothing and yet very obviously
wrong.  If you want to get into the overcommit game, fine.  If you
want to play a game where you will overcommit up to the size of the
ARC, well, ``meh'', but fine.  Until then, though, swap makes sense.


pgpA7wEb34DwB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, NFS, and ACLs ssues

2010-05-03 Thread Miles Nordin

 mef == Mary Ellen Fitzpatrick mfitz...@bu.edu writes:

   mef Is there a way to set permissions so that the /etc/auto.home
   mef file on the clients does not list every exported dir/mount
   mef point?

If I understand the question right, then, no.  These maps are very
traditional from the earliest days of NIS and need to be managed
centrally, and they should match the structure of filesystems exported
from servers exactly.  The new scheme of ``mirror mounts'' and
``referrals'' which does away with the global automount map and
sprinkles bits of the map onto individual servers, is oft-discussed
but seldom implemented, and it's not at all traditional so it's
unclear to me that it's ever going to become The Way even though
everyone talks about it like it's _fait accompli_.  I only mention it
to poison the well: if someone tries to discuss this with you, you
should immediately close your ears because they are only dreaming for
now, and none of it works yet.

Unfortunately, autofs implementations' quality varies widely.  I think
Linux is on their...fourth? rewrite of the whole automount framework,
and Mac OS X on at least their second if not third.  I found Mac OS
X's one is poor at handling nested mounts like what you're doing
compared to the solaris one.  The apple people sneakily altered the
automounter documentation to remove all examples showing nested
mounts, without actually documenting frankly the limitation which
surely prompted them to alter the examples.  slimey fucks.  You can
work around their fail using the 'net' option but this prevents
assembling subtrees from several different servers.  Each of your
nested subtrees must be from the same server when using the 'net'
option workaround because you lose the right to choose where they're
mounted:

 http://web.ivy.net/~carton/rant/macos-automounter.html#9050149

The linux one will do nested subtrees, but I think you need to express
the entire subtree as a single automount record, with a single
trigger.  This is different from Mac OS X with-workaround which will
(provided you use 'net') miraculously assemble a view of the entire
subtree from several dscl records which in theory could even be on
LDAP.  so, Linux will automount and unmount everything together, while
Mac OS X will not.  You might reasonably wish to have the mountpoints
within the automounted filesystem turn into triggers themselves so
that parts of the subtree are only mounted just as deeply as and only
along the branch needed to satisfy the trigger---that way a subtree
could be assembled from many servers, and if the map for a deep corner
of the subtree were changed and pushed, clients could start obeying
the changes sooner.  but I think on Linux this won't work.  not sure
it works anywhere though.  I guess it sort of works on Mac OS X with
heavy caveat, but not sure about Solaris.

carton  -hard,intr,noacl/ cash:/export/home/ \
/VDI cash:/export/home//VDI

but although Linux beats the Mac here, the linux one is shit at handling
direct mounts: if you give it a subdirectory in which it owns all of
the fake files, like /home, it works ok.  but if you want it to for
example automount /arrchive (just the one filesystem onto /arrchive
from one share on one server) I found it hardly works at all.  so I
have /arrchive automounted on my Solaris boxes, and
/remedial-automount/arrchive with symlink 
/archive - remedial-automount/arrchive on Linux boxes.

so much for one traditional centrally-managed map.


pgppJ7OZtGw8p.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Miles Nordin

 dm == David Magda dma...@ee.ryerson.ca writes:

dm Given that ZFS is always consistent on-disk, why would you
dm lose a pool if you lose the ZIL and/or cache file?

because of lazy assertions inside 'zpool import'.  you are right there
is no fundamental reason for it---it's just code that doesn't exist.

If you are a developer you can probably still recover your pool, but
there aren't any commands with a supported interface to do it.
'zpool.cache' doesn't contain magical information, but it allows you
to pass through a different code path that doesn't include the
``BrrkBrrk, omg panic device missing, BAIL OUT HERE'' checks.  I don't
think squirreling away copies of zpool.cache is a great way to make
your pool safe from slog failures because there may be other things
about the different manual 'zpool import' codepath that you need
during a disaster, like -F, which will remain inaccessible to you if
you rely on some saving-your-zpool.cache hack, even if your hack ends
up actually working when the time comes, which it might not.

I think is really interesting, the case of an HA cluster using a
single-device slog made from a ramdisk on the passive node.  This case
would also become safer if slogs were fully disposeable.


pgpmcPw2Mcugv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

 A failed unmirrored log device would be the
 permanent death of the pool.

re It has also been shown that such pools are recoverable, albeit
re with tedious, manual procedures required.

for the 100th time, No, they're not, not if you lose zpool.cache also.


pgpuUVBmI8w1p.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

re a well managed system will not lose zpool.cache or any other
re file.

I would complain this was circular reasoning if it weren't such
obvious chest-puffing bullshit.

It's normal even to the extent of being a best practice to have no
redundancy for rpool on systems that can tolerate gaps in availability
because you can reinstall from the livecd relatively quickly.

re It is disingenuous to complain about multiple failures

strongly disagree.  I'm quite genuine.

A really common and really terrible suggestion is, ``get an SSD, and
put your rpool in one slice and your slog in another.''  If you do
that and lose the SSD, you've lost the whole pool.  You cannot recover
with 'zpool clear' or any number of -f -F -FFF flags.  This common
scenario doesn't require any multiple failure.

Now, even among those who don't do this, people following your
suggestions will not design their systems realizing the rpool and the
SSD make up a redundant pair.  They will not see: you can lose the
rpool and import the pool IFF you have the SSD, and you can lose the
SSD and force-online the pool IFF you have the rpool with the
missing-slog pool already imported to it.  They will instead desgin
following the raidz/mirroring failure rules treating slog as
disposable, like you've told them, and this is flat wrong.  Hiding
behind fuzzy glossary terms like ``multiple failures'' is useless,
IMHO to the point of being deliberately obtuse.

Besides that, you don't need any multiple failures---all you need to
do is make the mistake of typing the perfectly reasonable command
'zpool export' in the course of trying to fix your problem, and poof,
your whole pool is gone.

A pool that runs fine until you try to export and re-import it, after
which it is permanently lost, is a ticking time bomb.  I don't think
it's a good idea to run that way at all because of the flexible tools
one needs to have available for maintenance in a disaster (ex., livecd
of newer version with special import -F rescue-magic in it, WONT WORK.
moving drives to a different controller causing them to have a
different devid, WONT WORK.  accumulate enough of these and not only
does your toolkit get smaller and weaker, but you must move slowly and
with great fear because the slightest move can make everything explode
in totally unobvious ways.).  If you do want to run this way, as an
absolute MINIMUM, you need to discuss this cannot-import case at
moments like this one so that it can influence people's designs.

It seems if I say it the long way, I get ignored.  If I say it the
short way, you dive into every corner case.  I don't know how to be
any more clear, so...good luck out there, y'all.


pgplz3pxj4vHy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Secure delete?

2010-04-16 Thread Miles Nordin

 edm == Eric D Mudama edmud...@bounceswoosh.org writes:

   edm How would you stripe or manage a dataset across a mix of
   edm devices with different geometries?

the ``geometry'' discussed is 1-dimensional: sector size.

The way that you do it is to align all writes, and never write
anything smaller than the sector size.  The rule is very simple, and
you can also start or stop following it at any moment without
rewriting any of the dataset and still get the full benefit.


pgpj2CsEgHKlY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Secure delete?

2010-04-16 Thread Miles Nordin

 edm == Eric D Mudama edmud...@bounceswoosh.org writes:

   edm What you're suggesting is exactly what SSD vendors already do.

no, it's not.  You have to do it for them.

   edm They present a 512B standard host interface sector size, and
   edm perform their own translations and management inside the
   edm device.

It is not nearly so magical!

The pages are 2 - 4kB.  They are this size for nothing to do with the
erase block size or the secret blackbox filesystem running on the SSD.
It's because of the ECC, because the reed-solomon for the entire block
must be recalculated if any of the block is changed.  Therefore,
changing 0.5kB means:

for a 4kB page device:

  * read 4kB
  * write 4kB

for a 2kB page device:

  * read 2kB
  * write 2kB


and changing 4kB at offset integer * 4kB means:

for a 4kB device:

  * write 4kB

for a 2kB device:

  * write 4kB

It does not matter if all devices have the same page size or not.
Just write at the biggest size, or write at the appropriate size if
you can.  The important thing is that you write a whole page, even if
you just pad with zeroes, so the controller does not have to do any
reading.  simple.

the problem with big-sector spinning hard drives and
alignment/blocksize is exactly the same problem.  non-ZFS people
discuss it a lot becuase ZFS filesystems start at integer * rather
large block offset, thanks to all the disk label hokus pocus, but
NTFS filesystems often start at 16065 * 0.5kB


pgpEdIwHb5RuZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?

2010-04-15 Thread Miles Nordin

 jcm == James C McPherson james.mcpher...@oracle.com writes:
 ga == Günther Alka a...@hfg-gmuend.de writes:

   jcm I am amazed that you believe OpenSolaris binary distro has too
   jcm much desktop stuff. Most people I have come across are firmly
   jcm of the belief that it does not have enough.

minification is stupid, anyway.  It causes way more harm than good.  I
can understand not wanting to have weird flavour-of-the-month daemons
running until you've been bothered to learn what they do, but not
wanting to have their binaries on the disk is just silly.  It's also
annoying when some sysadmin minifies away xauth so that 'ssh -Y'
doesn't work, or minifies away vi because he uses nano---his OCD
becomes unreasonable biggotry taking the place of building a workable
consensus platform, which is the proper task at hand when deciding
what to include and how to present it.

But it gets much worse when the minifiers start reaching into the
packages themselves and turning off options.  Ex., they will turn off
the Perl/Python scripting support for some common package because they
want to yank out Perl and Python to make the distribution smaller.  Or
they do not want to ship libX11.so, so they'll rebuild packages with X
support switched off.  Once they've done that, if you actually need
those things, it will waste heaps more time to track down what went
wrong.  

The existence of the knobs themselves is harmful enough, but the
popular demand of idiots for this kind of knob wastes the time of the
non-idiot packagers expected to provide it: they have to split the
result of a single build into twenty tiny interdependent subpackages,
shim dlopen() in there where it wasn't before (if it's a binary
package system), and then go back and test the whole monster: wherever
they drop the ball, you suffer, and while they're tossing the ball
around they're spending time pandering to the damned minifiers instead
of making and updating other packages which are actually useful to
sane people.

The insanity gets pushed further when whole packages start factoring 
core pieces of functionality into ``modules'', so now in Eye of Gnome,
I have the ``double click on a picture to make it bigger Module.so''.
I guess, if I want to make my system smaller, I can use the packaging
system to remove the ability to double click on pictures and make them
bigger?  What the fuck?  The minification fetish has spread out both
directions from the packaging system and infected everything from the
architecture of the source code to the user-visible menu structure of
the app!

Minification zealotry should stick to systems running from NOR flash
like openwrt, or 1GB NAND systems like android.  It's got no place on
a system with disks.  

As a corrolary, any minification based on busting a binary into .so's
and then scattering the .so's into packages is stupid, because the
package systems where minification makes sense are source-based and
don't need that, in fact suffer from it because the split binaries
contain more symbols and are larger in core and larger on disk.

Just say no to minification if you're doing it because it ``feels''
right.  Just knock it off.  Go work on your car stereo, or develop
perverted rituals with your espresso machine, instead.

ga i installed opensolaris and my first impression was very
ga disappointing.

yeah.  me, too: my first impression was ``the installer does not work
at all without X11.  oh, and BTW X11 does not work at all without
nVidia haha, ENJOY.''  That was at least two years ago though.

ga the gui was slow and not very intuitive and the only thing
ga thats's running fine was the browser.

wtf, mate?  You complain the install is not minimal, but then you
judge the overall system by the superficial impression its GUI makes?

ga if someone will try it -its free.

Is it?  I don't really understand the nexenta license, which is why I
don't bother with it.  The opensolaris licensing is already confusing
because parts of it are binary, and 'pkg' makes it very easy to
install things with non-redistributable licenses, or extremely weird
things like SunPro compilers that claim to have different licenses
depending on what you use them for or how you define yourself as a
person and include automatic agreements not to publish unfavorable
benchmarks and other similar bullshit.  It's admirable, important, and
surprising to me that Solaris has actually managed to become a
redistributable livecd with a modern package system (yeah, and where's
your darwin livecd, fanboy?), but still because of the ecosystem
opensolaris comes from you're constantly one enter key away from
encumbering your system.

If you want it to be free maybe use freebsd---then you still get ZFS
but you get away from some of the lazy assertions, most of the binary
disk drivers and mid-layers, and from the stupid legacy disk-labeling.
FreeBSD also has a scripted build process all the way from source tree
to .iso that you can run yourself.

Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?

2010-04-14 Thread Miles Nordin

 dd == David Dyer-Bennet d...@dd-b.net writes:

dd Is it possible to switch to b132 now, for example?

yeah, this is not so bad.  I know of two approaches:

 * genunix.org assembles livecd's of each bnnn tag.  You can burn
   one, unplug from the internet, install it.  It is nice to have a
   livecd capable of mounting whatever zpool and zfs version you are
   using.  I'm not sure how they do this, but they do it.

 * see these untested but relatively safe-looking instructions (apolo
   to whoever posted that i didn't write down the credit):

formal IPS docs: 
http://dlc.sun.com/osol/docs/content/2009.06/IMGPACKAGESYS/index.html

how to get a specific snv build with ips
-8-
Starting from OpenSolaris 2009.06 (snv_111b) active BE.

1) beadm create snv_111b-dev
2) beadm activate snv_111b-dev
3) reboot
4) pkg set-authority -O http://pkg.opensolaris.org/dev opensolaris.org
5) pkg install SUNWipkg
6) pkg list 'entire*'
7) beadm create snv_118
8) beadm mount snv_118 /mnt
9) pkg -R /mnt refresh
10) pkg -R /mnt install ent...@0.5.11-0.118
11) bootadm update-archive -R /mnt
12) beadm umount snv_118
13) beadm activate snv_118
14) reboot

Now you have a snv_118 development environment.

also see:
 http://defect.opensolaris.org/bz/show_bug.cgi?id=3436
 which currently says about the same thing.
-8-

you see the bnnn is specified in line 10, ent...@0.5.11-0.nnn

There is no ``failsafe'' boot archive with opensolaris like the
ramdisk-based one that was in the now-terminated SXCE, so you should
make a failsafe boot option yourself by cloning a working BE and
leaving that clone alone.  and...make the failsafe clone new enough to
understand your pool version or else it's not very useful. :)


pgpxowC3Fu66n.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Miles Nordin

 dm == David Magda dma...@ee.ryerson.ca writes:
 bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes:

dm OP may also want to look into the multi-platform pkgsrc for
dm third-party open source software:

+1.  jucr.opensolaris.org seems to be based on RPM which is totally
fail.  RPM is the oldest, crappiest, most frustrating thing!  packages
are always frustrating but pkgsrc is designed to isolate itself from
the idiosyncracies of each host platform, through factoring.

Its major weakness is upgrades, but with Solaris you can use zones and
snapshots to make this a lot less painful:

 * run their ``bulk build'' inside a zone.  The ``bulk build'' feature
   is like the jucr: it downloads stuff from all over the internet and
   bulids it, generates a tree of static web pages to report its
   results, plus a repository of binary packages.  Like jucr it does
   not build packages on an ordinary machine, but in a well-specified
   minimal environment which has installed only the packages named as
   build dependencies---between each package build the bulk scripts
   remove all not-needed packages.  Thus you really need a separate
   machine, like a zone, for bulk building.  There is a non-bulk way
   to build pkgsrc, but it's not as good.

   Except that unlike the jucr, the implementation of the bulk build
   is included in the pkgsrc distribution and supported and ordinary
   people who run pkgsrc are expected to use it themselves.

 * clone a zone, upgrade the packages inside it using the binary
   packages produced by the bulk build, and cut services over to the
   clone only after everything's working right.

Both of these things are a bit painful with pkgsrc on normal systems
and much easier with zones and ZFS.  The type of upgrade that's
guaranteed to work on pkgsrc, is:

 * to take a snapshot of /usr/pkgsrc which *is* pkgsrc, all packages'
   build instructions, and no binaries under this tree

 * ``bulk build''

 * replace all your current running packages with the new binary
   packages in the repository the bulk build made.

In practice people usually rebuild less than that to upgrade a
package, and it often works anyway, but if it doesn't work then you're
left wondering ``is pkgsrc just broken again, or will a more thorough 
upgrade actually work?''

The coolest immediate trick is that you can run more than one bulk
build with different starting options, ex SunPro vs gcc, 32 vs 64-bit.
The first step of using pkgsrc is to ``bootstrap'' it, and during
bootstrap you choose the C compiler and also whether to use host's or
pkgsrc's versions of things like perl and pax and awk.

You also choose prefixes for /usr /var and /etc and /var/db/pkg that
will isolate all pkgsrc files from the rest of the system.  In general
this level of pathname flexibility is only achievable at build time,
so only a source-based package system can pull off this trick.  The
corrolary is that you can install more than one pkgsrc on a single
system and choose between them with PATH.  pkgsrc is generally
designed to embed full pathnames of its shared libs, so this has got a
good shot of working.  

You could have /usr/pkg64 and /usr/pkg32, or /usr/pkg-gcc and
/usr/pkg-spro.  pkgsrc will also build pkg_add, pkg_info, u.s.w. under
/usr/pkg-gcc/bin which will point to /var/db/pkg-gcc or whatever to
track what's installed, so you can have more than one pkg_add on a
single system pointing to different sets of directories.  You could
also do weirder things like use different paths every time you do a
bulk build, like /usr/pkg-20100130 and /usr/pkg-20100408, although
it's very strange to do that so far.

It would also be possible to use ugly post-Unix directory layouts, ex
/pkg/marker/usr/bin and /pkg/marker/etc and
/pkg/marker/var/db/pkg, and then make /pkg/marker into a ZFS that
could be snapshotted and rolled back.  It is odd in pkgsrc world to
put /var/db/pkg tracking-database of what's installed into the same
subtree as the installed stuff itself, but in the context of ZFS it
makes sense to do that.  However the pathnames will be fixed for a
given set of binary packages, so whatever you do with the ZFS the
results of bulk builds sharing a common ``bootstrap'' phase would have
to stay mounted on the same directory.  You cannot clone something to
a new directory then add/remove packages.  There was an attempt called
``pkgviews'' to do something like this, but I think it's ultimately
doomed because the idea's not compartmentalized enough to work with
every package.

In general pkgsrc gives you a toolkit for dealing with suboptimal
package trees where a lot of shit is broken.  It's well-adapted to the
ugly modern way we run Unixes, sealed, with only web facing the users,
because you can dedicate an entire bulk build to one user-facing app.
If you have an app that needs a one-line change to openldap, pkgsrc
makes it easy to perform this 1-line change and rebuild 100
interdependent packages linked to your mutant library,

Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-04-08 Thread Miles Nordin

 rs == Ragnar Sundblad ra...@csc.kth.se writes:

rs use IPSEC to make IP address spoofing harder.

IPsec with channel binding is win, but not until SA's are offloaded to
the NIC and all NIC's can do IPsec AES at line rate.  Until this
happens you need to accept there will be some protocols used on SAN
that are not on ``the Internet'' and for which your axiomatic security
declarations don't apply, where the relevant features are things like
doing the DNS lookup in the proper .rhosts manner and doing uRPF,
minimum, and more optimistically stop adding new protocols without
IPv6 support, and start adding support for multiple IP stacks / VRF's.
If saying ``the only way to do any given thing is twicecrypted
kerberized ipsec within dnssec namespaces'' is blocking doing these
immediate plaintext things that allow a host to participate in both
the internet and a SAN at once, well that's no good either.


pgptkJNIK5h42.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Miles Nordin

 jr == Jeroen Roodhart j.r.roodh...@uva.nl writes:

jr Running OSOL nv130. Power off the machine, removed the F20 and
jr power back on. Machines boots OK and comes up normally with
jr the following message in 'zpool status':

yeah, but try it again and this time put rpool on the F20 as well and
try to import the pool from a LiveCD: if you lose zpool.cache at this
stage, your pool is toast./end repeat mode


pgpt1GZtrVxS6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Miles Nordin

 enh == Edward Ned Harvey solar...@nedharvey.com writes:

   enh If you have zpool less than version 19 (when ability to remove
   enh log device was introduced) and you have a non-mirrored log
   enh device that failed, you had better treat the situation as an
   enh emergency.

Ed the log device removal support is only good for adding a slog to
try it out, then changing your mind and removing the slog (which was
not possible before).  It doesn't change the reliability situation one
bit: pools with dead slogs are not importable.  There've been threads
on this for a while.  It's well-discussed because it's an example of
IMHO broken process of ``obviously a critical requirement but not
technically part of the original RFE which is already late,'' as well
as a dangerous pitfall for ZFS admins.  I imagine the process works
well in other cases to keep stuff granular enough that it can be
prioritized effectively, but in this case it's made the slog feature
significantly incomplete for a couple years and put many production
systems in a precarious spot, and the whole mess was predicted before
the slog feature was integrated.

  The on-disk log (slog or otherwise), if I understand right, can
  actually make the filesystem recover to a crash-INconsistent
  state 

   enh You're speaking the opposite of common sense.  

Yeah, I'm doing it on purpose to suggest that just guessing how you
feel things ought to work based on vague notions of economy isn't a
good idea.

   enh If disabling the ZIL makes the system faster *and* less prone
   enh to data corruption, please explain why we don't all disable
   enh the ZIL?

I said complying with fsync can make the system recover to a state not
equal to one you might have hypothetically snapshotted in a moment
leading up to the crash.  Elsewhere I might've said disabling the ZIL
does not make the system more prone to data corruption, *iff* you are
not an NFS server.

If you are, disabling the ZIL can lead to lost writes if an NFS server
reboots and an NFS client does not, which can definitely cause
app-level data corruption.

Disabling the ZIL breaks the D requirement of ACID databases which
might screw up apps that replicate, or keep databases on several
separate servers in sync, and it might lead to lost mail on an MTA,
but because unlike non-COW filesystems it costs nothing extra for ZFS
to preserve write ordering even without fsync(), AIUI you will not get
corrupted application-level data by disabling the ZIL.  you just get
missing data that the app has a right to expect should be there.  The
dire warnings written by kernel developers in the wikis of ``don't
EVER disable the ZIL'' are totally ridiculous and inappropriate IMO.
I think they probably just worked really hard to write the ZIL piece
of ZFS, and don't want people telling their brilliant code to fuckoff
just because it makes things a little slower.  so we get all this
``enterprise'' snobbery and so on.

``crash consistent'' is a technical term not a common-sense term, and
I may have used it incorrectly:

 
http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html

With a system that loses power on which fsync() had been in use, the
files getting fsync()'ed will probably recover to more recent versions
than the rest of the files, which means the recovered state achieved
by yanking the cord couldn't have been emulated by cloning a snapshot
and not actually having lost power.  However, the app calling fsync()
will expect this, so it's not supposed to lead to application-level
inconsistency.  

If you test your app's recovery ability in just that way, by cloning
snapshots of filesystems on which the app is actively writing and then
seeing if the app can recover the clone, then you're unfortunately not
testing the app quite hard enough if fsync() is involved, so yeah I
guess disabling the ZIL might in theory make incorrectly-written apps
less prone to data corruption.  Likewise, no testing of the app on a
ZFS will be aggressive enough to make the app powerfail-proof on a
non-COW POSIX system because ZFS keeps more ordering than the API
actually guarantees to the app.

I'm repeating myself though.  I wish you'll just read my posts with at
least paragraph granularity instead of just picking out individual
sentences and discarding everything that seems too complicated or too
awkwardly stated.

I'm basing this all on the ``common sense'' that to do otherwise,
fsync() would have to completely ignore its filedescriptor
argument. It'd have to copy the entire in-memory ZIL to the slog and
behave the same as 'lockfs -fa', which I think would perform too badly
compared to non-ZFS filesystems' fsync()s, and would lead to emphatic
performance advice like ``segregate files that get lots of fsync()s
into separate ZFS datasets from files that get high write bandwidth,''
and we don't have advice like that in the blogs/lists/wikis which
makes me think it's not beneficial (the benefit would be

Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-02 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

re # ptime zdb -S zwimming Simulated DDT histogram:
re  refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   
DSIZE
re   Total2.63M277G218G225G3.22M337G263G
270G

rein-core size = 2.63M * 250 = 657.5 MB

Thanks, that is really useful!  It'll probably make the difference
between trying dedup and not, for me.

It is not working for me yet.  It got to this point in prstat:

  6754 root 2554M 1439M sleep   600   0:03:31 1.9% zdb/106

and then ran out of memory:

 $ pfexec ptime zdb -S tub
 out of memory -- generating core dump

I might add some swap I guess.  I will have to try it on another
machine with more RAM and less pool, and see how the size of the zdb
image compares to the calculated size of DDT needed.  So long as zdb
is the same or a little smaller than the DDT it predicts, the tool's
still useful, just sometimes it will report ``DDT too big but not sure
by how much'', by coredumping/thrashing instead of finishing.


pgprpk9HSdr61.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Miles Nordin

 enh == Edward Ned Harvey solar...@nedharvey.com writes:

   enh Dude, don't be so arrogant.  Acting like you know what I'm
   enh talking about better than I do.  Face it that you have
   enh something to learn here.

funny!  AIUI you are wrong and Casper is right.

ZFS recovers to a crash-consistent state, even without the slog,
meaning it recovers to some state through which the filesystem passed
in the seconds leading up to the crash.  This isn't what UFS or XFS
do.

The on-disk log (slog or otherwise), if I understand right, can
actually make the filesystem recover to a crash-INconsistent state (a
state not equal to a snapshot you might have hypothetically taken in
the seconds leading up to the crash), because files that were recently
fsync()'d may be of newer versions than files that weren't---that is,
fsync() durably commits only the file it references, by copying that
*part* of the in-RAM ZIL to the durable slog.  fsync() is not
equivalent to 'lockfs -fa' committing every file on the system (is
it?).  I guess I could be wrong about that.

If I'm right, this isn't a bad thing because apps that call fsync()
are supposed to expect the inconsistency, but it's still important to
understanding what's going on.


pgpUNxWo30EYO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool split problem?

2010-04-01 Thread Miles Nordin

 la == Lori Alt lori@oracle.com writes:

la I'm only pointing out that eliminating the zpool.cache file
la would not enable root pools to be split.  More work is
la required for that.

makes sense.  All the same, please do not retaliate against the
bug-opener by adding a lazy-assertion to prevent rpools from being
split: this type of brittleness, ex. around all the many disk-labeling
programs, is a large part of what makes Solaris systems feel flakey
and unwelcoming to those who've used Linux, BSD, or Mac OS X.  and
AFAICT there is not much of it in the ZFS boot support so far---it's
an uncluttered architecture that's quite friendly to creative abuse
and impatient hacking.


pgpy5Ksjv18Ne.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Miles Nordin

 rm == Robert Milkowski mi...@task.gda.pl writes:

rm This is not true. If ZIL device would die *while pool is
rm imported* then ZFS would start using z ZIL withing a pool and
rm continue to operate.

what you do not say, is that a pool with dead zil cannot be 
'import -f'd.  So, for example, if your rpool and slog are on the same
SSD, and it dies, you have just lost your whole pool.


pgp9E0wFxqcc4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Miles Nordin

 rm == Robert Milkowski mi...@task.gda.pl writes:

rm the reason you get better performance out of the box on Linux
rm as NFS server is that it actually behaves like with disabled
rm ZIL

careful.

Solaris people have been slinging mud at linux for things unfsd did in
spite of the fact knfsd has been around for a decade.  and ``has
options to behave like the ZIL is disabled (sync/async in
/etc/exports)'' != ``always behaves like the ZIL is disabled''.

If you are certain about Linux NFS servers not preserving data for
hard mounts when the server reboots even with the 'sync' option which
is the default, please confirm, but otherwise I do not believe you.

rm Which is an expected behavior when you break NFS requirements
rm as Linux does out of the box.

wrong.  The default is 'sync' in /etc/exports.  The default has
changed, but the default is 'sync', and the whole thing is
well-documented.

rm What would be useful though is to be able to easily disable
rm ZIL per dataset instead of OS wide switch.

yeah, Linux NFS servers have that granularity for their equivalent
option.


pgpg1qLhwVTDs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool cannot replace a replacing device

2010-03-29 Thread Miles Nordin

 cm == Courtney Malone court...@courtneymalone.com writes:
 j == Jim  biainmcna...@hotmail.com writes:

 j Thanks for the suggestion, but have tried detaching but it
 j refuses reporting no valid replicas.

yeah this happened to someone else also, see list archives around
2008-12-03:

cm I have a 10 drive raidz, recently one of the disks appeared to
cm be generating errors (this later turned out to be a cable),

cm # zpool replace data 17096229131581286394 c0t2d0

cm cannot replace 17096229131581286394 with c0t2d0: cannot
cm replace a replacing device

cm if i try to detach it i get:

cm # zpool detach data 17096229131581286394

cm cannot detach 17096229131581286394: no valid replicas


pgpKVbb2twZdu.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Dedup Performance

2010-03-24 Thread Miles Nordin

 srbi == Steve Radich, BitShop, Inc ste...@bitshop.com writes:

  srbi 
http://www.bitshop.com/Blogs/tabid/95/EntryId/78/Bug-in-OpenSolaris-SMB-Server-causes-slow-disk-i-o-always.aspx

I'm having trouble understanding many things in here like ``our file
move'' (moving what from where to where with what protocol?) and
``with SMB running'' (with the server enabled on Solaris, with
filesystems mounted, with activity on the mountpoints?  what does
running mean?) and ``RAID-0/stripe reads is the slow point'' (what
does this mean?  How did you determine which part of the stack is
limiting the observed speed?  This is normally quite difficult and
requires comparing several experiments, not doing just one experiment
like ``a file move between zfs pools''.).  What is ``bytes the
negotiated protocol allows''?  mtu, mss, window size?  Can you show us
in what tool you see one number and where you see the other number
that's too big?


pgpAMuI2YHJGk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] is this pool recoverable?

2010-03-20 Thread Miles Nordin

 sn == Sriram Narayanan sri...@belenix.org writes:

sn http://docs.sun.com/app/docs/doc/817-2271/ghbxs?a=view

yeah, but he has no slog, and he says 'zpool clear' makes the system
panic and reboot, so even from way over here that link looks useless.

Patrick, maybe try a newer livecd from genunix.org like b130 or later
and see if the panic is fixed so that you can import/clear/export the
pool.  The new livecd's also have 'zpool import -F' for Fix Harder
(see manpage first).  Let us know what happens.


pgpT7dIOFPNUD.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Error in zfs list output?

2010-03-19 Thread Miles Nordin

 bh == Brandon High bh...@freaks.com writes:

bh I think I'm seeing an error in the output from zfs list with
bh regards to snapshot space utilization.

no bug.  You just need to think harder about it: the space used cannot
be neatly put into buckets next to each snapshot that add to the
total, just because of...math.  To help understand, suppose you
decide, just to fuck things up, that from now on every time you take a
snapshot you take two snapshots, with exactly zero filesystem writing
happening between the two.  What do you want 'zfs list' to say now?

What does happen if you do that, is it says all snapshots use zero space.

the space shown in zfs list is the amount you'd get back if you
deleted this one snapshot.  Yes, every time you delete a snapshot, all
the numbers reshuffle.  Yes, there is a whole cat's cradle of space
accounting information hidden in there that does not come out through
'zfs list'.


pgpzRUSk68FzY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/OSOL/Firewire...

2010-03-19 Thread Miles Nordin

 k == Khyron  khyron4...@gmail.com writes:

 k FireWire is an Apple technology, so they have a vested
 k interest in making sure it works well [...]  They could even
 k have a specific chipset that they exclusively use in their
 k systems,

yes, you keep repeating yourselves, but there are only a few firewire
host chips, like ohci and lynx, and apple uses the same ones as
everyone else, no magic.  Why would you speak such a complicated
fantasy out loud without any reason to believe it other than your
imaginations?

I also tried to use firewire on Solaris long ago and had a lot of
problems with it, both with the driver stack in Solaris and with the
embedded software inside a cheaper non-Oxford case (Prolific).  I
think y'all forum users shuold stick to SAS/SATA for external disks
and avoid firewire and USB both.

Realize, though, that it is not just the chip driver but the entire
software stack that influences speed and reliability.  Even above what
you normally consider the firewire stack, above all the mid-layer and
scsi emulation stuff, Mac OS X for example is rigorous about handling
force-unmounting, both with umount -f and disks that go away without
warning.  FreeBSD OTOH has major problems with force-unmounting,
panicing and waiting forever.  Solaris has problems too with freezing
zpool maintenance commands, access to pools unrelated to the one with
the device that went away, and NFS serving anything while any zpool is
frozen.  This is a problem even if you don't make a habit of yanking
disks because it can make diagnosing problems really difficult: what
if your case, like my non-Oxford one, has a firmware bug that makes it
freeze up sometimes?  or a flakey power supply or lose cable?  If the
OS does not stay up long enough to report the case detached, and stay
sane enough for you to figure out what makes it retach (waiting a
while, rebooting the case, jiggling the power connector, jiggling the
data connector) then you will probably never figure out what's wrong
with it, as I didn't for months while if I'd had the same broken case
on a Mac I'd have realized almost immediately that it sometimes
detaches itself for no reason and retaches when I cycle it's power
switch but not when I plug/unplug its data cable and not when I reboot
the Mac, so I'd know the case had buggy firmware, while with Solaris I
just get these craazy panic messages.  Once your exception
handling reaches a certain level of crappyness, you cannot touch
anything without everything collapsing.

And on Solaris all this freezing/panicing behavior depends a lot which
disk driver yuo're using while Mac OS X it's, meh, basically working
the same for SATA, USB, Firewire, or NFS client, and also you can
mount images with hdiutil over NFS without getting weird checksum
errors or deadlocks like you do with file or lofiadm-backed ZFS.
(globalsan iscsi is still a mess though, worse than all other mac disk
drivers and worse than the solaris initiator)

I do not like the Mac OS much because it's slow, because the
hardware's overpriced and fragile, because the only people running it
inside VM's are using piratebay copies, and because I distrust Apple
and strongly disapprove of their master plan both in intent and
practice like the way they crippled dtrace, the displayport bullshit,
and their terrible developer relations like nontransparent last-minute
API yanking and ``agreements'' where you even have to agree not to
discuss the agreement, and in general of their honing a talent for
manipulating people into exploitable corners by slowly convincing them
it's okay to feel lazy and entitled.  But yes they've got some things
relevant to server-side storage working better than Solaris does like
handling flakey disks sanely, and providing source for the stable
supported version of their OS not just the development version.


pgpzf9yUTzCYk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Miles Nordin

 djm == Darren J Moffat darren.mof...@oracle.com writes:

   djm I've logged CR# 6936195 ZFS send stream while checksumed
   djm isn't fault tollerant to keep track of that.

Other tar/cpio-like tools are also able to:

 * verify the checksums without extracting (like scrub)

 * verify or even extract the stream using a small userland tool that
   writes files using POSIX functions, so that you can build the tool
   on not-Solaris or extract the data onto not-ZFS.  The 'zfs send'
   stream can't be extracted without the solaris kernel, although yes
   the promise that newer kernels can extract older streams is a very
   helpful one.

   For example, ufsdump | ufsrestore could move UFS data into ZFS.
   but zfs send | zfs recv leaves us trapped on ZFS, even though
   migrating/restoring ZFS data onto a pNFS or Lustre backend is a
   realistic desire in the near term.

 * partial extract

Personally, I could give up the third bullet point.

Admittedly the second bullet is hard to manage while still backing up
zvol's, pNFS / Lustre data-node datasets, windows ACL's, properties,
snapshots/clones, u.s.w., so it's kind of...if you want both vanilla
and chocolate cake at once, you're both going to be unhappy.  But
there should at least be *a* tool that can copy from zfs to NFSv4
while preserving windows ACL's, and the tool should build on other
OS's that support NFSv4 and be capable of faithfully copying one NFSv4
tree to another preserving all the magical metadata.

I know it sounds like ACL-aware rsync is unrelated to your (Darren)
goal of tweaking 'zfs send' to be appropriate for backups, but for
example before ZFS I could make a backup on the machine with disks
attached to it or on an NFS client, and get exactly the same stream
out.  Likewise, I could restore into an NFS client.  Sticking to a
clean API instead of dumping the guts of the filesystem, made the old
stream formats more archival.

The ``I need to extract a ZFS dataset so large that my only available
container is a distributed Lustre filesystem'' use-case is pretty
squarely within the archival realm, is going to be urgent in a year or
so if it isn't already, and is accomodated by GNUtar, cpio, Amanda
(even old ufsrestore Amanda), and all the big commercial backup tools.

I admit it would be pretty damn cool if someone could write a purely
userland version of 'zfs send' and 'zfs recv' that interact with the
outside world using only POSIX file i/o and unix pipes but produce the
standard deduped-ZFS-stream format, even if the hypothetical userland
tool accomplishes this by including a FUSE-like amount of ZFS code and
thus being quite hard to build.  However, so far I don't think the
goals of a replication tool:

 ``make a faithful and complete copy, efficiently, or else give an
   error,''

are compatible with the goals of an archival tool:

 ``extract robustly far into the future even in non-ideal and hard to
   predict circumstances such as different host kernel, different
   destination filesystem, corrupted stream, limited restore space.''


pgpyWHuwbuWZf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Miles Nordin

 c == Miles Nordin car...@ivy.net writes:
 mg == Mike Gerdts mger...@gmail.com writes:

 c are compatible with the goals of an archival tool:

sorry, obviously I meant ``not compatible''.

mg Richard Elling made an interesting observation that suggests
mg that storing a zfs send data stream on tape is a quite
mg reasonable thing to do.  Richard's background makes me trust
mg his analysis of this much more than I trust the typical person
mg that says that zfs send output is poison.

ssh and tape are perfect, yet whenever ZFS pools become corrupt
Richard talks about scars on his knees from weak TCP checksums and
lying disk drives and about creating a ``single protection domain'' of
zfs checksums and redundancy instead of a bucket-brigade of fail of
tcp into ssh into $blackbox_backup_Solution(likely involving
unchecksummed disk storage) into SCSI/FC into ECC tapes.  At worst,
lying then or lying now?  At best, the whole thing still strikes me as
a pattern of banging a bunch of arcania into whatever shape's needed
to fit the conclusion that ZFS is glorious and no further work is
requried to make it perfect.

and there is still no way to validate a tape without extracting it,
which is last I worked with them, an optional but suggested part of
$blackbox_backup_Solution (and one which, incidentally, helps with the
bucket brigade problem Richard likes to point out).

and the other archival problems of constraining the restore
environment, and the fundamental incompatibility of goals between
faithful replication and robust, future-proof archiving from my last
post.


pgpLLsyZQuSKJ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-17 Thread Miles Nordin

 k == Khyron  khyron4...@gmail.com writes:

 k Star is probably perfect once it gets ZFS (e.g. NFS v4) ACL

nope, because snapshots are lost and clones are expanded wrt their
parents, and the original tree of snapshots/clones can never be
restored.

we are repeating, though.  This is all in the archives.


pgpTLTb9Ads3W.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-17 Thread Miles Nordin

 la == Lori Alt lori@sun.com writes:

la This is no longer the case.  The send stream format is now
la versioned in such a way that future versions of Solaris will
la be able to read send streams generated by earlier versions of
la Solaris.

Your memory of the thread is selective.  This is only one of the
several problems with it.

If you are not concerned with bitflip gremlins on tape, then all the
baloney about checksums and copies=2 metadata and insisting on
zpool-level redundancy is just a bunch of opportunistic FUD.

la The comment in the zfs(1M) manpage discouraging the
la use of send streams for later restoration has been removed.

The man page never warned of all the problems, nor did the si wiki.


pgpCjAGUvOlWe.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-03-10 Thread Miles Nordin

 ea == erik ableson eable...@me.com writes:
 dc == Dennis Clarke dcla...@blastwave.org writes:

  rw,ro...@100.198.100.0/24, it works fine, and the NFS client
  can do the write without error.

ea I' ve found that the NFS host based settings required the
ea FQDN, and that the reverse lookup must be available in your
ea DNS.

I found, oddly, the @a.b.c.d/y syntax works only if the client's IP
has reverse lookup.  I had to add bogus hostnames to /etc/hosts for
the whole /24 because if I didn't, for v3 it would reject mounts
immediately, and for v4 mountd would core dump (and get restarted)
which you see from the client as a mount that appears to hang.  This
is all using the @ip/mask syntax.

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832

If you use hostnames instead, it makes sense that you would have to
use FQDN's.  If you want to rewrite mountd to allow using short
hostnames, the access checking has to be done like this:

  at export time:
given hostname- forward nss lookup - list of IP's - remember IP's

  at mount time:
client IP - check against list of remembered IP's

but with fqdn's it can be:

  at export time:
given hostname - remember it

  at mount time:
 client IP - reverse nss lookup - check against remembered list
   \--forward lookup-verify client IP among results

The second way, all the lookups happen at mount time rather than
export time.  This way the data in the nameservice can change without
forcing you to learn and then invoke some kind of ``rescan the
exported filesystems'' command or making mountd remember TTL's for its
cached nss data, or any such complexity.  Keep all the nameservice
caching inside nscd so there is only one place to flush it!  However
the forward lookup is mandatory for security, not optional OCDism.
Without it, anyone from any IP can access your NFS server so long as
he has control of his reverse lookup, which he probably does.  I hope
mountd is doing that forward lookup!

dc Try to use a backslash to escape those special chars like so :

dc zfs set
dc sharenfs=nosub\,nosuid\,rw\=hostname1\:hostname2\,root\=hostname2
dc zpoolname/zfsname/pathname

wth?  Commas and colons are not special characters.  This is silly.



pgptWVuUb6wBm.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-03-10 Thread Miles Nordin

 dc == Dennis Clarke dcla...@blastwave.org writes:

dc zfs set
dc sharenfs=nosub\,nosuid\,rw\=hostname1\:hostname2\,root\=hostname2
dc zpoolname/zfsname/pathname

  wth?  Commas and colons are not special characters.  This is
  silly.

dc Works real well.

I said it was silly, not broken.  It's cargo-cult.  Try this:

\z\f\s \s\e\t 
\s\h\a\r\e\n\f\s\=\n\o\s\u\b\,\n\o\s\u\i\d\,\r\w\=\h\o\s\t\n\a\m\e\1\:\h\o\s\t\n\a\m\e\2\,\r\o\o\t\=\h\o\s\t\n\a\m\e\2
 \z\p\o\o\l\n\a\m\e\/\z\f\s\n\a\m\e\/\p\a\t\h\n\a\m\e

works real well, too.


pgp9sZc4ojaDX.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backup zpool to tape

2010-03-09 Thread Miles Nordin

 gd == Gregory Durham gregory.dur...@gmail.com writes:

gd it to mount on boot

I do not understand why you have a different at-boot-mounting problem
with and without lofiadm: either way it's your script doing the
importing explicitly, right?  so just add lofiadm to your script.  I
guess you were exporting pools explicitly at shutdown because you
didn't trust solaris to unmount the two levels of zfs in the right
order?

Anyway I would guess it doesn't matter because my ``back up file
zpools to tape'' suggestion seems to be bogus bad advice.  The other
bug referenced in the one you quoted, 6915127, seems a lot more
disruptive and says there are weird corruption problems with using
file vdev's directly, and then there are deadlock problems with
lofiadm from the two layers of zfs that haven't been ironed out yet.
I guess file-based zpools do not work, and we're back to having no
good plan that I can see to back up zpools to tape that preserves
dedup, snapshots/clones, NFSv4 acl's, u.s.w.  I assumed they did work
because it looked like regression tests people were quoting and many
examples depended upon them, but now it seems they don't, which
explains some problems I had last month extracting an s10brand image
from a .VDI. :( (iirc i got the image out using lofiadm and just
assumed I was confused, banging away at things until they work and
then forgetting about them.  not good on me.)

There is only zfs send which is made with replication in mind (

 * it'll intentionally destroy the entire stream and any incremental
   descendents if there's a single bit-flip, which is a good feature
   to make sure the replication is retried if the copy's not faithful
   but a bad feature for tape.  If ZFS rallies against other
   filesystems for their fragile lack of metadata copies and
   checksums, why should the tape format be so oddly fragile that tape
   archives become massive gamma gremlin detectors?

 * and it has no scrub-like method analagous to 'tar t' or 'cpio -it'
   because it's assumed you'll always recv it in a situation where
   you've the opportunity to re-send, while a tape is something you
   might like to validate after transporting it or every few years.
   If pools need scrubing why don't tapes?

 * and no partial-restore feature because it assumes if you don't have
   enough space on the destination for the entire dataset you'll use
   rsync or cpio or some other tree-granularity tool instead of the
   replication toolkit.  a tool which does not fully exist (sparse
   files, 4GB files, NFSv4 ACL's), but that's a separate problem.

).

how about zpools on zvol's.  Does that avoid the deadlock/corruption
bugs with file vdevs?  It's not a workaround for the cases in the bug
becuase they wanted to use NFS to replace iSCSI, but for backups,
zvols might be okay, if they work?  It's certainly possible to write
them onto a tape (dd was originally meant for such things).


pgpaynQ63iMAj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?

2010-03-08 Thread Miles Nordin

 al == Adam Leventhal a...@eng.sun.com writes:

al As always, we welcome feedback (although zfs-discuss is not
al the appropriate forum),

``Please, you criticize our work in private while we compliment it in
public.''


pgpyrrUQeYImd.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshot recycle freezes system activity

2010-03-08 Thread Miles Nordin

 gm == Gary Mills mi...@cc.umanitoba.ca writes:

gm destroys the oldest snapshots and creates new ones, both
gm recursively.

I'd be curious if you try taking the same snapshots non-recursively
instead, does the pause go away?  

Because recursive snapshots are special: they're supposed to
atomically synchronize the cut-point across all the filesystems
involved, AIUI.  I don't see that recursive destroys should be
anything special though.

gm Is it destroying old snapshots or creating new ones that
gm causes this dead time?

sortof seems like you should tell us this, not the other way
around. :)  Seriously though, isn't that easy to test?  And I'm curious
myself too.


pgpnlnCUlJtvb.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 >

1 - 100 of 473 matches

Mail list logo