from:"Nicolas Williams"

Re: [zfs-discuss] ZFS Dedup question

2011-01-28 Thread Nicolas Williams

On Fri, Jan 28, 2011 at 01:38:11PM -0800, Igor P wrote:
 I created a zfs pool with dedup with the following settings:
 zpool create data c8t1d0
 zfs create data/shared
 zfs set dedup=on data/shared
 
 The thing I was wondering about was it seems like ZFS only dedup at
 the file level and not the block. When I make multiple copies of a
 file to the store I see an increase in the deup ratio, but when I copy
 similar files the ratio stays at 1.00x.

Dedup is done at the block level, not file level.  Similar files does
not mean that they actually share common blocks.  You'll have to look
more closely to determine if they do.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-18 Thread Nicolas Williams

On Tue, Jan 18, 2011 at 07:16:04AM -0800, Orvar Korvar wrote:
 BTW, I thought about this. What do you say?
 
 Assume I want to compress data and I succeed in doing so. And then I
 transfer the compressed data. So all the information I transferred is
 the compressed data. But, then you don't count all the information:
 knowledge about which algorithm was used, which number system, laws of
 math, etc. So there are lots of other information that is implicit,
 when compress/decompress - not just the data.
 
 So, if you add data and all implicit information you get a certain bit
 size X. Do this again on the same set of data, with another algorithm
 and you get another bit size Y. 
 
 You compress the data, using lots of implicit information. If you use
 less implicit information (simple algorithm relying on simple math),
 will X be smaller than if you use lots of implicit information
 (advanced algorithm relying on a large body of advanced math)? What
 can you say about the numbers X and Y? Advanced math requires many
 math books that you need to transfer as well.

Just as the laws of thermodynamics preclude perpetual motion machines,
so do they preclude infinite, loss-less data compression.  Yes,
thermodynamics and information theory are linked, amazingly enough.

Data compression algorithms work by identifying certain types of
patterns, then replacing the input with notes such as pattern 1 is ...
and appears at offsets 12345 and 1234567 (I'm simplifying a lot).  Data
that has few or no observable patterns (observable by the compression
algorithm in question) will not compress, but will expand if you insist
on compressing -- randomly-generated data (e.g., the output of
/dev/urandom) will not compress at all and will expand if you insist.
Even just one bit needed to indicate whether a file is compressed or not
will mean expansion when you fail to compress and store the original
instead of the compressed version.  Data compression reduces
repetition, thus making it harder to further compress compressed data.

Try it yourself.  Try building a pipeline of all the compression tools
you have, see how many rounds of compression you can apply to typical
data before further compression fails.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-17 Thread Nicolas Williams

On Sat, Jan 15, 2011 at 10:19:23AM -0600, Bob Friesenhahn wrote:
 On Fri, 14 Jan 2011, Peter Taps wrote:
 
 Thank you for sharing the calculations. In lay terms, for Sha256,
 how many blocks of data would be needed to have one collision?
 
 Two.

Pretty funny.

In this thread some of you are treating SHA-256 as an idealized hash
function.  The odds of accidentally finding collisions in an idealized
256-bit hash function are minute because the distribution of hash
function outputs over inputs is random (or, rather, pseudo-random).

But cryptographic hash functions are generally only approximations of
idealized hash functions.  There's nothing to say that there aren't
pathological corner cases where a given hash function produces lots of
collisions that would be semantically meaningful to people -- i.e., a
set of inputs over which the outputs are not randomly distributed.  Now,
of course, we don't know of such pathological corner cases for SHA-256,
but not that long ago we didn't know of any for SHA-1 or MD5 either.

The question of whether disabling verification would improve performance
is pretty simple: if you have highly deduplicatious, _synchronous_ (or
nearly so, due to frequent fsync()s or NFS close operations) writes, and
the working set did not fit in the ARC nor L2ARC, then yes, disabling
verification will help significantly, by removing an average of at least
half a disk rotation from the write latency.  Or if you have the same
work load but with asynchronous writes that might as well be synchronous
due to an undersized cache (relative to the workload).  Otherwise the
cost of verification should be hidden by caching.

Another way to put this would be that you should first determine that
verification is actually affecting performance, and only _then_ should
you consider disabling it.  But if you want to have the freedom to
disable verficiation, then you should be using SHA-256 (or switch to it
when disabling verification).

Safety features that cost nothing are not worth turning off,
so make sure their cost is significant before even thinking
of turning them off.

Similarly, the cost of SHA-256 vs. Fletcher should also be lost in the
noise if the system has enough CPU, but if the choice of hash function
could make the system CPU-bound instead of I/O-bound, then the choice of
hash function would make an impact on performance.  The choice of hash
functions will have a different performance impact than verification: a
slower hash function will affect non-deduplicatious workloads more than
highly deduplicatious workloads (since the latter will require more I/O
for verification, which will overwhelm the cost of the hash function).
Again, measure first.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Nicolas Williams

On Fri, Jan 07, 2011 at 06:39:51AM -0800, Michael DeMan wrote:
 On Jan 7, 2011, at 6:13 AM, David Magda wrote:
  The other thing to note is that by default (with de-dupe disabled), ZFS
  uses Fletcher checksums to prevent data corruption. Add also the fact all
  other file systems don't have any checksums, and simply rely on the fact
  that disks have a bit error rate of (at best) 10^-16.
 
 Agreed - but I think it is still missing the point of what the
 original poster was asking about.
 
 In all honesty I think the debate is a business decision - the highly
 improbable vs. certainty.

The OP seemed to be concerned that SHA-256 is particularly slow, so the
business decision here would involve a performance vs. error rate
trade-off.

Now, unless you have highly deduplicatious data, a workload with a high
cache hit ratio in the ARC for DDT entries, and a fast ZIL device, I
suspect that the I/O costs of dedup dominate the cost of the hash
function, which means: the above business trade-off is not worthwhile as
one would be trading an tiny uptick in error rates for small uptick in
performance.  Before you even get to where you're making such a decision
you'll want to have invested in plenty of RAM, L2ARC and fast ZIL device
capacity -- and for those making such that investment I suspect that the
OP's trade-off won't seem worthwhile.

BTW, note that verification isn't guaranteed to have a zero error
rate...  Imagine a) a block being written collides with a different
block already in the pool, b) bit rot on disk in that colliding block
such that the on-disk block matches the new block, c) on a mirrored vdev
such that you might get one or another version of the block in question,
randomly.  Such an error requires monumentally bad luck to happen at
all.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Nicolas Williams

On Thu, Jan 06, 2011 at 11:44:31AM -0800, Peter Taps wrote:
 I have been told that the checksum value returned by Sha256 is almost
 guaranteed to be unique.

All hash functions are guaranteed to have collisions [for inputs larger
than their output anyways].

  In fact, if Sha256 fails in some case, we
 have a bigger problem such as memory corruption, etc. Essentially,
 adding verification to sha256 is an overkill.

What makes a hash function cryptographically secure is not impossibility
of collisions, but computational difficulty of finding arbitrary
colliding input pairs, collisions for known inputs, second pre-images,
and first pre-images.  Just because you can't easily find collisions on
purpose doesn't mean that you can't accidentally find collisions.

That said, if the distribution of SHA-256 is even enough then your
chances of finding a collision by accident are so remote (one in 2^128)
that you could reasonably decide that you don't care.

 Perhaps (Sha256+NoVerification) would work 99.99% of the time. But
 (Fletcher+Verification) would work 100% of the time.

Fletcher is faster than SHA-256, so I think that must be what you're
asking about: can Fletcher+Verification be faster than
Sha256+NoVerification?  Or do you have some other goal?

Assuming I guessed correctly...  The speed of the hash function isn't
significant compared to the cost of the verification I/O, period, end of
story.  So, SHA-256 w/o verification will be faster than Fletcher +
Verification -- lots faster if you have particularly deduplicatious data
to write.  Moreorever, SHA-256 + verification will likely be somewhat
faster than Fletcher + verification because SHA-256 will likely have
fewer collisions than Fletcher, and the cost of I/O dominates the cost
of the hash functions.

 Which one of the two is a better deduplication strategy?
 
 If we do not use verification with Sha256, what is the worst case
 scenario? Is it just more disk space occupied (because of failure to
 detect duplicate blocks) or there is a chance of actual data
 corruption (because two blocks were assumed to be duplicate although
 they are not)?

If you don't verify then you run the risk of corruption on collision,
NOT the risk of using too much disk space.

 Or, if I go with (Sha256+Verification), how much is the overhead of
 verification on the overall process?
 
 If I do go with verification, it seems (Fletcher+Verification) is more
 efficient than (Sha256+Verification). And both are 100% accurate in
 detecting duplicate blocks.

You're confused.  Fletcher may be faster to compute than SHA-256, but
the run-time of both is as nothing compared to latency of the disk I/O
needed for verification, which means that the hash function's rate of
collisions is more important than its computational cost.

(Now, Fletcher is thought to not be a cryptographically secure hash
function, while SHA-256 is, for now, considered cryptographically
secure.  That probably means that the distribution of Fletcher's outputs
over random inputs is not as even as that of SHA-256, which probably
means you can expect more collisions with Fletcher than with SHA-256.
Note that I made no absolute statements in the previous sentence --
that's because I've not read any studies of Fletcher's performance
relative to SHA-256, thus I'm not certain of anything stated in the
previous sentence.)

David Magda's advice is spot on.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Nicolas Williams

On Thu, Jan 06, 2011 at 06:07:47PM -0500, David Magda wrote:
 On Jan 6, 2011, at 15:57, Nicolas Williams wrote:
 
  Fletcher is faster than SHA-256, so I think that must be what you're
  asking about: can Fletcher+Verification be faster than
  Sha256+NoVerification?  Or do you have some other goal?
 
 Would running on recent T-series servers, which have have on-die
 crypto units, help any in this regard?

Yes, particularly for larger blocks.

Hash collisions don't matter as long as ZFS verifies dups, so the real
question is: what is the false positive dup rate (i.e., the accidental
collision rate).  But that's going to vary a lot by {hash function,
working data set}, thus it's not possible to make exact determinations,
just estimates.

For me the biggest issue is that as good as Fletcher is for a CRC, I'd
rather have a cryptographic hash function because I've seen incredibly
odd CRC failures before.  There's a famous case from within SWAN a few
years ago where a switch flipped pairs of bits such that all too often
the various CRCs that applied to the moving packets failed to detect the
bit flips, and we discovered this when an SCCS file in a clone of the ON
gate got corrupted.  Such failures (collisions) wouldn't affect dedup,
but they would mask corruption of non-deduped blocks.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-27 Thread Nicolas Williams

On Mon, Dec 27, 2010 at 09:06:45PM -0500, Edward Ned Harvey wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Nicolas Williams
  
   Actually I'd say that latency has a direct relationship to IOPS because
 it's the
  time it takes to perform an IO that determines how many IOs Per Second
  that can be performed.
  
  Assuming you have enough synchronous writes and that you can organize
  them so as to keep the drive at max sustained sequential write
  bandwidth, then IOPS == bandwidth / logical I/O size.  Latency doesn't
 
 Ok, what we've hit here is two people using the same word to talk about
 different things.  Apples to oranges, as it were.  Both meanings of IOPS
 are ok, but context is everything.  
 
 There are drive random IOPS, which is dependent on latency and seek time,
 and there is also measured random IOPS above the filesystem layer, which is
 not always related to latency or seek time, as described above.

Clearly the application cares about _synchronous_ operations that are
meaningful to it.  In the case of an NFS application that would be
open() with O_CREAT (and particularly O_EXCL), close(), fsync() and so
on.  For a POSIX (but not NFS) application the number of synchronous
operations is smaller.  The rate of asynchronous operations is less
important to the application because those are subject to caching, thus
less predictable.  But to the filesystem the IOPS are not just about
synchronous I/O but about how many distinct I/O operations can be
completed per unit of time.  I tried to keep this clear; sorry for any
confusion.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-26 Thread Nicolas Williams

On Sat, Dec 25, 2010 at 08:37:42PM -0500, Ross Walker wrote:
 On Dec 24, 2010, at 1:21 PM, Richard Elling richard.ell...@gmail.com wrote:
 
  Latency is what matters most.  While there is a loose relationship between 
  IOPS
  and latency, you really want low latency.  For 15krpm drives, the average 
  latency
  is 2ms for zero seeks.  A decent SSD will beat that by an order of 
  magnitude.
 
 Actually I'd say that latency has a direct relationship to IOPS because it's 
 the time it takes to perform an IO that determines how many IOs Per Second 
 that can be performed.

Assuming you have enough synchronous writes and that you can organize
them so as to keep the drive at max sustained sequential write
bandwidth, then IOPS == bandwidth / logical I/O size.  Latency doesn't
enter into that formula.  Latency does remain though, and will be
noticeable to apps doing synchronous operations.

Thus 100MB/s, say, sustained sequential write bandwidth with, say, 2KB
avg ZIL entries you'd get 51200/s logical, sync write operations.  The
latency for each such operation would still be 2ms (or whatever it is
for the given disk).  Since you'd likely have to batch many ZIL writes
you'd end up making the latency for some ops longer than 2ms and others
shorter, but if you can keep the drive at max sustained seq write
bandwidth then the average latency will be 2ms.

SSDs are clearly a better choice.

BTW, a parallelized tar would greatly help reduce the impact of high
latency open()/close() (over NFS) operations...

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] stupid ZFS question - floating point operations

2010-12-23 Thread Nicolas Williams

On Thu, Dec 23, 2010 at 09:32:13AM +, Darren J Moffat wrote:
 On 22/12/2010 20:27, Garrett D'Amore wrote:
 That said, some operations -- and cryptographic ones in particular --
 may use floating point registers and operations because for some
 architectures (sun4u rings a bell) this can make certain expensive
 
 Well remembered!  There are sun4u optimisations that use the
 floating point unit but those only apply to the bignum code which in
 kernel is only used by RSA.
 
 operations go faster. I don't think this is the case for secure
 hash/message digest algorithms, but if you use ZFS encryption as found
 in Solaris 11 Express you might find that on certain systems these
 registers are used for performance reasons, either on the bulk crypto or
 on the keying operations. (More likely the latter, but my memory of
 these optimizations is still hazy.)
 
 RSA isn't used at all by ZFS encryption, everything is AES
 (including key wrapping) and SHA256.
 
 So those optimistations for floating point don't come into play for
 ZFS encryption.

Moreover, we have platform-specific crypto optimizations.  If there were
FPU operations that help speed up symmetric crypto on an M4000 but not
on UltraSPARC T2s, then we'd use that on the one but not on the other.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-23 Thread Nicolas Williams

On Thu, Dec 23, 2010 at 11:25:43AM +0100, Stephan Budach wrote:
 as I have learned from the discussion about which SSD to use as ZIL
 drives, I stumbled across this article, that discusses short
 stroking for increasing IOPs on SAS and SATA drives:

There was a thread on this a while back.  I forget when or the subject.
But yes, you could even use 7200 rpm drives to make a fast ZIL device.
The trick is the on-disk format, and the pseudo-device driver that you
would have to layer on top of the actual device(s) to get such
performance.  The key is that sustained sequential I/O rates for disks
can be quite large, so if you organize the disk in a log form and use
the outer tracks only, then you can get pretend to have awesome write
IOPS for a disk (but NOT read IOPs).

But it's not necessarily as cheap as you might think.  You'd be making
very inefficient use of an expensive disk (in the case of an SAS 15k rpm
disk), or disks, and if plural then you are also using more ports
(oops).  Disks used this way probably also consume more power than SSDs
(OK, this part of my analysis if very iffy), and you still need to do
something about ensuring syncs to disk on power failure (such as just
disabling the cache on the disk, but this would lower performance,
increasing the cost).  When you factor all the costs in I suspect you'll
find that SSDs are priced reasonably well.  That's not to say that one
could not put together a disk-based log device that could eat SSDs'
lunch, but SSD prices would then just come down to match that -- and you
can expect SSD prices to come down anyways, as with any new
technologies.

I don't mean to discourage you, just to point out that there's plenty of
work to do to make short-stroked disks as ZILs a workable reality,
while the economics of doing that work versus waiting for SSD prices to
come down don't seem appealing.  Caveat emptor: my analysis is
off-the-cuff; I could be wrong.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-12-02 Thread Nicolas Williams

On Wed, Nov 17, 2010 at 01:58:06PM -0800, Bill Sommerfeld wrote:
 On 11/17/10 12:04, Miles Nordin wrote:
 black-box crypto is snake oil at any level, IMNSHO.
 
 Absolutely.

As Darren said, much of the design has been discussed in public, and
reviewed by cryptographers.  It'd be nicer if we had a detailed paper
though.

 Congrats again on finishing your project, but every other disk
 encryption framework I've seen taken remotely seriously has a detailed
 paper describing the algorithm, not just a list of features and a
 configuration guide.  It should be a requirement for anything treated
 as more than a toy.  I might have missed yours, or maybe it's coming
 soon.
 
 In particular, the mechanism by which dedup-friendly block IV's are
 chosen based on the plaintext needs public scrutiny.  Knowing
 Darren, it's very likely that he got it right, but in crypto, all
 the details matter and if a spec detailed enough to allow for
 interoperability isn't available, it's safest to assume that some of
 the details are wrong.

Dedup + crypto does have security implications.  Specifically: it
facilitates traffic analysis, and then known- and even
chosen-plaintext attacks (if there were any practical such attacks on
the cipher).

For example, IIUC, the ratio of dedup vs.  non-dedup blocks + analysis
of dnodes and their data sizes (in blocks) + per-dnode dedup ratios can
probably be used to identify OS images, which would then help mount
known-plaintext attacks.  For a mailstore you'd be able to distinguish
mail sent or kept by a single local user vs. mail sent to and kept by
more than one local user, and by sending mail you could help mount
chose-plaintext attacks.  And so on.

My advice would be to not bother encrypting OS images, and if you
encrypt only documents, then dedup is likely of less or no interest to
you -- in general, you may not want to bother with dedup + crypto.
However, it is fantastic that crypto and dedup can work together.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-12-02 Thread Nicolas Williams

Also, when the IV is stored you can more easily look for accidental IV
re-use, and if you can find hash collisions, them you can even cause IV
re-use (if you can write to the filesystem in question).  For GCM IV
re-use is rather fatal (for CCM it's bad, but IIRC not fatal), so I'd
not use GCM with dedup either.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-10 Thread Nicolas Williams

On Sat, Oct 09, 2010 at 09:52:51PM -0700, Richard Elling wrote:
 Are we living in the past?
 
 In the bad old days, UNIX systems spoke NFS and Windows systems spoke
 CIFS. The cost of creating a file system was expensive -- slices,
 partitions, etc.
 
 With ZFS, file systems (datasets) are relatively inexpensive.
 
 So, are we putting too many constraints into a system (ZFS) which is
 busy trying to remove constraints?  Is it reasonable to expect that
 ZPL is the only kind of file system ZFS customers need?  Is it high
 time for a ZCIFS dataset?

I don't quite understand what you mean.  ZPL is just a POSIX layer.  It
_happens_ to be used not just by the system call layer in Solaris, but
also by the SMB and NFS servers, but you could also imagine the SMB and
NFS servers using the DMU directly while maintaining on-disk
compatibility with the ZPL.  Not using the ZPL does not necessitate
having a different on-disk format, or different semantics.

Now, if you were asking about dataset properties that make a dataset
behave more like what Windows expects or more like what Unix expects,
that's different, but that wouldn't require junking the ZPL.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-06 Thread Nicolas Williams

On Wed, Oct 06, 2010 at 04:38:02PM -0400, Miles Nordin wrote:
  nw == Nicolas Williams nicolas.willi...@oracle.com writes:
 
 nw The current system fails closed 
 
 wrong.
 
 $ touch t0
 $ chmod 444 t0
 $ chmod A0+user:$(id -nu):write_data:allow t0
 $ ls -l t0
 -r--r--r--+  1 carton   carton 0 Oct  6 20:22 t0
 
 now go to an NFSv3 client:
 $ ls -l t0
 -r--r--r-- 1 carton 405 0 2010-10-06 16:26 t0
 $ echo lala  t0
 $ 
 
 wide open.

The system does what the ACL says.  The mode fails to accurately
represent the actual access because... the mode can't.  Now, we could
have chosen (and still could choose to) represent the presence of ACEs
for subjects other than owner@/group@/everyone@ by using the group bits
of the mode to represent the maximal set of permissions granted.

But I don't consider the above failing open.

 nw You seem to be in denial.  You continue to ignore the
 nw constraint that Windows clients must be able to fully control
 nw permissions in spite of their inability to perceive and modify
 nw file modes.
 
 You remain unshakably certain that this is true of my proposal in
 spite of the fact that you've said clearly that you don't understand
 my proposal.  That's bad science.

*You* stated that your proposal wouldn't allow Windows users full
control over file permissions.

 It may be my fault that you don't understand it: maybe I need to write
 something shorter but just as expressive to fit within mailing list
 attention spans, or maybe my examples are unclear.  However that
 doesn't mean that I'm in denial nor make you right---that just makes
 me annoying.

Yes, that may be.  I encourage you to find a clearer way to express your
proposal.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-06 Thread Nicolas Williams

On Wed, Oct 06, 2010 at 05:19:25PM -0400, Miles Nordin wrote:
  nw == Nicolas Williams nicolas.willi...@oracle.com writes:
 
 nw *You* stated that your proposal wouldn't allow Windows users
 nw full control over file permissions.
 
 me: I have a proposal
 
 you: op!  OP op, wait!  DOES YOUR PROPOSAL blah blah WINDOWS blah blah
  COMPLETELY AND EXACTLY LIKE THE CURRENT ONE.
 
 me: no, but what it does is...

The correct quote is:

no, not under my proposal.

That's from a post from you on September 30, 2010, with Message-Id:
oqd3ruvkf3@castrovalva.ivy.net.  That was a direct answer to a
direct question.

Now, maybe you wish to change your view.  That'd be fine.  Do not,
however, imply that I'm liar though, not if you want to be taken
seriously.  Please re-write your proposal _clearly_ and refrain from
personal attacks.

Cheers,

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Migrating to an aclmode-less world

2010-10-05 Thread Nicolas Williams

On Mon, Oct 04, 2010 at 04:30:05PM -0600, Cindy Swearingen wrote:
 Hi Simon,
 
 I don't think you will see much difference for these reasons:
 
 1. The CIFS server ignores the aclinherit/aclmode properties.

Because CIFS/SMB has no chmod operation :)

 2. Your aclinherit=passthrough setting overrides the aclmode
 property anyway.

aclinherit=passthrough-x is a better choice.

Also, aclinherit doesn't override aclmode.  aclinherit applies on create
and aclmode used to apply on chmod.

 3. The only difference is that if you use chmod on these files
 to manually change the permissions, you will lose the ACL values.

Right.  That only happens from NFSv3 clients [that don't instead edit
the POSIX Draft ACL translated from the ZFS ACL], from non-Windows NFSv4
clients [that don't instead edit the ACL], and from local applications
[that don't instead edit the ZFS ACL].

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-05 Thread Nicolas Williams

On Mon, Oct 04, 2010 at 02:28:18PM -0400, Miles Nordin wrote:
  nw == Nicolas Williams nicolas.willi...@oracle.com writes:
 
 nw I would think that 777 would invite chmods.  I think you are
 nw handwaving.
 
 it is how AFS worked.  Since no file on a normal unix box besides /tmp

But would the AFS experience translate into double plus happiness for us?

 ever had 777 it would send a SIGWTF to any AFS-unaware graybeards that
 stumbled onto the directory, alerting them that they needed to go
 learn something and come back.

A signal?!  How would that work when the entity doing a chmod is on a
remote NFS client?

 I understand that everything:everyone on windows doesn't send SIGWTF,
 but 777 on unix for AFS sites it did.  You realize it's not
 hypothetical, right?  AFS was actually implemented, widely, and
 there's experience with it.

Yes... but I'm skeptical about the universality of that experience's
applicability.  Specifically: I don't think it could work for us.

AFS developers had fewer constraints than Solaris developers.  It is no
surprise that they were able to find happy solutions to these sorts of
problems long ago.

OpenAFS has a Windows native client and an Explorer shell extension
(which surely handles chmod?).  However, we don't have the luxury of
telling customers to install third-party (possibly ours, whatever)
Windows native clients for protocols other than SMB, nor can we tell
them to install Explorer shell extensions.  Solaris' SMB server needs to
work out of the box and without the limitations implied by having a
separate ACL and mode (well, we have that now, but we always compute a
new mode from the new ACL when ACLs are changed).

 If they failed to act on the SIGWTF, the overall system enforced the
 tighter of the unix permissions and the AFS ACL, so it fails closed.
 The current system fails open.

The current system fails closed (by discarding the ACL and replacing it
with a new one based entirely on the new mode).

 Also AFS did no translation between unix permissions and AFS ACL's so
 it was easy to undo such a mistake when it happened: double-check the
 AFS ACL is not too wide on the directories where you see unix people
 mucking around in case the muckers were responding to a real problem,
 then set the unix modes back to 777.

Right, but with SMB in the picture we don't have this luxury.  You seem
unwilling to accept that one constraint.

 nw When chmod()ing an object... ZFS would search for the most
 nw specific matching file in .zfs/ACLs/ and, if found, would
 nw replace the chmod()ed object's ACL with that of the
 nw .zfs/ACLs/... file found.  The .inherit suffix would indicate
 nw that if the chmod() target's parent directory has inherittable
 nw ACEs then they will be groupmasked and added to the ACEs from
 nw the .zfs/ACLs/... file to produce a final ACL.
 
 This proposal, like the current situation, seems to make chmod
 configurable to act like ``not chmod'' which IMHO is exactly what's
 unpopular about the current regime.  You've tried to leave chmod

To some degree, yes.  It's different though, and might conceivably be
acceptable, though I don't think it will be (I was illustrating
potential alternatives).

But I really like one thing about it: most apps shouldn't care about ACL
contents, they should care about context-specific permissions changes.
In a directory containing shared documents the intention should
typically be share with all these people, while in home directories
the intention should typically be don't share with anyone (but this
will vary; e.g., ~/.ssh/authorized_keys needs to be reachable and
readable by everyone).  Add in executable versus not- executable, and
you have a pretty complete picture -- just a few named ACLs at most,
per-dataset.

If we could replace chmod(2) with a version that takes actual names for
pre-configured ACLs, _that_ would be great.  But we can't for the same
reason that we can't remove chmod(2): it's a widely used interface.

 active on windows trees and guess at the intent of whoever invokes
 chmod, providing no warning that you're secretly doing
 ``approximately'' what he asked for rather than exactly.  Maybe that
 flies on Windows, but on Unix people expect more precision: thorough
 abstractions that survive corner cases and have good exception
 handling.

Look, mode is a pretty lame hammer -- ACLs are far, far more granular--
but it's a hammer that many apps use.  Given the lack of granularity of
modes, I think an approximation of intent is the best we can do.

Consider: both, aclmode=discard and aclmode=groupmask behaviors can be
considered to be what the user intended.  How do you know if the user
intended for other users and groups to retain access limited to the
group bits of a new mode?  You can't, not without asking the user.  So
aclmode=discard is certainly an approximation of user intent, and so
aclmode=groupmask must be considered an approximation as well.

That would

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-04 Thread Nicolas Williams

On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles Nordin wrote:
  Can the user in (3) fix the permissions from Windows?
 
 no, not under my proposal.

Let's give it a whirld anyways:

 but it sounds like currently people cannot ``fix'' permissions through
 the quirky autotranslation anyway, certainly not to the point where
 neither unix nor windows users are confused: windows users are always
 confused, and unix users don't get to see all the permissions.

No, that's not right.  Today you can fix permissions from any NFSv4
client that exports an NFSv4-style ACL interface to users.  You can fix
permissions from Windows.  You can fix permissions a local Solaris
shell.  You can also fix permissions from NFSv3 clients (but you get
POSIX Draft - ZFS translated ACLs, which are confusing because they
tend to result in DENY ACEs being scattered all over).  You can also
chmod, but you lose your ACL if you do that.

  Now what?
 
 set the unix perms to 777 as a sign to the unix people to either (a)
 leave it alone, or (b) learn to use 'chmod A...'.  This will actually
 work: it's not a hand-waving hypothetical that just doesn't play out.

I would think that 777 would invite chmods.  I think you are handwaving.

 What I provide, which we don't have now, is a way to make:
 
   /tub/dataset/a subtree
 
 -rwxrwxrwxin old unix
 [working, changeable permissions] in windows
 
   /tub/dataset/b subtree
 
 -rw-r--r--in old unix
 [everything: everyone]in windows, but unix permissions 
   still enforced
 
 this means:
 
  * unix writers and windows writers can cooperate even within a single
dataset
 
  * an intuitive warning sign when non-native permissions are in effect, 
 
  * fewer leaked-data surprises

I don't understand what exactly you're proposing.  You've not said
anything about how chmod is to be handled.

 If you accept that the autotranslation between the two permissions
 regimes is total shit, which it is, then what I offer is the best oyu
 can hope for.

If I could understand what you're proposing I might agree, who knows.
But I do think there's other possibilities, some probably better than
what you propose (whatever that is).

Here's a crazy alternative that might work (or not): allow users to
pre-configure named ACLs where the names are {owner, group, mode}.
E.g., we could have:

.zfs/ACLs/user/[group:][d|-]permissions[.inherit]
^   ^^^  ^
||   |
+-- owned by |   |
user   +-- applies to  |
 directory   |
 or other|
 objects |
 |
see below

When chmod()ing an object... ZFS would search for the most specific
matching file in .zfs/ACLs/ and, if found, would replace the chmod()ed
object's ACL with that of the .zfs/ACLs/... file found.  The .inherit
suffix would indicate that if the chmod() target's parent directory has
inherittable ACEs then they will be groupmasked and added to the ACEs
from the .zfs/ACLs/... file to produce a final ACL.

E.g., a chmod(0644) of /a/b/c/foo (say, a file owned by 'joe' with group
'staff', with /, /a, /a/b, and /a/b/c all being datasets), where c has
inherittable ACEs would cause ZFS to search for
.zfs/ACLs/joe/staff:-rw-r--r--.inherit, .zfs/ACLs/joe/-rw-r--r--.inherit, 
zfs/ACLs/joe/staff:-rw-r--r--, and .zfs/ACLs/joe/-rw-r--r--, first in
/a/b/c, then /a/b, then /a, then /.

I said this is crazy.  Is it?  I think it probably is.  This would
almost certainly prove to be a hard-to-use design.  Users would need to
be educated in order to not be surprised...  OTOH, it puts much more
control in the hands of the user.  These named ACLs could be inheritted
from parent datasets as a way to avoid having to set them up too many
times.  And with the .inherit twist it probably has enough granularity
of control to be useful (particularly if users are dataset-happy).
Finally, these could even be managed remotely.

I see zero chance of such a design being adopted.

It'd be better, IMO, to go for non-POSIX-equivalent groupmasking and
translations of POSIX mode_t and POSIX Draft ACLs to ZFS ACLs.  For
example: take the current translations, remove all owner@ and group DENY
ACEs, then sort any remaining user DENY ACEs to be first, and any
everyone@ DENY ACEs to be last.  The results would surely be surprising
to some users, but the kinds of mode_t and POSIX Draft ACLs where
surprise is likely are rare.

That's two alternatives right there.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Nicolas Williams

On Thu, Sep 30, 2010 at 02:55:26PM -0400, Miles Nordin wrote:
  nw == Nicolas Williams nicolas.willi...@oracle.com writes:
 nw Keep in mind that Windows lacks a mode_t.  We need to interop
 nw with Windows.  If a Windows user cannot completely change file
 nw perms because there's a mode_t completely out of their
 nw reach... they'll be frustrated.
 
 well...AIUI this already works very badly, so keep that in mind, too.
 
 In AFS this is handled by most files having 777, and we could do the
 same if we had an AND-based system.  This is both less frustrating and
 more self-documenting than the current system.
 
 In an AND-based system, some unix users will be able to edit the
 windows permissions with 'chmod A...'.  In shops using older unixes
 where users can only set mode bits, the rule becomes ``enforced
 permissions are the lesser of what Unix people and Windows people
 apply.''  This rule is easy to understand, not frustrating, and
 readily encourages ad-hoc cooperation (``can you please set
 everything-everyone on your subtree?  we'll handle it in unix.'' /
 ``can you please set 777 on your subtree?  or 770 group windows?  we
 want to add windows silly-sid-permissions.'').  This is a big step
 better than existing systems with subtrees where Unix and Windows
 users are forced to cooperate.

Consider this chronologically-ordered sequence of events:

1) File is created via Windows, gets SMB/ZFS/NFSv4-style ACL, including
   inherittable ACEs.  A mode computed from this ACL might be 664, say.

2) A Unix user does chmod(644) on that file, and one way or another this
   effectively reduces permissions otherwise granted by the ACL.

3) Another Windows user now fails to get write perm that they should
   have, so they complain, and then the owner tries to view/change the
   ACL from a Windows desktop.

Now what?

Can the user in (3) fix the permissions from Windows?  For that to be
possible the mode must implicitly get recomputed when the ACL is
modified.

What if (2) happens again?  But, OK, this is a problem no matter what,
whether we do groupmasking, discard, or keep mode separate from the ACL
and AND the two.

ZFS does, in fact, keep a separate mode, and it does recompute it when
ACLs are modified.  So this may just be a matter of doing the AND thing
and not touching the ACL on chmod.  Is that what you have in mind?

 It would certainly work much better than the current system, where you
 look at your permissions and don't have any idea whether you've got
 more, less, or exactly the same permission as what your software is
 telling you: the crappy autotranslation teaches users that all bets
 are off.

No, currently you look at permissions that they reflect the ACL (with
the group bits being the max of all non-owner@ and non-everyone@ ACEs).

 It would be nice if, under my proposal, we could delete the unix
 tagspace entirely:
 
  chpacl '(unix)' chmod -R A- .

Huh?

 but unfortunately, deletion of ACL's is special-cased by Solaris's
 chmod to ``rewrite ACL's that match the UNIX permissions bits,'' so it
 would probably have to stay special-cased in a tagspace system.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Nicolas Williams

On Thu, Sep 30, 2010 at 03:28:14PM -0500, Nicolas Williams wrote:
 Consider this chronologically-ordered sequence of events:
 
 1) File is created via Windows, gets SMB/ZFS/NFSv4-style ACL, including
inherittable ACEs.  A mode computed from this ACL might be 664, say.
 
 2) A Unix user does chmod(644) on that file, and one way or another this
effectively reduces permissions otherwise granted by the ACL.
 
 3) Another Windows user now fails to get write perm that they should
have, so they complain, and then the owner tries to view/change the
ACL from a Windows desktop.
 
 Now what?
 
 Can the user in (3) fix the permissions from Windows?  For that to be
 possible the mode must implicitly get recomputed when the ACL is
 modified.

Also, even if in (3) the user can fix the perms from Windows because
we'd recompute the mode from the ACL, the user wouldn't be able to see
the effective ACL (as reduced by the mode_t that Windows can't see).
The only way to address that is... to do groupmasking.  And that gets us
back to the problems we had with groupmasking.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Nicolas Williams

On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles Nordin wrote:
  Can the user in (3) fix the permissions from Windows?
 
 no, not under my proposal.

Then your proposal is a non-starter.  Support for multiple remote
filesystem access protocols is key for ZFS and Solaris.

The impedance mismatches between these various protocols means that we
need to make some trade-offs.  In this case I think the business (as
well as the engineers involved) would assert that being a good SMB
server is critical, and that being able to authoritatively edit file
permissions via SMB clients is part of what it means to be a good SMB
server.

Now, you could argue that we should being aclmode back and let the user
choose which trade-offs to make.  And you might propose new values for
aclmode or enhancements to the groupmask setting of aclmode.

 but it sounds like currently people cannot ``fix'' permissions through
 the quirky autotranslation anyway, certainly not to the point where
 neither unix nor windows users are confused: windows users are always
 confused, and unix users don't get to see all the permissions.

Thus the current behavior is the same as the old aclmode=discard
setting.

  Now what?
 
 set the unix perms to 777 as a sign to the unix people to either (a)
 leave it alone, or (b) learn to use 'chmod A...'.  This will actually
 work: it's not a hand-waving hypothetical that just doesn't play out.

That's not an option, not for a default behavior anyways.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-29 Thread Nicolas Williams

On Wed, Sep 29, 2010 at 03:44:57AM -0700, Ralph Böhme wrote:
  On 9/28/2010 2:13 PM, Nicolas Williams wrote:
  The version of samba bundled with Solaris 10 seems to
  insist on 
  chmod'ing stuff. I've tried all of the various

Just in case it's not clear, I did not write the quoted text.  (One can
tell from the level of quotation that an attribution is missing and that
none of my text was quoted.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Nicolas Williams

Keep in mind that Windows lacks a mode_t.  We need to interop with
Windows.  If a Windows user cannot completely change file perms because
there's a mode_t completely out of their reach... they'll be frustrated.

Thus an ACL-and-mode model where both are applied doesn't work.  It'd be
nice, but it won't work.

The mode has to be entirely encoded by the ACL.  But we can't resort to
interesting encoding tricks as Windows users won't understand them.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Nicolas Williams

On Wed, Sep 29, 2010 at 03:09:22PM -0700, Ralph Böhme wrote:
  Keep in mind that Windows lacks a mode_t.  We need to
  interop with Windows.
 
 Oh my, I see. Another itch to scratch. Now at least Windows users are
 happy while me and mabye others are not.

Yes.  Pardon me for forgetting to mention this earlier.  There's so many
wrinkles here...  But this is one of the biggers; I should not have
forgotten it.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Nicolas Williams

On Wed, Sep 29, 2010 at 05:21:51PM -0500, Nicolas Williams wrote:
 On Wed, Sep 29, 2010 at 03:09:22PM -0700, Ralph Böhme wrote:
   Keep in mind that Windows lacks a mode_t.  We need to
   interop with Windows.
  
  Oh my, I see. Another itch to scratch. Now at least Windows users are
  happy while me and mabye others are not.
 
 Yes.  Pardon me for forgetting to mention this earlier.  There's so many
 wrinkles here...  But this is one of the biggers; I should not have

s/biggers/biggest/

 forgotten it.
 
 Nico
 -- 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-28 Thread Nicolas Williams

On Tue, Sep 28, 2010 at 12:18:49PM -0700, Paul B. Henson wrote:
 On Sat, 25 Sep 2010, [iso-8859-1] Ralph Böhme wrote:
 
  Darwin ACL model is nice and slick, the new NFSv4 one in 147 is just
  braindead. chmod resulting in ACLs being discarded is a bizarre design
  decision.
 
 Agreed. What's the point of ACLs that disappear? Sun didn't want to fix
 acl/chmod interaction, maybe one of the new OpenSolaris forks will do the
 right thing...

I've researched this enough (mainly by reading most of the ~240 or so
relevant zfs-discuss posts and several bug reports) to conclude the
following:

 - ACLs derived from POSIX mode_t and/or POSIX Draft ACLs that result in
   DENY ACEs are enormously confusing to users.

 - ACLs derived from POSIX mode_t and/or POSIX Draft ACLs that result in
   DENY ACEs are susceptible to ACL re-ordering when modified from
   Windows clients -which insist on DENY ACEs first-, leading to much
   confusion.

 - This all gets more confusing when hand-crafted ZFS inherittable ACEs
   are mixed with chmod(2)s with the old aclmode=groupmask setting.

The old aclmode=passthrough setting was dangerous and had to be removed,
period.  (Doing chmod(600) would not necessarily deny other users/groups
access -- that's very, very broken.)

That only leaves aclmode=discard and some variant of aclmode=groupmask
that is less confusing.

But here's the thing: the only time that groupmasking results in
sensible ACLs is when it doesn't require DENY ACEs, which in turn is
only when mode_t bits and/or POSIX ACLs are strictly non-increasing
(e.g., 777, 775, 771, 750, 755, 751, etcetera, would all be OK, but 757
would not be).

The problem then is this: if you have an aclmode setting that sometimes
groupmasks and sometimes discards... that'll be confusing too!

So one might wonder: can one determine user intent from the ACL prior to
the change and the mode/POSIX ACL being set, and then edit the ZFS ACL
in a way that approximates the user's intention?  I believe that would
be possible, but risky too, as the need to avoid DENY ACEs (see Windows
issue) would often result in more permissive ACLs than the user actually
intended.

Taken altogether I believe that aclmode=discard is the simplest setting
to explain and understand.  Perhaps eventually a variant of groupmasking
will be developed that is also simple to explain and understand, but
right now I very much doubt it (and yes, I've tried myself).  But much
better than that would be if we just move to a ZFS ACL world (which,
among other things, means we'll need a simple libc API for editing
ACLs).

Note, incidentally, that there's a groupmasking behavior left in ZFS at
this time: on create of objects in directories with inherittable ACEs
and with aclinherit=passthrough*.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-28 Thread Nicolas Williams

On Tue, Sep 28, 2010 at 02:03:30PM -0700, Paul B. Henson wrote:
 On Tue, 28 Sep 2010, Nicolas Williams wrote:
 
  I've researched this enough (mainly by reading most of the ~240 or so
  relevant zfs-discuss posts and several bug reports)
 
 And I think some fair fraction of those posts were from me, so I'll try not
 to start rehashing old discussions ;).

:)

  That only leaves aclmode=discard and some variant of aclmode=groupmask
  that is less confusing.
 
 Or aclmode=deny, which is pretty simple, not very confusing, and basically
 the only paradigm that will prevent chmod from breaking your ACL.

That can potentially render many applications unusable.

  So one might wonder: can one determine user intent from the ACL prior to
  the change and the mode/POSIX ACL being set, and then edit the ZFS ACL
  in a way that approximates the user's intention?
 
 You're assuming the user is intentionally executing the chmod, or even
 *aware* of it happening. Probably at least 99% of the chmod calls executed
 on a file with a ZFS ACL at my site are the result of non-ACL aware legacy
 apps being stupid. In which case the *user* intent to to *leave the damn
 ACL alone* :)...

But that's not really clear.  The user is running some app.  The app
does some chmod(2)ing on behalf of the user.  The user may also use
chmod(1).  Now what?  Suppose you make chmod(1) not use chomd(2), so as
to be able to say that all chmod(2) calls are made by apps, not the
user.   But then what about scripts that use chmod(1)?

Basically, I think intent can be estimated in some cases, and combined
with some simplifying assumptions (that will sometimes be wrong), such
as security entities are all distinct, non-overlapping (as a way to
minimize the number of DENY ACEs needed) can yield a groupmasking
algorithm that doesn't suck.  However, it'd still not be easy to
explain, and it'd still result in surprises (since the above assumption
will often be wrong, leading to more permissive ACLs than the user might
have intended!).  Seems like a lot of work for little gain, and high
support call generation rate.

  But much better than that would be if we just move to a ZFS ACL world
  (which, among other things, means we'll need a simple libc API for
  editing ACLs).
 
 Yep. And a good first step towards an ACL world would be providing a way to
 keep chmod from destroying ACLs in the current world...

I don't think that will happen...

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-28 Thread Nicolas Williams

On Wed, Sep 29, 2010 at 10:15:32AM +1300, Ian Collins wrote:
 Based on my own research, experimentation and client requests, I
 agree with all of the above.

Good to know.

 I have be re-ordering and cleaning (deny) ACEs for one client for a
 couple of years now and we haven't seen any user complaints.  In
 their environment, all ACLs started life as POSIX (from a Solaris 9
 host) and with the benefit of hindsight, I would have cleaned them
 up on import to ZFS rather than simply reading the POSIX ACL and
 writing back to ZFS.

The saddest scenario would be when you have to interop with NFSv3
clients whose users (or their apps) are POSIX ACL happy, but whose files
also need to be accessible from NFSv4, SMB, and local ZPL clients where
the users (possibly the same users, or their apps) are also ZFS ACL
happy.  Particularly if you also have Windows clients and the users edit
file ACLs there too!  Thankfully this is relatively easy to avoid
because: apps that edit ACLs are few and far between, thus easy to
remediate, and users should not really be manually setting POSIX Draft
and ZFS/NFSv4/SMB ACLs on the same files.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pools inside pools

2010-09-23 Thread Nicolas Williams

On Thu, Sep 23, 2010 at 06:58:29AM +, Markus Kovero wrote:
  What is an example of where a checksummed outside pool would not be able 
  to protect a non-checksummed inside pool?  Would an intermittent 
  RAM/motherboard/CPU failure that only corrupted the inner pool's block 
  before it was passed to the outer pool (and did not corrupt the outer 
  pool's block) be a valid example?
 
  If checksums are desirable in this scenario, then redundancy would also 
  be needed to recover from checksum failures.
 
 That is excellent point also, what is the point for checksumming if
 you cannot recover from it? At this kind of configuration one would
 benefit performance-wise not having to calculate checksums again.

The benefit of checksumming in the inner tunnel, as it were (the inner
pool), is to provide one more layer of protection relative to iSCSI.
But without redundancy in the inner pool you cannot recover from
failures, as you point out.  And you must have checksumming in the outer
pool, so that it can be scrubbed.

It's tempting to say that the inner pool should not checksum at all, and
that iSCSI and IPsec should be configured correctly to provide
sufficient protection to the inner pool.  Another possibility is to have
a remote ZFS protocol of sorts, but then you begin to wonder if
something like Lustre (married to ZFS) isn't better.

 Checksums in outer pools effectively protect from disk issues, if
 hardware fails so data is corrupted isn't outer pools redundancy going
 to handle it for inner pool also.

Yes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Please warn a home user against OpenSolaris under VirtualBox under WinXP ; )

2010-09-22 Thread Nicolas Williams

On Wed, Sep 22, 2010 at 07:14:43AM -0700, Orvar Korvar wrote:
 There was a guy doing that: Windows as host and OpenSolaris as guest
 with raw access to his disks. He lost his 12 TB data. It turned out
 that VirtualBox dont honor the write flush flag (or something
 similar).

VirtualBox has an option to honor flushes.

Also, recent versions of ZFS can recover by throwing out the last N
transactions that were not committed fully.

 In other words, I would never ever do that. Your data is safer with
 Windows only and a Windows raid solution.
 
 Use OpenSolaris as host instead, and Win as guest.

I don't think your advice is correct.  If you're going to run production
services on VirtualBox VMs then you should enable cache flushes in VBox:

http://www.virtualbox.org/manual/ch12.html#id2692517


To enable flushing for IDE disks, issue the following command:

VBoxManage setextradata VM name
  VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush 0

The value [x] that selects the disk is 0 for the master device on the
first channel, 1 for the slave device on the first channel, 2 for the
master device on the second channel or 3 for the master device on the
second channel.

To enable flushing for SATA disks, issue the following command:

VBoxManage setextradata VM name
  VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush 0

The value [x] that selects the disk can be a value between 0 and 29.


IMO VBox should have a simple toggle for this in either its disk or vm
manager UI.  And the flush commands should be honored by default.  What
VBox could do is have some radio buttons or checkboxes for indicating
the purpose of a given VM, and then derive default flush behavior from
that (e.g., test and gaming VMs need not honor flushes, dev VMs might,
and prod VMs do).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS COW and simultaneous read write of files

2010-09-22 Thread Nicolas Williams

On Wed, Sep 22, 2010 at 12:30:58PM -0600, Neil Perrin wrote:
 On 09/22/10 11:22, Moazam Raja wrote:
 Hi all, I have a ZFS question related to COW and scope.
 
 If user A is reading a file while user B is writing to the same file,
 when do the changes introduced by user B become visible to everyone?
 Is there a block level scope, or file level, or something else?
 
 Thanks!
 
 Assuming the user is using read and write against zfs files.
 ZFS has reader/writer range locking within files.
 If thread A is trying to read the same section that thread B is
 writing it will
 block until the data is written. Note, written in this case means
 written into the zfs
 cache and not to the disks. If thread A requires that changes to the
 file be stable (on disk)
 before reading it can use the little known O_RSYNC flag.

That's assuming local access (i.e., POSIX semantics).  It's different if
NFS is involved (because of NFS' close-to-open semantics).  It might be
different if SMB is involved (dunno).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Nicolas Williams

On Wed, Sep 15, 2010 at 05:18:08PM -0400, Edward Ned Harvey wrote:
 It is absolutely not difficult to avoid fragmentation on a spindle drive, at
 the level I described.  Just keep plenty of empty space in your drive, and
 you won't have a fragmentation problem.  (Except as required by COW.)  How
 on earth do you conclude this is practically impossible?

That's expensive.  It's also approaching short-stroking (which is
expensive).  Which is what Richard said (in so many words, that it's
expensive).  Can you make HDDs perform awesome?  Yes, but you'll need
lots of them, and you'll need to use them very inefficiently.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is the 1000 bit?

2010-09-14 Thread Nicolas Williams

On Tue, Sep 14, 2010 at 04:13:31PM -0400, Linder, Doug wrote:
 I recently created a test zpool (RAIDZ) on some iSCSI shares.  I made
 a few test directories and files.  When I do a listing, I see
 something I've never seen before:
 
 [r...@hostname anewdir] # ls -la
 total 6160
 drwxr-xr-x   2 root other  4 Sep 14 14:16 .
 drwxr-xr-x   4 root root   5 Sep 14 15:04 ..
 -rw--T   1 root other2097152 Sep 14 14:16 barfile1
 -rw--T   1 root other1048576 Sep 14 14:16 foofile1
 
 I looked up the T bit in the man page for ls, and it says that T
 means  The 1000 bit is turned on, and execution is off (undefined
 bit-state).  Which is as clear as mud.

It's the sticky bit.  Nowadays it's only useful on directories, and
really it's generally only used with 777 permissions.  The chmod(1) (man
-M/usr/man chmod) and chmod(2) (man -s 2 chmod)  manpages describe the
sticky bit.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode

2010-08-27 Thread Nicolas Williams

On Sat, Aug 28, 2010 at 12:05:53PM +1200, Ian Collins wrote:
 Think of this from the perspective of an application. How would
 write failure be reported?  open(2) returns EACCES if the file can
 not be written but there isn't a corresponding return from write(2).
 Any open file descriptors would have to be updated to reflect the
 change of access and the application would end up with an unexpected
 error return (EBADF?).

EROFS.  But write(2) isn't supposed to return EROFS.  NFSv3's and v4's
write ops are allowed to return the NFS equivalent of EROFS, and so
typically NFS clients do cause write(2) to return EROFS in such cases
(but then, NFS isn't fully POSIX).

write(2) can return EIO though, and, IIRC, the BSD revoke(2) syscall
arranges for just that to be returned by write(2) calls on revoked
fildes.

IMO EROFS and EIO would both be OK.  It might be a good idea to require
a force option to make a change that would cause non-POSIX behavior.

I'd think that there's many possible ways to handle this:

a) disallow setting readonly=on on mounted datasets that are
   readonly=false;

b) disallow ... but only if there are any fildes open for write (doesn't
   matter if shared with NFS as NFS writes are allowed to return EROFS);

c) allow the change but make it take effect on next mount;

d) force umount the dataset, make the change, mount again;

e) have write(2), to fildes open for write before the change to
   readonly=on, return EROFS after the change;

f) same as (d) but only if you force the prop change;

g) have write(2), to fildes open for write before the change to
   readonly=on, return EIO after the change;

h) allow write(2)s to fildes open for write before the change to
   readonly=on;

(h) is current behavior.  (a) and (b) would be reasonable, but if EBUSY,
the user may not be able to change the property without drastic steps
(such as rebooting, if there's lots of datasets below).  (c) would be
confusing, and not that useful.  (d) would be unreasonable (plus what if
there's datasets below this one?!).  (e)...  may be reasonable if you
think that we're well outside POSIX the moment you change the readonly
prop to on.  (f) is reasonable (by forcing the change you'd be saying
that you're happy to leave POSIX land).  (h) is reasonable.

 If the application has been given permission to open a file for
 writing and this permission is unexpectedly revoked, strange things
 my happen.  The file being written would be in an inconsistent
 state.

Well, there's always the BSD revoke(2) system call.  Use it and 

 I think it is better to let write operation complete and leave the
 file in a consistent state.

There is that too.  But you could, too, just power off...  The
application should use fsync(2) (or fdatasync()) carefully to ensure
that failed write(2)s and power failures don't leave the application in
an unrecoverable state.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-19 Thread Nicolas Williams

On Fri, Aug 20, 2010 at 09:23:56AM +1200, Ian Collins wrote:
 On 08/20/10 08:30 AM, Garrett D'Amore wrote:
 There is no common C++ ABI.  So you get into compatibility concerns
 between code built with different compilers (like Studio vs. g++).
 Fail.
 
 Which is why we have extern C.  Just about any Solaris driver,
 library or kernel module could be implemented in C++ behind the C
 compatibility layer and no one would notice.

Any driver C++ code would still need a C++ run-time.  Either you must
statically link it in, or you'll have a problem with multiple drivers
using different C++ run-times.  If you statically link in the run-time,
then you're bloating the text of the kernel.  If you're not then you
have a problem.  C++ is bad because of its ABI issues, really.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-19 Thread Nicolas Williams

On Fri, Aug 20, 2010 at 09:38:51AM +1200, Ian Collins wrote:
 On 08/20/10 09:33 AM, Nicolas Williams wrote:
 Any driver C++ code would still need a C++ run-time.  Either you must
 statically link it in, or you'll have a problem with multiple drivers
 using different C++ run-times.  If you statically link in the run-time,
 then you're bloating the text of the kernel.  If you're not then you
 have a problem.  C++ is bad because of its ABI issues, really.
 
 You snipped the bit where I said
 
 Drivers and kernel modules are a good example; in that world you
 have to live without the runtime library (which is dynamic only).
 So you are effectively just using C++ as a superset of C with all
 the benefits that offers.
 
 So you basically loose the C++ specific parts of the standard
 library and exceptions.  But you still have the built in features of
 the language.

I'm not sure it's that easy to avoid the C++ run-time when you're
coding.  And anyways, the temptation to build classes that can be used
elsewhere becomes rather strong.  IMO C++ in the kernel is asking for
trouble.  And C++ in user-land?  Same thing: you'll end up wanting to
turn parts of your application into libraries, and then some other
developer will want to use those in their C++ app, and then you run into
the ABI issues all over again.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] User level transactional API

2010-08-12 Thread Nicolas Williams

On Thu, Aug 12, 2010 at 07:48:10PM -0500, Norm Jacobs wrote:
 For single file updates, this is commonly solved by writing data to
 a temp file and using rename(2) to move it in place when it's ready.

For anything more complicated you need... a more complicated approach.

Note that transactional API means, among other things, rollback --
easy at the whole dataset level, hard in more granular form.  Dataset-
level rollback is nowhere need granular enough for applications).

Application transactions consisting of more than one atomic filesystem
operation require application-level recovery code.  SQLite3 is a good
(though maybe extreme?) example of such an application; there are many
others.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris Filesystem

2010-07-14 Thread Nicolas Williams

On Wed, Jul 14, 2010 at 03:07:59PM -0600, Beau J. Bechdol wrote:
 So not sue if this is the correct list to email to or not. I am curious to
 know on my machine I have two hard drive (c8t0d0 and c8t1d0). Can some one
 explain to me what this exactly means? What does c8 t0 and d0 actually
 mean. I might have to go back to solaris 101 to understand what this all
 means.

The 'c' is for controller, and the number that follows is one that is
assigned to the given controller (not necessarily on a first-come-
first-served 0-based basis!).  The controller number should be
considered unpredictable at install time.  Once installed it shouldn't
change, except for removable disks, where the controller number might
vary according to which slot you plugged the disk into.

The 't' is for target.

The 'd' is for disk -- think LUN.

The 'p' is for partition, and is used in Solaris on x86.

The 's' is for slice.  Slices are like partitions, but only used in
SOLARIS2 partitions, of which you're allowed no more than one per-disk.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Hash functions (was Re: Hashing files rapidly on ZFS)

2010-07-09 Thread Nicolas Williams

On Thu, Jul 08, 2010 at 08:42:33PM -0700, Garrett D'Amore wrote:
 On Fri, 2010-07-09 at 10:23 +1000, Peter Jeremy wrote:
  In theory, collisions happen.  In practice, given a cryptographic hash,
  if you can find two different blocks or files that produce the same
  output, please publicise it widely as you have broken that hash function.
 
 Not necessarily.  While you *should* publicize it widely, given all the
 possible text that we have, and all the other variants, its
 theoretically possibly to get likely.  Like winning a lottery where
 everyone else has a million tickets, but you only have one.
 
 Such an occurrence -- if isolated -- would not, IMO, constitute a
 'breaking' of the hash function.

A hash function is broken when we know how to create colliding
inputs.  A random collision does not a break make, though it might,
perhaps, help figure out how to break the hash function later.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-06-30 Thread Nicolas Williams

On Wed, Jun 30, 2010 at 01:35:31PM -0700, valrh...@gmail.com wrote:
 Finally, for my purposes, it doesn't seem like a ZIL is necessary? I'm
 the only user of the fileserver, so there probably won't be more than
 two or three computers, maximum, accessing stuff (and writing stuff)
 remotely.

It depends on what you're doing.

The perennial complaint about NFS is the synchronous open()/close()
operations and the fact that archivers (tar, ...) will generally unpack
archives in a single-threaded manner, which means all those synchronous
ops punctuate the archiver's performance with pauses.  This is a load
type for which ZIL devices come in quite handy.  If you write lots of
small files often and in single-threaded ways _and_ want to guarantee
you don't lose transactions, then you want a ZIL device.  (The recent
knob for controlling whether synchronous I/O gets done asynchronously
would help you if you don't care about losing a few seconds worth of
writes, assuming that feature makes it into any release of Solaris.)

 But, from what I can gather, by spending a little under $400, I should
 substantially increase the performance of my system with dedup? Many
 thanks, again, in advance.

If you have deduplicatious data, yes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Nicolas Williams

On Wed, Jun 16, 2010 at 04:44:07PM +0200, Arne Jansen wrote:
 Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or main
 pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection
 of the RAID-controller the disk can leave the write cache enabled. This means
 the disk can write essentially with full speed, meaning 150MB/s for a 15k 
 drive.
 114000 4k writes/s are 456MB/s, so 3 spindles should do.

You'd still have to flush those caches at the end of each transaction,
which would tend to come every few seconds, so you'd need to factor that
in.  You can definitely do with disk what you can do with SSDs, but not
necessarily with the same SWAP (space, wattage and price), and you'd
have a more complex system no matter what.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication and ISO files

2010-06-04 Thread Nicolas Williams

On Fri, Jun 04, 2010 at 12:37:01PM -0700, Ray Van Dolson wrote:
 On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High wrote:
  On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson rvandol...@esri.com wrote:
   The ISO's I'm testing with are the 32-bit and 64-bit versions of the
   RHEL5 DVD ISO's.  While both have their differences, they do contain a
   lot of similar data as well.
  
  Similar != identical.
  
  Dedup works on blocks in zfs, so unless the iso files have identical
  data aligned at 128k boundaries you won't see any savings.
  
   If I explode both ISO files and copy them to my ZFS filesystem I see
   about a 1.24x dedup ratio.
  
  Each file starts a new block, so the identical files can be deduped.
  
  -B
 
 Makes sense.  So, as someone else suggested, decreasing my block size
 may improve the deduplication ratio.
 
 recordsize I presume is the value to tweak?

Yes, but I'd not expect that much commonality between 32-bit and 64-bit
Linux ISOs...

Do the same check again with the ISOs exploded, as you say.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] questions about zil

2010-05-24 Thread Nicolas Williams

On Mon, May 24, 2010 at 05:48:56PM -0400, Thomas Burgess wrote:
 I recently got a new SSD (ocz vertex LE 50gb)
 
 It seems to work really well as a ZIL performance wise.  My question is, how
 safe is it?  I know it doesn't have a supercap so lets' say dataloss
 occursis it just dataloss or is it pool loss?

Just dataloss.

 also, does the fact that i have a UPS matter?

Relative to power loss, yes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] send/recv over ssh

2010-05-20 Thread Nicolas Williams

On Thu, May 20, 2010 at 04:23:49PM -0400, Thomas Burgess wrote:
 I know i'm probably doing something REALLY stupid.but for some reason i
 can't get send/recv to work over ssh.  I just built a new media server and
 i'd like to move a few filesystem from my old server to my new server but
 for some reason i keep getting strange errors...
 
 At first i'd see something like this:
 
 pfexec: can't get real path of ``/usr/bin/zfs''
 
 or something like this:
 
 zfs: Command not found

Add /usr/sbin to your PATH or use /usr/sbin/zfs as the full path of the
zfs(1M) command.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] inodes in snapshots

2010-05-19 Thread Nicolas Williams

On Wed, May 19, 2010 at 05:33:05AM -0700, Chris Gerhard wrote:
 The reason for wanting to know is to try and find versions of a file.

No, there's no such guarantee.  The same inode and generation number
pair is extremely unlikely to be re-used, but the inode number itself is
likely to be re-used.

 If a file is renamed then the only way to know that the renamed file
 was the same as a file in a snapshot would be if the inode numbers
 matched. However for that to be reliable it would require the i-nodes
 are not reused.

There's also the crtime (creation time, not to be confused with ctime),
which you can get with ls(1).

  If they are able to be reused then when an inode number matches I
  would also have to compare the real creation time which requires
  looking at the extended attributes.

Right, that's what you'll have to do.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread Nicolas Williams

On Wed, May 19, 2010 at 07:50:13AM -0700, John Hoogerdijk wrote:
 Think about the potential problems if I don't mirror the log devices
 across the WAN.

If you don't mirror the log devices then your disaster recovery
semantics will be that you'll miss any transactions that hadn't been
committed to disk yet at the time of the disaster.  Which means that the
log devices' effects is purely local: for recovery from local power
failures (not extending to local disasters) and for acceleration.

This may or may not be acceptable to you.  If not, then mirror the log
devices.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-19 Thread Nicolas Williams

On Wed, May 19, 2010 at 02:29:24PM -0700, Don wrote:
 Since it ignores Cache Flush command and it doesn't have any
 persistant buffer storage, disabling the write cache is the best you
 can do.
 
 This actually brings up another question I had: What is the risk,
 beyond a few seconds of lost writes, if I lose power, there is no
 capacitor and the cache is not disabled?

You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).  (You also lose writes from
the currently open transaction, but that's unavoidable in any system.)

Nowadays the system will let you know at boot time that the last
transaction was not committed properly and you'll have a chance to go
back to the previous transaction.

For me, getting much-better-than-disk performance out of an SSD with
cache disabled is enough to make that SSD worthwhile, provided the price
is right of course.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Nicolas Williams

On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote:
 On 5/6/10 5:28 AM, Robert Milkowski wrote:
 
 sync=disabled
 Synchronous requests are disabled. File system transactions
 only commit to stable storage on the next DMU transaction group
 commit which can be many seconds.
 
 Is there a way (short of DTrace) to write() some data and get
 notified when the corresponding txg is committed? Think of it as a
 poor man's group commit.

fsync(2) is it.  Of course, if you disable sync writes then there's no
way to find out for sure.  If you need to know when a write is durable,
then don't disable sync writes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-21 Thread Nicolas Williams

On Wed, Apr 21, 2010 at 10:45:24AM -0400, Edward Ned Harvey wrote:
  From: Mark Shellenbaum [mailto:mark.shellenb...@oracle.com]
  
   You can create/destroy/rename snapshots via mkdir, rmdir, mv inside
  the
   .zfs/snapshot directory, however, it will only work if you're running
  the
   command locally.  It will not work from a NFS client.
  
  
  It will work over NFS or SMB, but you will need to allow it via the
  necessary delegated administration permissions.
 
 Go on?
 I tried it over NFS and it didn't work.  So ... what are the necessary
 permissions?

See zfs(1M), search for delegate.

 I did it from a NFS client as root, where root maps to root.

Huh; dunno why that didn't work.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-21 Thread Nicolas Williams

On Wed, Apr 21, 2010 at 01:03:39PM -0500, Jason King wrote:
 ISTR POSIX also doesn't allow a number of features that can be turned
 on with zfs (even ignoring the current issues that prevent ZFS from
 being fully POSIX compliant today).  I think an additional option for
 the snapdir property ('directory' ?) that provides this behavior (with
 suitable warnings about posix compliance) would be reasonable.
 
 I believe it's sufficient that zfs provide the necessary options to
 act in a posix compliant manner (much like you have to set $PATH
 correctly to get POSIX conforming behavior, even though that might not
 be the default), though I'm happy to be corrected about this.

Yes, that's true.  But you couldn't rely on this behavior, whereas you
can rely on dataset roots having .zfs.  If you're going to script this,
then you'll want to rely on the current (POSIX-compliant) behavior.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-20 Thread Nicolas Williams

On Tue, Apr 20, 2010 at 04:28:02PM +, A Darren Dunham wrote:
On Sat, Apr 17, 2010 at 09:03:33AM -0400, Edward Ned Harvey wrote:
zfs list -t snapshot lists in time order.

Good to know. I'll keep that in mind for my zfs send scripts but it's not
relevant for the case at hand. Because zfs list isn't available on the
NFS client, where the users are trying to do this sort of stuff.

I'll note for comparison that the Netapp shapshots do expose this in one
way.

The actual snapshot directory access time is set to the time of the
snapshot. That makes it visible over NFS. Would be handy to do
something similar in ZFS.

The .zfs/snapshot directory is most certainly available over NFS.

But note that .zfs does not appear in directory listings of dataset
roots -- you have to actually refer to it:

% ls -f|fgrep .zfs
% ls -f .zfs
. ..snapshot
% ls .zfs/snapshot
snapshots
% nfsstat -m $PWD
/net/.../pool/nico from ...:/pool/nico
Flags:
vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,retrans=5,timeo=600
Attr cache:acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

And you can even create, rename and destroy snapshots by creating,
renaming and removing directories in .zfs/snapshot:

% mkdir .zfs/snapshot/foo
% mv .zfs/snapshot/foo .zfs/snapshot/bar
% rmdir .zfs/snapshot/bar

(All this also works locally, not just over ZFS.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-16 Thread Nicolas Williams

On Fri, Apr 16, 2010 at 01:54:45PM -0400, Edward Ned Harvey wrote:
 If you've got nested zfs filesystems, and you're in some subdirectory where
 there's a file or something you want to rollback, it's presently difficult
 to know how far back up the tree you need to go, to find the correct .zfs
 subdirectory, and then you need to figure out the name of the snapshots
 available, and then you need to perform the restore, even after you figure
 all that out.

I've a ksh93 script that lists all the snapshotted versions of a file...
Works over NFS too.

% zfshist /usr/bin/ls
History for /usr/bin/ls (/.zfs/snapshot/*/usr/bin/ls):
-r-xr-xr-x   1 root bin33416 Jul  9  2008 
/.zfs/snapshot/install/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-12-07-20:47:58/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-12-01-00:42:30/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-07-17-21:08:45/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-06-03-03:44:34/usr/bin/ls
% 

It's not perfect (e.g., it doesn't properly canonicalize its arguments,
so it doesn't handle symlinks and ..s in paths), but it's a start.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-16 Thread Nicolas Williams

On Fri, Apr 16, 2010 at 02:19:47PM -0700, Richard Elling wrote:
 On Apr 16, 2010, at 1:37 PM, Nicolas Williams wrote:
  I've a ksh93 script that lists all the snapshotted versions of a file...
  Works over NFS too.
  
  % zfshist /usr/bin/ls
  History for /usr/bin/ls (/.zfs/snapshot/*/usr/bin/ls):
  -r-xr-xr-x   1 root bin33416 Jul  9  2008 
  /.zfs/snapshot/install/usr/bin/ls
  -r-xr-xr-x   1 root bin37612 Nov 21  2008 
  /.zfs/snapshot/2009-12-07-20:47:58/usr/bin/ls
  -r-xr-xr-x   1 root bin37612 Nov 21  2008 
  /.zfs/snapshot/2009-12-01-00:42:30/usr/bin/ls
  -r-xr-xr-x   1 root bin37612 Nov 21  2008 
  /.zfs/snapshot/2009-07-17-21:08:45/usr/bin/ls
  -r-xr-xr-x   1 root bin37612 Nov 21  2008 
  /.zfs/snapshot/2009-06-03-03:44:34/usr/bin/ls
  % 
  
  It's not perfect (e.g., it doesn't properly canonicalize its arguments,
  so it doesn't handle symlinks and ..s in paths), but it's a start.
 
 There are some interesting design challenges here.  For the general case, you 
 can't rely on the snapshot name to be in time order, so you need to sort by 
 the
 mtime of the destination.  

I'm using ls -ltr.

 It would be cool to only list files which are different.

True.  That'd not be hard.

 If you mv a file to another directory, you might want to search by filename
 or a partial directory+filename.

Or even inode number.

 Or maybe you just setup your tracker.cfg and be happy? 

Exactly.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making ZFS better: rm files/directories from snapshots

2010-04-16 Thread Nicolas Williams

On Fri, Apr 16, 2010 at 01:56:07PM -0400, Edward Ned Harvey wrote:
 The typical problem scenario is:  Some user or users fill up the filesystem.
 They rm some files, but disk space is not freed.  You need to destroy all
 the snapshots that contain the deleted files, before disk space is available
 again.
 
 It would be nice if you could rm files from snapshots, without needing to
 destroy the whole snapshot.
 
 Is there any existing work or solution for this?  

See the archives.  See the other replies to you already.  Short version: no.

However, a script to find all the snapshots that you'd have to delete in
order to delete some file might be useful, but really, only marginally
so: you should send your snapshots to backup and clean them out from
time to time anyways.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Rollback From ZFS Send

2010-04-06 Thread Nicolas Williams

On Tue, Apr 06, 2010 at 11:53:23AM -0400, Tony MacDoodle wrote:
 Can I rollback a snapshot that I did a zfs send on?
 
 ie: zfs send testpool/w...@april6  /backups/w...@april6_2010

That you did a zfs send does not prevent you from rolling back to a
previous snapshot.  Similarly for zfs recv -- that you went from one
snapshot to another by zfs receiving a send does not stop you from
rolling back to an earlier snapshot.

You do need to have an earlier snapshot to rollback to, if you want to
rollback.

Also, if you are using zfs send for backups, or for replication, and you
rollback the primary dataset, then you'll need to update your backups/
replicas accordingly.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff

2010-03-29 Thread Nicolas Williams

One really good use for zfs diff would be: as a way to index zfs send
backups by contents.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff

2010-03-29 Thread Nicolas Williams

zfs diff is incredibly cool.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send and ARC

2010-03-25 Thread Nicolas Williams

On Thu, Mar 25, 2010 at 04:23:38PM +, Darren J Moffat wrote:
 If the data is in the L2ARC that is still better than going out to
 the main pool disks to get the compressed version.

advocate customer='devil'

Well, one could just compress it...  If you'd otherwise put compression
in the ssh pipe (or elsewhere) then you could stop doing that.

/advocate customer='devil'

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send and receive corruption across a WAN link?

2010-03-22 Thread Nicolas Williams

On Thu, Mar 18, 2010 at 10:38:00PM -0700, Rob wrote:
 Can a ZFS send stream become corrupt when piped between two hosts
 across a WAN link using 'ssh'?

No.  SSHv2 uses HMAC-MD5 and/or HMAC-SHA-1, depending on what gets
negotiated, for integrity protection.  The chances of random on the wire
corruption going undetected by link-layer CRCs, TCP's CRC and SSHv2's
MACs is infinitessimally small.  I suspect the chances of local bit
flips due to cosmic rays and what not are higher.

A bigger problem is that SSHv2 connections do not survive corruption on
the wire.  That is, if corruption is detected then the connection gets
aborted.  If you were zfs send'ing 1TB across a long, narrow link and
corruption hit the wire while sending the last block you'd have to
re-send the whole thing (but even then such corruption would still have
to get past link-layer and TCP checksums -- I've seen it happen, so it
is possible, but it is also unlikely).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Nicolas Williams

On Tue, Mar 02, 2010 at 11:10:52AM -0800, Bill Sommerfeld wrote:
 On 03/02/10 08:13, Fredrich Maney wrote:
 Why not do the same sort of thing and use that extra bit to flag a
 file, or directory, as being an ACL only file and will negate the rest
 of the mask? That accomplishes what Paul is looking for, without
 breaking the existing model for those that need/wish to continue to
 use it?
 
 While we're designing on the fly:

Heh.

   Another possibility would be to use an 
 additional umask bit or two to influence the mode-bit - acl interaction.

Well, I think the bit, if we must have one, belongs in the filesystem
objects that have ACLs, as opposed to processes.  There may be no umask
to apply in remote access cases, so using a process attribute is likely
to result in different behavior according to the access protocol and
client.  That might not be surprising for the CIFS case, but it
certainly would be for the NFS case.

But also I think it's the owner of an object that should decide what
happens to the object's ACL on chmod rather than random programs and
user environments.

We might need multiple bits, but we do have multiple bits to play with
in mode_t.  The main issue with adding mode_t bits is going to be: will
apps handle the appearance of new mode_t bits correctly?  I suspect that
they will, or at least that we'd condier it a bug if they didn't.  Or we
could add a new file attribute.

But given cheap datasets, why not settle for a suitable dataset property
as a starting point.  I.e., maybe we could play with aclmode a little
more.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Nicolas Williams

On Mon, Mar 01, 2010 at 09:04:58PM -0800, Paul B. Henson wrote:
 On Mon, 1 Mar 2010, Nicolas Williams wrote:
  Yes, that sounds useful.  (Group modebits could be applied to all ACEs
  that are neither owner@ nor everyone@ ACEs.)
 
 That sounds an awful lot like the POSIX mask_obj, which was the bane of my
 previous filesystem, DFS, and which, as it seems history repeats itself, I
 was also unable to get an option implemented to ignore it and allow ACL's
 to work without impediment.

Alternatively group modebits apply to only the group@ ACEs.  This could
be just yet another option.

If no modebits were to apply to ACEs with subjects other than
owner@/group@/everyone@ (what about subjects that match the file's
owner/group but aren't owner@/gr...@?) then there'd be no way to use
modebits as a big filter for ACLs.  This is why I proposed the above.

  If users have private primary groups then you can have them run with
  umask 007 or 002 and use set-gid and/or inherittable ACLs to ensure that
  users can share files in specific directories.  (This is one reason that
  I recommend always giving users their own private primary groups.)
 
 The only reason for the recommendation to give users their own private
 primary groups is because of the lack of flexibility of the umask/mode bits
 security model. In an environment with inheritable ACL's (that aren't
 subject to being violated by that legacy security model) there's no real
 need.

All reasons I have for it really come back to this: the idea of a
primary group and file group is an anachronism from back when ACLs (and
supplementary group memberships!) were overkill.  Think back to the days
when the ATT labs were the only place where Unix ran and Unix had a
user base in the tens of users.  We're stuck with the notion of a
primary group (Windows seems to have it for interop with POSIX).  The
way to make the best of that situation is to give every user their own
private group.

  Alternatively we could have a new mode bit to indicate that the group
  bits of umask are to be treated as zero, or maybe assign this behavior
  to the set-gid bit on ZFS.
 
 So rather than a nice simple option granting ACL's immunity from umask/mode
 bits baggage, another attempted mapping/interaction?

You have a good idea of what is simple for your use case.  Your use
case also appears to be greatly influenced by what we could (should, do)
consider to be a bug in Samba.  Your idea of simple may not match
every one else's.  And your idea of simple might well differ if that
one application didn't use chmod() at all.

Personally I don't see a simple, non-surprising solution.  I see a set
of solutions that one could pick from.  In all cases I think we need a
way to synthesize modebits from ACLs (e.g., for objects created via
protocols that have no conception of modebits but have a conception of
ACLs) -- that's a difficult problem because any algorithm for doing that
will necessarily be lossy in many cases.

 If you only ever access ZFS via CIFS from windows clients, you can have a
 pure ACL model. Why should access via local shell or NFSv4 be a poor
 stepchild and chained down with legacy semantics that make it exceedingly
 difficult to actually use ACL's for their intended purpose?

I am certainly not advocating that.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Nicolas Williams

BTW, it should be relatively easy to implement aclmode=ignore and
aclmode=deny, if you like.

 - $SRC/common/zfs/zfs_prop.c needs to be updated to know about the new
   values of aclmode.

 - $SRC/uts/common/fs/zfs/zfs_acl.c:zfs_acl_chmod()'s callers need to be
   modified:

- in the create path if zfs_acl_chmod() gets called then you can't
  ignore nore deny the mode;
- zfs_acl_chmod_setattr() should call neither zfs_acl_node_read()
  nor zfs_acl_chmod() if aclmode=ignore or aclmode=deny
- in all other paths you zfs_acl_chmod() should do what it should do

 - $SRC/uts/common/fs/zfs/zfs_vnops.c:zfs_setattr() may need some
   updates too, e.g., to not call zfs_aclset_common() in the case of
   aclmode=ignore -- you'll probably have to play around to figure out
   what else.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-01 Thread Nicolas Williams

On Fri, Feb 26, 2010 at 03:00:29PM -0500, Miles Nordin wrote:
  nw == Nicolas Williams nicolas.willi...@sun.com writes:
 
 nw What could we do to make it easier to use ACLs?
 
 1. how about AFS-style ones where the effective permission is the AND
of the ACL and the unix permission?  You might have to combine this

Yes, that sounds useful.  (Group modebits could be applied to all ACEs
that are neither owner@ nor everyone@ ACEs.)

with an inheritable-by-subdirectories umask setting so you could
create ACL-dominated lands of files that are all unix 777, but this
would stop clobbering difficult-to-recreate ACL's as well as
unintended information leaking.

If users have private primary groups then you can have them run with
umask 007 or 002 and use set-gid and/or inherittable ACLs to ensure that
users can share files in specific directories.  (This is one reason that
I recommend always giving users their own private primary groups.)

Alternatively we could have a new mode bit to indicate that the group
bits of umask are to be treated as zero, or maybe assign this behavior
to the set-gid bit on ZFS.

 2. define a standard API for them, add ability to replicate them to
[...]

That'd be nice.

 Maybe we're beyond the point of no return for the first suggestion.

Why?  It can just be another value of the aclmode property.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Nicolas Williams

On Fri, Feb 26, 2010 at 08:23:40AM -0800, Paul B. Henson wrote:
 So far it's been quite a struggle to deploy ACL's on an enterprise central
 file services platform with access via multiple protocols and have them
 actually be functional and reliable. I can see why the average consumer
 might give up.

Can you describe your struggles?  What could we do to make it easier to
use ACLs?  Is this about chmod [and so random apps] clobbering ACLs? or
something more fundamental about ACLs?

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Nicolas Williams

On Fri, Feb 26, 2010 at 02:50:05PM -0800, Paul B. Henson wrote:
 On Fri, 26 Feb 2010, Bill Sommerfeld wrote:
 
  I believe this proposal is sound.
 
 Mere words can not express the sheer joy with which I receive this opinion
 from an @sun.com address ;).

I believe we can do a bit better.

A chmod that adds (see below) or removes one of r, w or x for owner is a
simple ACL edit (the bit may turn into multiple ACE bits, but whatever)
modifying / replacing / adding owner@ ACEs (if there is one).  A similar
chmod that affecting group bits should probably apply to group@ ACEs.  A
similar chmod that affecting other should apply to any everyone@ ACEs.

For set-uid/gid and the sticky bits being set/cleared on non-directories
chmod should not affect the ACL at all.  For directories the sticky and
setgid bits may require editing the inherittable ACEs of the ACL.

 There's also the question of what to do with the non-access-control pieces
 of the legacy mode bits that have no ACL equivilent (suid, sgid, sticky
 bit, et al). I think the only way to set those is with an absolute chmod,

chmod(2) always takes an absolute mode.  ZFS would have to reconstruct
the relative change based on the previous mode... but how to know what
the previous mode was?  ZFS would have to construct one from the
owner@/group@/everyone@ + set-uid/gid + sticky bits, if any.  Best
effort will do.

 so there'd be no way to manipulate them in the current implementation
 without whacking the ACL. That's likely done relatively infrequently, those
 bits could always be set before the ACL is applied. In our current
 deployment the only one we use is sgid on directories, which is inherited,
 not directly applied.

You should probably stop using the set-gid bit on directories and use
inherttable ACLs instead...

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Nicolas Williams

On Fri, Feb 26, 2010 at 04:26:43PM -0800, Paul B. Henson wrote:
 On Fri, 26 Feb 2010, Nicolas Williams wrote:
  I believe we can do a bit better.
 
  A chmod that adds (see below) or removes one of r, w or x for owner is a
  simple ACL edit (the bit may turn into multiple ACE bits, but whatever)
  modifying / replacing / adding owner@ ACEs (if there is one).  A similar
  chmod that affecting group bits should probably apply to group@ ACEs.  A
  similar chmod that affecting other should apply to any everyone@ ACEs.
 
 I don't necessarily think that's better; and I believe that's approximately
 the behavior you can already get with aclmode=passthrough.
 
 If something is trying to change permissions on an object with a
 non-trivial ACL using chmod, I think it's safe to assume that's not what
 the original user who configured the ACL wants. At least, that would be
 safe to assume if the user had explicitly configured the hypothetical
 aclmode=deny or aclmode=ignore :).

Suppose you deny or ignore chmods.  Well, how would you ever set or
reset set-uid/gid and sticky bits?  chmod(2) deals only in absolute
modes, not relative changes, which means that in order to distinguish
those bits from the rwx bits the filesystem would have to know the
file's current mode bits in order to compare them to the new bits -- but
this is hard (see my other e-mail in a new sub-thread).  You'd have to
remove the ACL then chmod; oof.

 Take, for example, a problem I'm currently having on Linux clients mounting
 ZFS over NFSv4. Linux supports NFSv4, and even has a utility to manipulate
 NFSv4 ACL's that works ok (but isn't nearly as nice as the ACL integrated
 chmod command in Solaris). However, the default behavior of the linux cp
 command is to try and copy the mode bits along with the file. So, I copy
 a file into zfs over the NFSv4 mount from some local location. The file is
 created and inherits the explicitly configured ACL from the parent
 directory; the cp command then does a chmod() on it and the ACL is broken.
 That's not what I want, I configured that inheritable ACL for a reason, and
 I want it respected regardless of the permissions of the file in its
 original location.

Can you make that utility avoid the chmod?  The mode bits should come
from the open(2)/creat(2), and there should be no need to set them again
after setting the ACL.

 Another instance is an application that doesn't seem to trust creat() and
 umask to do the right thing, after creating a file it explicitly chmod's it
 to match the permissions it thinks it should have had based on the
 requested mode and the current umask. If the file inherited an explicitly
 specified non-trivial ACL, there's really nothing that can be done about
 that chmod, other than ignore or deny it, that will result in the
 permissions intended by the user who configured the ACL.

Such an app is broken.

  For set-uid/gid and the sticky bits being set/cleared on non-directories
  chmod should not affect the ACL at all.
 
 Agreed.

But see above, below.

  For directories the sticky and setgid bits may require editing the
  inherittable ACEs of the ACL.
 
 Sticky bit yes; in fact, as it affects permissions I think I'd lump that in
 to the ignore/deny category. sgid on directory though? That doesn't
 explicitly affect permission, it just potentially changes the group
 ownership of new files/directories. I suppose that indirectly affects
 permissions, as the implicit group@ ACE would be applied to a different
 group, but that's probably the intention of the person setting the sgid
 bit, and I don't think any actual ACL entry changes should occur from it.

I think both can be implemented as inherittable ACLs.

  chmod(2) always takes an absolute mode.  ZFS would have to reconstruct
  the relative change based on the previous mode...
 
 Or perhaps some interface extension allowing relative changes to the
 non-permission mode bits?

But we'd have to extend NFSv4 and get the extension adopted and
deployed.  There's no chance of such a change being made in a short
period of time -- we're talking years.

   For example, chown(2) allows you to specify -1
 for either the user or group, meaning don't change that one. mode_t is
 unsigned, so negative values won't work there, but there are a ton of
 extra bits in an unsigned int not relevant to the mode, perhaps setting one
 of them to signify only non permission related mode bits should be
 manipulated:

True, there's enough unused bits there that you could add ignore bits
(and mode4 is an unsigned 32-bit integer in NFSv4 too), but once again
you'd have to get clients and servers to understand this...

 [...]
 
 But back to ACL/chmod; I don't think there's any way to map a permission
 mode bits change via chmod to an ACL change that is guaranteed to be
 acceptable to the creator of the ACL. I think there should be some form of
 option available such that if an application is not ACL aware, it flat out
 shouldn't

Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Nicolas Williams

On Wed, Feb 24, 2010 at 02:09:42PM -0600, Bob Friesenhahn wrote:
 I have a directory here containing a million files and it has not 
 caused any strain for zfs at all although it can cause considerable 
 stress on applications.

The biggest problem is always the apps.  For example, ls by default
sorts, and if you're using a locale with a non-trivial collation (e.g.,
any UTF-8 locales) then the sort gets very expensive.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Nicolas Williams

On Wed, Feb 24, 2010 at 03:31:51PM -0600, Bob Friesenhahn wrote:
 With millions of such tiny files, it makes sense to put the small 
 files in a separate zfs filesystem which has its recordsize property 
 set to a size not much larger than the size of the files.  This should 
 reduce waste, resulting in reduced potential for fragmentation in the 
 rest of the pool.

Tuning the dataset recordsize down does not help in this case.  The
files are already small, so their recordsize is already small.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS 'secure erase'

2010-02-05 Thread Nicolas Williams

On Fri, Feb 05, 2010 at 03:49:15PM -0500, c.hanover wrote:
 Two things, mostly related, that I'm trying to find answers to for our
 security team.
 
 Does this scenario make sense:
 * Create a filesystem at /users/nfsshare1, user uses it for a while,
 asks for the filesystem to be deleted
 * New user asks for a filesystem and is given /users/nfsshare2.  What
 are the chances that they could use some tool or other to read
 unallocated blocks to view the previous user's data?

If the tool isn't accessing the raw disks, then the answer is no
chance.  (There's no way to access the raw disks over NFS.)

 Related to that, when files are deleted on a ZFS volume over an NFS
 share, how are they wiped out?  Are they zeroed or anything.  Same
 question for destroying ZFS filesystems, does the data lay about in
 any way?  (That's largely answered by the first scenario.)

Deleting a file does not guarantee that data blocks are released:
snapshots might exist that retain references to the data blocks of a
file that is being deleted.  Nor are blocks wiped when released.

 If the data is retrievable in any way, is there a way to a) securely
 destroy a filesystem, or b) securely erase empty space on a
 filesystem.

When ZFS crypto ships you'll be able to securely destroy encrypted
datasets.  Until then the only form of secure erasure is to destroy the
pool and then wipe the individual disks.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS 'secure erase'

2010-02-05 Thread Nicolas Williams

On Fri, Feb 05, 2010 at 04:41:08PM -0500, Miles Nordin wrote:
  ch == c hanover chano...@umich.edu writes:
 
 ch is there a way to a) securely destroy a filesystem,
 
 AIUI zfs crypto will include this, some day, by forgetting the key.

Right.

 but for SSD, zfs above a zvol, or zfs above a SAN that may do
 snapshots without your consent, I think it's just logically not a
 solveable problem, period, unless you have a writeable keystore
 outside the vdev structure.

IIIRC ZFS crypto will store encrypted blocks in L2ARC and ZIL, so
forgetting the key is sufficient to obtain a high degree of security.

ZFS crypto over zvols and what not presents no additional problems.
However, if your passphrase is guessable then the key might be
recoverable even after it's forgotten.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS 'secure erase'

2010-02-05 Thread Nicolas Williams

On Fri, Feb 05, 2010 at 05:08:02PM -0500, c.hanover wrote:
 In our particular case, there won't be snapshots of destroyed
 filesystems (I create the snapshots, and destroy them with the
 filesystem).

OK.

 I'm not too sure on the particulars of NFS/ZFS, but would it be
 possible to create a 1GB file without writing any data to it, and then
 use a hex editor to access the data stored on those blocks previously?

Absolutely not.

That is, you can create a 1GB file without writing to it, but it will
appear to contain all zeros.

 Any chance someone could make any kind of sense of the contents
 (allocated in the same order they were before, or what have you)?

No.  See above.

 ZFS crypto will be nice when we get either NFSv4 or NFSv3 w/krb5 for
 over the wire encryption.  Until then, not much point.

You can use NFS with krb5 over the wire encryption _now_.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] unionfs help

2010-02-04 Thread Nicolas Williams

On Thu, Feb 04, 2010 at 03:19:15PM -0500, Frank Cusack wrote:
 BTW, I could just install everything in the global zone and use the
 default inheritance of /usr into each local zone to see the data.
 But then my zones are not independent portable entities; they would
 depend on some non-default software installed in the global zone.
 
 Just wanted to explain why this is valuable to me and not just some
 crazy way to do something simple.

There's no unionfs for Solaris.

(For those of you who don't know, unionfs is a BSDism and is a
pseudo-filesystem which presents the union of two underlying
filesystems, but with all changes being made only to one of the two
filesystems.  The idea is that one of the underlying filesystems cannot
be modified through the union, with all changes made through the union
being recorded in an overlay fs.  Think, for example, of unionfs-
mounting read-only media containing sources: you could cd to the mount
point and build the sources, with all intermediate files and results
placed in the overlay.)

In Frank's case, IIUC, the better solution is to avoid the need for
unionfs in the first place by not placing pkg content in directories
that one might want to be writable from zones.  If there's anything
about Perl5 (or anything else) that causes this need to arise, then I
suggest filing a bug.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] unionfs help

2010-02-04 Thread Nicolas Williams

On Thu, Feb 04, 2010 at 04:03:19PM -0500, Frank Cusack wrote:
 On 2/4/10 2:46 PM -0600 Nicolas Williams wrote:
 In Frank's case, IIUC, the better solution is to avoid the need for
 unionfs in the first place by not placing pkg content in directories
 that one might want to be writable from zones.  If there's anything
 about Perl5 (or anything else) that causes this need to arise, then I
 suggest filing a bug.
 
 Right, and thanks for chiming in.  Problem is that perl wants to install
 add-on packages in places that the coincide with the system install.
 Most stuff is limited to the site_perl directory, which is easily
 redirected, but it also has some other locations it likes to meddle with.

Maybe we need a zone_perl location.  Judicious use of the search paths
will get you out of this bind, I think.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-21 Thread Nicolas Williams

On Thu, Jan 21, 2010 at 02:11:31PM -0800, Moshe Vainer wrote:
 PS: For data that you want to mostly archive, consider using Amazon
 Web Services (AWS) S3 service. Right now there is no charge to push
 data into the cloud and its $0.15/gigabyte to keep it there. Do a
 quick (back of the napkin) calculation on what storage you can get for
 $30/month and factor in bandwidth costs (to pull the data when/if you
 need it). My napkin calculations tell me that I cannot compete
 with AWS S3 for up to 100Gb of storage available 7x24. Even the
 electric utility bill would be more than AWS charges - especially when
 you consider UPS and air conditioning. And thats not including any
 hardware (capital equipment) costs! see: http://aws.amazon.com/s3/
 
 When going the amazon route, you always need to take into account
 retrieval time/bandwidth cost.  If you were to store 100GB on Amazon -
 how fast can you get your data back, or how much would bandwidth cost
 you to retrieve it in a timely manner. It is all a matter of
 requirements of course.

Don't forget asymmetric upload/download bandwidth.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Nicolas Williams

On Thu, Dec 17, 2009 at 03:32:21PM +0100, Kjetil Torgrim Homme wrote:
 if the hash used for dedup is completely separate from the hash used for
 data protection, I don't see any downsides to computing the dedup hash
 from uncompressed data.  why isn't it?

Hash and checksum functions are slow (hash functions are slower, but
either way you'll be loading large blocks of data, which sets a floor
for cost).  Duplicating work is bad for performance.  Using the same
checksum for integrity protection and dedup is an optimization, and a
very nice one at that.  Having separate checksums would require making
blkptr_t larger, which imposes its own costs.

There's lots of trade-offs here.  Using the same checksum/hash for
integrity protection and dedup is a great solution.

If you use a non-cryptographic checksum algorithm then you'll
want to enable verification for dedup.  That's all.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] file concatenation with ZFS copy-on-write

2009-12-03 Thread Nicolas Williams

On Thu, Dec 03, 2009 at 03:57:28AM -0800, Per Baatrup wrote:
 I would like to to concatenate N files into one big file taking
 advantage of ZFS copy-on-write semantics so that the file
 concatenation is done without actually copying any (large amount of)
 file content.
   cat f1 f2 f3 f4 f5  f15
 Is this already possible when source and target are on the same ZFS
 filesystem?
 
 Am looking into the ZFS source code to understand if there are
 sufficient (private) interfaces to make a simple zcat -o f15   f1 f2
 f3 f4 f5 userland application in C code. Does anybody have advice on
 this?

There have been plenty of answers already.

Quite aside from dedup, the fact that all blocks in a file must have the
same uncompressed size means that if any of f2..f5 have different block
sizes from f1, or any of f1..f5's last blocks are partial then ZFS could
not perform this concatenation as efficiently as you wish.

In other words: dedup _is_ what you're looking for...

...but also ZFS most likely could not do any better with any other, more
specific non-dedup solution.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] file concatenation with ZFS copy-on-write

2009-12-03 Thread Nicolas Williams

On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote:
 if any of f2..f5 have different block sizes from f1
 
 This restriction does not sound so bad to me if this only refers to
 changes to the blocksize of a particular ZFS filesystem or copying
 between different ZFSes in the same pool. This can properly be managed
 with a -f switch on the userlan app to force the copy when it would
 fail.

Why expose such details?

If you have dedup on and if the file blocks and sizes align then

cat f1 f2 f3 f4 f5  f6

will do the right thing and consume only space for new metadata.

If the file blocks and sizes do not align then

cat f1 f2 f3 f4 f5  f6

will still work correctly.

Or do you mean that you want a way to do that cat ONLY if it would
consume no new space for data?  (That might actually be a good
justification for a ZFS cat command, though I think, too, that one could
script it.)

 any of f1..f5's last blocks are partial
 
 Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS
 blocksize? This is a severe restriction that will fail unless in very
 special cases.

Say f1 is 1MB, f2 is 128KB, f3 is 510 bytes, f4 is 514 bytes, and f5 is
10MB, and the recordsize for their containing datasets is 128KB, then
the new file will consume 10MB + 128KB more than f1..f5 did, but 1MB +
128KB will be de-duplicated.

This is not really a severe restriction.  To make ZFS do better than
that would require much extra metadata and complexity in the filesystem
that users who don't need to do space-efficient file concatenation (most
users, that is) won't want to pay for.

 Is this related to the disk format or is it restriction in the
 implrmentation? (do you know where to look in the source code?).

Both.

 ...but also ZFS most likely could not do any better with any other, more
 specific non-dedup solution
 
 Properly lots of I/O traffic, digest calculation+lookups, could be
 saved as we already know it will be a duplicate.  (In our case the
 files are gigabyte sizes)

ZFS hashes, and records hashes of blocks, not sub-blocks.  Look at my
above example.  To efficiently dedup the concatenation of the 10MB of f5
would require being able to have something like sub-block pointers.
Alternatively, if you want a concatenation-specific feature ZFS would
have to have a metadata notion of concatentation, but then the Unix way
of concatenating files couldn't be used for this since the necessary
context is lost in the I/O redirection.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard

2009-11-11 Thread Nicolas Williams

On Mon, Sep 07, 2009 at 09:58:19AM -0700, Richard Elling wrote:
 I only know of hole punching in the context of networking. ZFS doesn't
 do networking, so the pedantic answer is no.

But a VDEV may be an iSCSI device, thus there can be networking below
ZFS.

For some iSCSI targets (including ZVOL-based ones) a hole punchin
operation can be very useful since it explicitly tells the backend that
some contiguous block of space can be released for allocation to others.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] PSARC recover files?

2009-11-10 Thread Nicolas Williams

On Tue, Nov 10, 2009 at 03:33:22PM -0600, Tim Cook wrote:
 You're telling me a scrub won't actively clean up corruption in snapshots?
 That sounds absolutely absurd to me.

Depends on how much redundancy you have in your pool.  If you have no
mirrors, no RAID-Z, and no ditto blocks for data, well, you have no
redundancy, and ZFS won't be able to recover affected files.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Nicolas Williams

On Mon, Nov 02, 2009 at 12:58:32PM -0500, Dennis Clarke wrote:
 Looking at FIPS-180-3 in sections 4.1.2 and 4.1.3 I was thinking that the
 major leap from SHA256 to SHA512 was a 32-bit to 64-bit step.

ZFS doesn't have enough room in blkptr_t for 512-bi hashes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup question

2009-11-02 Thread Nicolas Williams

On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote:
 forgive my ignorance, but what's the advantage of this new dedup over  
 the existing compression option?  Wouldn't full-filesystem compression  
 naturally de-dupe?

If you snapshot/clone as you go, then yes, dedup will do little for you
because you'll already have done the deduplication via snapshots and
clones.  But dedup will give you that benefit even if you don't
snapshot/clone all your data.  Not all data can be managed
hierarchically, with a single dataset at the root of a history tree.

For example, suppose you want to create two VirtualBox VMs running the
same guest OS, sharing as much on-disk storage as possible.  Before
dedup you had to: create one VM, then snapshot and clone that VM's VDI
files, use an undocumented command to change the UUID in the clones,
import them into VirtualBox, and setup the cloned VM using the cloned
VDI files.  (I know because that's how I manage my VMs; it's a pain,
really.)  With dedup you need only enable dedup and then install the two
VMs.

Clearly the dedup approach is far, far easier to use than the
snapshot/clone approach.  And since you can't always snapshot/clone...

There are many examples where snapshot/clone isn't feasible but dedup
can help.  For example: mail stores (though they can do dedup at the
application layer by using message IDs and hashes).  For example: home
directories (think of users saving documents sent via e-mail).  For
example: source code workspaces (ONNV, Xorg, Linux, whatever), where
users might not think ahead to snapshot/clone a local clone (I also tend
to maintain a local SCM clone that I then snapshot/clone to get
workspaces for bug fixes and projects; it's a pain, really).  I'm sure
there are many, many other examples.

The workspace example is particularly interesting: with the
snapshot/clone approach you get to deduplicate the _source code_, but
not the _object code_, while with dedup you get both dedup'ed
automatically.

As for compression, that helps whether you dedup or not, and it helps by
about the same factor either way -- dedup and compression are unrelated,
really.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs inotify?

2009-10-26 Thread Nicolas Williams

On Mon, Oct 26, 2009 at 08:53:50PM -0700, Anil wrote:
 I haven't tried this, but this must be very easy with dtrace. How come
 no one mentioned it yet? :) You would have to monitor some specific
 syscalls...

DTrace is not reliable in this sense: it will drop events rather than
overburden the system.  Also, system calls are not the only thing you
want to watch for -- you should really trace the VFS/fop rather than
syscalls for this.  In any case, port_create(3C) and gamin are the way
forward.

port_create(3C) is rather easy to use.  Searching the web for
PORT_SOURCE_FILE you'll find useful docs like:

http://blogs.sun.com/praks/entry/file_events_notification

which has example code too.

I do think it'd be useful to have command-line utility in core Solaris
that uses this facility, something like the example in Prakash's blog
(which, incidentally, _works_), but perhaps a bit more complete.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't rm file when No space left on device...

2009-10-01 Thread Nicolas Williams

On Thu, Oct 01, 2009 at 11:03:06AM -0700, Rudolf Potucek wrote:
 Hmm ... I understand this is a bug, but only in the sense that the
 message is not sufficiently descriptive. Removing the file from the
 source filesystem will not necessarily free any space because the
 blocks have to be retained in the snapshots. The same problem exists
 for zeroing the file with file as suggested earlier.
 
 It seems like the appropriate solution would be to have a tool that
 allows removing a file from one or more snapshots at the same time as
 removing the source ... 

That would make them not really snapshots.  And such a tool would have
to fix clones too.

Snapshot and clones are great.  They are also great ways to consume too
much space.  One must do some spring cleaning once in a while.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Nicolas Williams

On Fri, Sep 04, 2009 at 01:41:15PM -0700, Richard Elling wrote:
 On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:
 We have groups generating terabytes a day of image data  from lab  
 instruments and saving them to an X4500.
 
 Wouldn't it be easier to compress at the application, or between the
 application and the archiving file system?

Especially when it comes to reading the images back!

ZFS compression is transparent.  You can't write uncompressed data then
read back compressed data.  And compression is at the block level, not
for the whole file, so even if you could read it back compressed, it
wouldn't be in a useful format.

Most people want to transfer data compressed, particularly images.  So
compressing at the application level in this case seems best to me.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] utf8only and normalization properties

2009-08-27 Thread Nicolas Williams

So, the manpage seems to have a bug in it.  The valid values for the
normalization property are:

none | formC | formD | formKC | formKD

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Almost empty ZFS filesystem - 14GB?

2009-08-21 Thread Nicolas Williams

On Fri, Aug 21, 2009 at 06:46:32AM -0700, Chris Murray wrote:
 Nico, what is a zero-link file, and how would I go about finding
 whether I have one? You'll have to bear with me, I'm afraid, as I'm
 still building my Solaris knowledge at the minute - I was brought up
 on Windows. I use Solaris for my storage needs now though, and slowly
 improving on my knowledge so I can move away from Windows one day  :)

I see that Mark S. thinks this may be a specific ZFS bug, and there's a
followup with instructions on how to detect if that's the case.

However, it can also be a zero-link file.  I've certainly run into that
problem before myself, on UFS and other filesystems.

A zero-link file is a file that has been removed (unlink(2)ed), but
which remains open in some process(es).  Such a file continues to
consume space until the processes that have it open are killed.

Typically you'd use pfiles(1) or lsof to find such files.

 If it makes any difference, the problem persists after a full reboot,

Yeah, if you rebooted and there's no 14GB .nfs* files, then this is not
a zero-link file.  See the followups.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Almost empty ZFS filesystem - 14GB?

2009-08-18 Thread Nicolas Williams

Perhaps an open 14GB, zero-link file?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] utf8only and normalization properties

2009-08-13 Thread Nicolas Williams

On Thu, Aug 13, 2009 at 05:57:57PM -0500, Haudy Kazemi wrote:
 Therefore, if you need to interoperate with MacOS X then you should
 enable the normalization feature.
   
 Thank you for the reply. My goal is to configure the filesystem for the 
 lowest common denominator without knowing up front which clients will be 
 used. OS X and Win XP are listed because they are commonly used as 
 desktop OSes.  Ubuntu Linux is a third potential desktop OS.

Right, so set normalization=formD .

 The normalization property documentation says this property indicates 
 whether a file system should perform a unicode normalization of file 
 names whenever two file names are compared.  File names are always 
 stored unmodified, names are normalized as part of any comparison 
 process.  Where does the file system use filename comparisons and what 
 does it use them for?  Filename collision checking?  Sorting?

The system does filename comparisons when doing lookups
(open(/foo/bar/baz, ...) does at least three such lookups, for
example), and on create (since that involves a lookup).

Yes, this is about collisions.  Consider a file named á (that's a
with an acute accent).  There are _two_ possible encodings for that name
in UTF-8.  That means that you could have two files in the same
directory and with the same name, though they'd have different names if
you looked at the bytes that make up the names.  That would be
confusing, at the very least.

To avoid such collisions you can enable normalization.

You can find more here:

http://blogs.sun.com/nico/entry/filesystem_i18n

 Is it used for any other operation, say when returning a filename to an 
 application?  Would applications reading/writing files to a ZFS 

No, directory listings always return the filename used when the file
name was created, without any normalization.

 filesystem ever notice the difference in normalization settings as long 
 as they produce filenames that do not conflict with existing names or 
 create invalid UTF8?  The documentation says filenames are stored 
 unmodified, which sounds like things should be transparent to applications.

Applications shouldn't notice normalization being enabled.  The only
reasons to disable normalization are: a) you don't want to force the use
of UTF-8, or b) you consistently use a single normalization form and you
don't want to pay a penalty for normalizing on lookup.

(b) is probably not a problem -- the normalization code is fast if you
use all US-ASCII strings, and it's linear with the number of non-ASCII,
Unicode codepoints in file names.  But I don't have performance numbers
to share.  I think that normalization should be enabled by default if
you enable utf8only, and utf8only should probably be enabled by default
in Solaris, but that's just my personal opinion.

 (In regard to filename collision checking, if non-normalized unmodified 
 filenames are always stored on disk, and they don't conflict in 
 non-normalized form, what would the point be of normalizing the 
 filenames for a comparison?  To verify there isn't conflict in 
 normalized forms, and if there is no conflict with an existing file to 
 allow the filename to be written unmodified?)

Yes.

 The ZFS documentation doesn't list the valid values for the 
 normalization property other than 'none.  From your reply and from the 

The zfs(1M) manpage lists them:

 normalization = none | formD | formKCf

That's not all existing Unicode normalization forms, no.  The reason for
this is that we only normalize on lookup (the file names returned by
readdir are not normalized), and for that the forms C and D are
semantically equivalent, but K and non-K forms are not semantically
equivalent, so we need one K form and one non-K form.  NFD is faster
than NFC, but the K forms require a trip through form C, so NFKC is
faster than NFKD (at least if I remember correctly).  Which means that
NFD and NFKC were sufficient, and there's no reason to ever want NFC or
NFKD.

 suggest they be added to the documentation at
 http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html

Yes, that's a good point.

PS:  ZFS directories are hashed.  When normalization is enabled, the
 hash keys are normalized on create, but the hash contents are not,
 so filenames rename unnormalized.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] feature proposal

2009-07-29 Thread Nicolas Williams

On Wed, Jul 29, 2009 at 03:35:06PM +0100, Darren J Moffat wrote:
 Andriy Gapon wrote:
 What do you think about the following feature?
 
 Subdirectory is automatically a new filesystem property - an 
 administrator turns
 on this magic property of a filesystem, after that every mkdir *in the 
 root* of
 that filesystem creates a new filesystem. The new filesystems have
 default/inherited properties except for the magic property which is off.
 
 This has been brought up before and I thought there was an open CR for 
 it but I can't find it.

I'd want this to be something one could set per-directory, and I'd want
it to not be inherittable (or to have control over whether it is
inherittable).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The importance of ECC RAM for ZFS

2009-07-24 Thread Nicolas Williams

On Fri, Jul 24, 2009 at 05:01:15PM +0200, dick hoogendijk wrote:
 On Fri, 24 Jul 2009 10:44:36 -0400
 Kyle McDonald kmcdon...@egenera.com wrote:
  ... then it seems  like a shame (or a waste?)  not to equally
  protect the data both before it's given to ZFS for writing, and after
  ZFS reads it back and returns it to you.
 
 But that was not the question.
 The question was: [quote] My question is: is there any technical
 reason, in ZFS's design, that makes it particularly important for ZFS
 to require ECC RAM?

The only thing I can think of is this: if a cosmic ray flips a bit in
memory holding a ZFS transaction that's already had all its checksums
computed, but hasn't hit disk yet, then you'll have a checksum
verification failure later when you read back the affected file (or
directory).  Using ECC memory avoids that.  You still have the processor
to worry about though.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

2009-07-23 Thread Nicolas Williams

On Wed, Jul 22, 2009 at 02:45:52PM -0500, Bob Friesenhahn wrote:
 On Wed, 22 Jul 2009, t. johnson wrote:
 Lets say I have a simple-ish setup that uses vmware files for 
 virtual disks on an NFS share from zfs. I'm wondering how zfs' 
 variable block size comes into play? Does it make the alignment 
 problem go away? Does it make it worse? Or should we perhaps be
 
 My understanding is that zfs uses fixed block sizes except for the 
 tail block of a file, or if the filesystem has compression enabled.

For one block files, the block is variable, between 512 bytes and the
smaller of the dataset's recordsize or 128KB.  For multi-block files all
blocks are the same size, except the tail block.  But these are sizes in
file data, not actual on-disk sizes (which can be less because of
compression).

 Zfs's large blocks can definitely cause performance problems if the 
 system has insufficient memory to cache the blocks which are accessed, 
 or only part of the block is updated.

You should set the virtual disk image files' recordsize (or, rather, the
containing dataset's recordsize) to match the preferred block size of
the filesystem types (or data) that you'll put on those virtual disks.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSDs get faster and less expensive

2009-07-21 Thread Nicolas Williams

On Tue, Jul 21, 2009 at 02:45:57PM -0700, Richard Elling wrote:
 But to put this in perspective, you would have to *delete* 20 GBytes

Or overwrite (since the overwrites turn in to COW writes of new blocks
and the old blocks are released if not referred to from snapshot).

 of data a day on a ZFS file system for 5 years (according to Intel) to
 reach the expected endurance.  I don't know many people who delete
 that much data continuously (I suspect that the satellite data vendors
 might in their staging servers... not exactly a market for SSDs)

Don't forget atime updates.  If you just read, you're still writing.

Of course, the writes from atime updates will generally be less than the
number of data blocks read, so you might have to read many more times
what you say in order to get the same effect.

(Speaking of atime updates, I run my root datasets with atime updates
disabled.  I don't have hard data, but it stands to reason that things
can go fast that way.  I also mount filesystems in VMs with atime
disabled.

Yes, I'm picking nits; sorry.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?

2009-06-19 Thread Nicolas Williams

On Fri, Jun 19, 2009 at 04:09:29PM -0400, Miles Nordin wrote:
 Also, as I said elsewhere, there's a barrier controlled by Sun to
 getting bugs accepted.  This is a useful barrier: the bug database is
 a more useful drive toward improvement if it's not cluttered.  It also
 means, like I said, sometimes the mailing list is a more useful place
 for information.

There's two bug databases, sadly.  bugs.opensolaris.org is like you
describe, whereas defect.opensolaris.org is not.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..

2009-05-22 Thread Nicolas Williams

On Fri, May 22, 2009 at 04:40:43PM -0600, Eric D. Mudama wrote:
 As another datapoint, the 111a opensolaris preview got me ~29MB/s
 through an SSH tunnel with no tuning on a 40GB dataset.
 
 Sender was a Core2Duo E4500 reading from SSDs and receiver was a Xeon
 E5520 writing to a few mirrored 7200RPM SATA vdevs in a single pool.
 Network was a $35 8-port gigabit netgear switch.

Unfortunately the SunSSH doesn't know how to grow SSHv2 channel windows
to take full advantage of the TCP BDP, so you could probably have gone
faster.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs promote/destroy enhancements?

2009-04-23 Thread Nicolas Williams

On Thu, Apr 23, 2009 at 09:59:33AM -0600, Matthew Ahrens wrote:
 zfs destroy [-r] -p sounds great.
 
 I'm not a big fan of the -t template.  Do you have conflicting snapshot 
 names due to the way your (zones) software works, or are you concerned 
 about sysadmins creating these conflicting snapshots?  If it's the former, 
 would it be possible to change the zones software to avoid it?

I think the -t option -- automatic snapshot name conflict resolution --
makes a lot of sense in the context of snapshots and clones mostly
managed by a system component (zonedm, beadm) but where users can also
create snapshots (e.g., for time slider, backups): you don't want the
users to create snapshot names that will later prevent zoneadm/beadm
destroy.  Making the users responsible for resolving such conflicts
seems not user-friendly to me.

However, if we could just avoid the conflicts in the first place then
we'd not need an option for automatic snapshot name conflict resolution.
Conflicts could be avoided by requiring that all snapshot names of a
dataset and of clones of snapshots of that dataset, and so on, be
unique.  Snapshot name uniqueness could be a property of the root
dataset of a snapshot/clone tree.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs promote/destroy enhancements?

2009-04-23 Thread Nicolas Williams

On Thu, Apr 23, 2009 at 11:25:54AM -0700, Edward Pilatowicz wrote:
 an interesting idea.  i can file an RFE on this as well, but there are a
 couple side effects to consider with this approach.
 
 setting this property would break zfs snapshot -r if there are
 multiple snapshots and clones of a single filesystem.

I agree.  You'd want to make snapshot -r compute new snapshot names in
some manner, so you're back to something like templates, which is why
I really like your proposal.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] MySQL On ZFS Performance(fsync) Problem?

2009-04-15 Thread Nicolas Williams

On Wed, Apr 15, 2009 at 07:39:13PM +0200, Kees Nuyt wrote:
 On Wed, 15 Apr 2009 14:28:45 +0800, ??
 sky...@gmail.com wrote:
  I did some test  about MySQL's Insert performance 
  on ZFS,  and met a big performance problem,
  *i'm not sure what's the point*.

Q1: Did you set the filesystem's recordsize to match MySQL/InnoDB's page
size?

If not, then try doing so (and re-create/copy the DB files to ensure
they get the new recordsize).

Q2: Did you disable the ZIL?  If so then do re-enable it.

 [snip performance and config info]
 
 Is there any one can help me, 
 why fsync on zfs is so bad? 
 or other problem?
 
 My guess:
 The InnoDB engine uses copy-on-write internally.
 zfs adds another layer of copy-on-write. Both try to
 optimize localization (keep related data close on the disk).
 
 Amongst other things this fight between the two causes
 fragmentation.

I doubt that's the problem.  On ZFS fsync() would mean syncing more than
just the writes to the given file, rather: all the pending writes.  To
make that go faster ZFS has the ZIL as a way to avoid having to commit
an entire ZFS transaction.  But even so writes to the ZIL are
synchronous.  If fsync()s are too slow even with the ZIL enabled then
you should put the ZIL on a write-biased flash device if at all
possible.

 Performance will get better if someone designs a MySQL
 storage engine which is aware of zfs and uses zfs
 copy-on-write primitives.

That may be, but I don't believe that two layers of COW will cause
problems in this case.  See my questions above.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [Fwd: ZFS user/group quotas space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]

2009-04-01 Thread Nicolas Williams

On Wed, Apr 01, 2009 at 10:58:34AM +0200, casper@sun.com wrote:
 I know that this is one of the additional protocols developed for NFSv2 
 and NFSv3; does NFSv4 has a similar mechanism to get the quota?

Yes, NFSv4.0 and 4.1 both provide the same quota information retrieval
interface, three file/directory attributes:

 - quota_avail_hard
 - quota_avail_soft
 - quota_used

It's not clear if the values returned for these attributes are supposed
to specific to the credentials of the caller or what, but I assume it's
the former.  I don't know if the Solaris NFSv4 client and server support
this feature; the attributes are REQUIRED to implement in v4.1, but I'm
not sure if that's also true in v4.0).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [Fwd: ZFS user/group quotas space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]

2009-04-01 Thread Nicolas Williams

On Wed, Apr 01, 2009 at 10:04:47AM +0100, Darren J Moffat wrote:
 If we had the .zfs/props/propname RFE implemented that would allow 
 users to see this regardless of what file sharing protocol they use.
 As well as lots of other very interesting info about the filesystem.

Indeed!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [Fwd: ZFS user/group quotas space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]

2009-03-31 Thread Nicolas Williams

On Tue, Mar 31, 2009 at 02:37:02PM -0500, Mike Gerdts wrote:
  The user or group is specified using one of the following forms:
  posix name (eg. ahrens)
  posix numeric id (eg. 126829)
  sid name (eg. ahr...@sun)
  sid numeric id (eg. S-1-12345-12423-125829)
 
 How does this work with zones?  Suppose in the global zone I have
 passwd entries like:

ZFS stores UIDs, GIDs and SIDs on disk.  The POSIX ID/SID - name
resolution happens in user-land.  As such the answer to your question is
the same as for any other operating system facility that deals with
POSIX ID/SID - name resolution: it depends on each zone's
configuration.

(In general kernel code never deals with user/group names directly, but
with UIDs/GIDs/SIDs.  One exception is the NFSv4 code, which upcalls to
user-land to resolve NFSv4 n...@domain user/group names and vice versa.)

 jill:x:123:123:Jill Admin:/home/jill:/bin/bash
 joe:x:124:124:Joe Admin:/home/joe:/bin/bash
 
 And in a non-global zone (called bedrock) I have:
 
 fred:x:123:123:Fred Flintstone:/home/fred:/bin/bash
 barney:x:124:124:Barney Rubble:/home/barney:/bin/bash
 
 Dataset rpool/quarry is delegated to the zone bedrock.
 
 Does zfs get all rpool/quarry report the same thing whether it is
 run in the global zone or the non-global zone?

If you use the -n option, yes :)

Oh, but then, the -n option is for the new zfs {user|group}space
sub-command.  I don't think zfs get is getting that option; maybe it
should.

 Has there been any thought to using a UID resolution mechanism similar
 to that used by ps?  That is, if zfs get ... dataset is run in the
 global zone and the dataset is deleted to a non-global zone, display
 the UID rather than a possibly mistaken username.

That seems like a good idea to me.  You should send that comment to the
ARC case record (send an e-mail to psarc-...@sun.com with
PSARC/2009/204 somewhere in the Subject: header).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 >

1 - 100 of 361 matches

Mail list logo