Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

Robin Axelsson Sun, 28 Oct 2012 05:10:59 -0700

On 2012-10-24 21:58, Timothy Coalson wrote:

On Wed, Oct 24, 2012 at 6:17 AM, Robin Axelsson<
gu99r...@student.chalmers.se>  wrote:

It would be interesting to know how you convert a raidz2 stripe to say a
raidz3 stripe. Let's say that I'm on a raidz2 pool and want to add an extra
parity drive by converting it to a raidz3 pool.  I'm imagining that would
be like creating a raidz1 pool on top of the leaf vdevs that constitutes
the raidz2 pool and the new leaf vdev which results in an additional parity
drive. It doesn't sound too difficult to do that. Actually, this way you
could even get raidz4 or raidz5 pools. Question is though, how things would
pan out performance wise, I would imagine that a 55 drive raidz25 pool is
really taxing on the CPU.

Multiple parity is more complicated than that, an additional xor device (a
la traditional raid4) would end up with zeros everywhere, and couldn't
reconstruct your data from an additional failure.  Look at "computing
parity" in http://en.wikipedia.org/wiki/Raid_6#RAID_6 .  While in theory it
can extend to more than 3 parity blocks, it is unclear whether more than 3
will offer any serious additional benefits (using multiple raidz2 vdevs can
give you better IOPS than larger raidz3 vdevs, with little change in raw
space efficiency).  There are also combinatorial implications to multiple
bit errors in a single data chunk with high parity levels, but that is
somewhat unlikely.

XOR you say? I didn't know that raidz used xor for parity. I thoughtthey used some kind of a Reed-Solomon implementation à la PAR2 on theblock level to achieve "RAID like" functionality. It never was statedfrom what I could read in the documentation that the raid functionalitywas implemented like traditional hardware RAID. If xor is the case thenI'm curious as to how they managed to pull off a raidz3 implementationwith three disk redundancy.


Maybe a good read into the zpool source code would help clarifying things...


Going from raidz3 to raidz2 or from raidz2 to raidz1 sounds like a

no-brainer; you just remove one drive from the pool and force zpool to
accept the new state as "normal".

A degraded raidz2 vdev has to compute the missing block from parity on
nearly every read, this is not the normal state of raidz1.  Changing the
parity level, either up or down, has similar complications in the on-disk
structure.

But expanding a raidz pool with additional storage while preserving the

parity structure sounds a little bit trickier. I don't think I have that
knowledge to write a bpr rewriter although I'm reading Solaris Internals
right now ;)


Unless raidz* did something radically different than raid5/6 (as in, not
having the parity blocks necessarily next to each other in the data chunk,
and having their positions recorded in the data chunk itself), the position
of the parity and data blocks would change.  The "always consistent on
disk" approach of ZFS adds additional problems to this, which probably make
it impossible to rewrite the re-parity'ed chunk over the old chunk, meaning
it has to find some free space every time it wants to update a chunk to the
new parity level.

What you describe here is known as unionfs in Linux, among others.
I think there were RFEs or otherwise expressed desires to make that
in Solaris and later illumos (I did campaign for that sometime ago),
but AFAIK this was not yet done by anyone.

  YES, UnionFS-like functionality is what I was talking about. It seems

like it has been abandoned in favor of AuFS in the Linux and the BSD world.
It seems to have functions that are a little overkill to use with zfs, such
as copy-on-write. Perhaps a more simplistic implementation of it would be
more suitable for zfs.

You could create zfs filesystems for subfolders in your "dataset" from the
separate pools, and give them mountpoints that put them into the same
directory.  You would have to balance the data allocation between the pools
manually, though.

I know that works but I was talking about having files stored atdifferent (hardware) locations and yet being in the same ... folder, Iguess you are using MacOS :)


Perhaps a similar functionality can be established through an abstraction

layer behind network shares.

In Windows this functionality is called 'disk pooling', btw.


In ZFS, disk pooling is done by "creating a zpool", emphasis on singular.
  Do you actually expect a large portion of your disks to go offline
suddenly?  I don't see a good way to handle this (good meaning there are no
missing files under the expected error conditions) that gets you more than
50% of your raw storage capacity (mirrors across the boundary of what you
expect to go down together).  I doubt I would like the outcome of having
some software make arbitrary decisions of what real filesystem to put each
file on, and then having one filesystem fail, so if you really expect this,
you may be happier keeping the two pools separate and deciding where to put
stuff yourself (since if you are expecting a set of disks to fail, I expect
you would have some idea as to which ones it would be, for instance an
external enclosure).

If, on the other hand, you don't expect your hardware to drop an entire set
of disks for no good reason, making them into one large storage pool and
putting your filesystem in it will share your data transparently across all
disks without needing to set anything else up.

Tim

It seems that ZFS is good at protecting data but when things do happento go south then ZFS seems to be pretty bad at handling the situation.The more hard drives that are used in a storage pool the higher thelikelihood will be that something goes wrong.

While I agree that it is not reasonable to expect that all files willstill be accessible if a large portion of the disks go offline at leastit would be great if whatever happens to be in the remaining driveswould still be accessible.

One way to achieve something along that direction would be to createsome kind of a separation in the file system so that say two vdevconfigurations are technically independent but together constitutes acommon unified storage location. It would be like cells in a ship; evenif a few cells break and take in water, the ship won't sink because theother cells are intact.

_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

.




_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] ZFS; what the manuals don't say ...

Reply via email to